Predicting discovery rates of genomic features

Predicting discovery rates of genomic features

Simon Gravel, NHLBI GO Exome Sequencing Project
(Submitted on 13 Mar 2014)

Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict “omics” variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require about 15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and sub-sampled 1000 Genomes Project data. Extrapolating based on the NHLBI Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African-Americans, and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types.


Increased genetic diversity improves crop yield stability under climate variability: a computational study on sunflower

Increased genetic diversity improves crop yield stability under climate variability: a computational study on sunflower

Pierre Casadebaig (1), Ronan Trépos (2), Victor Picheny (2), Nicolas B. Langlade (3), Patrick Vincourt (3), Philippe Debaeke (1) ((1) INRA, UMR1248 AGIR, Castanet-Tolosan, France, (2) INRA, UR875 MIAT, Castanet-Tolosan, France, (3) INRA, UMR441 LIPM, Castanet-Tolosan, France)
(Submitted on 12 Mar 2014)

A crop can be represented as a biotechnical system in which components are either chosen (cultivar, management) or given (soil, climate) and whose combination generates highly variable stress patterns and yield responses. Here, we used modeling and simulation to predict the crop phenotypic plasticity resulting from the interaction of plant traits (G), climatic variability (E) and management actions (M). We designed two in silico experiments that compared existing and virtual sunflower cultivars (Helianthus annuus L.) in a target population of cropping environments by simulating a range of indicators of crop performance. Optimization methods were then used to search for GEM combinations that matched desired crop specifications. Computational experiments showed that the fit of particular cultivars in specific environments is gradually increasing with the knowledge of pedo-climatic conditions. At the regional scale, tuning the choice of cultivar impacted crop performance the same magnitude as the effect of yearly genetic progress made by breeding. When considering virtual genetic material, designed by recombining plant traits, cultivar choice had a greater positive impact on crop performance and stability. Results suggested that breeding for key traits conferring plant plasticity improved cultivar global adaptation capacity whereas increasing genetic diversity allowed to choose cultivars with distinctive traits that were more adapted to specific conditions. Consequently, breeding genetic material that is both plastic and diverse may improve yield stability of agricultural systems exposed to climatic variability. We argue that process-based modeling could help enhancing spatial management of cultivated genetic diversity and could be integrated in functional breeding approaches.

Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals

Substitution and site-specific selection driving B cell affinity maturation is consistent across individuals

Connor O. McCoy, Trevor Bedford, Vladimir N. Minin, Harlan Robins, Frederick A. Matsen IV
(Submitted on 12 Mar 2014)

The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep short-read data set of B cell receptors. We find that the substitution process is conserved across individuals but varies significantly across gene segments. We investigate selection on B cell receptors using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection, which we find is dominated by purifying selection, though not uniformly so.

Mapping quantitative trait loci underlying function-valued phenotypes

Mapping quantitative trait loci underlying function-valued phenotypes

Il-Youp Kwak, Candace R. Moore, Edgar P. Spalding, Karl W. Broman
(Submitted on 12 Mar 2014)

Most statistical methods for QTL mapping focus on a single phenotype. However, multiple phenotypes are commonly measured, and recent technological advances have greatly simplified the automated acquisition of numerous phenotypes, including function-valued phenotypes, such as growth measured over time. While there exist methods for QTL mapping with function-valued phenotypes, they are generally computationally intensive and focus on single-QTL models. We propose two simple, fast methods that maintain high power and precision and are amenable to extensions with multiple-QTL models using a penalized likelihood approach. After identifying multiple QTL by these approaches, we can view the function-valued QTL effects to provide a deeper understanding of the underlying processes. Our methods have been implemented as a package for R, funqtl.

Genealogy of a Wright Fisher model with strong seed bank component

Genealogy of a Wright Fisher model with strong seed bank component

Jochen Blath, Bjarki Eldon, Adrián González Casanova, Noemi Kurt
(Submitted on 12 Mar 2014)

We investigate the behaviour of the genealogy of a Wright-Fisher population model under the influence of a strong seed-bank effect. More precisely, we consider a simple seed-bank age distribution with two atoms, leading to either classical or long genealogical jumps (the latter modeling the effect of seed-dormancy). We assume that the length of these long jumps scales like a power Nβ of the original population size N, thus giving rise to a `strong’ seed-bank effect. For a certain range of β, we prove that the ancestral process of a sample of n individuals converges under a non-classical time-scaling to Kingman’s n−coalescent. Further, for a wider range of parameters, we analyze the time to the most recent common ancestor of two individuals analytically and by simulation.

A chromatin structure based model accurately predicts DNA replication timing in human cells

A chromatin structure based model accurately predicts DNA replication timing in human cells
Yevgeniy Gindin, Manuel S. Valenzuela, Mirit I. Aladjem, Paul S. Meltzer, Sven Bilke
Subjects: Subcellular Processes (q-bio.SC); Genomics (q-bio.GN)

The metazoan genome is replicated in precise cell lineage specific temporal order. However, the mechanism controlling this orchestrated process is poorly understood as no molecular mechanisms have been identified that actively regulate the firing sequence of genome replication. Here we develop a mechanistic model of genome replication capable of predicting, with accuracy rivaling experimental repeats, observed empirical replication timing program in humans. In our model, replication is initiated in an uncoordinated (time-stochastic) manner at well-defined sites. The model contains, in addition to the choice of the genomic landmark that localizes initiation, only a single adjustable parameter of direct biological relevance: the number of replication forks. We find that DNase hypersensitive sites are optimal and independent determinants of DNA replication initiation. We demonstrate that the DNA replication timing program in human cells is a robust emergent phenomenon that, by its very nature, does not require a regulatory mechanism determining a proper replication initiation firing sequence.

Alignathon: A competitive assessment of whole genome alignment methods.

Alignathon: A competitive assessment of whole genome alignment methods.

Dent Earl, Ngan K Nguyen, Glenn Hickey, Robert S. Harris, Stephen Fitzgerald, Kathryn Beal, Igor Seledtsov, Vladimir Molodtsov, Brian Raney, Hiram Clawson, Jaebum Kim, Carsten Kemena, Jia-Ming Chang, Ionas Erb, Alexander Poliakov, Minmei Hou, Javier Herrero, Victor Solovyev, Aaron E. Darling, Jian Ma, Cedric Notredame, Michael Brudno, Inna Dubchak, David Haussler, Benedict Paten

Background: Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark datasets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole genome alignment (WGA). Results: Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments, and assessments were performed collectively after all the submissions were received. Three datasets were used: two of simulated primate and mammalian phylogenies, and one of 20 real fly genomes. In total 35 submissions were assessed, submitted by ten teams using 12 different alignment pipelines. Conclusions: We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable difference in the alignment quality of differently annotated regions, and found few tools aligned the duplications analysed. We found many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all datasets, submissions and assessment programs for further study, and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.

Adaptive evolution of molecular phenotypes

Adaptive evolution of molecular phenotypes

Torsten Held, Armita Nourmohammad, Michael Lässig
(Submitted on 7 Mar 2014)

Molecular phenotypes link genomic information with organismic functions, fitness, and evolution. Quantitative traits are complex phenotypes that depend on multiple genomic loci. In this paper, we study the adaptive evolution of a quantitative trait under time-dependent selection, which arises from environmental changes or through fitness interactions with other co-evolving phenotypes. We analyze a model of trait evolution under mutations and genetic drift in a single-peak fitness seascape. The fitness peak performs a constrained random walk in the trait amplitude, which determines the time-dependent trait optimum in a given population. We derive analytical expressions for the distribution of the time-dependent trait divergence between populations and of the trait diversity within populations. Based on this solution, we develop a method to infer adaptive evolution of quantitative traits. Specifically, we show that the ratio of the average trait divergence and the diversity is a universal function of evolutionary time, which predicts the stabilizing strength and the driving rate of the fitness seascape. From an information-theoretic point of view, this function measures the macro-evolutionary entropy in a population ensemble, which determines the predictability of the evolutionary process. Our solution also quantifies two key characteristics of adapting populations: the cumulative fitness flux, which measures the total amount of adaptation, and the adaptive load, which is the fitness cost due to a population’s lag behind the fitness peak.

An improved sequence measure used to scan genomes for regions of recent gene flow

An improved sequence measure used to scan genomes for regions of recent gene flow

Anthony J. Geneva, Christina A. Muirhead, LeAnne M. Lovato, Sarah B. Kingan, Daniel Garrigan
(Submitted on 6 Mar 2014)

The study of complex speciation, or speciation with gene flow, requires the identification of genomic regions that are either unusually divergent or that have experienced recent gene flow. Furthermore, the rapid growth of population genomic datasets relevant to studying complex speciation requires that analytical tools be scalable to the level of whole-genome analysis. We present a simple sequence measure, Gmin which is specifically designed to identify regions of diverging genomes as candidates for experiencing recent gene flow. Gmin is defined as the ratio of the minimum number of nucleotide differences between sequences from two different populations to the average number of between-population differences. We compare the sensitivity of Gmin to that of the widely used index of population differentiation, Fst. Extensive computer simulations demonstrate that Gmin has greater sensitivity and specificity to detect gene flow than Fst. Additionally, the sensitivity of Gmin to detect gene flow is robust with respect to both the population mutation and recombination rates, suggesting that it is flexible and can be applied to a variety of biological scenarios. Finally, a scan of Gmin across the X~chromosome of Drosophila melanogaster identifies candidate regions of introgression between sub-Saharan African and cosmopolitan populations that were previously missed by other methods. These results demonstrate that Gmin is a biologically straightforward, yet powerful, alternative to Fst, as well as to more computationally intensive model-based methods for detecting gene flow.

A renewal theory approach to IBD sharing

A renewal theory approach to IBD sharing

Shai Carmi, Itsik Pe’er
(Submitted on 6 Mar 2014)

Long genomic segments that are nearly identical between a pair of individuals and are inherited from a recent common ancestor without recombination are called identical-by-descent (IBD) segments. IBD sharing has numerous applications in genetics, from demographic inference to phasing, imputation, pedigree reconstruction, and disease mapping. Here, we provide a theoretical analysis of IBD sharing under Markovian approximations of the coalescent with recombination. We describe a general framework for the IBD process along the chromosome under the Markovian models (SMC/SMC’), as well as introduce and justify a new model, which we term the renewal approximation, under which lengths of successive segments are independent. Then, considering the infinite-chromosome limit of the IBD process, we recover previous results (for SMC) and derive new results (for SMC’) for the average fraction of the chromosome found in long shared segments and the average number of such segments. A number of new results for tree heights in SMC’ are proved as lemmas. We then use renewal theory to derive an expression (in Laplace space) for the distribution of the number of shared segments and demonstrate implications for demographic inference. We also use renewal theory to compute the distribution of the fraction of the chromosome shared. While the expression is again in Laplace space, we could invert the first two moments and compare a number of approximations. Finally, we generalized all results to populations with variable historical effective size.