Genealogy of a Wright Fisher model with strong seed bank component

Genealogy of a Wright Fisher model with strong seed bank component

Jochen Blath, Bjarki Eldon, Adrián González Casanova, Noemi Kurt
(Submitted on 12 Mar 2014)

We investigate the behaviour of the genealogy of a Wright-Fisher population model under the influence of a strong seed-bank effect. More precisely, we consider a simple seed-bank age distribution with two atoms, leading to either classical or long genealogical jumps (the latter modeling the effect of seed-dormancy). We assume that the length of these long jumps scales like a power Nβ of the original population size N, thus giving rise to a `strong’ seed-bank effect. For a certain range of β, we prove that the ancestral process of a sample of n individuals converges under a non-classical time-scaling to Kingman’s n−coalescent. Further, for a wider range of parameters, we analyze the time to the most recent common ancestor of two individuals analytically and by simulation.

Alignathon: A competitive assessment of whole genome alignment methods.

Alignathon: A competitive assessment of whole genome alignment methods.

Dent Earl, Ngan K Nguyen, Glenn Hickey, Robert S. Harris, Stephen Fitzgerald, Kathryn Beal, Igor Seledtsov, Vladimir Molodtsov, Brian Raney, Hiram Clawson, Jaebum Kim, Carsten Kemena, Jia-Ming Chang, Ionas Erb, Alexander Poliakov, Minmei Hou, Javier Herrero, Victor Solovyev, Aaron E. Darling, Jian Ma, Cedric Notredame, Michael Brudno, Inna Dubchak, David Haussler, Benedict Paten

Background: Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark datasets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole genome alignment (WGA). Results: Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments, and assessments were performed collectively after all the submissions were received. Three datasets were used: two of simulated primate and mammalian phylogenies, and one of 20 real fly genomes. In total 35 submissions were assessed, submitted by ten teams using 12 different alignment pipelines. Conclusions: We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable difference in the alignment quality of differently annotated regions, and found few tools aligned the duplications analysed. We found many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all datasets, submissions and assessment programs for further study, and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.

Adaptive evolution of molecular phenotypes

Adaptive evolution of molecular phenotypes

Torsten Held, Armita Nourmohammad, Michael Lässig
(Submitted on 7 Mar 2014)

Molecular phenotypes link genomic information with organismic functions, fitness, and evolution. Quantitative traits are complex phenotypes that depend on multiple genomic loci. In this paper, we study the adaptive evolution of a quantitative trait under time-dependent selection, which arises from environmental changes or through fitness interactions with other co-evolving phenotypes. We analyze a model of trait evolution under mutations and genetic drift in a single-peak fitness seascape. The fitness peak performs a constrained random walk in the trait amplitude, which determines the time-dependent trait optimum in a given population. We derive analytical expressions for the distribution of the time-dependent trait divergence between populations and of the trait diversity within populations. Based on this solution, we develop a method to infer adaptive evolution of quantitative traits. Specifically, we show that the ratio of the average trait divergence and the diversity is a universal function of evolutionary time, which predicts the stabilizing strength and the driving rate of the fitness seascape. From an information-theoretic point of view, this function measures the macro-evolutionary entropy in a population ensemble, which determines the predictability of the evolutionary process. Our solution also quantifies two key characteristics of adapting populations: the cumulative fitness flux, which measures the total amount of adaptation, and the adaptive load, which is the fitness cost due to a population’s lag behind the fitness peak.

An improved sequence measure used to scan genomes for regions of recent gene flow

An improved sequence measure used to scan genomes for regions of recent gene flow

Anthony J. Geneva, Christina A. Muirhead, LeAnne M. Lovato, Sarah B. Kingan, Daniel Garrigan
(Submitted on 6 Mar 2014)

The study of complex speciation, or speciation with gene flow, requires the identification of genomic regions that are either unusually divergent or that have experienced recent gene flow. Furthermore, the rapid growth of population genomic datasets relevant to studying complex speciation requires that analytical tools be scalable to the level of whole-genome analysis. We present a simple sequence measure, Gmin which is specifically designed to identify regions of diverging genomes as candidates for experiencing recent gene flow. Gmin is defined as the ratio of the minimum number of nucleotide differences between sequences from two different populations to the average number of between-population differences. We compare the sensitivity of Gmin to that of the widely used index of population differentiation, Fst. Extensive computer simulations demonstrate that Gmin has greater sensitivity and specificity to detect gene flow than Fst. Additionally, the sensitivity of Gmin to detect gene flow is robust with respect to both the population mutation and recombination rates, suggesting that it is flexible and can be applied to a variety of biological scenarios. Finally, a scan of Gmin across the X~chromosome of Drosophila melanogaster identifies candidate regions of introgression between sub-Saharan African and cosmopolitan populations that were previously missed by other methods. These results demonstrate that Gmin is a biologically straightforward, yet powerful, alternative to Fst, as well as to more computationally intensive model-based methods for detecting gene flow.

A renewal theory approach to IBD sharing

A renewal theory approach to IBD sharing

Shai Carmi, Itsik Pe’er
(Submitted on 6 Mar 2014)

Long genomic segments that are nearly identical between a pair of individuals and are inherited from a recent common ancestor without recombination are called identical-by-descent (IBD) segments. IBD sharing has numerous applications in genetics, from demographic inference to phasing, imputation, pedigree reconstruction, and disease mapping. Here, we provide a theoretical analysis of IBD sharing under Markovian approximations of the coalescent with recombination. We describe a general framework for the IBD process along the chromosome under the Markovian models (SMC/SMC’), as well as introduce and justify a new model, which we term the renewal approximation, under which lengths of successive segments are independent. Then, considering the infinite-chromosome limit of the IBD process, we recover previous results (for SMC) and derive new results (for SMC’) for the average fraction of the chromosome found in long shared segments and the average number of such segments. A number of new results for tree heights in SMC’ are proved as lemmas. We then use renewal theory to derive an expression (in Laplace space) for the distribution of the number of shared segments and demonstrate implications for demographic inference. We also use renewal theory to compute the distribution of the fraction of the chromosome shared. While the expression is again in Laplace space, we could invert the first two moments and compare a number of approximations. Finally, we generalized all results to populations with variable historical effective size.

Mycobiome of the Bat White Nose Syndrome (WNS) Affected Caves and Mines reveals High Diversity of Fungi and Local Adaptation by the Fungal Pathogen Pseudogymnoascus (Geomyces) destructans

Mycobiome of the Bat White Nose Syndrome (WNS) Affected Caves and Mines reveals High Diversity of Fungi and Local Adaptation by the Fungal Pathogen Pseudogymnoascus (Geomyces) destructans

Tao Zhang, Tanya R. Victor, Sunanda S. Rajkumar, Xiaojiang Li, Joseph C. Okoniewski, Alan C. Hicks, April D. Davis, Kelly Broussard, Shannon L. LaDeau, Sudha Chaturvedi, Vishnu Chaturvedi
(Submitted on 3 Mar 2014)

The investigations of the bat White Nose Syndrome (WNS) have yet to provide answers as to how the causative fungus Pseudogymnoascus (Geomyces) destructans (Pd) first appeared in the Northeast and how a single clone has spread rapidly in the US and Canada. We aimed to catalogue Pd and all other fungi (mycobiome) by the culture-dependent (CD) and culture-independent (CI) methods in four Mines and two Caves from the epicenter of WNS zoonotic. Six hundred sixty-five fungal isolates were obtained by CD method including the live recovery of Pd. Seven hundred three nucleotide sequences that met the definition of operational taxonomic units (OTUs) were recovered by CI methods. Most OTUs belonged to unidentified clones deposited in the databases as environmental nucleic acid sequences (ENAS). The core mycobiome of WNS affected sites comprised of 46 species of fungi from 31 genera recovered in culture, and 17 fungal genera and 31 ENAS identified from clone libraries. Fungi such as Arthroderma spp., Geomyces spp., Kernia spp., Mortierella spp., Penicillium spp., and Verticillium spp. were predominant in culture while Ganoderma spp., Geomyces spp., Mortierella spp., Penicillium spp. and Trichosporon spp. were abundant is clone libraries. Alpha diversity analyses from CI data revealed that fungal community structure was highly diverse. However, the true species diversity remains undetermined due to under sampling. The frequent recovery of Pd indicated that the pathogen has adapted to WNS-afflicted habitats. Further, this study supports the hypothesis that Pd is an introduced species. These findings underscore the need for integrated WNS control measures that target both bats and the fungal pathogen.

Decoding coalescent hidden Markov models in linear time

Decoding coalescent hidden Markov models in linear time

Kelley Harris, Sara Sheehan, John A. Kamm, Yun S. Song
(Submitted on 4 Mar 2014)

In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.

Most viewed on Haldane’s Sieve: February 2014

The most viewed posts on Haldane’s Sieve last month were:

Local description of phylogenetic group-based models

Local description of phylogenetic group-based models

Marta Casanellas, Jesús Fernández-Sánchez, Mateusz Michałek
(Submitted on 27 Feb 2014)

Motivated by phylogenetics, our aim is to obtain a system of equations that define a phylogenetic variety on an open set containing the biologically meaningful points. In this paper we consider phylogenetic varieties defined via group-based models. For any finite abelian group G, we provide an explicit construction of codimX phylogenetic invariants (polynomial equations) of degree at most |G| that define the variety X on a Zariski open set U. The set U contains all biologically meaningful points when G is the group of the Kimura 3-parameter model. In particular, our main result confirms a conjecture by the third author and, on the set U, a couple of conjectures by Bernd Sturmfels and Seth Sullivant.

DNA methylation modulates transcription factor occupancy chiefly at sites of high intrinsic cell-type variability

DNA methylation modulates transcription factor occupancy chiefly at sites of high intrinsic cell-type variability

Matthew Maurano, Hao Wang, Sam John, Anthony Shafer, Theresa Canfield, Kristen Lee, John A Stamatoyannopoulos

The nuclear genome of every cell harbors millions of unoccupied transcription factor (TF) recognition sequences that harbor methylated cytosines. Although DNA methylation is commonly invoked as a repressive mechanism, the extent to which it actively silences specific TF occupancy sites is unknown. To define the role of DNA methylation in modulating TF binding, we quantified the effect of DNA methyltransferase abrogation on the occupancy patterns of a ubiquitous TF capable of autonomous binding to its target sites in chromatin (CTCF). Here we show that the vast majority of unoccupied, methylated CTCF recognition sequences remain unbound upon depletion of DNA methylation. Rather, methylation-regulated binding is restricted to a small fraction of elements that exhibit high intrinsic variability in CTCF occupancy across cell types. Our results suggest that DNA methylation is not a major groundskeeper of genomic transcription factor occupancy landscapes, but rather a specialized mechanism for stabilizing epigenetically labile sites.