Coalescent-based species tree inference has become widely used in the analysis of genome-scale multilocus and SNP datasets when the goal is inference of a species-level phylogeny. However, numerous evolutionary processes are known to violate the assumptions of a coalescence-only model and complicate inference of the species tree. One such process is hybrid speciation, in which a species shares its ancestry with two distinct species. Although many methods have been proposed to detect hybrid speciation, only a few have considered both hybridization and coalescence in a unified framework, and these are generally limited to the setting in which putative hybrid species must be identified in advance. Here we propose a method that can examine genome-scale data for a large number of taxa and detect those taxa that may have arisen via hybridization, as well as their potential “parental” taxa. The method is based on a model that considers both coalescence and hybridization together, and uses phylogenetic invariants to construct a test that scales well in terms of computational time for both the number of taxa and the amount of sequence data. We test the method using simulated data for up 20 taxa and 100,000bp, and find that the method accurately identifies both recent and ancient hybrid species in less than 30 seconds. We apply the method to two empirical datasets, one composed of Sistrurus rattlesnakes for which hybrid speciation is not supported by previous work, and one consisting of several species of Heliconius butterflies for which some evidence of hybrid speciation has been previously found.
Independent evolution of ab- and adaxial stomatal density enables adaptation
Evidence of adoption, monozygotic twinning, and low inbreeding rates in a large genetic pedigree of polar bears
An evolutionary hourglass of herbivore-induced transcriptomic responses in Nicotiana attenuata
What kind of maternal effects are selected for in fluctuating environments?
BrowseVCF: a web-based application and workflow to quickly prioritise disease-causative variants in VCF files.
Ancestral genome reconstruction reveals the history of ecological diversification in Agrobacterium.
Efficient Bayesian species tree inference under the multi-species coalescent
Efficient Bayesian species tree inference under the multi-species coalescent
Bruce Rannala, Ziheng Yang
A method was developed for Bayesian inference of species phylogeny using the multi-species coalescent model. To improve the mixing properties of the Markov chain Monte Carlo (MCMC) algorithm that traverses the space of species trees, we implement two efficient MCMC proposals: the first is based on the Subtree Pruning and Regrafting (SPR) algorithm and the second is based on a novel node-slider algorithm. Like the Nearest-Neighbor Interchange (NNI) algorithm we implemented previously, both algorithms propose changes to the species tree, while simultaneously altering the gene trees at multiple genetic loci to automatically avoid conflicts with the newly-proposed species tree. The method integrates over gene trees, naturally taking account of the uncertainty of gene tree topology and branch lengths given the sequence data. A simulation study was performed to examine the statistical properties of the new method. We found that it has excellent statistical performance, inferring the correct species tree with near certainty when analyzing 10 loci. The prior on species trees has some impact, particularly for small numbers of loci. An empirical dataset (for rattlesnakes) was reanalyzed. While the 18 nuclear loci and one mitochondrial locus support largely consistent species trees under the multi-species coalescent model estimates of parameters suggest drastically different evolutionary dynamics between the nuclear and mitochondrial loci.
The non-equilibrium allele frequency spectrum in a Poisson random field framework
The non-equilibrium allele frequency spectrum in a Poisson random field framework
Ingemar Kaj, Carina F. Mugal
In population genetic studies, the allele frequency spectrum (AFS) efficiently summarizes genome-wide polymorphism data and shapes a variety of allele frequency-based summary statistics. While existing theory typically features equilibrium conditions, emerging methodology requires an analytical understanding of the build-up of the allele frequencies over time. In this work, we use the framework of Poisson random fields to derive new representations of the non-equilibrium AFS for the case of a Wright-Fisher population model with selection. In our approach, the AFS is a scaling-limit of the expectation of a Poisson stochastic integral and the representation of the non-equilibrium AFS arises in terms of a fixation time probability distribution. The known duality between the Wright-Fisher diffusion process and a birth and death process generalizing Kingman’s coalescent yields an additional representation. The results carry over to the setting of a random sample drawn from the population and provide the non-equilibrium behavior of sample statistics. Our findings are consistent with and extend a previous approach where the non-equilibrium AFS solves a partial differential forward equation with a non-traditional boundary condition. Moreover, we provide a bridge to previous coalescent-based work, and hence tie several frameworks together. Since frequency-based summary statistics are widely used in population genetics, for example, to identify candidate loci of adaptive evolution, to infer the demographic history of a population, or to improve our understanding of the underlying mechanics of speciation events, the presented results are potentially useful for a broad range of topics.
Score distributions of gapped multiple sequence alignments down to the low-probability tail
Score distributions of gapped multiple sequence alignments down to the low-probability tail
Pascal Fieth, Alexander K. Hartmann
Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to the for practical applications in Molecular Biology much more relevant case of multiple sequence alignments with gaps. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10^-160, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments.