Ancestral genome reconstruction reveals the history of ecological diversification in Agrobacterium.

Ancestral genome reconstruction reveals the history of ecological diversification in Agrobacterium.

Florent Lassalle, Remi Planel, Simon Penel, David Chapulliot, Valerie Barbe, Audrey Dubost, Alexandra Calteau, David Vallenet, Damien Mornico, Laurent Gueguen, Ludovic Vial, Daniel Muller, Vincent Daubin, Xavier Nesme

Efficient Bayesian species tree inference under the multi-species coalescent

Efficient Bayesian species tree inference under the multi-species coalescent
Bruce Rannala, Ziheng Yang

A method was developed for Bayesian inference of species phylogeny using the multi-species coalescent model. To improve the mixing properties of the Markov chain Monte Carlo (MCMC) algorithm that traverses the space of species trees, we implement two efficient MCMC proposals: the first is based on the Subtree Pruning and Regrafting (SPR) algorithm and the second is based on a novel node-slider algorithm. Like the Nearest-Neighbor Interchange (NNI) algorithm we implemented previously, both algorithms propose changes to the species tree, while simultaneously altering the gene trees at multiple genetic loci to automatically avoid conflicts with the newly-proposed species tree. The method integrates over gene trees, naturally taking account of the uncertainty of gene tree topology and branch lengths given the sequence data. A simulation study was performed to examine the statistical properties of the new method. We found that it has excellent statistical performance, inferring the correct species tree with near certainty when analyzing 10 loci. The prior on species trees has some impact, particularly for small numbers of loci. An empirical dataset (for rattlesnakes) was reanalyzed. While the 18 nuclear loci and one mitochondrial locus support largely consistent species trees under the multi-species coalescent model estimates of parameters suggest drastically different evolutionary dynamics between the nuclear and mitochondrial loci.

The non-equilibrium allele frequency spectrum in a Poisson random field framework

The non-equilibrium allele frequency spectrum in a Poisson random field framework
Ingemar Kaj, Carina F. Mugal

In population genetic studies, the allele frequency spectrum (AFS) efficiently summarizes genome-wide polymorphism data and shapes a variety of allele frequency-based summary statistics. While existing theory typically features equilibrium conditions, emerging methodology requires an analytical understanding of the build-up of the allele frequencies over time. In this work, we use the framework of Poisson random fields to derive new representations of the non-equilibrium AFS for the case of a Wright-Fisher population model with selection. In our approach, the AFS is a scaling-limit of the expectation of a Poisson stochastic integral and the representation of the non-equilibrium AFS arises in terms of a fixation time probability distribution. The known duality between the Wright-Fisher diffusion process and a birth and death process generalizing Kingman’s coalescent yields an additional representation. The results carry over to the setting of a random sample drawn from the population and provide the non-equilibrium behavior of sample statistics. Our findings are consistent with and extend a previous approach where the non-equilibrium AFS solves a partial differential forward equation with a non-traditional boundary condition. Moreover, we provide a bridge to previous coalescent-based work, and hence tie several frameworks together. Since frequency-based summary statistics are widely used in population genetics, for example, to identify candidate loci of adaptive evolution, to infer the demographic history of a population, or to improve our understanding of the underlying mechanics of speciation events, the presented results are potentially useful for a broad range of topics.

Score distributions of gapped multiple sequence alignments down to the low-probability tail

Score distributions of gapped multiple sequence alignments down to the low-probability tail
Pascal Fieth, Alexander K. Hartmann

Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution is known analytically to follow a Gumbel distribution. Distributions for gapped local alignments and global alignments of finite lengths can only be obtained numerically. To obtain result for the small-probability region, specific statistical mechanics-based rare-event algorithms can be applied. In previous studies, this was achieved for pairwise alignments. They showed that, contrary to results from previous simple sampling studies, strong deviations from the Gumbel distribution occur in case of finite sequence lengths. Here we extend the studies to the for practical applications in Molecular Biology much more relevant case of multiple sequence alignments with gaps. We study the distributions of scores over a large range of the support, reaching probabilities as small as 10^-160, for global and local (sum-of-pair scores) multiple alignments. We find that even after suitable rescaling, eliminating the sequence-length dependence, the distributions for multiple alignment differ from the pairwise alignment case. Furthermore, we also show that the previously discussed Gaussian correction to the Gumbel distribution needs to be refined, also for the case of pairwise alignments.

Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods

Efficient Quartet Representations of Trees and Applications to Supertree and Summary Methods
Ruth Davidson, MaLyn Lawhorn, Joseph Rusinko, Noah Weber

Quartet trees which are displayed by larger phylogenetic trees have long been used as inputs for species tree and supertree reconstruction. Computational constraints prevent the use of all displayed quartets when the number of genes or number of taxa is large. We introduce the Efficient Quartet System (EQS) to represent a phylogenetic tree with a subset of the quartets displayed by the tree. We show mathematically that the set of quartets obtained from a tree via EQS contains all of the combinatorial information of the tree itself. We also demonstrate via performance tests on some simulated datasets that the use of EQS to reduce the number of quartets input to quartet-based species tree methods (including summary methods) and supertree methods only corresponds to small reductions in accuracy.

Amplifiers for the Moran Process

Amplifiers for the Moran Process
Andreas Galanis, Andreas Göbel, Leslie Ann Goldberg, John Lapinskas, David Richerby

The Moran process, as studied by Lieberman, Hauert and Nowak, is a stochastic process modelling the spread of genetic mutations in populations. It has an underlying graph in which vertices correspond to individuals. Initially, one individual (chosen uniformly at random) possesses a mutation, with fitness r>1. All other individuals have fitness 1. At each step of the discrete-time process, an individual is chosen with probability proportional to its fitness, and its state (mutant or non-mutant) is passed on to an out-neighbour chosen u.a.r. If the underlying graph is strongly connected, the process will eventually reach fixation (all individuals are mutants) or extinction (no individuals are mutants). We define an infinite family of directed graphs to be strongly amplifying if, for every r>1, the extinction probability tends to 0 as the number n of vertices increases. Strong amplification is a rather surprising property – the initial mutant only has a fixed selective advantage, independent of n, which is “amplified” to give a fixation probability tending to 1. Strong amplifiers have received quite a bit of attention. Lieberman et al. proposed two potential families of them: superstars and metafunnels. It has been argued heuristically that some infinite families of superstars are strongly amplifying. The same has been claimed for metafunnels. We give the first rigorous proof that there is a strongly amplifying family of directed graphs which we call “megastars”. We show that the extinction probability of n-vertex graphs in this family of megastars is roughly n−1/2, up to logarithmic factors, and that all infinite families of superstars and metafunnels have larger extinction probabilities as a function of n. Our analysis of megastars is tight, up to logarithmic factors. We also clarify the literature on the isothermal theorem of Lieberman et al.

Inferring protein-protein interaction networks from inter-protein sequence co-evolution

Inferring protein-protein interaction networks from inter-protein sequence co-evolution
Christoph Feinauer, Hendrik Szurmant, Martin Weigt, Andrea Pagnani

Interaction between proteins is a fundamental mechanism that underlies virtually all biological processes. Many important interactions are conserved across a large variety of species. The need to maintain interaction leads to a high degree of co-evolution between residues in the interface between partner proteins. The inference of protein-protein interaction networks from the rapidly growing sequence databases is one of the most formidable tasks in systems biology today. We propose here a novel approach based on the Direct-Coupling Analysis of the co-evolution between inter-protein residue pairs. We use ribosomal and trp operon proteins as test cases: For the small resp. large ribosomal subunit our approach predicts protein-interaction partners at a true-positive rate of 70% resp. 90% within the first 10 predictions, with areas of 0.69 resp. 0.81 under the ROC curves for all predictions. In the trp operon, it assigns the two largest interaction scores to the only two interactions experimentally known. On the level of residue interactions we show that for both the small and the large ribosomal subunit our approach predicts interacting residues in the system with a true positive rate of 60% and 85% in the first 20 predictions. We use artificial data to show that the performance of our approach depends crucially on the size of the joint multiple sequence alignments and analyze how many sequences would be necessary for a perfect prediction if the sequences were sampled from the same model that we use for prediction. Given the performance of our approach on the test data we speculate that it can be used to detect new interactions, especially in the light of the rapid growth of available sequence data.

The impact of natural selection on the distribution of cis-regulatory variation across the genome of an outcrossing plant

The impact of natural selection on the distribution of cis-regulatory variation across the genome of an outcrossing plant

Kim A Steige, Benjamin Laenen, Johan Reimegård, Douglas Scofield, Tanja Slotte

Evolutionary dynamics of cytoplasmic segregation and fusion: Mitochondrial mixing facilitated the evolution of sex at the origin of eukaryotes

Evolutionary dynamics of cytoplasmic segregation and fusion: Mitochondrial mixing facilitated the evolution of sex at the origin of eukaryotes

Arunas L Radzvilavicius

Changes in the relative abundance of two Saccharomyces species from oak forests to wine fermentations

Changes in the relative abundance of two Saccharomyces species from oak forests to wine fermentations

Sofia Dashko, Ping Liu, Helena Volk, Lorena Butinar, Jure Piskur, Justin C. Fay