Demography-adjusted tests of neutrality based on genome-wide SNP data

Demography-adjusted tests of neutrality based on genome-wide SNP data
Marina Rafajlović (1), Alexander Klassmann (2), Anders Eriksson (3), Thomas Wiehe (2), Bernhard Mehlig (1) ((1) Department of Physics, University of Gothenburg, Sweden, (2) Institut für Genetik, Universität zu Köln, Germany, (3) Department of Zoology, University of Cambridge, U.K.)
(Submitted on 1 Jul 2013)

Tests of the neutral evolution hypothesis are usually built on the standard null model which assumes that mutations are neutral and population size remains constant over time. However, it is unclear how such tests are affected if the last assumption is dropped. Here, we extend the unifying framework for tests based on the site frequency spectrum, introduced by Achaz and Ferretti, to populations of varying size. A key ingredient is to specify the first two moments of the frequency spectrum. We show that these moments can be determined analytically if a population has experienced two instantaneous size changes in the past. We apply our method to data from ten human populations gathered in the 1000 genomes project, estimate their demographies and define demography-adjusted versions of Tajima’s $D$, Fay & Wu’s $H$, and Zeng’s $E$. The adjusted test statistics facilitate the direct comparison between populations and they show that most of the differences among populations seen in the original tests can be explained by demography. We carried out whole genome screens for deviation from neutrality and identified candidate regions of recent positive selection. We provide track files with values of the adjusted and original tests for upload to the UCSC genome browser.

Lateral Gene Transfer, Rearrangement and Reconciliation

Lateral Gene Transfer, Rearrangement and Reconciliation
Murray Patterson, Gergely J Szöllősi, Vincent Daubin, Eric Tannier
(Submitted on 27 Jun 2013)

Background.
Models of ancestral gene order reconstruction have progressively integrated different evolutionary patterns and processes such as unequal gene content, gene duplications, and implicitly sequence evolution via reconciled gene trees. In unicellular organisms, these models have so far ignored lateral gene transfer, even though it can have an important confounding effect on such models, as well as a rich source of information on the function of genes through the detection of transfers of entire clusters of genes.
Result.
We report an algorithm together with its implementation, DeCoLT, that reconstructs ancestral genome organization based on reconciled gene trees which summarize information on sequence evolution, gene origination, duplication, loss, and lateral transfer. DeCoLT finds in polynomial time the minimum number of rearrangements, computed as the number of gains and breakages of adjacencies between pairs of genes. We apply DeCoLT to 1099 gene families from 36 cyanobacteria genomes.
Conclusion.
DeCoLT is able to reconstruct adjacencies in 35 ancestral bacterial genomes with a thousand genes families in a few hours, and detects clusters of co-transferred genes. As there is no constraint on genome organization, adjacencies can be generalized to any relationship between genes to reconstruct ancestral interactions, functions or complexes with the same framework.

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data
John O’Brien, Xavier Didelot, Zamin Iqbal, LucasAmenga-Etego, Bartu Ahiska, Daniel Falush
(Submitted on 26 Jun 2013)

Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of strong mixing among samples. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples.

Genome-wide inference of ancestral recombination graphs

Genome-wide inference of ancestral recombination graphs
Matthew D. Rasmussen, Adam Siepel
(Submitted on 21 Jun 2013)

The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of all coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are extremely computationally intensive, depend on fairly crude approximations, or are limited to small numbers of samples. As a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to be applied on the scale of dozens of complete human genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call “threading”. Using techniques based on hidden Markov models, this threading operation can be performed exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated applications of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG, for twenty or more sequences generated under realistic parameters for human populations. We also report initial results from applications of ARGweaver to high-coverage individual human genome sequences from Complete Genomics. Work is in progress on further applications of these methods to genome-wide sequence data.

Phylogenetic analysis accounting for age-dependent death and sampling with applications to epidemics

Phylogenetic analysis accounting for age-dependent death and sampling with applications to epidemics
Amaury Lambert, Helen K. Alexander, Tanja Stadler
(Submitted on 14 Jun 2013)

The reconstruction of phylogenetic trees based on viral genetic sequence data sequentially sampled from an epidemic provides estimates of the past transmission dynamics, by fitting epidemiological models to these trees. To our knowledge, none of the epidemiological models currently used in phylogenetics can account for recovery rates and sampling rates dependent on the time elapsed since transmission.
Here we introduce an epidemiological model where infectives leave the epidemic, either by recovery or sampling, after some random time which may follow an arbitrary distribution.
We derive an expression for the likelihood of the phylogenetic tree of sampled infectives under our general epidemiological model. The analytic concept developed in this paper will facilitate inference of past epidemiological dynamics and provide an analytical framework for performing very efficient simulations of phylogenetic trees under our model. The main idea of our analytic study is that the non-Markovian epidemiological model giving rise to phylogenetic trees growing vertically as time goes by, can be represented by a Markovian “coalescent point process” growing horizontally by the sequential addition of pairs of coalescence and sampling times.
As examples, we discuss two special cases of our general model, namely an application to influenza and an application to HIV. Though phrased in epidemiological terms, our framework can also be used for instance to fit macroevolutionary models to phylogenies of extant and extinct species, accounting for general species lifetime distributions.

mendelFix: a Perl script for checking Mendelian errors in high density SNP data of trio designs

mendelFix: a Perl script for checking Mendelian errors in high density SNP data of trio designs
Yuri Tani Utsunomiya, Rodrigo Vitorio Alonso, Adriana Santana do Carmo, Francine Campagnari, José Antonio Vinsintin, José Fernando Garcia
(Submitted on 10 Jun 2013)

Here we present mendelFix, a Perl script for checking Mendelian errors in genome-wide SNP data of trio designs. The program takes 12-recoded PLINK PED and MAP files as input to calculate a series of summary statistics for Mendelian errors, sets missing offspring genotypes that present Mendelian inconsistencies, and implements a simplistic procedure to infer missing genotypes using parent information. The program can be easily incorporated in any pipeline for family-based SNP data analysis, and is distributed as free software under the GNU General Public License.

Enhancement of a Novel Method for Mutational Disease Prediction using Bioinformatics Techniques and Backpropagation Algorithm

Enhancement of a Novel Method for Mutational Disease Prediction using Bioinformatics Techniques and Backpropagation Algorithm
Ayad Ghany Ismaeel, Anar Auda Ablahad
(Submitted on 7 Jun 2013)

The noval method for mutational disease prediction using bioinformatics tools and datasets for diagnosis the malignant mutations with powerful Artificial Neural Network (Backpropagation Network) for classifying these malignant mutations are related to gene(s) (like BRCA1 and BRCA2) cause a disease (breast cancer). This noval method did not take in consideration just like adopted for dealing, analyzing and treat the gene sequences for extracting useful information from the sequence, also exceeded the environment factors which play important roles in deciding and calculating some of genes features in order to view its functional parts and relations to diseases. This paper is proposed an enhancement of a novel method as a first way for diagnosis and prediction the disease by mutations considering and introducing multi other features show the alternations, changes in the environment as well as genes, comparing sequences to gain information about the structure or function of a query sequence, also proposing optimal and more accurate system for classification and dealing with specific disorder using backpropagation with mean square rate 0.000000001. Index Terms (Homology sequence, GC content and AT content, Bioinformatics, Backpropagation Network, BLAST, DNA Sequence, Protein Sequence)

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences
Dietrich Radel, Andreas Sand, Mike Steel
(Submitted on 6 Jun 2013)

Finding optimal evolutionary trees from sequence data is typically an intractable problem, and there is usually no way of knowing how close to optimal the best tree from some search truly is. The problem would seem to be particularly acute when we have many taxa and when that data has high levels of homoplasy, in which the individual characters require many changes to fit on the best tree. However, a recent mathematical result has provided a precise tool to generate a short number of high-homoplasy characters for any given tree, so that this tree is provably the optimal tree under the maximum parsimony criterion. This provides, for the first time, a rigorous way to test tree search algorithms on homoplasy-rich data, where we know in advance what the `best’ tree is. In this short note we consider just one search program (TNT) but show that it is able to locate the globally optimal tree correctly for 32,768 taxa, even though the characters in the dataset requires, on average, 1148 state-changes each to fit on this tree, and the number of characters is only 57.

Populations in statistical genetic modelling and inference

Populations in statistical genetic modelling and inference

Daniel John Lawson
(Submitted on 4 Jun 2013)

What is a population? This review considers how a population may be defined in terms of understanding the structure of the underlying genetics of the individuals involved. The main approach is to consider statistically identifiable groups of randomly mating individuals, which is well defined in theory for any type of (sexual) organism. We discuss generative models using drift, admixture and spatial structure, and the ancestral recombination graph. These are contrasted with statistical models for inference, principle component analysis and other `non-parametric’ methods. The relationships between these approaches are explored with both simulated and real-data examples. The state-of-the-art practical software tools are discussed and contrasted. We conclude that populations are a useful theoretical construct that can be well defined in theory and often approximately exist in practice.

Computing the posterior expectation of phylogenetic trees

Computing the posterior expectation of phylogenetic trees
Philipp Benner, Miroslav Bačák
(Submitted on 16 May 2013)

Inferring phylogenetic trees from multiple sequence alignments often relies upon Markov chain Monte Carlo (MCMC) methods to generate tree samples from a posterior distribution. To give a rigorous approximation of the posterior expectation, one needs to compute the mean of the tree samples and therefore a sound definition of a mean and algorithms for its computation are highly demanded. To the best of our knowledge, no existing method of phylogenetic inference can handle the full set of sample trees, because such trees typically have different topologies. We develop a novel statistical model for the inference of phylogenetic trees based on the tree space due to Billera et al. [2001]. Since it is an Hadamard space, the mean and median are well defined, which we also motivate from a decision theoretic perspective. The actual approximation of the posterior expectation relies on some recent developments in Hadamard spaces (Ba\v{c}\’ak [2013a], Miller et al. [2012]) and the fast computation of geodesics in tree space (Owen and Provan [2011]), which altogether enable to compute medians and means of trees with different topologies. Our intention is to give a full self-contained description of the methods required to approximate posterior expectations. We demonstrate these methods on the small ribosomal subunit rRNA sequence alignment. The posterior expectations obtained on this data set are a meaningful summary of the posterior distribution and the uncertainty about the tree topology.