The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation

The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation
Tracy A. Heath, John P. Huelsenbeck, Tanja Stadler
(Submitted on 10 Oct 2013)

Time-calibrated species phylogenies are critical for addressing a wide range of questions in evolutionary biology, such as those that elucidate historical biogeography or uncover patterns of coevolution and diversification. Because molecular sequence data are not informative on absolute time, external data, most commonly fossil age estimates, are required to calibrate estimates of species divergence dates. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily chosen parametric distributions on internal nodes, often disregarding most of the information in the fossil record. We introduce the ‘fossilized birth-death’ (FBD) process, a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are part of the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in robust and accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods. We then used this model to estimate the speciation times for a dataset composed of all living bears, indicating that the genus Ursus diversified in the late Miocene to mid Pliocene.

Routes for breaching and protecting genetic privacy

Routes for breaching and protecting genetic privacy
Yaniv Erlich, Arvind Narayanan
(Submitted on 11 Oct 2013)

We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats to genetic privacy and discuss potential mitigation strategies for privacy-preserving dissemination of genetic data.

Application of compressed sensing to genome wide association studies and genomic selection

Application of compressed sensing to genome wide association studies and genomic selection
Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics
Marta Rosikiewicz, Marc Robinson-Rechavi
(Submitted on 8 Oct 2013)

Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for quality control. In this study we propose a new inde-pendent quality metric that is based on evolutionary conservation of expression profiles. We show, using 11 large organ-specific datasets, that IQRray, a new quality metrics developed by us, exhibits the highest correlation with this reference metric, among 14 metrics tested. IQRray outperforms other methods in identification of poor quality arrays in dataset composed of arrays from many independent experiments. In con-trast, the performance of methods designed for detecting outliers in a single experiment like NUSE and RLE was low because of the inability of these method to detect datasets containing only low quality arrays, and the fact that the scores cannot be directly compared between ex-periments. Availability: The R implementation of IQRray is available at: this ftp URL

Let my people go (home) to Spain: a genealogical model of Jewish identities since 1492

Let my people go (home) to Spain: a genealogical model of Jewish identities since 1492
Joshua S. Weitz
(Submitted on 7 Oct 2013)

The Spanish government recently announced an official fast-track path to citizenship for any individual who is Jewish and whose ancestors were expelled from Spain during the inquisition-related dislocation of Spanish Jews in 1492. It would seem that this policy targets a small subset of the global Jewish population, i.e., restricted to individuals who retain cultural practices associated with ancestral origins in Spain. However, the central contribution of this manuscript is to demonstrate how and why the policy is far more likely to apply to a very large fraction (i.e., the vast majority) of Jews. This claim is supported using a series of genealogical models that include transmissable “identities” and preferential intra-group mating. Model analysis reveals that even when intra-group mating is strong and even if only a small subset of a present-day population retains cultural practices typically associated with that of an ancestral group, it is highly likely that nearly all members of that population have direct geneaological links to that ancestral group, given sufficient number of generations have elapsed. The basis for this conclusion is that not having a link to an ancestral group must be a property of all of an individual’s ancestors, the probability of which declines (nearly) superexponentially with each successive generation. These findings highlight unexpected incongruities induced by genealogical dynamics between present-day and ancestral identities.

Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment

Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment
Piotr Plonski, Jan P. Radomski
(Submitted on 8 Oct 2013)

Most of major algorithms for phylogenetic tree reconstruction assume that sequences in the analyzed set either do not have any offspring, or that parent sequences can maximally mutate into just two descendants. The graph resulting from such assumptions forms therefore a binary tree, with all the nodes labeled as leaves. However, these constraints are unduly restrictive as there are numerous data sets with multiple offspring of the same ancestors. Here we propose a solution to analyze and visualize such sets in a more intuitive manner. The method reconstructs phylogenetic tree by assigning the sequences with offspring as internal nodes, and the sequences without offspring as leaf nodes. In the resulting tree there is no constraint for the number of adjacent nodes, which means that the solution tree needs not to be a binary graph only. The subsequent derivation of evolutionary pathways, and pair-wise mutations, are then an algorithmically straightforward, with edge’s length corresponding directly to the number of mutations. Other tree reconstruction algorithms can be extended in the proposed manner, to also give unbiased topologies.

A novel spectral method for inferring general selection from time series genetic data

A novel spectral method for inferring general selection from time series genetic data

Matthias Steinrücken, Anand Bhaskar, Yun S. Song
(Submitted on 3 Oct 2013)

Recently there has been growing interest in using time series genetic variation data, either from experimental evolution studies or ancient DNA samples, to make inference about evolutionary processes. While such temporal data can facilitate identifying genomic regions under selective pressure and estimating associated fitness parameters, it is a challenging problem to compute the likelihood of the underlying selection model given DNA samples obtained at several time points. Here, we develop an efficient algorithm to tackle this challenge. The key methodological advance in our work is the development of a novel spectral method to analytically and efficiently integrate over all trajectories of the population allele frequency between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the allele frequency space to approximate certain integrals using numerical schemes. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and apply the method to analyze time series ancient DNA data from genetic loci (ASIP and MC1R) associated with coat coloration in horses. In contrast to the conclusions of previous studies which considered only a few special selection schemes, our exploration of the full fitness parameter space reveals that balancing selection (in the form of heterozygote advantage) may have been acting on these loci.

Some mathematical tools for the Lenski experiment

Some mathematical tools for the Lenski experiment
Bernard Ycart (LJK), Agnès Hamon (LJK), Joël Gaffé (LAPM), Dominique Schneider (LAPM)
(Submitted on 2 Oct 2013)

The Lenski experiment is a long term daily reproduction of Escherichia coli, that has evidenced phenotypic and genetic evolutions along the years. Some mathematical models, that could be usefull in understanding the results of that experiment, are reviewed here: stochastic and deterministic growth, mutation appearance and fixation, competition of species.

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible
Paul J. McMurdie, Susan Holmes
(Submitted on 1 Oct 2013)

The interpretation of count data originating from the current generation of DNA sequencing platforms requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are overdispersed relative to a simple Poisson model These challenges can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability. This approach is already well-characterized and implemented for RNA-Seq data in R packages such as edgeR and DESeq.
We use statistical theory, extensive simulations, and empirical data to show that variance stabilizing normalization using a mixture model like the negative binomial is appropriate for microbiome count data. In simulations detecting differential abundance, normalization procedures based on a Gamma-Poisson mixture model provided systematic improvement in performance over crude proportions or rarefied counts — both of which led to a high rate of false positives. In simulations evaluating clustering accuracy, we found that the rarefying procedure discarded samples that were nevertheless accurately clustered by alternative methods, and that the choice of minimum library size threshold was critical in some settings, but with an optimum that is unknown in practice. Techniques that use variance stabilizing transformations by modeling microbiome count data with a mixture distribution, such as those implemented in edgeR and DESeq, substantially improved upon techniques that attempt to normalize by rarefying or crude proportions. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

Identical inferences about correlated evolution arise from ancestral state reconstruction and independent contrasts

Identical inferences about correlated evolution arise from ancestral state reconstruction and independent contrasts
Michael G. Elliot
(Submitted on 30 Sep 2013)

Inferences about the evolution of continuous traits based on reconstruction of ancestral states has often been considered more error-prone than analysis of independent contrasts. Here we show that both methods in fact yield identical estimators for the correlation coefficient and regression gradient of correlated traits, indicating that reconstructed ancestral states are a valid source of information about correlated evolution. We show that the independent contrast associated with a pair of sibling nodes on a phylogenetic tree can be expressed in terms of the maximum likelihood ancestral state function at those nodes and their common parent. This expression gives rise to novel formulae for independent contrasts for any model of evolution admitting of a local likelihood function. We thus derive new formulae for independent contrasts applicable to traits evolving under directional drift, and use simulated data to show that these directional contrasts provide better estimates of evolutionary model parameters than standard independent contrasts, when traits in fact evolve with a directional tendency.