General triallelic frequency spectrum under demographic models with variable population size

General triallelic frequency spectrum under demographic models with variable population size
Paul A. Jenkins, Jonas W. Mueller, Yun S. Song
(Submitted on 13 Oct 2013)

It is becoming routine to obtain datasets on DNA sequence variation across several thousands of chromosomes, providing unprecedented opportunity to infer the underlying biological and demographic forces. Such data make it vital to study summary statistics which offer enough compression to be tractable, while preserving a great deal of information. One well-studied summary is the site frequency spectrum—the empirical distribution, across segregating sites, of the sample frequency of the derived allele. However, most previous theoretical work has assumed that each site has experienced at most one mutation event in its genealogical history, which becomes less tenable for very large sample sizes. In this work we obtain, in closed-form, the predicted frequency spectrum of a site that has experienced at most two mutation events, under very general assumptions about the distribution of branch lengths in the underlying coalescent tree. Among other applications, we obtain the frequency spectrum of a triallelic site in a model of historically varying population size. We demonstrate the utility of our formulas in two settings: First, we show that triallelic sites are more sensitive to the parameters of a population that has experienced historical growth, suggesting that they will have use if they can be incorporated into demographic inference. Second, we investigate a recently proposed alternative mechanism of mutation in which the two derived alleles of a triallelic site are created simultaneously within a single individual, and we develop a test to determine whether it is responsible for the excess of triallelic sites in the human genome.

Non-identifiability of identity coefficients at biallelic loci

Non-identifiability of identity coefficients at biallelic loci
Miklós Csűrös
(Submitted on 13 Oct 2013)

Shared genealogies introduce allele dependencies in diploid genotypes, as alleles within an individual or between different individuals will likely match when they originate from a recent common ancestor. At a locus shared by a pair of diploid individuals, there are nine combinatorially distinct modes of identity-by-descent (IBD), capturing all possible combinations of coancestry and inbreeding. A distribution over the IBD modes is described by the nine associated probabilities, known as (Jacquard’s) identity coefficients. The genetic relatedness between two individuals can be succinctly characterized by the identity coefficients corresponding to the joint genealogy. The identity coefficients (together with allele frequencies) determine the distribution of joint genotypes at a locus. At a locus with two possible alleles, identity coefficients are not identifiable because different coefficients can generate the same genotype distribution.
We analyze precisely how different IBD modes combine into identical genotype distributions at diallelic loci. In particular, we describe IBD mode mixtures that result in identical genotype distributions at all allele frequencies, implying the non-identifiability of the identity coefficients from independent loci. Our analysis yields an exhaustive characterization of relatedness statistics that are always identifiable. Importantly, we show that identifiable relatedness statistics include the kinship coefficient (probability that a random pair of alleles are identical by descent between individuals) and inbreeding-related measures, which can thus be estimated from genotype distributions at independent loci.

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection
Darren Kessner, John Novembre
(Submitted on 11 Oct 2013)

forqs is a forward-in-time simulation of recombination, quantitative traits, and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. forqs is implemented as a command- line C++ program. Source code and binary executables for Linux, OSX, and Windows are freely available under a permissive BSD license.

The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation

The Fossilized Birth-Death Process: A Coherent Model of Fossil Calibration for Divergence Time Estimation
Tracy A. Heath, John P. Huelsenbeck, Tanja Stadler
(Submitted on 10 Oct 2013)

Time-calibrated species phylogenies are critical for addressing a wide range of questions in evolutionary biology, such as those that elucidate historical biogeography or uncover patterns of coevolution and diversification. Because molecular sequence data are not informative on absolute time, external data, most commonly fossil age estimates, are required to calibrate estimates of species divergence dates. For Bayesian divergence-time methods, the common practice for calibration using fossil information involves placing arbitrarily chosen parametric distributions on internal nodes, often disregarding most of the information in the fossil record. We introduce the ‘fossilized birth-death’ (FBD) process, a model for calibrating divergence-time estimates in a Bayesian framework, explicitly acknowledging that extant species and fossils are part of the same macroevolutionary process. Under this model, absolute node age estimates are calibrated by a single diversification model and arbitrary calibration densities are not necessary. Moreover, the FBD model allows for inclusion of all available fossils. We performed analyses of simulated data and show that node-age estimation under the FBD model results in robust and accurate estimates of species divergence times with realistic measures of statistical uncertainty, overcoming major limitations of standard divergence time estimation methods. We then used this model to estimate the speciation times for a dataset composed of all living bears, indicating that the genus Ursus diversified in the late Miocene to mid Pliocene.

Routes for breaching and protecting genetic privacy

Routes for breaching and protecting genetic privacy
Yaniv Erlich, Arvind Narayanan
(Submitted on 11 Oct 2013)

We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats to genetic privacy and discuss potential mitigation strategies for privacy-preserving dissemination of genetic data.

Application of compressed sensing to genome wide association studies and genomic selection

Application of compressed sensing to genome wide association studies and genomic selection
Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics

IQRray, a new method for Affymetrix microarray quality control, and the homologous organ conservation score, a new benchmark method for quality control metrics
Marta Rosikiewicz, Marc Robinson-Rechavi
(Submitted on 8 Oct 2013)

Motivation: Microarray results accumulated in public repositories are widely re-used in meta-analytical studies and secondary databases. The quality of the data obtained with this technology varies from experiment to experiment and efficient method for quality assessment is neces-sary to ensure their reliability. Results: The lack of a good benchmark has hampered evaluation of existing methods for quality control. In this study we propose a new inde-pendent quality metric that is based on evolutionary conservation of expression profiles. We show, using 11 large organ-specific datasets, that IQRray, a new quality metrics developed by us, exhibits the highest correlation with this reference metric, among 14 metrics tested. IQRray outperforms other methods in identification of poor quality arrays in dataset composed of arrays from many independent experiments. In con-trast, the performance of methods designed for detecting outliers in a single experiment like NUSE and RLE was low because of the inability of these method to detect datasets containing only low quality arrays, and the fact that the scores cannot be directly compared between ex-periments. Availability: The R implementation of IQRray is available at: this ftp URL

Let my people go (home) to Spain: a genealogical model of Jewish identities since 1492

Let my people go (home) to Spain: a genealogical model of Jewish identities since 1492
Joshua S. Weitz
(Submitted on 7 Oct 2013)

The Spanish government recently announced an official fast-track path to citizenship for any individual who is Jewish and whose ancestors were expelled from Spain during the inquisition-related dislocation of Spanish Jews in 1492. It would seem that this policy targets a small subset of the global Jewish population, i.e., restricted to individuals who retain cultural practices associated with ancestral origins in Spain. However, the central contribution of this manuscript is to demonstrate how and why the policy is far more likely to apply to a very large fraction (i.e., the vast majority) of Jews. This claim is supported using a series of genealogical models that include transmissable “identities” and preferential intra-group mating. Model analysis reveals that even when intra-group mating is strong and even if only a small subset of a present-day population retains cultural practices typically associated with that of an ancestral group, it is highly likely that nearly all members of that population have direct geneaological links to that ancestral group, given sufficient number of generations have elapsed. The basis for this conclusion is that not having a link to an ancestral group must be a property of all of an individual’s ancestors, the probability of which declines (nearly) superexponentially with each successive generation. These findings highlight unexpected incongruities induced by genealogical dynamics between present-day and ancestral identities.

Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment

Neighbor Joining Plus – algorithm for phylogenetic tree reconstruction with proper nodes assignment
Piotr Plonski, Jan P. Radomski
(Submitted on 8 Oct 2013)

Most of major algorithms for phylogenetic tree reconstruction assume that sequences in the analyzed set either do not have any offspring, or that parent sequences can maximally mutate into just two descendants. The graph resulting from such assumptions forms therefore a binary tree, with all the nodes labeled as leaves. However, these constraints are unduly restrictive as there are numerous data sets with multiple offspring of the same ancestors. Here we propose a solution to analyze and visualize such sets in a more intuitive manner. The method reconstructs phylogenetic tree by assigning the sequences with offspring as internal nodes, and the sequences without offspring as leaf nodes. In the resulting tree there is no constraint for the number of adjacent nodes, which means that the solution tree needs not to be a binary graph only. The subsequent derivation of evolutionary pathways, and pair-wise mutations, are then an algorithmically straightforward, with edge’s length corresponding directly to the number of mutations. Other tree reconstruction algorithms can be extended in the proposed manner, to also give unbiased topologies.

A novel spectral method for inferring general selection from time series genetic data

A novel spectral method for inferring general selection from time series genetic data

Matthias Steinrücken, Anand Bhaskar, Yun S. Song
(Submitted on 3 Oct 2013)

Recently there has been growing interest in using time series genetic variation data, either from experimental evolution studies or ancient DNA samples, to make inference about evolutionary processes. While such temporal data can facilitate identifying genomic regions under selective pressure and estimating associated fitness parameters, it is a challenging problem to compute the likelihood of the underlying selection model given DNA samples obtained at several time points. Here, we develop an efficient algorithm to tackle this challenge. The key methodological advance in our work is the development of a novel spectral method to analytically and efficiently integrate over all trajectories of the population allele frequency between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the allele frequency space to approximate certain integrals using numerical schemes. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and apply the method to analyze time series ancient DNA data from genetic loci (ASIP and MC1R) associated with coat coloration in horses. In contrast to the conclusions of previous studies which considered only a few special selection schemes, our exploration of the full fitness parameter space reveals that balancing selection (in the form of heterozygote advantage) may have been acting on these loci.