Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle
Jared E. Decker, Stephanie D. McKay, Megan M. Rolf, JaeWoo Kim, Antonio Molina Alcalá, Tad S. Sonstegard, Olivier Hanotte, Anders Götherström, Christopher M. Seabury, Lisa Praharani, Masroor Ellahi Babar, Luciana Correia de Almeida Regitano, Mehmet Ali Yildiz, Michael P. Heaton, Wansheng Lui, Chu-Zhao Lei, James M. Reecy, Muhammad Saif-Ur-Rehman, Robert D. Schnabel, Jeremy F. Taylor
(Submitted on 19 Sep 2013)

The domestication and development of cattle has considerably impacted human societies, but the histories of cattle breeds have been poorly understood especially for African, Asian, and American breeds. Using genotypes from 43,043 autosomal single nucleotide polymorphism markers scored in 1,543 animals, we evaluate the population structure of 134 domesticated bovid breeds. Regardless of the analytical method or sample subset, the three major groups of Asian indicine, Eurasian taurine, and African taurine were consistently observed. Patterns of geographic dispersal resulting from co-migration with humans and exportation are recognizable in phylogenetic networks. All analytical methods reveal patterns of hybridization which occurred after divergence. Using 19 breeds, we map the cline of indicine introgression into Africa. We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine. We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia. We also identify the influence of species other than Bos taurus in the formation of Asian breeds. We detect the pronounced influence of Shorthorn cattle in the formation of European breeds. Iberian and Italian cattle possess introgression from African taurine. American Criollo cattle are shown to be of Iberian, and not African, decent. Indicine introgression into American cattle occurred in the Americas, and not Europe. We argue that cattle migration, movement and trading followed by admixture have been important forces in shaping modern bovine genomic variation.

Change point analysis of histone modifications reveals epigenetic blocks with distinct regulatory activity and biological functions

Change point analysis of histone modifications reveals epigenetic blocks with distinct regulatory activity and biological functions
Mengjie Chen, Haifan Lin, Hongyu Zhao
(Submitted on 20 Sep 2013)

Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant biological functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks across the chromosome with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs manifest relatively homogeneous enrichment pattern of histone modifications. We further characterized BLOCKs by their transcription levels, distribution of genes, binding profiles of a broad panel of chromatin proteins, degree of co-expression and GO enrichment. Our results demonstrate that these blocks, although inferred merely from histone modifications, reveal strong relevance with transcription events and chromatin organization, which suggest their important roles in coordinated gene regulation.

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
Sean Whalen, Gaurav Pandey
(Submitted on 19 Sep 2013)

The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.

The identifiability of piecewise demographic models from the sample frequency spectrum

The identifiability of piecewise demographic models from the sample frequency spectrum
Anand Bhaskar, Yun S. Song
(Submitted on 19 Sep 2013)

The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different demographic models can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this non-identifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant and piecewise-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the “folded” SFS, which is often used when there is ambiguity as to which allelic type is ancestral.

The effect of paternal age on offspring intelligence and personality when controlling for paternal trait level

The effect of paternal age on offspring intelligence and personality when controlling for paternal trait level

Ruben C. Arslan, Lars Penke, Wendy Johnson, William G. Iacono, Matt McGue
(Submitted on 18 Sep 2013)

Paternal age at conception has been found to predict the number of new genetic mutations. We examined the effect of father’s age at birth on offspring intelligence, head circumference and personality traits. Using the Minnesota Twin Family Study sample we tested paternal age effects while controlling for parents’ trait levels measured with the same precision as offspring’s. From evolutionary genetic considerations we predicted a negative effect of paternal age on offspring intelligence, but not on other traits. Controlling for parental IQ had the effect of turning a positive-zero order association negative. We found paternal age effects on offspring IQ and MPQ Absorption, but they were not robustly significant, nor replicable with additional covariates. No other noteworthy effects were found. Parents’ intelligence and personality correlated with their ages at twin birth, which may have obscured a small negative effect of advanced paternal age (< 1% of variance explained) on intelligence. We discuss future avenues for studies of paternal age effects and suggest that stronger research designs are needed to rule out confounding factors involving birth order and the Flynn effect.

Accurate Computation of Survival Statistics in Genome-wide Studies

Accurate Computation of Survival Statistics in Genome-wide Studies
Fabio Vandin, Alexandra Papoutsaki, Benjamin J. Raphael, Eli Upfal
(Submitted on 17 Sep 2013)

A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations.

Watterson estimators for Next Generation Sequencing: from trios to autopolyploids

Watterson estimators for Next Generation Sequencing: from trios to autopolyploids
Luca Ferretti, Sebástian E. Ramos-Onsins
(Submitted on 17 Sep 2013)

Several variation of the Watterson estimator of variability for Next Generation Sequencing (NGS) data have been proposed in the literature. We present a unified framework for generalized Watterson estimators based on Maximum Composite Likelihood, which encompasses most of the existing estimators. We propose this class of unbiased estimators as generalized Watterson estimators for a large class of NGS data, including pools and trios. We also discuss the relation with the estimators that have been proposed in the literature and show that they admit two equivalent but seemingly different forms, deriving a set of combinatorial identities as a byproduct. Finally, we give a detailed treatment of Watterson estimators for single or multiple autopolyploid individuals.

Author Post: The Population Genetic Signature of Polygenic Local Adaptation

This guest post is by Jeremy Berg [@JeremyJBerg] and Graham Coop [@Graham_coop] on their paper The Population Genetic Signature of Polygenic Local Adaptation arXived here

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.


In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.


We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.

A Gene Regulatory Model of heterosis and speciation

A Gene Regulatory Model of heterosis and speciation
Peter M. F. Emmrich, Vera Pancaldi, Hannah E. Roberts, Krystyna A. Kelly, David C. Baulcombe
(Submitted on 15 Sep 2013)

Crossing individuals from genetically distinct populations often results in improvements in quantitative traits, such as growth rate, biomass production and stress resistance; this phenomenon is known as heterosis. We have taken a computational approach to explore the mechanisms underlying heterosis, developing a simulation of evolution and hybridization of Gene Regulatory Networks (GRNs) in a Boolean framework. These artificial regulatory networks exhibit biologically realistic topological properties and fitness is measured as the ability of a network to respond to external inputs in the correct way. Our model reproduced experimental observations from the literature on heterosis using only biologically meaningful parameters, such as mutation rates. Hybrid vigor was observed, its extent was seen to increase as parental populations diverged until it collapses when the two populations have become incompatible. Thus, the model also describes a process of speciation and links it to collapsing hybrid fitness due to genetic incompatibility of the separated populations. We also reproduce for the first time in a model the fact that hybrid vigor cannot easily be fixed by crossing hybrids, which is currently an important drawback of the use of hybrid crops. The simulation allows us to study the effects of three standard models for the genetic basis of heterosis, dominance, over-dominance, and epistasis. In our simulation over-dominance is the main factor contributing to hybrid vigour, whereas under-dominance and epistatic incompatibility are responsible for the fitness collapse. As the parental populations diverge, a single mutation can determine an almost sudden incompatibility leading to low fitness hybrids.

Universality and predictability in the evolution of molecular quantitative traits

Universality and predictability in the evolution of molecular quantitative traits
Armita Nourmohammad, Torsten Held, Michael Lässig
(Submitted on 12 Sep 2013)

Molecular traits, such as gene expression levels or protein binding affinities, are increasingly accessible to quantitative measurement by modern high-throughput techniques. Such traits measure molecular functions and, from an evolutionary point of view, are important as targets of natural selection. Here we discuss recent developments in the evolutionary theory of quantitative traits that reach beyond classical quantitative genetics. We focus on universal evolutionary characteristics: these are largely independent of a trait’s genetic basis, which is often at least partially unknown. We show that universal measurements can be used to infer selection on a quantitative trait, which determines its evolutionary mode of conservation or adaptation. Furthermore, universality is closely linked to predictability of trait evolution across lineages. We argue that universal trait statistics extends over a range of cellular scales and opens new avenues of quantitative evolutionary systems biology.