How to infer relative fitness from a sample of genomic sequences

How to infer relative fitness from a sample of genomic sequences
Adel Dayarian, Boris I Shraiman
(Submitted on 29 Aug 2012)

Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman’s coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we shall demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test {\it in silico} using simulations of a Wright-Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator which identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with the actual fitness, with top 10% ranked being in the top 20% fittest with false discovery rate of 0.1-0.3 depending on the mutation/selection parameters. The ranking also enables to predict the common genotype of the future population. While the inference accuracy increases monotonically with sample size, sample sizes of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks.

The impact of deleterious passenger mutations on cancer progression.

The impact of deleterious passenger mutations on cancer progression. (arXiv:1208.6068v1 [q-bio.PE])
by Christopher D McFarland, Gregory V Kryukov, Shamil Sunyaev, Leonid Mirny

Cancer progression is driven by a small number of genetic alterations accumulating in a neoplasm. These few driver alterations reside in a cancer genome alongside tens of thousands of other mutations that are widely believed to have no role in cancer and termed passengers. Many passengers, however, fall within protein coding genes and other functional elements and can possibly have deleterious effects on cancer cells. Here we investigate a potential of mildly deleterious passengers to accumulate and alter the course of neoplastic progression. Our approach combines evolutionary simulations of cancer progression with the analysis of cancer sequencing data. In our simulations, individual cells stochastically divide, acquire advantageous driver and deleterious passenger mutations, or die. Surprisingly, despite selection against them, passengers accumulate and largely evade selection during progression. Although individually weak, the collective burden of passengers alters the course of progression leading to several phenomena observed in oncology that cannot be explained by a traditional driver-centric view. We tested predictions of the model using cancer genomic data. We find that many passenger mutations are likely to be damaging and that, in agreement with the model, they have largely evaded purifying selection. Finally, we used our model to explore cancer treatments that exploit the load of passengers by either 1) increasing the mutation rate; or 2) exacerbating their deleterious effects. While both approaches lead to cancer regression, the later leads to less frequent relapse. Our results suggest a new framework for understanding cancer progression as a balance of driver and passenger mutations.

The genetic prehistory of southern Africa

The genetic prehistory of southern Africa

Joseph K. Pickrell, Nick Patterson, Chiara Barbieri, Falko Berthold, Linda Gerlach, Mark Lipson, Po-Ru Loh, Tom Güldemann, Blesswell Kure, Sununguko Wata Mpoloka, Hirosi Nakagawa, Christfried Naumann, Joanna L. Mountain, Carlos D. Bustamante, Bonnie Berger, Brenna M. Henn, Mark Stoneking, David Reich, Brigitte Pakendorf
(Submitted on 23 Jul 2012)

The hunter-gatherer populations of southern and eastern Africa are known to harbor some of the most ancient human lineages, but their historical relationships are poorly understood. We report data from 22 populations analyzed at over half a million single nucleotide polymorphisms (SNPs), using a genome-wide array designed for studies of history. The southern Africans-here called Khoisan-fall into two groups, loosely corresponding to the northwestern and southeastern Kalahari, which we show separated within the last 30,000 years. All individuals derive at least a few percent of their genomes from admixture with non-Khoisan populations that began 1,200 years ago. In addition, the Hadza, an east African hunter-gatherer population that speaks a language with click consonants, derive about a quarter of their ancestry from admixture with a population related to the Khoisan, implying an ancient genetic link between southern and eastern Africa.

The geography of recent genetic ancestry across Europe

The geography of recent genetic ancestry across Europe

Peter Ralph, Graham Coop
(Submitted on 16 Jul 2012 (v1), last revised 19 Jul 2012 (this version, v2))

The recent genealogical history of human populations is a complex mosaic formed by individual migration, large-scale population movements, and other demographic events. Population genomics datasets can provide a window into this recent history, as rare traces of recent shared genetic ancestry are detectable due to long segments of shared genomic material. We make use of genomic data for 2,257 Europeans (the POPRES dataset) to conduct one of the first surveys of recent genealogical ancestry over the past three thousand years at a continental scale. We detected 1.9 million shared genomic segments, and used the lengths of these to infer the distribution of shared ancestors across time and geography. We find that a pair of modern Europeans living in neighboring populations share around 10-50 genetic common ancestors from the last 1500 years, and upwards of 500 genetic ancestors from the previous 1000 years. These numbers drop off exponentially with geographic distance, but since genetic ancestry is rare, individuals from opposite ends of Europe are still expected to share millions of common genealogical ancestors over the last 1000 years. There is substantial regional variation in the number of shared genetic ancestors: especially high numbers of common ancestors between many eastern populations likely date to the Slavic and/or Hunnic expansions, while much lower levels of common ancestry in the Italian and Iberian peninsulas may indicate weaker demographic effects of Germanic expansions into these areas and/or more stably structured populations. Recent shared ancestry in modern Europeans is ubiquitous, and clearly shows the impact of both small-scale migration and large historical events. Population genomic datasets have considerable power to uncover recent demographic history, and will allow a much fuller picture of the close genealogical kinship of individuals across the world.

Our paper: Blood ties: ABO is a trans-species polymorphism in primates

[This author post is by Laure Ségurel [a postdoc in the Przeworski Lab] on the paper Blood ties: ABO is a trans-species polymorphism in primates, posted on the arXiv here]

The mysteries of the ABO blood group were first brought to our attention by Carole Ober. When we started working on it, we were mostly surprised by how little was known about the function of such a heavily studied gene and such an important clinical phenotype. Indeed, the expression of A, B and/or O antigens at the surface of some cells is a polymorphic phenotype shared by species as diverse as macaques and baboons in Africa, gibbons in Asia, squirrel monkeys in the Americas and, of course, humans throughout most of the world yet, many questions remain unanswered, such as: What is the biological role of ABO in different cell types? Why did Hominoids evolve toward its expression of blood cells whereas other primates express it only on epithelial/endothelial cells? Why is the O allele at such high frequency only in humans? What are the selective agents responsible for the maintenance of this polymorphism? And why did chimpanzees and bonobos apparently loose the polymorphism?

One question that we became interested in answering with population genetic tools was that of the origin of such blood types. When did the genetic polymorphism first emerge and which species share it identical by descent (as opposed to by convergent evolution)? Answers to these questions could tell us where and when having multiple alleles at this locus became advantageous. We therefore sequenced as many Hominoids, Old World monkeys and New World monkeys we could get our hands on, and, even more interestingly, we started thinking about the expectations under a model of convergent evolution, i.e., one where the AB genetic polymorphism was created independently multiple times in different species (and then maintained by balancing selection in these lineages) versus under a model of trans-species polymorphism, i.e., in which the AB genetic polymorphism arose early in time and was transmitted identical by descent to distinct species. Key to distinguishing the two predictions is the age of different selected alleles within a polymorphic population.

We therefore compared alleles within humans, orangutans, gibbons, macaques, baboons and colobus monkeys (all polymorphic species for the A and B alleles), and showed that, at least among Hominoids and among Old World monkeys, the observed genetic pattern is not compatible with a model of convergent evolution but on the contrary matches the expectations under a model of a trans-species polymorphism maintained by multi-allelic balancing selection. In other words, the data indicate that the AB polymorphism was present at least around 20 Millions of years ago, if not earlier. Also, interestingly, it seems that the A, B and O functional classes do not provide a complete description of the allelic classes natural selection is acting on, which underscores the need for more detailed functional studies of ABO sub-groups.

By submitting the paper to arXiv, we hope to circulate it to a diverse audience and without delay. In particular, we hope that the study will motivate more experimental/functional work about the role of this polymorphism in immune response, e.g., to pathogen infections.

Laure Ségurel

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing
Erik Garrison, Gabor Marth
(Submitted on 17 Jul 2012 (v1), last revised 20 Jul 2012 (this version, v2))

The direct detection of haplotypes from short-read DNA sequencing data requires changes to existing small-variant detection methods. Here, we develop a Bayesian statistical framework which is capable of modeling multiallelic loci in sets of individuals with non-uniform copy number. We then describe our implementation of this framework in a haplotype-based variant detector, FreeBayes.

Our paper: Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals

Our next “our paper” guest post is by Vincent Lynch [@VinJLynch] who’s just joined the UChicago faculty from a postdoc at Yale. He’s posting about his recently arXived paper:

Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals. ArXived here.
Explaining how morphology evolves is a major challenge in biology. While it’s clear that changes in gene regulation are ultimately responsible for the development and evolution of complex characters, we are only just beginning to understand the molecular mechanisms of gene regulatory evolution. This is largely due to the emergence of new technologies, such as mRNA-Seq and ChIP-Seq, which give biologists the tools to explore evolution across the genome and in non-model species.

We took advantage of these methods to explore the evolution of gene expression in the uterus during the origin of pregnancy in mammals. Using mRNA-Seq, we show that gene expression evolved extremely rapidly during major stages in the evolution of pregnancy, for example during the origin of maternal resource provisioning in the stem-lineage of Mammalia, placentation in the stem-lineage of Theria, and implantation in the stem-lineage of Eutheria. Using ChIP-Seq to identify the cis-regulatory elements of genes recruited into uterine expression in mammals suggests that the majority of enhancers and promoters derived from mammalian lineage-specific transposons.

While recent technological advances are changing the way we do biology (see Wagner 2013), as these emerging methods come into the mainstream we must collectively define our new standards of evidence. What experiments and methods build a convincing case for X? Is it sufficient, for example, to conclude that a transposon donated a novel promoter to a gene if a ChIP-Seq peak for a histone mark associated with promoters lies within the transposon? If we then expand that observation across the genome, can we reasonably conclude that transposons are casually responsible for gene regulatory change? For these reasons we chose to post our manuscript as a work-in-progress to arXiv, both as our contribution to the larger discussion of what constitutes the standards of evidence in this emerging field of biology and as an opportunity to receive feedback from our colleagues to complement formal peer-review.

Vincent Lynch

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

Sebastien Roch
(Submitted on 17 Jul 2012)

Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths.

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

Matthias Steinrücken, Joshua S. Paul, Yun S. Song
(Submitted on 25 Aug 2012)

Conditional sampling distributions (CSDs), sometimes referred to as copying models, underlie numerous practical tools in population genomic analyses. Though an important application that has received much attention is the inference of population structure, the explicit exchange of migrants at specified rates has not hitherto been incorporated into the CSD in a principled framework. Recently, in the case of a single panmictic population, a sequentially Markov CSD has been developed as an accurate, efficient approximation to a principled CSD derived from the diffusion process dual to the coalescent with recombination. In this paper, the sequentially Markov CSD framework is extended to incorporate subdivided population structure, thus providing an efficiently computable CSD that admits a genealogical interpretation related to the structured coalescent with migration and recombination. As a concrete application, it is demonstrated empirically that the CSD developed here can be employed to yield accurate estimation of a wide range of migration rates.

An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection

An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection

Matthias Steinrücken, Y. X. Rachel Wang, Yun S. Song
(Submitted on 25 Aug 2012)

Characterizing time-evolution of allele frequencies in a population is a fundamental problem in population genetics. In the Wright-Fisher diffusion, such dynamics is captured by the transition density function, which satisfies well-known partial differential equations. For a multi-allelic model with general diploid selection, various theoretical results exist on representations of the transition density, but finding an explicit formula has remained a difficult problem. In this paper, a technique recently developed for a diallelic model is extended to find an explicit transition density for an arbitrary number of alleles, under a general diploid selection model with recurrent parent-independent mutation. Specifically, the method finds the eigenvalues and eigenfunctions of the generator associated with the multi-allelic diffusion, thus yielding an accurate spectral representation of the transition density. Furthermore, this approach allows for efficient, accurate computation of various other quantities of interest, including the normalizing constant of the stationary distribution and the rate of convergence to this distribution.