Author post: Inferring human population size and separation history from multiple genome sequences

This guest post is by Stephan Schiffels (@stschiff) on his paper with Richard Durbin Inferring human population size and separation history from multiple genome sequences biorxived here

In our paper, we study genome sequences to learn about human history and how human populations are related to each other. Remarkably, we only need a few individuals for this, because once we look sufficiently many generations into the past, every single genome contains fragments from a very large number of ancestors. This means that given only two genomes, say one individual from Africa and one individual from Europe, we typically find shared fragments from common ancestors (great great … great grandparents) from 2,000 or more generations ago. This trace of shared segments in our genomes can be detected and enables us to make inference about human history.

A few years ago, Heng Li and Richard Durbin introduced the PSMC method which is based on estimating this shared common ancestry in a single diploid genome to infer population sizes. We now introduced a major extension to this approach, called MSMC (Multiple Sequentially Markovian Coalescent), which is able to find and date traces of shared ancestry across multiple genome sequences. This is generally a hard problem because of the complex way of how sequences relate with each other through recombination and mutation (see an excellent blog post by Adam Siepel). In our method, we therefore made a choice to focus only on the pair of segments which coalesce first, i.e. share the most recent common ancestor of all pairs. Because of ancestral recombinations, this changes along the sequences.

Consider again the example of an African and a European individual, each of them carrying two copies of a chromosome. In one part of their genomes, the most recent ancestor of any two chromosomes may be shared between the two European chromosomes, in other parts it may be shared between the two African chromosomes, and in some cases it may actually be found across a European and an African chromosome. The relative frequency of how often we observe each of the three cases, and the distribution of times to the most recent common ancestor, give information about when the separation happened, and how long it took for the ancestral people to part fully from each other. In the case of West-Africans and Europeans, we found that the two populations started to separate from each other (at least genetically) long before the known out-of-Africa emigration 50,000 years ago. And we see the same thing if we compare West-Africans to Asians or Americans instead of Europeans. We can also see clearly how ancestors of Native Americans separated from Asians around 20,000 years ago, consistently preceding the known first arrival of people in the New World around 15,000 years ago.

Our method can also estimate effective population size changes through time. One consequence of our approach to look only for the first common ancestor is that we can now look into the much more recent past than was previously possible with similar methods, such as PSMC. For example, we can now see a deep bottleneck in Native American ancestors around 15,000 years ago which fits with the separation and immigration history described above, and we can see recent expansions that are consistent with the spread of agriculture in Africa.

We believe that MSMC is a useful tool for estimating population history from whole genome sequences. But more ideas and development are still needed in the future to expand this approach to more genomes and to look into the past even more recently than 2,000 years ago, which is our current limit with MSMC. Closely related approaches are currently developed by Yun Song, Thomas Mailund and others, which will complement MSMC. This is a great time to work in this field, given that many more high quality individual genome sequences are being generated, and in many cases from populations that we have not covered at all in our paper. All of this will help to greatly expand our knowledge of human population history.

Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions

Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions

Matthew Hall, Andrew Rambaut
(Submitted on 2 Jun 2014)

The reconstruction of transmission trees for epidemics from genetic data has been the subject of some recent interest. It has been demonstrated that the transmission tree structure can be investigated by augmenting internal nodes of a phylogenetic tree constructed using pathogen sequences from the epidemic with information about the host that held the corresponding lineage. In this paper, we note that this augmentation is equivalent to a correspondence between transmission trees and partitions of the phylogenetic tree into connected subtrees each containing one tip, and provide a framework for Markov Chain Monte Carlo inference of phylogenies that are partitioned in this way, giving a new method to co-estimate both trees. The procedure is integrated in the existing phylogenetic inference package BEAST.

Target enrichment of ultraconserved elements from arthropods provides a genomic perspective on relationships among Hymenoptera

Target enrichment of ultraconserved elements from arthropods provides a genomic perspective on relationships among Hymenoptera

Brant C. Faircloth, Michael G. Branstetter, Noor D. White, Seán G. Brady
(Submitted on 2 Jun 2014)

Gaining a genomic perspective on phylogeny requires the collection of data from many putatively independent loci collected across the genome. Among insects, an increasingly common approach to collecting this class of data involves transcriptome sequencing, because few insects have high-quality genome sequences available; assembling new genomes remains a limiting factor; the transcribed portion of the genome is a reasonable, reduced subset of the genome to target; and the data collected from transcribed portions of the genome are similar in composition to the types of data with which biologists have traditionally worked (e.g., exons). However, molecular techniques requiring RNA as a template are limited to using very high quality source materials, which are often unavailable from a large proportion of biologically important insect samples. Recent research suggests that DNA-based target enrichment of conserved genomic elements offers another path to collecting phylogenomic data across insect taxa, provided that conserved elements are present in and can be collected from insect genomes. Here, we identify a large set (n=1510) of ultraconserved elements (UCE) shared among the insect order Hymenoptera. We use in silico analyses to show that these loci accurately reconstruct relationships among genome-enabled Hymenoptera, and we design a set of baits for enriching these loci that researchers can use with DNA templates extracted from a variety of sources. We use our UCE bait set to enrich an average of 721 UCE loci from 30 hymenopteran taxa, and we use these UCE loci to reconstruct phylogenetic relationships spanning very old (≥220 MYA) to very young (≥1 MYA) divergences among hymenopteran lineages. In contrast to a recent study addressing hymenopteran phylogeny using transcriptome data, we found ants to be sister to all remaining aculeate lineages with complete support.

The most parsimonious tree for random data

The most parsimonious tree for random data

Mareike Fischer, Michelle Galla, Lina Herbst, Mike Steel
(Submitted on 1 Jun 2014)

Applying a method to reconstruct a phylogenetic tree from random data provides a way to detect whether that method has an inherent bias towards certain tree `shapes’. For maximum parsimony, applied to a sequence of random 2-state data, each possible binary phylogenetic tree has exactly the same distribution for its parsimony score. Despite this pleasing and slightly surprising symmetry, some binary phylogenetic trees are more likely than others to be a most parsimonious (MP) tree for a sequence of k such characters, as we show. For k=2, and unrooted binary trees on six taxa, any tree with a caterpillar shape has a higher chance of being an MP tree than any tree with a symmetric shape. On the other hand, if we take any two binary trees, on any number of taxa, we prove that this bias between the two trees vanishes as the number of characters grows. However, again there is a twist: MP trees on six taxa are more likely to have certain shapes than a uniform distribution on binary phylogenetic trees predicts, and this difference does not appear to dissipate as k grows.

A field test for frequency-dependent selection on mimetic colour patterns in Heliconius butterflies

A field test for frequency-dependent selection on mimetic colour patterns in Heliconius butterflies

Patricio Alejandro Salazar Carrión, Martin Stevens, Robert T. Jones, Imogen Ogilvie, Chris Jiggins

Müllerian mimicry, the similarity among unpalatable species, is thought to evolve by frequency-dependent selection. Accordingly, phenotypes that become established in an area are positively selected because predators have learnt to avoid these forms, while introduced phenotypes are eliminated because predators have not yet learnt to associate these other forms with unprofitability. We tested this prediction in two areas where different colour morphs of the mimetic species Heliconius erato and H. melpomene have become established, as well as in the hybrid zone between these morphs. In each area we tested for selection on three colour patterns: the two parental and the most common hybrid. We recorded bird predation on butterfly models with paper wings, matching the appearance of each morph to bird vision, and plasticine bodies. We did not detect differences in survival between colour morphs, but all morphs were more highly attacked in the hybrid zone. This finding is consistent with recent evidence from controlled experiments with captive birds, which suggest that the effectiveness of warning signals decreases when a large signal diversity is available to predators. This is likely to occur in the hybrid zone where over twenty hybrid phenotypes coexist.

Phylogenetic Identification and Functional Characterization of Orthologs and Paralogs across Human, Mouse, Fly, and Worm

Phylogenetic Identification and Functional Characterization of Orthologs and Paralogs across Human, Mouse, Fly, and Worm

Yi-Chieh Wu, Mukul S Bansal, Matthew D Rasmussen, Javier Herrero, Manolis Kellis

Model organisms can serve the biological and medical community by enabling the study of conserved gene families and pathways in experimentally-tractable systems. Their use, however, hinges on the ability to reliably identify evolutionary orthologs and paralogs with high accuracy, which can be a great challenge at both small and large evolutionary distances. Here, we present a phylogenomics-based approach for the identification of orthologous and paralogous genes in human, mouse, fly, and worm, which forms the foundation of the comparative analyses of the modENCODE and mouse ENCODE projects. We study a median of 16,101 genes across 2 mammalian genomes (human, mouse), 12 Drosophila genomes, 5 Caenorhabditis genomes, and an outgroup yeast genome, and demonstrate that accurate inference of evolutionary relationships and events across these species must account for frequent gene-tree topology errors due to both incomplete lineage sorting and insufficient phylogenetic signal. Furthermore, we show that integration of two separate phylogenomic pipelines yields increased accuracy, suggesting that their sources of error are independent, and finally, we leverage the resulting annotation of homologous genes to study the functional impact of gene duplication and loss in the context of rich gene expression and functional genomic datasets of the modENCODE, mouse ENCODE, and human ENCODE projects.

Most viewed on Haldane’s Sieve: May 2014

The most viewed posts on Haldane’s Sieve in May 2014 were:

High performance computation of landscape genomic models integrating local indices of spatial association


High performance computation of landscape genomic models integrating local indices of spatial association

Sylvie Stucki, Pablo Orozco-terWengel, Michael W. Bruford, Licia Colli, Charles Masembe, Riccardo Negrini, Pierre Taberlet, Stéphane Joost, the NEXTGEN Consortium
Comments: 1 figure in text, 1 figure in supplementary material
Subjects: Populations and Evolution (q-bio.PE)

Motivation: The increasing availability of high-throughput datasets requires powerful methods to support the detection of signatures of selection in landscape genomics. Results: We present an integrated approach to study signatures of local adaptation, providing rapid processing of whole genome data and enabling assessment of spatial association using molecular markers. Availabilty: Sam{\ss}ada is an open source software written in C++ available at http:lasig.epfl.ch/sambada (under the license GNU GPL 3). Compiled versions are provided for Windows, Linux and MacOS X. Contact: stephane.joost@epfl.ch, sylvie.stucki@a3.epfl.ch. Supplementary material is available online.

High-resolution transcriptome analysis with long-read RNA sequencing


High-resolution transcriptome analysis with long-read RNA sequencing

Hyunghoon Cho, Joe Davis, Xin Li, Kevin S. Smith, Alexis Battle, Stephen B. Montgomery
Comments: 29 pages, 8 figures, 11 supplementary figures
Subjects: Genomics (q-bio.GN)

RNA sequencing (RNA-seq) enables characterization and quantification of individual transcriptomes as well as detection of patterns of allelic expression and alternative splicing. Current RNA-seq protocols depend on high-throughput short-read sequencing of cDNA. However, as ongoing advances are rapidly yielding increasing read lengths, a technical hurdle remains in identifying the degree to which differences in read length influence various transcriptome analyses. In this study, we generated two paired-end RNA-seq datasets of differing read lengths (2×75 bp and 2×262 bp) for lymphoblastoid cell line GM12878 and compared the effect of read length on transcriptome analyses, including read-mapping performance, gene and transcript quantification, and detection of allele-specific expression (ASE) and allele-specific alternative splicing (ASAS) patterns. Our results indicate that, while the current long-read protocol is considerably more expensive than short-read sequencing, there are important benefits that can only be achieved with longer read length, including lower mapping bias and reduced ambiguity in assigning reads to genomic elements, such as mRNA transcript. We show that these benefits ultimately lead to improved detection of cis-acting regulatory and splicing variation effects within individuals.

Cis-regulatory elements and human evolution

Cis-regulatory elements and human evolution
Adam Siepel, Leonardo Arbiza

Modification of gene regulation has long been considered an important force in human evolution, particularly through changes to cis-regulatory elements (CREs) that function in transcriptional regulation. For decades, however, the study of cis-regulatory evolution was severely limited by the available data. New data sets describing the locations of CREs and genetic variation within and between species have now made it possible to study CRE evolution much more directly on a genome-wide scale. Here, we review recent research on the evolution of CREs in humans based on large-scale genomic data sets. We consider inferences based on primate divergence,human polymorphism, and combinations of divergence and polymorphism. We then consider “new frontiers” in this field stemming from recent research on transcriptional regulation.