A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing

A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing
John E. McCormack, Michael G. Harvey, Brant C. Faircloth, Nicholas G. Crawford, Travis C. Glenn, Robb T. Brumfield
(Submitted on 4 Oct 2012)

Evolutionary relationships among birds in Neoaves, a clade including the vast majority of avian diversity, have vexed systematists due to the ancient, rapid radiation of numerous lineages. We applied a new phylogenomic approach to resolve relationships in Neoaves using target enrichment (sequence capture) and high-throughput sequencing of ultraconserved elements (UCEs) in avian genomes. We collected sequence data from UCE loci for 32 members of Neoaves and one outgroup (chicken) and analyzed data sets that differed in amount of missing data. An alignment of 1,541 loci that allowed missing data was 87% complete and resulted in a highly resolved phylogeny with broad agreement between the Bayesian and maximum-likelihood (ML) trees. Although the 100% complete matrix of 416 UCE loci was broadly similar, the Bayesian and ML trees differed to a greater extent in this analysis, suggesting that increasing from 416 to 1,541 loci led to increased stability and resolution of the tree. Novel results of our study include surprisingly close relationships between phenotypically divergent bird families, such as tropicbirds (Phaethontidae) and the sunbittern (Eurypygidae) as well as a sister relationship between bustards (Otididae) and turacos (Musophagidae). This phylogeny bolsters support for monophyletic waterbird and landbird clades and also strongly supports controversial relationships from previous studies, including the sister relationship between passerines and parrots and the non-monophyly of raptorial birds in the hawk and falcon families. Although significant challenges remain to fully resolving some of the deep relationships in Neoaves, especially among lineages outside the waterbirds and landbirds, this study suggests that increased data will yield an increasingly resolved avian phylogeny.

Our paper: The geography of recent genetic ancestry across Europe

This guest post is by Peter Ralph and Graham Coop (@graham_coop) on their paper The geography of recent genetic ancestry across Europe arXived here

In this paper we look at the genetic traces of very recent common ancestry between pairs of individuals from across Europe. We’ll likely write a few more accessible posts on this work when the paper is closer to publication, but for now (in the spirit of Haldane’s sieve) we write a little bit more of a technical post now [the full details are in the paper].

We started this project wanting to estimate recent migration rates, across continents like Europe — if we could learn how far away distant cousins are from each other, then, all else equal, we could then estimate typical migration distances. This isn’t where we ended up (that’s another project we are working on), but the basic idea, of looking at the geographic distribution of close relatives, led us to some interesting places.

As most populations lack the amazing pedigrees like that worked on by Decode in Iceland [e.g. here] we can’t actually know the true relationships between the samples (other than a few obvious siblings and full cousins). However, long segments of chromosome shared (almost) identically by descent (IBD) between two people have probably been inherited from a recent common ancestor. The length of these IBD segments tells us something about how long ago the ancestor lived,
since the older the ancestor, the more opportunity for recombination to whittle down the segment.

This has been worked on by a bunch of different groups, but the historical inference has usually been applied to small or relatively isolated populations. To really push the boundaries of these approaches we used the European subset of the POPRES dataset, which consists of thousands of human individuals. This is currently one of the best genome-scale, geographically indexed datasets, and represents a huge outbred population where we’d expect patterns of variation to be (at least partly) due to continuous migration, rather than, say, recent mixing of diverged populations or bottlenecks. So, we ran BEAGLE on the dataset to find IBD segments, and got lots of wonderful signal — it turned out that most pairs of people in the European sample (around 75%) shared IBD segments that were megabases long (i.e. longer than 1 centi-Morgan, cM). After a bunch of power and false-positive simulations, we were convinced that most of those blocks of IBD had been inherited from single common ancestors.

You could think of our results in two pieces: first, doing descriptive statistics on the distribution of IBD abundances and lengths across geography; and second, doing some inference on this distribution to see what we can learn about when those common ancestors lived.

As we hoped, there was a nice relationship to geography – people nearer to each other typically shared more and longer IBD than people farther away, in a nice monotonic relationship. This convinced us that continuous, local migration had played an important role in shaping current patterns of relatedness across Europe. Geographic distance was definitely not the only factor – superimposed on top of this was distinctive regional variation. For example, of one the strongest signals we saw was that there are higher levels of IBD sharing in Eastern Europe. As you’ll see from the paper, after further work, we think this is a potential signal of the Slavic or Hunnic expansions.

There were also some surprises to us along the way — like people in the UK sharing more IBD with Irish than with other people in the UK — that turned out to make sense after thinking about rapidly growing populations with directional migration (although there are other explanations). We correlated the patterns we saw with historical events, but (as with most genomic studies of human history) there was a lot of uncertainty. Sure, the patterns we see are consistent with the story we told, but there could potentially be a lot of other explanations, especially given the complicated and often unknown demographic history of European populations. What if all the IBD we saw came from the Neolithic expansion rather than the last two or three thousand years? This turns out to be a bigger worry than you might think — it’s fairly unlikely that two people inherited a 3cM block from a single common ancestor from 6000 years ago,but if they have enough common ancestors from back then (e.g. a strong enough bottleneck), it turns out to be reasonably likely.

So, we did some coalescent theory to work out the relationship between numbers of shared ancestors back through time (closely related to coalescent rate) and the observed distribution of IBD block lengths. We could then invert this relationship to estimate from the observed distribution of IBD blocks the mean number of ancestors that pairs of people from different parts of Europe share with each other, as a function of time. Unfortunately, this turns out to have a lot of unavoidable uncertainty– the inversion problem is “ill-conditioned” (in other terms, the likelihood surface is ridged), meaning that there were a lot of different histories that gave the same IBD length distribution.

Despite this, we could still rigorously learn a lot of good information — in particular, nearly all the IBD blocks we found did actually come from ancestors living during the last 3,000 years. Although we could only tie down the ages of the common ancestors to within a few hundred year, the major patterns can be likely tied to known historical events. There is quite a bit of uncertainty about the specific interpretations — it is still not straightforward to go from pairwise numbers of shared ancestors 1,500 years ago to conclusions about demographic events at the time — but used in conjunction with other sources of information has the promise to conclusively resolve some longstanding debates about recent history.

Finally, two addendums (addenda?) about the methods: The first is that we took an empirical approach to estimating the relationship between coalescent time distribution and observed IBD block length, by simulating a bunch (actually copying over blocks and re-running BEAGLE). We did this because BEAGLE is effectively a black box, for our purposes. This sort of approach is more common in experimental physics, where the empirical properties of detectors have to be worked out (and the problem of inferring the signal is known as “data unfolding”).

Second, we should emphasize that the uncertainty we came across in inferring dates is theoretically unavoidable, using IBD block length data. We think this is a common issue for many sorts of population genetics data — situations in which, even though we have a ton of data, getting specific, tightly constrained inferences requires making fairly strong assumptions (or equivalently, working in a specific set of parametric models). This has been highlighted in some cases [like this one], but more work is needed on this to ensure that we represent the inherent uncertainties in population genetics inferences correctly.

We’d love feedback from the popgen community about what aspects of the paper they’d like to see clarified/improved [obviously it is a pretty involved paper already, so concentrate on specific suggestions]. We have a tonne more ideas of how to improve this inference technology and extend it to other applications. But we’d love to hear your thoughts too.

Best Practices for Scientific Computing

Outside our usual remit, but likely of interest to many of our readers. See here for online peer review.

Best Practices for Scientific Computing
D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Katy Huff, Ian Mitchell, Mark Plumbley, Ben Waugh, Ethan P. White, Greg Wilson, Paul Wilson
(Submitted on 1 Oct 2012)
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software.

A phylogenomic perspective on the radiation of ray-finned fishes based upon targeted sequencing of ultraconserved elements

A phylogenomic perspective on the radiation of ray-finned fishes based upon targeted sequencing of ultraconserved elements

Michael E. Alfaro, Brant C. Faircloth, Laurie Sorenson, Francesco Santini
(Submitted on 29 Sep 2012)

Ray-finned fishes constitute the dominant radiation of vertebrates with over 30,000 species. Although molecular phylogenetics has begun to disentangle major evolutionary relationships within this vast section of the Tree of Life, there is no widely available approach for efficiently collecting phylogenomic data within fishes, leaving much of the enormous potential of massively parallel sequencing technologies for resolving major radiations in ray-finned fishes unrealized. Here, we provide a genomic perspective on longstanding questions regarding the diversification of major groups of ray-finned fishes through targeted enrichment of ultraconserved nuclear DNA elements (UCEs) and their flanking sequence. Our workflow efficiently and economically generates data sets that are orders of magnitude larger than those produced by traditional approaches and is well-suited to working with museum specimens. Analysis of the UCE data set recovers a well-supported phylogeny at both shallow and deep time-scales that supports a monophyletic relationship between Amia and Lepisosteus (Holostei) and reveals elopomorphs and then osteoglossomorphs to be the earliest diverging teleost lineages. Divergence time estimation based upon 14 fossil calibrations reveals that crown teleosts appeared ~270 Ma at the end of the Permian and that elopomorphs, osteoglossomorphs, ostarioclupeomorphs, and euteleosts diverged from one another by 205 Ma during the Triassic. Our approach additionally reveals that sequence capture of UCE regions and their flanking sequence offers enormous potential for resolving phylogenetic relationships within ray-finned fishes.

Thoughts on: Finding the sources of missing heritability in a yeast cross

[This commentary post is by Joe Pickrell on Finding the sources of missing heritability in a yeast cross by Bloom et al., available from the arXiv here]

For decades, human geneticists have used twin studies and related methods to show that a considerable amount of the phenotypic variation in humans is driven by genetic variation (without any knowledge of the underlying loci). More recently, genome-wide association studies have made incredible progress in identifying specific genetic variants that are important in various traits. However, the loci identified in the latter studies often have small effects, and the sums of their effects rarely come close to the genetic effects known from the former. The difference between the genetic effects on a trait known from heritability studies and the effects estimated from individual loci has come to be known as the “missing heritability”, and much ink has been spilled on speculation as to its cause.

Bloom et al. take an elegant and straightforward approach to this question using a model system, the budding yeast Saccharomyces cerevisiae. The insight is that in order to make progress, one needs both an experimental design to isolate some of the possible causes of the “missing heritability”. To achieve this, Bloom et al. use a cross between two different yeast strains and grow the segregants in identical conditions. They thus remove much of the environmental variation in the phenotypes, while also removing the effect of allele frequency (since all alleles are at frequency 0.5 in the cross). While this means that they cannot address some controversies about potential sources of heritability in humans (for example, rare versus common variants), they are able to estimate how much phenotypic variation is due to detectable additive effects. The authors additionally develop high-throughput assays to measure phenotypes (in their case, growth rates in 46 different conditions) and genotypes in the segregants from the cross, so that they can perform high-powered mapping studies.

The main result relevant to the issue of “missing heritability” is presented in their Figure 3a (reproduced above). After performing a well-powered mapping study, the authors compare the effects from their identified loci to the narrow-sense heritability of each trait. As it turns out, the heritability is not missing! In this cross, the identified loci, though of small effect, add up to a substantial fraction of the overall (narrow-sense) heritability for most traits. The authors additional identify some gene-gene interactions that contribute to the broad-sense heritability (but by definition not the narrow-sense heritability) of many traits.

The authors provocatively interpret their results as supporting a model in which the majority of the “missing heritability” lies in a large number of variants with small effect sizes (in line with the model proposed most notably by Peter Visscher and colleagues, though the authors here make no claims about the allele frequencies of the relevant variants). While this seems to be true in yeast, it remains to be seen in humans. It’s of course easy to come up with reasons why this might not hold in humans–our species is a special snowflake and so forth–but this paper should be in the back of the mind of anyone who is thinking about this problem.

Horizontal gene transfer may explain variation in θs

Horizontal gene transfer may explain variation in θs
Rohan Maddamsetti, Philip J. Hatcher, Stéphane Cruveiller, Claudine Médigue, Jeffrey E. Barrick, Richard E. Lenski
(Submitted on 28 Sep 2012)

Martincorena et al. estimated synonymous diversity (\theta s = 2N \mu ) across 2,930 orthologous gene alignments from 34 Escherichia coli genomes, and found substantial variation among genes in the density of synonymous polymorphisms. They argue that this pattern reflects variation in the mutation rate per nucleotide (\mu) among genes. However, the effective population size (N) is not necessarily constant across the genome. In particular, different genes may have different histories of horizontal gene transfer (HGT), whereas Martincorena et al. used a model with random recombination to calculate \theta s. They did filter alignments in an effort to minimize the effects of HGT, but we doubt that any procedure can completely eliminate HGT among closely related genomes, such as E. coli living in the complex gut community.
Here we show that there is no significant variation among genes in rates of synonymous substitutions in a long-term evolution experiment with E. coli and that the per-gene rates are not correlated with \theta s estimates from genome comparisons. However, there is a significant association between \theta s and HGT events. Together, these findings imply that \theta s variation reflects different histories of HGT, not local optimization of mutation rates to reduce the risk of deleterious mutations as proposed by Martincorena et al.

Most viewed on Haldane’s Sieve: August-September 2012

Haldane’s Sieve has now been operating for a little over a month, and we’ve enjoyed reading the steadily growing stream of interesting manuscripts posted to preprint servers. In this post, we revisit the most viewed preprints since the site began. Perhaps unsurprisingly, these are all papers that the authors described to the community with “Our paper” posts: