Model adequacy and the macroevolution of angiosperm functional traits

Model adequacy and the macroevolution of angiosperm functional traits
Matthew Pennell, Richard G FitzJohn, William K Cornwell, Luke J Harmon

All models are wrong and sometimes even the best of a set of models is useless. Modern phylogenetic comparative methods (PCMs) are almost exclusively model–based and therefore making robust inferences from PCMs requires using a model of trait evolution that is a good explanation for the data. To date, researchers using PCMs have evaluated the explanatory power of a model only in terms of relative, not absolute, fit. Here we develop a general statistical framework for assessing the absolute fit, or adequacy, of phylogenetic models for the evolution of quantitative traits. We use our approach to test whether commonly used models are adequate descriptors of the macroevolutionary dynamics of real comparative data. We fit models of trait evolution to 337 comparative datasets covering three key Angiosperm functional traits and evaluated the absolute fit of the models to each dataset. Overall, the models we used are very inadequate for the evolution of these traits; this was true for many different groups and at many different scales. Furthermore, the relative support for a model had very little to do with its absolute adequacy. We argue that assessing model adequacy should be a key step in comparative analyses.

A chromatin structure based model accurately predicts DNA replication timing in human cells

A chromatin structure based model accurately predicts DNA replication timing in human cells
Yevgeniy Gindin, Manuel S. Valenzuela, Mirit I. Aladjem, Paul S. Meltzer, Sven Bilke
Subjects: Subcellular Processes (q-bio.SC); Genomics (q-bio.GN)

The metazoan genome is replicated in precise cell lineage specific temporal order. However, the mechanism controlling this orchestrated process is poorly understood as no molecular mechanisms have been identified that actively regulate the firing sequence of genome replication. Here we develop a mechanistic model of genome replication capable of predicting, with accuracy rivaling experimental repeats, observed empirical replication timing program in humans. In our model, replication is initiated in an uncoordinated (time-stochastic) manner at well-defined sites. The model contains, in addition to the choice of the genomic landmark that localizes initiation, only a single adjustable parameter of direct biological relevance: the number of replication forks. We find that DNase hypersensitive sites are optimal and independent determinants of DNA replication initiation. We demonstrate that the DNA replication timing program in human cells is a robust emergent phenomenon that, by its very nature, does not require a regulatory mechanism determining a proper replication initiation firing sequence.

Population genetics on islands connected by an arbitrary network: An analytic approach

Population genetics on islands connected by an arbitrary network: An analytic approach
George W A Constable, Alan J McKane
(Submitted on 11 Feb 2014)

We analyse a model consisting of a population of individuals which is subdivided into a finite set of demes, each of which has a fixed but differing number of individuals. The individuals can reproduce, die and migrate between the demes according to an arbitrary migration network. They are haploid, with two alleles present in the population; frequency independent selection is also incorporated, where the strength and direction of selection can vary from deme to deme. The system is formulated as an individual-based model, and the diffusion approximation systematically applied to express it as a set of nonlinear coupled stochastic differential equations. These can be made amenable to analysis through the elimination of fast-time variables. The resulting reduced model is analysed in a number of situations, including migration-selection balance leading to a polymorphic equilibrium of the two alleles, and an illustration of how the subdivision of the population can lead to non-trivial behaviour in the case where the network is a simple hub. The method we develop is systematic, may be applied to any network, and agrees well with the results of simulations in all cases studied and across a wide range of parameter values.

Discovering functional DNA elements using population genomic information: A proof of concept using human mtDNA

Discovering functional DNA elements using population genomic information: A proof of concept using human mtDNA
Daniel R. Schrider, Andrew D. Kern
Subjects: Populations and Evolution (q-bio.PE); Genomics (q-bio.GN)

Identifying the complete set of functional elements within the human genome would be a windfall for multiple areas of biological research including medicine, molecular biology, and evolution. Complete knowledge of function would aid in the prioritization of loci when searching for the genetic bases of disease or adaptive phenotypes. Because mutations that disrupt function are disfavored by natural selection, purifying selection leaves a detectable signature within functional elements; accordingly this signal has been exploited through the use of genomic comparisons of distantly related species. However, the functional complement of the genome changes extensively across time and between lineages, therefore, evidence of the current action of purifying selection is essential. Because the removal of deleterious mutations by natural selection also reduces within-species genetic diversity within functional loci, dense population genetic data have the potential to reveal genomic elements that are currently functional. Here we assess the potential of this approach using 16,411 human mitochondrial genomes. We show that the high density of polymorphism in this dataset precisely delineates regions experiencing purifying selection. Further, we show that the number of segregating alleles at a site is strongly correlated with its divergence across species after accounting for known mutational biases in human mtDNA. These two measures track one another at a remarkably fine scale across many loci–a correlation that is purely the result of natural selection. Our results demonstrate that genetic variation has the potential to reveal exactly which nucleotides in the genome are currently performing important functions and likely to have deleterious fitness effects when mutated. As more complete genomes are sequenced, similar power to reveal purifying selection may be achievable in the human nuclear genome.

Author post: Dynamic DNA Processing: A Microcode Model of Cell Differentiation

The following guest post is by Barry Jacobson on his preprint “Dynamic DNA Processing: A Microcode Model of Cell Differentiation”, arXived here.

The paper suggests that DNA should be viewed as a processor that operates by means of base-pairing with remote regions of the genome. If one sequence matches another (or is complementary to it) it will set up a structural loop, or other interaction. However, the paper postulates that at least one region of the genome of every cell will have a unique clock sequence that is shared by no other cell. Therefore, the clock of one cell may not match the same distant sequences as the clock of another. Thus, the pattern of loops that is formed, and the overall 3-D DNA structure, may differ from cell to cell. This will either assist or hinder binding of transcription factors in one type of cell, as compared to another, thus providing a mechanism of differential gene expression.

We discuss a method by how these differing clock sequences could be generated in cell division, so that the daughters each end up with a unique identifier. The identifier then unlocks certain conformations only for those cell types for which it is relevant. Similarly, SNP’s may function in a similar manner, by modifying 3-D configurations, thus altering TF activity.

We further postulate that if a clock or target is errantly mutated, so that it matches a target farther away than was intended, it may stretch the chromosome to the breaking point, and this is the cause of chromosomal breakage or translocations in cancer.

Finally, we allow for the possibility that a cell can modify its clock in response to the environment, such as when healing from trauma, or accepting a graft, in which case it needs to coordinate with neighboring cells. We suggest that perhaps chemical analogs of cell surface proteins may occasionally mistrigger such a clock modification, when none is necessary, and thereby cause incorrect matches and conformations in that cell, which can damage DNA, and lead to cancer, as before.

We realize this is all purely speculative, but we mention that we originally submitted this model to Nature without success 16 years ago, and since then, a number of its assumptions have been verified, as detailed in the recent submission to arXiv, therefore we believe it deserves a second look.

Author post: The evolution of sex differences in disease genetics


This guest post is by Ted Morrow, Jessica Abbott, and Will Gilks on their review paper Gilks et al. “The evolution of sex differences in disease genetics”

Our paper forms part of a research project (2Sexes_1Genome, 2012-16) devoted to investigating how sex-specific and sexually antagonistic selection influences the genome, and in particular whether genetic variants that are maintained as a result of these forms of selection could contribute to disease risk. We had three main aims with our paper, which we outline below together with a motivation for each.

Our first aim was to summarise evidence for sex-dependent genetic architecture in complex traits that were otherwise shared between the sexes. We focused particularly on disease phenotypes in humans, although a range of complex traits from diverse taxa were considered. The motivation for this was to establish a baseline for how widespread or rare sex-specific genetic architecture is. An important paper in this respect, published in Nature Reviews Genetics (Ober et al., 2008) specifically addressed the question of sex-specific genetic architecture in human diseases. It reviewed selected examples within the human disease genetics literature for sex-specific effects on a range of phenotypes. They concluded that studies where sex was ignored would miss some important variants that contribute to disease risk. While the Ober et al. (2008) paper makes a robust case for investigating sex as a factor in genetic analyses, several other genome-wide association studies in the primary literature have been published since, suggesting that an up to date review of these would be worthwhile. We did not intend to conduct a full-scale meta-analysis, although that would probably be a very informative exercise given potential problems in terms of reporting bias, non-independence of traits, and selection of traits with known sexual dimorphism. Nonetheless, a clear pattern emerges of widespread evidence of sex-specific genetic architecture based on heritability estimates (see Figure 1 in our paper), eQTLs, gene manipulations, expression studies, and SNPs with sex-by-genotype effects (see Table 1 in our paper). A recently published paper (not included in our review) even reports 10 out of 13 loci reaching genome-wide significance for recombination rate having sex-specific effects (Kong et al., 2013).

The second aim was to show how evolutionary theory could provide ultimate explanations for the origins of sex-specific genetic architecture. In this way, we propose that a deeper understanding of why genes cause disease, and why some common diseases show sexually dimorphic expression, may emerge. The evolutionary theory of why the sexes may differ phenotypically goes back to Darwin’s observations (1871) of how selection acts in males and females. He characterized males as active competitors, engaging in physical battles with rivals or investing in costly signals with which to woo potential mates. Females, on the other hand were characterized as being coy and choosy. There is now good evidence that mate choice is something not only limited to females, and that sexual selection also operates well after copulation (i.e. sperm competition and cryptic female choice). The key point is that fundamental differences between the sexes occur in terms of investment in reproduction, and as a consequence the routes by which males and females may maximize their fitness are often different. In other words, both natural and sexual selection frequently take sex-specific forms in terms of strength and/or direction. The latter possibility that selection acts antagonistically between the sexes is well established in several laboratory and wild populations, including humans. From a human disease perspective, disease may occur as a result of an individual’s phenotypic difference (or departure) from an optimal phenotype (where a particular trait value has the greatest fitness). This difference could be the result of a genetic constraint imposed by an intersexual genetic correlation for that trait, or indirectly (i.e. pleiotropically) though genetic correlations with other traits. Sex-specific or sexually antagonistic selection could therefore maintain genetic variation within a population that is either less favourable or actually deleterious for one sex. A recent model (Morrow & Connallon, 2013) shows how alleles with sex-specific or sexually antagonistic effects will contribute more to genetic variation for disease predisposition than alleles that are deleterious to both sexes in equal measure, and achieve higher allele frequencies. As a result, sexual dimorphism in the genetic architecture of complex polygenic diseases would emerge within the population. This evolutionary model clearly indicates that the search for loci contributing to disease risk in humans would benefit from exploring sex-specific genetic effects.

The final aim was to provide readers with an overview of the analytical options available for detecting sex-specific associations in genome-wide studies of complex diseases and phenotypes. As we show, more studies are investigating and discovering sex-dependent effects using GWAS data, Common strategies are to separate or stratify the samples within case and control groups by sex, or to model sex as a covariate. The first approach reduces the statistical power to detect sex-dependent effects, and thus only strong ones will be detected. The second simply controls for any sex-specific effects, it is not intended to identify them. We instead advocate the inclusion of a genotype-by-sex interaction term in statistical models, available as an option in some of the commonly used analytical platforms such as GenABEL and PLINK.

Overall, we hope our article raises the profile of sex-specific genetic effects, a topic that is already apparently receiving increasing interest judging by the recent crop of sex-specific associations appearing in the GWAS literature. This forms a more general theme within the field of human disease genetics, of exploring the impact of interaction effects, such as genotype-by-environment interactions. The identification of strong main effects has had successes but the debate over the ‘missing heritability’ of complex traits has activated researchers to look beyond to more complex processes such as epistasis and environmental effects. We welcome any comments either here on Haldane’s Sieve or in the comments section of biorXiv where are article is currently posted.

References
2Sexes_1Genome. 2012-16. Edward H. Morrow. FP7 ERC Starting Grant – Evolutionary, population and environmental biology. http://www.2020-horizon.com/2SEXES-1GENOME-Sex-specific-genetic-effects-on-fitness-and-human-disease(2SEXES-1GENOME)-s2903.html
Darwin, C. 1871. The Descent of Man. Prometheus Books, New York.
Kong, A., Thorleifsson, G., Frigge, M.L., Masson, G., Gudbjartsson, D.F., Villemoes, R., et al. 2013. Common and low-frequency variants associated with genome-wide recombination rate. Nat. Genet. doi:10.1038/ng.2833.
Morrow, E.H. & Connallon, T. 2013. Implications of sex-specific selection for the genetic basis of disease. Evol. Appl. doi:10.1111/eva.12097.
Ober, C., Loisel, D.A. & Gilad, Y. 2008. Sex-specific genetic architecture of human disease. Nat Rev Genet 9: 911–922.

Author post: The identifiability of piecewise demographic models from the sample frequency spectrum

This guest post is by Anand Bhaskar and Yun Song on their paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

With the advent of high-throughput sequencing technologies, it has been of great interest to use genomic data to understand human demographic history. For example, we now estimate that modern humans migrated out of Africa around 60K-120K years ago [1,2], and that Neandertals may have admixed with modern humans in Europe as recently as 47,000 years ago [3]. Apart from satisfying curiosity about our anthropological history, the inference of demography is important for several scientific reasons. Most importantly, demographic processes influence genetic variation, and understanding the interplay between natural selection, genetic drift, and demography is a key question in population genetics. Also, controlling for demography is important for practical applications. For example, the demography inferred from neutrally evolving genomic regions can serve as a null model when searching for regions under selection. Demographic models could also be used to circumvent the problem of spurious associations in case-control studies induced by population substructure.

A summary of whole haplotypes that is commonly used in population genetic analyses is the sample frequency spectrum (SFS). For a sample of n haplotypes from a panmictic (i.e. without substructure) population, the SFS is an (n-1)-dimensional vector where the i-th entry is the proportion of SNPs with i copies of the mutant allele in the sample. One can talk about a mutant/derived allele because most analyses assume that mutations are rare enough that the observed SNPs are dimorphic. The first few entries of the SFS capture the proportion of rare SNPs in the sample and are especially useful for inferring recent population history. Several recent large sample sequencing studies [4-6] have found that humans have many more putatively neutral rare SNPs compared to predictions from a constant population size model. Using the SFS from their data, these studies all infer demographic models with recent exponential population expansion.

However, until fairly recently, it was not known whether the SFS of a sample uniquely determines the underlying demographic model. Could it be possible that two different demographic models produce the exact same expected SFS for all sample sizes? In 2008, Simon Myers, Charles Fefferman, and Nick Patterson came up with an elegant mathematical argument [7] to show that there are infinitely many population size histories that generate the same expected SFS for all sample sizes. They even provided an explicit example of a population size history which produced the same expected SFS as a constant population size model. However, their example history had increasingly rapid oscillations in the population size in the recent past, something that we might not expect to find in real biological populations. After all, even though we commonly use continuous-time models of evolution like coalescent theory and diffusion processes, biological populations evolve in discrete events of birth and death.

Our research group has been working on demographic inference from the SFS and from full sequence data for the last several years, and so it was natural for us to ask whether the class of population size histories that are commonly inferred using statistical algorithms might also suffer from this non-identifiability problem. Most statistical methods infer piecewise population size histories, where the pieces come from some biologically-motivated family of functions. In particular, piecewise constant and piecewise exponential models commonly appear in the literature. And if one can indeed uniquely identify piecewise demographic models from the SFS, what sample sizes are needed to do so?

In our paper, we address this question by proving that if the underlying population size function is piecewise with at most K pieces, then the expected SFS of a random sample of size n uniquely determines the demography as long as n is larger than some function of K that depends on the type of pieces of the population size function. For example, if the underlying demographic model was piecewise constant with at most K pieces (i.e. described by at most 2K – 1 parameters), then the expected SFS of a sample of size 2K uniquely determines the demographic model. In other words, no two piecewise constant population size functions with at most K pieces can generate the same expected SFS for a sample of size 2K or larger. For piecewise exponential demographic models with at most K pieces, a sample size of 4K – 1 is sufficient to uniquely determine the demographic model. When one doesn’t know which allele is ancestral and which is derived (for example, if outgroup information is lacking at the relevant SNPs), demographic analysis can still be carried out using the SFS by “folding” it. The folded SFS has floor(n/2) entries, where the i-th entry is the proportion of SNPs with i copies of the minor allele (which might be an ancestral or derived allele). Since the folded SFS has only roughly half the dimension as the full SFS, one might expect to require twice as many samples to uniquely determine the demographic model from the folded SFS compared to the full SFS. We formally prove in our paper that this intuition is indeed correct.

It is important to stress that this identifiability result is statistical rather than algorithmic in that that one would need to have perfect information about the expected SFS of a random sample in order to uniquely determine the underlying piecewise demography. In practice, one can get good estimates of the expected SFS by considering a large number of SNPs in the inference procedure, and by considering SNPs that are farther apart along the chromosomes so that the coalescent trees for the sample at different SNPs will be roughly independent of each other. More work is certainly needed to understand how much genomic data (measured both in terms of the number of SNPs and the sample size) would be needed in practice to robustly infer realistic demographic models.

Works cited:

[1] Li, H. and Durbin, R. (2011) Inference of human population history from individual whole-genome sequences. Nature 475, 493–496.

[2] Scally, A. and Durbin, R. (2012). Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics, 13(10), 745-753.

[3] Sankararaman, S., Patterson, N., Li, H., Pääbo, S., and Reich, D. (2012) The date of interbreeding between Neandertals and modern humans. PLoS Genetics 8, e1002947.

[4] Nelson, Matthew R., et al. (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104.

[5] Tennessen, Jacob A., et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69.

[6] Fu, Wenqing, et al. (2012) Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220.

[7] Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

Author Post: The Population Genetic Signature of Polygenic Local Adaptation

This guest post is by Jeremy Berg [@JeremyJBerg] and Graham Coop [@Graham_coop] on their paper The Population Genetic Signature of Polygenic Local Adaptation arXived here

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.

Height-genetic-value

In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.

OneSNP

We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.

Some preprint comment streams at Haldane’s sieve and related sites

Given our one year anniversary, I thought I’d collect together a few examples of preprint commenting at work. These have taken place in the comment boxes of Haldane’s sieve and/or across a range of other blogs.

These are somewhat isolated cases, as the majority of preprints pass without any comment. It would be great to see more of this level of commentary. Remember comments can be simple inquiries about methods/figures/reference etc and don’t have to be super involved. In general we’ve found authors to be very responsive to comments, perhaps in part because they can take place as a more informal conversation without the pressures of publication concerns.

Genome sequencing highlights genes under selection and the dynamic early history of dogs
Reconstructing the population genetic history of the Caribbean
the population genetic signal of polygenic adaptation
The geography of recent genetic ancestry across Europe
Loss and Recovery of Genetic Diversity in Adapting Populations of HIV
Sailfish RNA-seq quantification
Genome-wide inference of ancestral recombination graphs

The date of interbreeding between Neandertals and modern humans.


Ancient west Eurasian ancestry in southern and eastern Africa.


The identifiability of piecewise demographic models from the sample frequency spectrum

Our paper: Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

This guest post is by Mike Harvey on his (along with coauthors) paper Tilston-Smith and Harvey et al Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales arXived here.

This paper is a result of work on developing markers and methods for generating genomic data for species without available genomes (I’ll refer to these as “non-model” species). The work is a collaborative effort between some researchers who are really on top of developments in sequencing technologies (and are also a blast to work with) – Travis Glenn at UGA, Brant Faircloth at UCLA, and John McCormack at Occidental – and our lab here at LSU. We think the marker sets we have been developing (ultraconserved elements) and more generally the method we are using (sequence capture) have the potential to make the genomic revolution more accessible to researchers studying the population genetics of diverse non-model organisms.

Background

Although genomic resources for humans and other model systems are increasing rapidly, the bottleneck for those of us working on the population genetics of non-model systems is simply our ability to generate data. Many of us are still struggling to take advantage of the increase in sequencing capacity provided by next-generation platforms. For many projects, sequencing entire genomes is neither feasible (yet) nor necessary, so researchers have focused on finding reasonable methods of subsampling the genome in a repeatable way such that the same subset of genomic regions can be sampled for many individuals. We often have to do this, however, with little to no prior genomic information from our particular study organism.

Most methods for subsampling the genome thus far have involved “random” sampling from across the genome by using restriction enzymes to digest genomic DNA and then sequencing fragments that fall in a particular part of the fragment size distribution. Drawbacks of these methods include (1) the fact that the researcher has no prior knowledge of where in the genome sequences will be coming from or what function the genomic region might serve, and (2) that the repeatability of the method, specifically the ability to generate data from the same loci across samples, depends on the conservation of the enzyme cut sites, and these often are not conserved at deeper timescales. Sequencing transcriptomes is also a popular method for subsampling the genome, but this simply isn’t an option for those of us working with museum specimens and tissues or old blood samples in which RNA hasn’t been properly preserved.

Sequence capture, a molecular technique involving genome enrichment by hybridization to RNA or DNA ‘probes’, is a flexible alternative that allows researchers to subsample whatever portions of the genome they like. The drawback of sequence capture, however, is that you need enough prior genomic information to design the synthetic oligos used as probes. This is not a problem for e.g. exome capture in humans in which the targeted genes are well characterized, but it is a challenge for non-model systems without sequenced genomes.

This is where ultraconserved elements come in. Ultraconserved elements (UCEs) are short genomic regions that are highly conserved across widely divergent species (e.g. all amniotes). Because they are so conserved, UCE sequences can be easily used as probes for sequence capture in diverse non-model organisms, even if the organisms themselves have little or no genomic information available. If you are not working on amniotes or fishes (for which we have already designed probe arrays), all you may need to find UCEs is a couple of genomes from species that diverged from your study organism within the last few hundred million years. Of course, this general approach is not specific to loci that fall into our narrow definition of UCEs, but is limited merely by the availability of genomic information that can be used to design probes. As additional genomic information becomes available from a given group additional loci, including protein-coding regions, can easily be added to capture arrays.

Our question for this paper – does sequence capture of UCEs work for population genetics?

We have previously used sequence capture of UCEs to understand deeper-level phylogenetic questions. We’ve found that at deep timescales, the flanking regions of UCEs contain a large amount of informative variation. The goals of the present study were (1) to see if sufficient information existed in UCEs to enable studies at shallow evolutionary (read "population genetic or phylogeographic") timescales, and (2) to explore some of the analyses that might be possible with population genetic data from non-model organisms. For our study, we sampled two individuals from each of four populations in five different species of non-model Neotropical birds. We conducted sequence capture using probes designed from 2,386 UCEs shared by amniotes and we sequenced the resulting libraries using an Illumina HiSeq. We then examined the number of loci recovered and the amount of informative variation in those loci for each of the five species. We also conducted some standard analyses – species tree estimation, demographic modeling, and species delimitation – for each species

We were able to recover between 776 and 1,516 UCE regions across the five species, and these contained sufficient variation to conduct population genetic analyses in each species. Species tree estimates, demographic parameters, and species limits mostly corresponded with prior estimates based on morphology or mitochondrial DNA sequences. Confidence intervals around demographic parameter estimates from the UCEs were much narrower than estimates from mitochondrial DNA using similar methods, supporting the idea that larger datasets will allow more precise estimates of species histories.

Some conclusions

Pending faster and cheaper methods for sequencing and de novo assembling whole genomes, methods for sampling a subset of the genome will be a practical necessity for population genetic studies in non-model organisms. Sequence capture is both intuitively appealing and practical in that it allows researchers to select a priori the regions of the genome in which they are interested. Ultraconserved elements pair nicely with sequence capture because they allow us to collect data from the same loci shared across a very broad spectrum of organisms (e.g. all amniotes or all fishes). As genomic data for diverse groups increases, UCE capture probes will certainly be augmented with additional genomic regions. In the meantime, sequence capture of UCEs has a lot to offer for population genetic studies of non-model organisms. See our paper for more information, or visit ultraconserved.org, where our probe sets, protocols, code, and other information are available under open-source licenses (BSD-style and Creative Commons) for anyone to use.