Inferring selective constraint and recent gain and loss of function from population genomic data

Inferring selective constraint and recent gain and loss of function from population genomic data
Daniel R. Schrider, Andrew D. Kern
(Submitted on 10 Sep 2013)

The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human specific function in the genome. Using only allele frequency information from the complete low coverage 1000 Genomes Project dataset in conjunction with a support vector machine trained from known functional and non-functional portions of the genome, we are able to identify functional portions of the genome with extremely high accuracy (~88%). Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain of function along the human lineage include a novel isoform of a killer cell immunoglobulin-like receptor, while loss of function candidates include many members of a gene cluster involved in shaping the complexity of synaptic connections in the brain. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods.

The role of mutation rate variation and genetic diversity in the architecture of human disease

The role of mutation rate variation and genetic diversity in the architecture of human disease
Ying Chen Eyre-Walker, Adam Eyre-Walker
(Submitted on 29 Aug 2013)

We have investigated the role that the mutation rate and the structure of genetic variation at a locus play in determining whether a gene is involved in disease. We predict that the mutation rate and its genetic diversity should be higher in genes associated with disease, unless all genes that could cause disease have already been identified. Consistent with our predictions we find that genes associated with Mendelian and complex disease are substantially longer than non-disease genes. However, we find that both Mendelian and complex disease genes are found in regions of the genome with relatively low mutation rates, as inferred from intron divergence between humans and chimpanzees. Complex disease gene are predicted to have higher rates of non-synonymous mutation than non-disease genes, but the opposite pattern is found in Mendelian disease genes. Finally, we find that disease genes are in regions of significantly elevated genetic diversity, even when variation in the rate of mutation is controlled for. The effect is small nevertheless. Our results suggest that variation in the genic mutation rate and the genetic architecture of the locus play a minor role in determining whether a gene is associated with disease.

Our paper: Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

This guest post is by Mike Harvey on his (along with coauthors) paper Tilston-Smith and Harvey et al Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales arXived here.

This paper is a result of work on developing markers and methods for generating genomic data for species without available genomes (I’ll refer to these as “non-model” species). The work is a collaborative effort between some researchers who are really on top of developments in sequencing technologies (and are also a blast to work with) – Travis Glenn at UGA, Brant Faircloth at UCLA, and John McCormack at Occidental – and our lab here at LSU. We think the marker sets we have been developing (ultraconserved elements) and more generally the method we are using (sequence capture) have the potential to make the genomic revolution more accessible to researchers studying the population genetics of diverse non-model organisms.

Background

Although genomic resources for humans and other model systems are increasing rapidly, the bottleneck for those of us working on the population genetics of non-model systems is simply our ability to generate data. Many of us are still struggling to take advantage of the increase in sequencing capacity provided by next-generation platforms. For many projects, sequencing entire genomes is neither feasible (yet) nor necessary, so researchers have focused on finding reasonable methods of subsampling the genome in a repeatable way such that the same subset of genomic regions can be sampled for many individuals. We often have to do this, however, with little to no prior genomic information from our particular study organism.

Most methods for subsampling the genome thus far have involved “random” sampling from across the genome by using restriction enzymes to digest genomic DNA and then sequencing fragments that fall in a particular part of the fragment size distribution. Drawbacks of these methods include (1) the fact that the researcher has no prior knowledge of where in the genome sequences will be coming from or what function the genomic region might serve, and (2) that the repeatability of the method, specifically the ability to generate data from the same loci across samples, depends on the conservation of the enzyme cut sites, and these often are not conserved at deeper timescales. Sequencing transcriptomes is also a popular method for subsampling the genome, but this simply isn’t an option for those of us working with museum specimens and tissues or old blood samples in which RNA hasn’t been properly preserved.

Sequence capture, a molecular technique involving genome enrichment by hybridization to RNA or DNA ‘probes’, is a flexible alternative that allows researchers to subsample whatever portions of the genome they like. The drawback of sequence capture, however, is that you need enough prior genomic information to design the synthetic oligos used as probes. This is not a problem for e.g. exome capture in humans in which the targeted genes are well characterized, but it is a challenge for non-model systems without sequenced genomes.

This is where ultraconserved elements come in. Ultraconserved elements (UCEs) are short genomic regions that are highly conserved across widely divergent species (e.g. all amniotes). Because they are so conserved, UCE sequences can be easily used as probes for sequence capture in diverse non-model organisms, even if the organisms themselves have little or no genomic information available. If you are not working on amniotes or fishes (for which we have already designed probe arrays), all you may need to find UCEs is a couple of genomes from species that diverged from your study organism within the last few hundred million years. Of course, this general approach is not specific to loci that fall into our narrow definition of UCEs, but is limited merely by the availability of genomic information that can be used to design probes. As additional genomic information becomes available from a given group additional loci, including protein-coding regions, can easily be added to capture arrays.

Our question for this paper – does sequence capture of UCEs work for population genetics?

We have previously used sequence capture of UCEs to understand deeper-level phylogenetic questions. We’ve found that at deep timescales, the flanking regions of UCEs contain a large amount of informative variation. The goals of the present study were (1) to see if sufficient information existed in UCEs to enable studies at shallow evolutionary (read "population genetic or phylogeographic") timescales, and (2) to explore some of the analyses that might be possible with population genetic data from non-model organisms. For our study, we sampled two individuals from each of four populations in five different species of non-model Neotropical birds. We conducted sequence capture using probes designed from 2,386 UCEs shared by amniotes and we sequenced the resulting libraries using an Illumina HiSeq. We then examined the number of loci recovered and the amount of informative variation in those loci for each of the five species. We also conducted some standard analyses – species tree estimation, demographic modeling, and species delimitation – for each species

We were able to recover between 776 and 1,516 UCE regions across the five species, and these contained sufficient variation to conduct population genetic analyses in each species. Species tree estimates, demographic parameters, and species limits mostly corresponded with prior estimates based on morphology or mitochondrial DNA sequences. Confidence intervals around demographic parameter estimates from the UCEs were much narrower than estimates from mitochondrial DNA using similar methods, supporting the idea that larger datasets will allow more precise estimates of species histories.

Some conclusions

Pending faster and cheaper methods for sequencing and de novo assembling whole genomes, methods for sampling a subset of the genome will be a practical necessity for population genetic studies in non-model organisms. Sequence capture is both intuitively appealing and practical in that it allows researchers to select a priori the regions of the genome in which they are interested. Ultraconserved elements pair nicely with sequence capture because they allow us to collect data from the same loci shared across a very broad spectrum of organisms (e.g. all amniotes or all fishes). As genomic data for diverse groups increases, UCE capture probes will certainly be augmented with additional genomic regions. In the meantime, sequence capture of UCEs has a lot to offer for population genetic studies of non-model organisms. See our paper for more information, or visit ultraconserved.org, where our probe sets, protocols, code, and other information are available under open-source licenses (BSD-style and Creative Commons) for anyone to use.

A network approach to analyzing highly recombinant malaria parasite genes

A network approach to analyzing highly recombinant malaria parasite genes
Daniel B. Larremore, Aaron Clauset, Caroline O. Buckee
(Submitted on 23 Aug 2013)

The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-{\alpha} (DBL{\alpha}) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBL{\alpha} classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.

Genome wide signals of pervasive positive selection in human evolution

Genome wide signals of pervasive positive selection in human evolution
David Enard, Philipp W. Messer, Dmitri Petrov
(Submitted on 22 Aug 2013)

The role of positive selection in human evolution remains controversial. On the one hand, scans for positive selection have identified hundreds of candidate loci and the genome-wide patterns of polymorphism show signatures consistent with frequent positive selection. On the other hand, recent studies have argued that many of the candidate loci are false positives and that most apparent genome-wide signatures of adaptation are in fact due to reduction of neutral diversity by linked recurrent deleterious mutations, known as background selection. Here we analyze human polymorphism data from the 1,000 Genomes project (Abecasis et al. 2012) and detect signatures of pervasive positive selection once we correct for the effects of background selection. We show that levels of neutral polymorphism are lower near amino acid substitutions, with the strongest reduction observed specifically near functionally consequential amino acid substitutions. Furthermore, amino acid substitutions are associated with signatures of recent adaptation that should not be generated by background selection, such as the presence of unusually long and frequent haplotypes and specific distortions in the site frequency spectrum. We use forward simulations to show that the observed signatures require a high rate of strongly adaptive substitutions in the vicinity of the amino acid changes. We further demonstrate that the observed signatures of positive selection correlate more strongly with the presence of regulatory sequences, as predicted by ENCODE (Gerstein et al. 2012), than the positions of amino acid substitutions. Our results establish that adaptation was frequent in human evolution and provide support for the hypothesis of King and Wilson (King and Wilson 1975) that adaptive divergence is primarily driven by regulatory changes.

The standing pool of genomic structural variation in a natural population of Mimulus guttatus

The standing pool of genomic structural variation in a natural population of Mimulus guttatus
Lex E. Flagel, John H. Willis, Todd J. Vision
(Submitted on 19 Aug 2013)

Major unresolved questions in evolutionary genetics include determining the contributions of different mutational sources to the total pool of genetic variation in a species, and understanding how these different forms of genetic variation interact with natural selection. Recent work has shown that structural variants (insertions, deletions, inversions and transpositions) are a major source of genetic variation, often out-numbering single nucleotide variants in terms of total bases affected. Despite the near ubiquity of structural variants, major questions about their interaction with natural selection remain. For example, how does the allele frequency spectrum of structural variants differ when compared to single nucleotide variants? How often do structural variants affect genes, and what are the consequences? To begin to address these questions, we have systematically identified and characterized a large set submicroscopic insertion and deletion (indel) variants (between 1 kb to 200 kb in length) among ten individuals from a single natural population of the plant species Mimulus guttatus. After extensive computational filtering, we focused on a set of 4,142 high-confidence indels that showed an experimental validation rate of 73%. All but one of these indels were < 200 kb. While the largest were generally at lower frequencies in the population, a surprising number of large indels are at intermediate frequencies. While indels overlapping with genes were much rarer than expected by chance, nearly 600 genes were affected by an indel. NBS-LRR defense response genes were the most enriched among the gene families affected. Most indels associated with genes were rare and appeared to be under purifying selection, though we do find four high-frequency derived insertion alleles that show signatures of recent positive selection.

Our Paper: The genomic impacts of drift and selection for hybrid performance in maize

This next paper is by Jeff Ross-Ibarra (@jrossibarra) on his paper (along with coauthors) Gerke et al The genomic impacts of drift and selection for hybrid performance in maize arXived here.

Iowa recurrent selection as an evolutionary experiment in hybrid vigor

Maize is an outcrossing species, and was cultivated as such up through the first quarter of the 20th century. Starting in the 1920’s, however, breeders began to abandon open-pollinated maize in favor of hybrid varieties resulting from crosses between inbred lines. Hybrids are often more robust and higher yielding than either inbred parent, a phenomenon known as hybrid vigor or heterosis.

Breeding for hybrid varieties – and presumably increased heterosis – has had a profound impact on diversity across the maize genome. There are at least two important differences from previous breeding efforts: first, breeders select on and work with inbred maize lines rather than mass selection on open-pollinated populations. This results in much smaller effective population sizes, and has implications for recessive traits and deleterious alleles that could be masked in heterozygotes. The second difference is that instead of selecting the best plants per se, breeders now select for inbreds that make high-yielding hybrids. This means a breeder might favor an inbred that itself is not high-yielding if it consistently makes good hybrids when paired with other inbreds.

We set out to study the effects of these breeding strategies on patterns of diversity across the maize genome. We took addvantage of one of the longest-running ongoing experiments on selection for hybrid performance, started in the late 1940’s by the US Dept. of Agriculture’s Agricutural Research Service. Two small (12 and 16) sets of founder inbred lines were randomly mated to create two base populations: the Iowa Stiff Stalk Synthetic (BSSS) and the Iowa Corn Borer Synthetic No. 1 (BSCB1). In addition to its role as an important selection experiment, multiple maize breeding lines have come out of the BSSS population, including the line used for the maize reference genome.

Diversity in the BSSS and BSCB1 is patterned predominantly by drift

Over the course of the experiment we studied, the two base populations underwent 16 cycles of recurrent selection, in which lines from each population were crossed to each other and evaluated for both hybrid and per-se performance. Selected lines were intermated within each population to form the next generation. To investigate the genomic impact of this selection scheme, we genotyped progenitor lines and over 600 individuals from multiple selection cycles using the Illumina MaizeSNP50 SNP array. And because we know the exact crossing and selection scheme used, we can compare the observed changes in genome-wide diversity with strictly neutral crossing simulations using the genotypes of the starting populations.

Both populations steadily lost genetic diversity as they became more diverged from one another, but diversity and divergence between BSSS and BSCB1 can be largely reproduced by simulation without any selection. In fact, principal component analysis clearly reveals changes in population structure and diversity that mirror alterations in rates of inbreeding and effective population size that occurred over the course of the experiment. This indicates the structure is not necessarily related to the phenotypic improvement, but might be a by-product of the breeding scheme. Similar population structure is reflected in a recent broad comparison of US maize germplasm and suggests that much of the diversity and structure of modern maize germplasm has been effected by genetic drift.

Selection efficacy and fixation at regions of low-recombination.

But genetic drift can’t be the whole story in these populations. Numerous experiments have shown that the later populations are superior to their progenitors in terms of hybrid yield and traits important to increased planting density (more plants per acre = more yield). These same trends are observed across North American maize as a whole, suggesting common themes in how maize has improved over time. Selection is difficult to detect in the face of strong genetic drift, especially when the selection has been on traits with complex genetic architectures. However our simulations do detect regions of low heterozygosity in each population that are longer than expected given their genetic distance.

The most striking pattern of these regions is their lack overlap between the two populations. In simple cases, classic overdominance models of heterosis predict that at a single locus, two distinct alleles confer heterozygote advantage when combined. In this case, selection should lead to decreased heterozygosity at a locus in both populations as complementary alleles rise in frequency. We don’t observe this, and neither did a different study that used other populations.

A popular alternative to the over-dominance model is the dominance model, which predicts that heterosis is caused by the complementation of linked recessive deleterious alleles. In this case, multiple haplotypes in the other population may complement a fixed region if most deleterious alleles in maize are rare. Evidence from numerous studies supports a dominance model of heterosis, including findings of excess residual heterozygosity in low recombination regions of a maize mapping population. In regions of low recombination, heterozygosity (and thus complementation) becomes important due to an inabilty to efficiently select for new recombinants in these regions, especially with low effective population sizes. And because of low rates of recombination, a small genetic interval in these regions becomes massive in physical space and encompasses the composite effects of many deleterious loci. We observe fixation in these regions in the BSSS and BSCB1 populations. They are short genetically (1-2 centimorgans), but make up very large fractions of the chromosome. We find that in many cases, these regions have been inherited largely intact from the original population founders, indicating that selection for new haplotype combinations in these regions has been ineffective. Large haplotypes in some cases may have fixed early on in the formation of many breeding programs, and the combination of limited exchange between breeding pools and small effective population sizes has provided little opportunity for selective removal of deleterious alleles. Complementation and the inefficiency of selection in these pericentromeric regions, which span a large portion of the physical genome, may thus explain the difference between hybrid and inbred yield and why it has remained fairly constant.

Our paper: Inferring HIV escape rates from multi-locus genotype data

This guest post is by Richard Neher on his paper with Taylor Kessinger and Alan Perelson: Kessinger et al. Inferring HIV escape rates from multi-locus genotype data. arXived here.
This is cross posted from the Neher lab website.

We have a new preprint on the arXiv (here on Haldane’s sieve). This work is the result of a collaboration between us and Alan Perelson, LANL, and explores methods to estimate parameters of the HIV-immune system interaction from time resolved sequence data. The focus of this paper is on early infeImagection dominated by a few rapid substitutions that fix because they prevent or reduce recognition of infected cells by the immune system via cytotoxic T-lymphocytes (CTL). CTL escape is one of the fastest instances of evolution I have come across. 4-6 mutations spread within a few weeks. It happens in most HIV infections and is partly predictable based on the HLA genotype of the infected person. These substitutions are so rapid that clonal interference has to be modeled. Our method fits a reduced model of clonal interference to the typically very sparse data and thereby estimates the selection coefficients, aka escape rates.

Why do we want to know these numbers?
The number of viruses in the blood of an infected person peaks 2-3 weeks after infection and thereafter drops by 2-3 order of magnitude. This drop is partly due to a response by the adaptive immune system. However, it has proved difficult to attribute this drop to specific parts of the immune response. The rates at which different mutations sweep through the population gives us information about the pressure exerted by the T-cell clones that target the epitope containing this mutation.

How do we do it?
Early in infection, the viral population is large and selection is strong. In these conditions, recombination is of minor importance since most double/triple… mutants are more efficiently produced by recurrent mutation than recombination. This implies that mutations accumulate sequentially always on a background one which already all previous mutations are present. The time at which a novel mutation happens in tightly constrained by the trajectory of preceding genotype. These constraints regularize the fitting problem to some degree and the multi-locus fitting is more robust than single locus fitting.

What do we learn about evolution in general?
In addition to the intrinsic interest in the HIV/CTL interaction, CTL escape is an ideal setting to study rapidly evolving populations. This evolution happens in its “natural” habitat and the selective pressure as well as the functional consequences of the observed molecular changes can be quantified via immunological data, protein structure, and replication assays. In addition, we have ample cross-sectional data (HIV sequences from many different patients) that allows us to look at prevalence of the escape mutations and potential compensatory mutations. None of this is done in this paper, but studying HIV/immune-system coevolution is a fascinating show case of rapid evolution.

The pattern and distribution of deleterious mutations in maize

The pattern and distribution of deleterious mutations in maize
Sofiane Mezmouk, Jeffrey Ross-Ibarra
(Submitted on 2 Aug 2013)

Most non-synonymous mutations are thought to be deleterious because of their effect on protein sequence. These polymorphisms are expected to be removed or kept at low frequency by the action of natural selection, and rare deleterious variants have been implicated as a possible explanation for the “missing heritability” seen in many studies of complex traits. Nonetheless, the effect of positive selection on linked sites or drift in small or inbred populations may also impact the evolution of deleterious alleles. Here, we made use of genome-wide genotyping data to characterize deleterious variants in a large panel of maize inbred lines. We show that, in spite of small effective population sizes and inbreeding, most putatively deleterious SNPs are indeed at low frequencies within individual genetic groups. We find that genes showing associations with a number of complex traits are enriched for deleterious variants. Together these data are consistent with the dominance model of heterosis, in which complementation of numerous low frequency, weak deleterious variants contribute to hybrid vigor.

Maximum likelihood evidence for Neandertal admixture in Eurasian populations from three genomes

Maximum likelihood evidence for Neandertal admixture in Eurasian populations from three genomes
Konrad Lohse, Laurent A.F. Frantz
(Submitted on 31 Jul 2013)

Although there has been much interest in estimating divergence and admixture from genomic data, it has proven difficult to distinguish gene flow after divergence from alternative histories involving structure in the ancestral population. The lack of a formal test to distinguish these scenarios has sparked recent controversy about the possibility of interbreeding between Neandertals and modern humans in Eurasia. We derive the probability of mutational configurations in non-recombining sequence blocks under alternative histories of divergence with admixture and ancestral structure. Dividing the genome into short blocks makes it possible to compute maximum likelihood estimates of parameters under both models. We apply this method to triplets of human Neandertal genomes and quantify the relative support for models of long-term population structure in the ancestral African popuation and admixture from Neandertals into Eurasian populations after their expansion out of Africa. Our analysis allows us — for the first time — to formally reject a history of ancestral population structure and instead reveals strong support for admixture from Neandertals into Eurasian populations at a higher rate (3.4%-7.9%) than suggested previously.