Genome wide signals of pervasive positive selection in human evolution

Genome wide signals of pervasive positive selection in human evolution
David Enard, Philipp W. Messer, Dmitri Petrov
(Submitted on 22 Aug 2013)

The role of positive selection in human evolution remains controversial. On the one hand, scans for positive selection have identified hundreds of candidate loci and the genome-wide patterns of polymorphism show signatures consistent with frequent positive selection. On the other hand, recent studies have argued that many of the candidate loci are false positives and that most apparent genome-wide signatures of adaptation are in fact due to reduction of neutral diversity by linked recurrent deleterious mutations, known as background selection. Here we analyze human polymorphism data from the 1,000 Genomes project (Abecasis et al. 2012) and detect signatures of pervasive positive selection once we correct for the effects of background selection. We show that levels of neutral polymorphism are lower near amino acid substitutions, with the strongest reduction observed specifically near functionally consequential amino acid substitutions. Furthermore, amino acid substitutions are associated with signatures of recent adaptation that should not be generated by background selection, such as the presence of unusually long and frequent haplotypes and specific distortions in the site frequency spectrum. We use forward simulations to show that the observed signatures require a high rate of strongly adaptive substitutions in the vicinity of the amino acid changes. We further demonstrate that the observed signatures of positive selection correlate more strongly with the presence of regulatory sequences, as predicted by ENCODE (Gerstein et al. 2012), than the positions of amino acid substitutions. Our results establish that adaptation was frequent in human evolution and provide support for the hypothesis of King and Wilson (King and Wilson 1975) that adaptive divergence is primarily driven by regulatory changes.

The standing pool of genomic structural variation in a natural population of Mimulus guttatus

The standing pool of genomic structural variation in a natural population of Mimulus guttatus
Lex E. Flagel, John H. Willis, Todd J. Vision
(Submitted on 19 Aug 2013)

Major unresolved questions in evolutionary genetics include determining the contributions of different mutational sources to the total pool of genetic variation in a species, and understanding how these different forms of genetic variation interact with natural selection. Recent work has shown that structural variants (insertions, deletions, inversions and transpositions) are a major source of genetic variation, often out-numbering single nucleotide variants in terms of total bases affected. Despite the near ubiquity of structural variants, major questions about their interaction with natural selection remain. For example, how does the allele frequency spectrum of structural variants differ when compared to single nucleotide variants? How often do structural variants affect genes, and what are the consequences? To begin to address these questions, we have systematically identified and characterized a large set submicroscopic insertion and deletion (indel) variants (between 1 kb to 200 kb in length) among ten individuals from a single natural population of the plant species Mimulus guttatus. After extensive computational filtering, we focused on a set of 4,142 high-confidence indels that showed an experimental validation rate of 73%. All but one of these indels were < 200 kb. While the largest were generally at lower frequencies in the population, a surprising number of large indels are at intermediate frequencies. While indels overlapping with genes were much rarer than expected by chance, nearly 600 genes were affected by an indel. NBS-LRR defense response genes were the most enriched among the gene families affected. Most indels associated with genes were rare and appeared to be under purifying selection, though we do find four high-frequency derived insertion alleles that show signatures of recent positive selection.

Gene and Gene-Set Analysis for Genome-Wide Association Studies

Gene and Gene-Set Analysis for Genome-Wide Association Studies
Inti Pedroso
(Submitted on 19 Aug 2013)

Genome-wide association studies (GWAS) have identified hundreds of loci at very stringent levels of statistical significance across many different human traits. However, it is now clear that very large samples (n~10^4-10^5) are needed to find the majority of genetic variants underlying risk for most human diseases. Therefore, the field has engaged itself in a race to increase study sample sizes with some studies yielding very successful results but also studies which provide little or no new insights. This project started early on in this new wave of studies and I decided to use an alternative approach that uses prior biological knowledge to improve both interpretation and power of GWAS. The project aimed to a) implement and develop new gene-based methods to derive gene-level statistics to use GWAS in well established system biology tools; b) use of these gene-level statistics in networks and gene-set analyses of GWAS data; c) mine GWAS of neuropsychiatric disorders using gene, gene-sets and integrative biology analyses with gene-expression studies; and d) explore the ability of these methods to improve the analysis GWAS on disease sub-phenotypes which usually suffer of very small sample sizes.

Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms

Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms
Rob Patro (1), Stephen M. Mount (2), Carl Kingsford (1) ((1) Lane Center for Computational Biology, School of Computer Science, Carnegie Mellon University, (2) Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland)
(Submitted on 16 Aug 2013)

RNA-seq has rapidly become the de facto technique to measure gene expression. However, the time required for analysis has not kept up with the pace of data generation. Here we introduce Sailfish, a novel computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Sailfish entirely avoids mapping reads, which is a time-consuming step in all current methods. Sailfish provides quantification estimates much faster than existing approaches (typically 20-times faster) without loss of accuracy.

Realistic simulations reveal extensive sample-specificity of RNA-seq biases

Realistic simulations reveal extensive sample-specificity of RNA-seq biases
Botond Sipos, Greg Slodkowicz, Tim Massingham, Nick Goldman
(Submitted on 14 Aug 2013)

In line with the importance of RNA-seq, the bioinformatics community has produced numerous data analysis tools incorporating methods to correct sample-specific biases. However, few advanced simulation tools exist to enable benchmarking of competing correction methods. We introduce the first framework to reproduce the properties of individual RNA-seq runs and, by applying it on several datasets, we demonstrate the importance of accounting for sample-specificity in realistic simulations.

On the sympatric evolution of coexistence by relative nonlinearity of competition

On the sympatric evolution of coexistence by relative nonlinearity of competition
Florian Hartig, Tamara Münkemüller, Karin Johst, Ulf Dieckmann
(Submitted on 14 Aug 2013)

If two species show different nonlinear responses to a single shared resource, and if each species modifies resource dynamics such that it favors its competitor, they may stably coexist. While the mechanism behind this phenomenon, known as relative nonlinearity of competition, is well understood, less is known about its evolutionary properties and its prevalence in real communities. We address this challenge by using the adaptive dynamics framework as well as individual-based simulations to compare dynamic and evolutionary stability of communities coexisting through relative nonlinearity. Evolution operates on the species’ density compensation strategies, and a trade-off between growth at high versus low resource availability (population density) is assumed. We confirm previous findings that, irrespective of the particular model of density-dependence, there are usually broad ranges of coexistence between overcompensating and undercompensating density-compensation strategies. We show that most of these strategies, however, are not evolutionarily stable and will be outcompeted by a single compensatory strategy. Only very specific evolutionary trade-offs allow evolutionary stability of strategies that coexist through relative nonlinearity. As we find no reason why these particular trade-offs should be abundant in nature, we conclude that sympatric evolution of relative nonlinearity seems possible, but rather unlikely. We speculate that this may explain why relative nonlinearity has seldom been observed, although we note that a low probability of sympatric evolution does not exclude the possibility that this mechanism of coexistence might still frequently occur when species with different evolutionary histories meet in the same community. Our results highlight the need for combining ecological and evolutionary perspectives for understanding community assembly and biogeographical patterns.

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human
Hunter B. Fraser
(Submitted on 8 Aug 2013)

Eukaryotic DNA replication follows a specific temporal program, with some genomic regions consistently replicating earlier than others, yet what determines this program is largely unknown. Highly transcribed regions have been observed to replicate in early S-phase in all plant and animal species studied to date, but this relationship is thought to be absent from both budding yeast and fission yeast. No association between cell-cycle regulated transcription and replication timing has been reported for any species. Here I show that in budding yeast, fission yeast, and human, the genes most highly transcribed during S-phase replicate early, whereas those repressed in S-phase replicate late. Transcription during other cell-cycle phases shows either the opposite correlation with replication timing, or no relation. The relationship is strongest near late-firing origins of replication, which is not consistent with a previously proposed model — that replication timing may affect transcription — and instead suggests a potential mechanism involving the recruitment of limiting replication initiation factors during S-phase. These results suggest that S-phase transcription may be an important determinant of DNA replication timing across eukaryotes, which may explain the well-established association between transcription and replication timing.

Our Paper: The genomic impacts of drift and selection for hybrid performance in maize

This next paper is by Jeff Ross-Ibarra (@jrossibarra) on his paper (along with coauthors) Gerke et al The genomic impacts of drift and selection for hybrid performance in maize arXived here.

Iowa recurrent selection as an evolutionary experiment in hybrid vigor

Maize is an outcrossing species, and was cultivated as such up through the first quarter of the 20th century. Starting in the 1920’s, however, breeders began to abandon open-pollinated maize in favor of hybrid varieties resulting from crosses between inbred lines. Hybrids are often more robust and higher yielding than either inbred parent, a phenomenon known as hybrid vigor or heterosis.

Breeding for hybrid varieties – and presumably increased heterosis – has had a profound impact on diversity across the maize genome. There are at least two important differences from previous breeding efforts: first, breeders select on and work with inbred maize lines rather than mass selection on open-pollinated populations. This results in much smaller effective population sizes, and has implications for recessive traits and deleterious alleles that could be masked in heterozygotes. The second difference is that instead of selecting the best plants per se, breeders now select for inbreds that make high-yielding hybrids. This means a breeder might favor an inbred that itself is not high-yielding if it consistently makes good hybrids when paired with other inbreds.

We set out to study the effects of these breeding strategies on patterns of diversity across the maize genome. We took addvantage of one of the longest-running ongoing experiments on selection for hybrid performance, started in the late 1940’s by the US Dept. of Agriculture’s Agricutural Research Service. Two small (12 and 16) sets of founder inbred lines were randomly mated to create two base populations: the Iowa Stiff Stalk Synthetic (BSSS) and the Iowa Corn Borer Synthetic No. 1 (BSCB1). In addition to its role as an important selection experiment, multiple maize breeding lines have come out of the BSSS population, including the line used for the maize reference genome.

Diversity in the BSSS and BSCB1 is patterned predominantly by drift

Over the course of the experiment we studied, the two base populations underwent 16 cycles of recurrent selection, in which lines from each population were crossed to each other and evaluated for both hybrid and per-se performance. Selected lines were intermated within each population to form the next generation. To investigate the genomic impact of this selection scheme, we genotyped progenitor lines and over 600 individuals from multiple selection cycles using the Illumina MaizeSNP50 SNP array. And because we know the exact crossing and selection scheme used, we can compare the observed changes in genome-wide diversity with strictly neutral crossing simulations using the genotypes of the starting populations.

Both populations steadily lost genetic diversity as they became more diverged from one another, but diversity and divergence between BSSS and BSCB1 can be largely reproduced by simulation without any selection. In fact, principal component analysis clearly reveals changes in population structure and diversity that mirror alterations in rates of inbreeding and effective population size that occurred over the course of the experiment. This indicates the structure is not necessarily related to the phenotypic improvement, but might be a by-product of the breeding scheme. Similar population structure is reflected in a recent broad comparison of US maize germplasm and suggests that much of the diversity and structure of modern maize germplasm has been effected by genetic drift.

Selection efficacy and fixation at regions of low-recombination.

But genetic drift can’t be the whole story in these populations. Numerous experiments have shown that the later populations are superior to their progenitors in terms of hybrid yield and traits important to increased planting density (more plants per acre = more yield). These same trends are observed across North American maize as a whole, suggesting common themes in how maize has improved over time. Selection is difficult to detect in the face of strong genetic drift, especially when the selection has been on traits with complex genetic architectures. However our simulations do detect regions of low heterozygosity in each population that are longer than expected given their genetic distance.

The most striking pattern of these regions is their lack overlap between the two populations. In simple cases, classic overdominance models of heterosis predict that at a single locus, two distinct alleles confer heterozygote advantage when combined. In this case, selection should lead to decreased heterozygosity at a locus in both populations as complementary alleles rise in frequency. We don’t observe this, and neither did a different study that used other populations.

A popular alternative to the over-dominance model is the dominance model, which predicts that heterosis is caused by the complementation of linked recessive deleterious alleles. In this case, multiple haplotypes in the other population may complement a fixed region if most deleterious alleles in maize are rare. Evidence from numerous studies supports a dominance model of heterosis, including findings of excess residual heterozygosity in low recombination regions of a maize mapping population. In regions of low recombination, heterozygosity (and thus complementation) becomes important due to an inabilty to efficiently select for new recombinants in these regions, especially with low effective population sizes. And because of low rates of recombination, a small genetic interval in these regions becomes massive in physical space and encompasses the composite effects of many deleterious loci. We observe fixation in these regions in the BSSS and BSCB1 populations. They are short genetically (1-2 centimorgans), but make up very large fractions of the chromosome. We find that in many cases, these regions have been inherited largely intact from the original population founders, indicating that selection for new haplotype combinations in these regions has been ineffective. Large haplotypes in some cases may have fixed early on in the formation of many breeding programs, and the combination of limited exchange between breeding pools and small effective population sizes has provided little opportunity for selective removal of deleterious alleles. Complementation and the inefficiency of selection in these pericentromeric regions, which span a large portion of the physical genome, may thus explain the difference between hybrid and inbred yield and why it has remained fairly constant.

Predicting protein contact map using evolutionary and physical constraints by integer programming

Predicting protein contact map using evolutionary and physical constraints by integer programming
Zhiyong Wang, Jinbo Xu
(Submitted on 8 Aug 2013)

Motivation. Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains very challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole contact map. A couple of recent methods predict contact map based upon residue co-evolution, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods require a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically unfavorable.
Results. This paper presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming (ILP). The evolutionary restraints include sequence profile, residue co-evolution and context-specific statistical potential. The physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration. PhyCMAP can predict contacts within minutes after PSIBLAST search for sequence homologs is done, much faster than the two recent methods PSICOV and EvFold.

How Population Growth Affects Linkage Disequilibrium

How Population Growth Affects Linkage Disequilibrium
Alan R. Rogers
(Submitted on 8 Aug 2013)

Linkage disequilibrium (LD) is often summarized using the “LD curve,” which relates the LD between pairs of sites to the distance that separates them along the chromosome. This paper shows how the LD curve responds to changes in population size. An expansion of population size generates an LD curve that declines steeply, especially if that expansion has followed a bottleneck. A reduction in size generates an LD curve that is high but relatively flat. In European data, the curve is steep, suggesting a history of population expansion.
These conclusions emerge from the study of $\sigma_d^2$, a measure of LD that has never played a central role. It has been seen merely as an approximation to another measure, $r^2$. Yet $\sigma_d^2$ has different dynamical behavior and provides deeper time depth. Furthermore, it is easily estimated from data and can be predicted from population history using a fast, deterministic algorithm.