Detecting recent selective sweeps while controlling for mutation rate and background selection

Detecting recent selective sweeps while controlling for mutation rate and background selection

Christian D. Huber , Michael DeGiorgio , Ines Hellmann , Rasmus Nielsen
doi: http://dx.doi.org/10.1101/018697

A composite likelihood ratio test implemented in the program SweepFinder is a commonly used method for scanning a genome for recent selective sweeps. SweepFinder uses information on the spatial pattern of the site frequency spectrum (SFS) around the selected locus. To avoid confounding effects of background selection and variation in the mutation process along the genome, the method is typically applied only to sites that are variable within species. However, the power to detect and localize selective sweeps can be greatly improved if invariable sites are also included in the analysis. In the spirit of a Hudson-Kreitman-Aguadé test, we suggest to add fixed differences relative to an outgroup to account for variation in mutation rate, thereby facilitating more robust and powerful analyses. We also develop a method for including background selection modeled as a local reduction in the effective population size. Using simulations we show that these advances lead to a gain in power while maintaining robustness to mutation rate variation. Furthermore, the new method also provides more precise localization of the causative mutation than methods using the spatial pattern of segregating sites alone.

Surveying the relative impact of mRNA features on local ribosome profiling read density in 28 datasets.

Surveying the relative impact of mRNA features on local ribosome profiling read density in 28 datasets.

Patrick O’Connor , Dmitry Andreev , Pavel Baranov
doi: http://dx.doi.org/10.1101/018762

Ribosome profiling is a promising technology for exploring gene expression. However, ribosome profiling data are characterized by a substantial number of outliers due to technical and biological factors. Here we introduce a simple computational method, Ribo-seq Unit Step Transformation (RUST) for the characterization of ribosome profiling data. We show that RUST is robust and outperforms conventional normalization techniques in the presence of sporadic noise. We used RUST to analyse 28 publicly available ribosome profiling datasets obtained from mammalian cells and tissues and from yeast. This revealed substantial protocol dependent variation in the composition of footprint libraries. We selected a high quality dataset to explore the mRNA features that affect local decoding rates and found that the amino acid identity encoded by the codon in the A-site is the major contributing factor followed by the identity of the codon itself and then the amino acid in the P-site. We also found that bulky amino acids slow down ribosome movement when they occur within the peptide tunnel and Proline residues may decrease or increase ribosome velocities depending on the context in which they occur. Moreover we show that a few parameters obtained with RUST are sufficient for predicting experimental densities with high accuracy. Due to its robustness and low computational demand, RUST could be used for quick routine characterization of ribosome profiling datasets to assess their quality as well as for the analysis of the relative impact of mRNA sequence features on local decoding rates.

Most viewed on Haldane’s Sieve: April 2014

The most viewed posts this month were:

Distinct nucleosome distribution patterns in two structurally and functionally differentiated nuclei of a unicellular eukaryote

Distinct nucleosome distribution patterns in two structurally and functionally differentiated nuclei of a unicellular eukaryote

Jie Xiong , Shan Gao , Wen Dui , Wentao Yang , Xiao Chen , Sean D Taverna , Ronald E. Pearlman , Wendy Ashlock , Wei Miao , Yifan Liu
doi: http://dx.doi.org/10.1101/018754

The ciliate protozoan Tetrahymena thermophila contains two types of structurally and functionally differentiated nuclei: the transcriptionally active somatic macronucleus (MAC) and the transcriptionally silent germ-line micronucleus (MIC). Here we demonstrate that MAC features well-positioned nucleosomes downstream of transcription start sites (TSS) likely connected with promoter proximal pausing of RNA polymerase II, as well as in exonic regions flanking both the 5′ and 3′ splice sites. In contrast, nucleosomes in MIC are more delocalized. Nucleosome occupancy in MAC and MIC are nonetheless highly correlated with each other and with predictions based upon DNA sequence features. Arrays of well-positioned nucleosomes are often correlated with GC content oscillations, suggesting significant contributions from cis-determinants. We propose that cis- and trans-determinants may coordinately accommodate some well-positioned nucleosomes with important functions, driven by a process in which positioned nucleosomes shape the mutational landscape of associated DNA sequences, while the DNA sequences in turn reinforce nucleosome positioning.

Standing genetic variation as a major contributor to adaptation in the Virginia chicken lines selection experiment

Standing genetic variation as a major contributor to adaptation in the Virginia chicken lines selection experiment

Zheya Sheng , Mats E Pettersson , Christa F Honaker , Paul B Siegel , Örjan Carlborg
doi: http://dx.doi.org/10.1101/018721

Artificial selection has, for decades, provided a powerful approach to study the genetics of adaptation. Using selective-sweep mapping, it is possible to identify genomic regions in populations where the allele-frequencies have diverged during selection. To avoid misleading signatures of selection, it is necessary to show that a sweep has an effect on the selected trait before it can be considered adaptive. Here, we confirm candidate selective-sweeps on a genome-wide scale in one of the longest, on-going bi-directional selection experiments in vertebrates, the Virginia high and low body-weight selected chicken lines. The candidate selective-sweeps represent standing genetic variants originating from the common base-population. Using a deep-intercross between the selected lines, 16 of 99 evaluated regions were confirmed to contain adaptive selective-sweeps based on their association with the selected trait, 56-day body-weight. Although individual additive effects were small, the fixation for alternative alleles in the high and low body-weight lines across these loci contributed at least 40% of the divergence between them and about half of the additive genetic variance present within and between the lines after 40 generations of selection. The genetic variance contributed by the sweeps corresponds to about 85% of the additive genetic variance of the base-population, illustrating that these loci were major contributors to the realised selection-response. Thus, the gradual, continued, long- term selection response in the Virginia lines was likely due to a considerable standing genetic variation in a highly polygenic genetic architecture in the base-population with contributions from a steady release of selectable genetic variation from new mutations and epistasis throughout the course of selection.

The complex admixture history and recent southern origins of Siberian populations

The complex admixture history and recent southern origins of Siberian populations

Irina Pugach , Rostislav Matveev , Viktor Spitsyn , Sergey Makarov , Innokentiy Novgorodov , Vladimir Osakovsky , Mark Stoneking , Brigitte Pakendorf
doi: http://dx.doi.org/10.1101/018770

Although Siberia was inhabited by modern humans at an early stage, there is still debate over whether this area remained habitable during the extremely cold period of the Last Glacial Maximum or whether it was subsequently repopulated by peoples with a recent shared ancestry. Previous studies of the genetic history of Siberian populations were hampered by the extensive admixture that appears to have taken place among these populations, since commonly used methods assume a tree-like population history and at most single admixture events. We therefore developed a new method based on the covariance of ancestry components, which we validated with simulated data, in order to investigate this potentially complex admixture history and to distinguish the effects of shared ancestry from prehistoric migrations and contact. We furthermore adapted a previously devised method of admixture dating for use with multiple events of gene flow, and applied these methods to whole-genome genotype data from over 500 individuals belonging to 20 different Siberian ethnolinguistic groups. The results of these analyses indicate that there have indeed been multiple layers of admixture detectable in most of the Siberian populations, with considerable differences in the admixture histories of individual populations, and with the earliest events dated to not more than 4500 years ago. Furthermore, most of the populations of Siberia included here, even those settled far to the north, can be shown to have a southern origin. These results provide support for a recent population replacement in this region, with the northward expansions of different populations possibly being driven partly by the advent of pastoralism, especially reindeer domestication. These newly developed methods to analyse multiple admixture events should aid in the investigation of similarly complex population histories elsewhere.

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda

Alita Burmeister , Richard Lenski , Justin Meyer
doi: http://dx.doi.org/10.1101/018606

The evolution of qualitatively new functions is fundamental for shaping the diversity of life. Such innovations are rare because they require multiple coordinated changes. We sought to understand the evolutionary processes involved in a particular key innovation, whereby phage λ evolved the ability to exploit a novel receptor, OmpF, on the surface of Escherichia coli cells. Previous work has shown that this transition repeatedly evolves in the laboratory, despite requiring four mutations in specific regions of a single gene. Here we examine how this innovation evolved by studying six intermediate genotypes that arose during independent transitions to use OmpF. In particular, we tested whether these genotypes were favored by selection, and how a coevolved change in the hosts influenced the fitness of the phage genotypes. To do so, we measured the fitness of the intermediate types relative to the ancestral λ when competing for either ancestral or coevolved host cells. All six intermediates had improved fitness on at least one host, and four had higher fitness on the coevolved host than on the ancestral host. These results show that the evolution of the phage’s new ability to use OmpF was repeatable because the intermediate genotypes were adaptive and, in many cases, because coevolution of the host favored their emergence.

Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis.

Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis.

Phelim Bradley , N Claire Gordon , Timothy M Walker , Laura Dunn , Simon Heys , Bill Huang , Sarah Earle , Louise J Pankhurst , Luke Anson , Mariateresa de Cesare , Paolo Piazza , Antonina A Votintseva , Tanya Golubchik , Daniel J Wilson , David H Wyllie , Roland Diel , Stefan Niemann , Silke Feuerriegel , Thomas A Kohl , Nazir Ismail , Shaheed V Omar , E Grace Smith , David Buck , Gil McVean , A Sarah Walker , Tim Peto , Derrick Crook , Zamin Iqbal
doi: http://dx.doi.org/10.1101/018564

Rapid and accurate detection of antibiotic resistance in pathogens is an urgent need, affecting both patient care and population-scale control. Microbial genome sequencing promises much, but many barriers exist to its routine deployment. Here, we address these challenges, using a de Bruijn graph comparison of clinical isolate and curated knowledge-base to identify species and predict resistance profile, including minor populations. This is implemented in a package, Mykrobe predictor, for S. aureus and M. tuberculosis, running in under three minutes on a laptop from raw data. For S. aureus, we train and validate in 495/471 samples respectively, finding error rates comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.3%/99.5% across 12 drugs. For M. tuberculosis, we identify species and predict resistance with specificity of 98.5% (training/validating on 1920/1609 samples). Sensitivity of 82.6% is limited by current understanding of genetic mechanisms. We also show that analysis of minor populations increases power to detect phenotypic resistance in second-line drugs without appreciable loss of specificity. Finally, we demonstrate feasibility of an emerging single-molecule sequencing technique.

Author post: Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

This guest post is by William Wen on his preprint “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing”, arXived here.

Our paper “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing” has been published in the journal Biostatistics, the preprint is also updated on the arXiv. The paper discusses linear mixed models (LMM) and the models commonly used for (rare variants) SNP set testing in a unified Bayesian framework, where fixed and random effects are naturally treated as corresponding prior distributions. Based on this general Bayesian representation, we derive Bayes factors as our primary inference device and demonstrate their usage in solving problems of hypothesis testing (e.g., single SNP and SNP set testing) and variable selection (e.g., multiple SNP analysis) in genetic association analysis. Here, we take the opportunity to summarize our main findings for a general audience.

Agreement with Frequentist Inference

We are able to derive various forms of analytic approximations of the Bayes factor based on the unified Bayesian model, and we find that these analytic approximations are connected to the commonly used frequentist test statistics, namely the Wald statistic, score statistic in LMM and the variance component score statistic in SNP set testing.

In the case of LMM-based single SNP testing, we find that under a specific prior specification of genetic effects, the approximate Bayes factors become monotonic transformations of the Wald or score test statistics (hence their corresponding p-values) obtained from LMM. This connection is very similar to what is reported by Wakefield (2008) in the context of simple logistic regression. It should be noted the specific prior specification (following Wakefield (2008), we call it implicit p-value prior) essentially assumes a larger a priori effect for SNPs that are less informative (either due to a smaller sample size or minor allele frequency). Although, from the Bayesian point of view, there seems to be a lack of proper justification for such prior assumptions in general, we often note that the overall effect of the implicit p-value prior on the final inference may be negligible in practice, especially when the sample size is large (We demonstrate this point with numerical experiments in the paper).

For SNP set testing, we show that the variance component score statistic in the popular SKAT model (Wu et al. (2011)) is also monotonic to the approximate Bayes factor in our unified model if the prior effect size under the alternative scenario is assumed small. Interestingly, such prior assumption represents a “local” alternative scenario for which score tests are known to be most powerful.

The above connections are well expected: after all, the frequentist models and the Bayesian representation share the exact same likelihood functions. From the Bayesian point of view, the connections reveal the implicit prior assumptions in the Frequentist inference. These connections also provide a principled way to “translate” the relevant frequentist statistics/p-values into Bayes factors for fine mapping analysis using Bayesian hierarchical models as demonstrated by Maller et al. (2012) and Pickrell (2014).

Advantages of Bayesian Inference

Bayesian Model Averaging

Bayesian model averaging allows simultaneous modeling of multiple alternative scenarios which may not be nested or compatible with each other. One interesting example is in the application of rare variants SNP set testing, where two primary classes of competing models, burden model (assuming most rare causal variants in a SNP set are either consistently deleterious or consistently protective) and SKAT model (assuming the rare causal variants in a SNP set can have bi-directional effects), target complimentary alternative scenarios. In our Bayesian framework, we show that these two types of alternative models correspond to different prior specifications, and a Bayes factor jointly considering the two types of models can be trivially computed by model averaging. A frequentist approach, SKAT-O, proposed by Lee et al. (2012) achieves the similar goal by using a mixture kernel (or prior from the Bayesian perspective). We discuss the subtle theoretical difference between the Bayesian model averaging and the use of SKAT-O prior and show by simulations, the two approaches have similar performance. Moreover, we find that the Bayesian model averaging is more general and flexible. To this end, we demonstrate a Bayesian SNP set testing example where three categories of alternative scenarios are jointly considered: in addition to the two aforementioned rare SNP association models, a common SNP association model is also included for averaging. Such application can be useful for eQTL studies to identify genes harboring cis-eQTLs.

Prior Specification for Genetic Effects

The explicit specification of the prior distributions on genetic effects for alternative models is seemingly a distinct feature of Bayesian inference. However, as we have shown, even the most commonly applied frequentist test statistics can be viewed as resulting from some implicit Bayesian priors. Therefore, it is only natural to regard the prior specification as an integrative component in modeling alternative scenarios. Many authors have shown that it is effective to incorporate functional annotations into genetic association analysis through prior specifications. In addition, we also show that in many practical settings, the desired priors can be sufficiently “learned” from data facilitated by Bayes factors.

Multiple SNP Association Analysis

Built upon the Bayes factor results, we demonstrate an example of multiple SNP fine mapping analysis via Bayesian variable selection in the context of LMM. The advantages of Bayesian variable selection and its comparison to the popular conditional analysis approach have been thoroughly discussed in another recent paper of ours (Wen et al. 2015).

Take-home Message

If single SNP/SNP set association testing is the end point of the analysis, the Bayesian and the commonly applied frequentist approaches yield similar results with very little practical difference. However, going beyond the simple hypothesis testing in genetic association analysis, we believe that the Bayesian approaches possess many unique advantages and is conceptually simple to apply in rather complicated practical settings.

The software/scripts, simulated and real data sets used in the paper are publicly available at github.

References

1. Wakefield, J. (2009). Bayes factors for genome‐wide association studies: comparison with P‐values. Genetic epidemiology, 33(1), 79-86.
2. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1), 82-93.
3. Maller, J. B., McVean, G., Byrnes, J., Vukcevic, D., Palin, K., Su, Z., Wellcome Trust Case Control Consortium., et al. (2012). Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics, 44(12), 1294-1301.
4. Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4), 559-573.
5. Lee, S., Wu, M. C., & Lin, X. (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13(4), 762-775.
6. Wen, X., Luca, F., & Pique-Regi, R. (2014). Cross-population meta-analysis of eQTLs: fine mapping and functional study. bioRxiv, 008797.

A high-throughput RNA-seq approach to profile transcriptional responses

A high-throughput RNA-seq approach to profile transcriptional responses

Gregory A Moyerbrailean , Gordon O Davis , Chris T Harvey , Donovan Watza , Xiaoquan Wen , Roger Pique-Regi , Francesca Luca
doi: http://dx.doi.org/10.1101/018416

In recent years, different technologies have been used to measure genome-wide gene expression levels and to study the transcriptome across many types of tissues and in response to in vitro treatments. However, a full understanding of gene regulation in any given cellular and environmental context combination is still missing. This is partly because analyzing tissue/environment-specific gene expression generally implies screening a large number of cellular conditions and samples, without prior knowledge of which conditions are most informative (e.g. some cell types may not respond to certain treatments). To circumvent these challenges, we have established a new two-step high-throughput and cost-effective RNA-seq approach: the first step consists of gene expression screening of a large number of conditions, while the second step focuses on deep sequencing of the most relevant conditions (e.g. largest number of differentially expressed genes). This study design allows for a fast and economical screen in step one, with a more profitable allocation of resources for the deep sequencing of re-pooled libraries in step two. We have applied this approach to study the response to 26 treatments in three lymphoblastoid cell line samples and we show that it is applicable for other high-throughput transcriptome profiling requiring iterative refinement or screening.