FermiKit: assembly-based variant calling for Illumina resequencing data

FermiKit: assembly-based variant calling for Illumina resequencing data
Heng Li
Subjects: Genomics (q-bio.GN)

Summary: FermiKit is a variant calling pipeline for Illumina data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions (INDELs) and structural variations (SVs). FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85GB RAM at the peak, and calls variants in half an hour to an accuracy comparable to the current practice. FermiKit assembly is a reduced representation of raw data while retaining most of the original information.
Availability and implementation: https://github.com/lh3/fermikit
Contact: hengli@broadinstitute.org

Integrative analysis of RNA, translation and protein levels reveals distinct regulatory variation across humans

Integrative analysis of RNA, translation and protein levels reveals distinct regulatory variation across humans
Can Cenik , Elif Sarinay Cenik , Gun W Byeon , Fabian Grubert , Sophie I Candille , Damek Spacek , Bilal Alsallakh , Hagen Tilgner , Carlos L Araya , Hua Tang , Emiliano Ricci , Michael P Snyder
doi: http://dx.doi.org/10.1101/018572

Elucidating the consequences of genetic differences between humans is essential for understanding phenotypic diversity and personalized medicine. Although variation in RNA levels, transcription factor binding and chromatin have been explored, little is known about global variation in translation and its genetic determinants. We used ribosome profiling, RNA sequencing, and mass spectrometry to perform an integrated analysis in lymphoblastoid cell lines from a diverse group of individuals. We find significant differences in RNA, translation, and protein levels suggesting diverse mechanisms of personalized gene expression control. Combined analysis of RNA expression and ribosome occupancy improves the identification of individual protein level differences. Finally, we identify genetic differences that specifically modulate ribosome occupancy – many of these differences lie close to start codons and upstream ORFs. Our results reveal a new level of gene expression variation among humans and indicate that genetic variants can cause changes in protein levels through effects on translation.

Author post: Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

This guest post is by William Wen on his preprint “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing”, arXived here.

Our paper “Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing” has been published in the journal Biostatistics, the preprint is also updated on the arXiv. The paper discusses linear mixed models (LMM) and the models commonly used for (rare variants) SNP set testing in a unified Bayesian framework, where fixed and random effects are naturally treated as corresponding prior distributions. Based on this general Bayesian representation, we derive Bayes factors as our primary inference device and demonstrate their usage in solving problems of hypothesis testing (e.g., single SNP and SNP set testing) and variable selection (e.g., multiple SNP analysis) in genetic association analysis. Here, we take the opportunity to summarize our main findings for a general audience.

Agreement with Frequentist Inference

We are able to derive various forms of analytic approximations of the Bayes factor based on the unified Bayesian model, and we find that these analytic approximations are connected to the commonly used frequentist test statistics, namely the Wald statistic, score statistic in LMM and the variance component score statistic in SNP set testing.

In the case of LMM-based single SNP testing, we find that under a specific prior specification of genetic effects, the approximate Bayes factors become monotonic transformations of the Wald or score test statistics (hence their corresponding p-values) obtained from LMM. This connection is very similar to what is reported by Wakefield (2008) in the context of simple logistic regression. It should be noted the specific prior specification (following Wakefield (2008), we call it implicit p-value prior) essentially assumes a larger a priori effect for SNPs that are less informative (either due to a smaller sample size or minor allele frequency). Although, from the Bayesian point of view, there seems to be a lack of proper justification for such prior assumptions in general, we often note that the overall effect of the implicit p-value prior on the final inference may be negligible in practice, especially when the sample size is large (We demonstrate this point with numerical experiments in the paper).

For SNP set testing, we show that the variance component score statistic in the popular SKAT model (Wu et al. (2011)) is also monotonic to the approximate Bayes factor in our unified model if the prior effect size under the alternative scenario is assumed small. Interestingly, such prior assumption represents a “local” alternative scenario for which score tests are known to be most powerful.

The above connections are well expected: after all, the frequentist models and the Bayesian representation share the exact same likelihood functions. From the Bayesian point of view, the connections reveal the implicit prior assumptions in the Frequentist inference. These connections also provide a principled way to “translate” the relevant frequentist statistics/p-values into Bayes factors for fine mapping analysis using Bayesian hierarchical models as demonstrated by Maller et al. (2012) and Pickrell (2014).

Advantages of Bayesian Inference

Bayesian Model Averaging

Bayesian model averaging allows simultaneous modeling of multiple alternative scenarios which may not be nested or compatible with each other. One interesting example is in the application of rare variants SNP set testing, where two primary classes of competing models, burden model (assuming most rare causal variants in a SNP set are either consistently deleterious or consistently protective) and SKAT model (assuming the rare causal variants in a SNP set can have bi-directional effects), target complimentary alternative scenarios. In our Bayesian framework, we show that these two types of alternative models correspond to different prior specifications, and a Bayes factor jointly considering the two types of models can be trivially computed by model averaging. A frequentist approach, SKAT-O, proposed by Lee et al. (2012) achieves the similar goal by using a mixture kernel (or prior from the Bayesian perspective). We discuss the subtle theoretical difference between the Bayesian model averaging and the use of SKAT-O prior and show by simulations, the two approaches have similar performance. Moreover, we find that the Bayesian model averaging is more general and flexible. To this end, we demonstrate a Bayesian SNP set testing example where three categories of alternative scenarios are jointly considered: in addition to the two aforementioned rare SNP association models, a common SNP association model is also included for averaging. Such application can be useful for eQTL studies to identify genes harboring cis-eQTLs.

Prior Specification for Genetic Effects

The explicit specification of the prior distributions on genetic effects for alternative models is seemingly a distinct feature of Bayesian inference. However, as we have shown, even the most commonly applied frequentist test statistics can be viewed as resulting from some implicit Bayesian priors. Therefore, it is only natural to regard the prior specification as an integrative component in modeling alternative scenarios. Many authors have shown that it is effective to incorporate functional annotations into genetic association analysis through prior specifications. In addition, we also show that in many practical settings, the desired priors can be sufficiently “learned” from data facilitated by Bayes factors.

Multiple SNP Association Analysis

Built upon the Bayes factor results, we demonstrate an example of multiple SNP fine mapping analysis via Bayesian variable selection in the context of LMM. The advantages of Bayesian variable selection and its comparison to the popular conditional analysis approach have been thoroughly discussed in another recent paper of ours (Wen et al. 2015).

Take-home Message

If single SNP/SNP set association testing is the end point of the analysis, the Bayesian and the commonly applied frequentist approaches yield similar results with very little practical difference. However, going beyond the simple hypothesis testing in genetic association analysis, we believe that the Bayesian approaches possess many unique advantages and is conceptually simple to apply in rather complicated practical settings.

The software/scripts, simulated and real data sets used in the paper are publicly available at github.


1. Wakefield, J. (2009). Bayes factors for genome‐wide association studies: comparison with P‐values. Genetic epidemiology, 33(1), 79-86.
2. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M., & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1), 82-93.
3. Maller, J. B., McVean, G., Byrnes, J., Vukcevic, D., Palin, K., Su, Z., Wellcome Trust Case Control Consortium., et al. (2012). Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics, 44(12), 1294-1301.
4. Pickrell, J. K. (2014). Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics, 94(4), 559-573.
5. Lee, S., Wu, M. C., & Lin, X. (2012). Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13(4), 762-775.
6. Wen, X., Luca, F., & Pique-Regi, R. (2014). Cross-population meta-analysis of eQTLs: fine mapping and functional study. bioRxiv, 008797.

Site-specific amino-acid preferences are mostly conserved in two closely related protein homologs

Site-specific amino-acid preferences are mostly conserved in two closely related protein homologs
Michael B Doud , Orr Ashenberg , Jesse Bloom
doi: http://dx.doi.org/10.1101/018457

Evolution drives changes in a protein’s sequence over time. The extent to which these changes in sequence affect the underlying preferences for each amino acid at each site is an important question with implications for comparative sequence-analysis methods such as molecular phylogenetics. To quantify the extent that site-specific amino-acid preferences change during evolution, we performed deep mutational scanning on two homologs of human influenza nucleoprotein with 94% amino-acid identity. We found that only a small fraction of sites (14 out of 497) exhibited changes in their amino-acid preferences that exceeded the noise in our experiments. Given the limited change in amino-acid preferences between these close homologs, we tested whether our measurements could be used to build site-specific substitution models that describe the evolution of nucleoproteins from more diverse influenza viruses. We found that site-specific evolutionary models informed by our experiments greatly outperformed non-site-specific alternatives in fitting the phylogenies of nucleoproteins from human, swine, equine, and avian influenza. Combining the experimental data from both nucleoprotein homologs improved phylogenetic fit, in part because measurements in multiple genetic contexts better captured the evolutionary average of the amino-acid preferences for sites with changing preferences. Overall, our results show that site-specific amino-acid preferences are sufficiently conserved during evolution that measuring mutational effects in one protein provides information that can improve quantitative evolutionary modeling of nearby homologs.

Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic

Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic
Vincent Lynch , Oscar C. Bedoya-Reina , Aakrosh Ratan , Michael Sulak , Daniela I. Drautz-Moses , George H. Perry , Webb Miller , Stephan C. Schuster
doi: http://dx.doi.org/10.1101/018366

Woolly mammoths and the living elephants are characterized by major phenotypic differences that allowed them to live in very different environments. To identify the genetic changes that underlie the suite of adaptations in woolly mammoths to life in extreme cold, we sequenced the nuclear genome from three Asian elephants and two woolly mammoths, identified and functionally annotated genetic changes unique to the woolly mammoth lineage. We find that genes with mammoth specific amino acid changes are enriched in functions related to circadian biology, skin and hair development and physiology, lipid metabolism, adipose development and physiology, and temperature sensation. Finally we resurrect and functionally test the mammoth and ancestral elephant TRPV3 gene, which encodes a temperature sensitive transient receptor potential (thermoTRP) channel involved in thermal sensation and hair growth, and show that a single mammoth-specific amino acid substitution in an otherwise highly conserved region of the TRPV3 channel strongly affected its temperature sensitivity. Our results have identified a set of genetic changes that likely played important roles in the adaptation of woolly mammoths to life in the high artic.

A high-throughput RNA-seq approach to profile transcriptional responses

A high-throughput RNA-seq approach to profile transcriptional responses

Gregory A Moyerbrailean , Gordon O Davis , Chris T Harvey , Donovan Watza , Xiaoquan Wen , Roger Pique-Regi , Francesca Luca
doi: http://dx.doi.org/10.1101/018416

In recent years, different technologies have been used to measure genome-wide gene expression levels and to study the transcriptome across many types of tissues and in response to in vitro treatments. However, a full understanding of gene regulation in any given cellular and environmental context combination is still missing. This is partly because analyzing tissue/environment-specific gene expression generally implies screening a large number of cellular conditions and samples, without prior knowledge of which conditions are most informative (e.g. some cell types may not respond to certain treatments). To circumvent these challenges, we have established a new two-step high-throughput and cost-effective RNA-seq approach: the first step consists of gene expression screening of a large number of conditions, while the second step focuses on deep sequencing of the most relevant conditions (e.g. largest number of differentially expressed genes). This study design allows for a fast and economical screen in step one, with a more profitable allocation of resources for the deep sequencing of re-pooled libraries in step two. We have applied this approach to study the response to 26 treatments in three lymphoblastoid cell line samples and we show that it is applicable for other high-throughput transcriptome profiling requiring iterative refinement or screening.

Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing

Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing

Saulo A. Aflitos, Elio Schijlen, Richard Finkers, Sandra Smit, Jun Wang, Gengyun Zhang, Ning Li, Likai Mao, Hans de Jong, Freek Bakker, Barbara Gravendeel, Timo Breit, Rob Dirks, Henk Huits, Darush Struss, Ruth Wagner, Hans van Leeuwen, Roeland van Ham, Laia Fito, Laëtitia Guigner, Myrna Sevilla, Philippe Ellul, Eric W. Ganko, Arvind Kapur, Emmanuel Reclus, Bernard de Geus, Henri van de Geest, Bas te Lintel Hekkert, Jan C. Van Haarst, Lars Smits, Andries Koops, Gabino Sanchez Perez, Dick de Ridder, Sjaak van Heusden, Richard Visser, Zhiwu Quan, Jiumeng Min, Li Liao, Xiaoli Wang, Guangbiao Wang, Zhen Yue, Xinhua Yang, Na Xu, Eric Schranz, Eric F. Smets, Rutger A. Vos, Han Rauwerda, Remco Ursem, Cees Schuit, Mike Kerns, Jan van den Berg, Wim H. Vriezen, Antoine Janssen, Torben Jahrman, Frederic Moquet, Julien Bonnet, Sander A. Peters
(Submitted on 21 Apr 2015)

Genetic variation in the tomato clade was explored by sequencing a selection of 84 tomato accessions and related wild species representative for the Lycopersicon, Arcanum, Eriopersicon, and Neolycopersicon groups. We present a reconstruction of three new reference genomes in support of our comparative genome analyses. Sequence diversity in commercial breeding lines appears extremely low, indicating the dramatic genetic erosion of crop tomatoes. This is reflected by the SNP count in wild species which can exceed 10 million i.e. 20 fold higher than in crop accessions. Comparative sequence alignment reveals group, species, and accession specific polymorphisms, which explain characteristic fruit traits and growth habits in tomato accessions. Using gene models from the annotated Heinz reference genome, we observe a bias in dN/dS ratio in fruit and growth diversification genes compared to a random set of genes, which probably is the result of a positive selection. We detected highly divergent segments in wild S. lycopersicum species, and footprints of introgressions in crop accessions originating from a common donor accession. Phylogenetic relationships of fruit diversification and growth specific genes from crop accessions show incomplete resolution and are dependent on the introgression donor. In contrast, whole genome SNP information has sufficient power to resolve the phylogenetic placement of each accession in the four main groups in the Lycopersicon clade using Maximum Likelihood analyses. Phylogenetic relationships appear correlated with habitat and mating type and point to the occurrence of geographical races within these groups and thus are of practical importance for introgressive hybridization breeding. Our study illustrates the need for multiple reference genomes in support of tomato comparative genomics and Solanum genome evolution studies.