Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James Drake, Jane M Landolin, Adam M Phillippy
doi: http://dx.doi.org/10.1101/008003

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.

Changes in epistatic interactions in the long-term evolution of HIV-1 protease

Changes in epistatic interactions in the long-term evolution of HIV-1 protease

Aditi Gupta, Christoph Adami
(Submitted on 12 Aug 2014)

The human immuno-deficiency virus sub-type 1 (HIV-1) is evolving to keep up with a changing fitness landscape, due to the various drugs introduced to stop the virus’s replication. As the virus adapts, the information the virus encodes about its environment must change, and this change is reflected in the amino-acid composition of proteins, as well as changes in viral RNAs, binding sites, and splice sites. Information can also be encoded in the interaction between residues in a single protein as well as across proteins, leading to a change in the epistatic patterns that can affect how the virus can change in the future. Measuring epistasis usually requires fitness measurements that are difficult to obtain in high-throughput. Here we show that epistasis can be inferred from the pair-wise information between residues, and study how epistasis and information have changed over the long-term. Using HIV-1 protease sequence data from public databases covering the years 1998-2006 (from both treated and untreated subjects), we show that drug treatment has increased the protease’s per-site entropies on average. At the same time, the sum of mutual entropies across all pairs of residues within the protease shows a significant increase over the years, indicating an increase in epistasis in response to treatment, a trend not seen within sequences from untreated subjects. Our findings suggest that information theory can be an important tool to study long-term trends in the evolution of macromolecules.

A codon model of nucleotide substitution with selection on synonymous codon usage

A codon model of nucleotide substitution with selection on synonymous codon usage

Laura Kubatko, Premal Shah, Radu Herbei, Michael Gilchrist
doi: http://dx.doi.org/10.1101/007849

The quality of phylogenetic inference made from protein-coding genes depends, in part, on the realism with which the codon substitution process is modeled. Here we propose a new mechanistic model that combines the standard M0 substitution model of Yang (1997) with a simplified model from Gilchrist (2007) that includes selection on synonymous substitutions as a function of codon-specific nonsense error rates. We tested the newly proposed model by applying it to 104 protein-coding genes in brewer’s yeast, and compared the fit of the new model to the standard M0 model and to the mutation-selection model of Yang and Nielsen (2008) using the AIC. Our new model provided significantly better fit in approximately 85% of the cases considered for the basic M0 model and in approximately 25% of the cases for the M0 model with estimated codon frequencies, but only in a few cases when the mutation-selection model was considered. However, our model includes a parameter that can be interpreted as a measure of the rate of protein production, and the estimates of this parameter were highly correlated with an independent measure of protein production for the yeast genes considered here. Finally, we found that in some cases the new model led to the preference of a different phylogeny for a subset of the genes considered, indicating that substitution model choice may have an impact on the estimated phylogeny.

Efficient Bayesian mixed model analysis increases association power in large cohorts

Efficient Bayesian mixed model analysis increases association power in large cohorts

Po-Ru Loh, George Tucker, Brendan K Bulik-Sullivan, Bjarni J Vilhjalmsson, Hilary K Finucane, Daniel I Chasman, Paul M Ridker, Benjamin M Neale, Bonnie Berger, Nick Patterson, Alkes L Price
doi: http://dx.doi.org/10.1101/007799

Linear mixed models are a powerful statistical tool for identifying genetic associations and avoiding confounding. However, existing methods are computationally intractable in large cohorts, and may not optimize power. All existing methods require time cost O(MN^2) (where N = #samples and M = #SNPs) and implicitly assume an infinitesimal genetic architecture in which effect sizes are normally distributed, which can limit power. Here, we present a far more efficient mixed model association method, BOLT-LMM, which requires only a small number of O(MN) iterations and increases power by modeling more realistic, non-infinitesimal genetic architectures via a Bayesian mixture prior on marker effect sizes. We applied BOLT-LMM to nine quantitative traits in 23,294 samples from the Women’s Genome Health Study (WGHS) and observed significant increases in power, consistent with simulations. Theory and simulations show that the boost in power increases with cohort size, making BOLT-LMM appealing for GWAS in large cohorts.

Genome-wide predictability of restriction sites across the eukaryotic tree of life

Genome-wide predictability of restriction sites across the eukaryotic tree of life

Santiago Herrera, Paula H. Reyes-Herrera, Timothy M. Shank
doi: http://dx.doi.org/10.1101/007781

High-throughput sequencing of reduced representation libraries obtained through digestion with restriction enzymes–generally known as restriction-site associated DNA sequencing (RAD-seq)–is now one most commonly used strategies to generate single nucleotide polymorphism data in eukaryotes. The choice of restriction enzyme is critical for the design of any RAD-seq study as it determines the number of genetic markers that can be obtained for a given species, and ultimately the success of a project. In this study we tested the hypothesis that genome composition, in terms of GC content, mono-, di- and trinucleotide compositions, can be used to predict the number of restriction sites for a given combination of restriction enzyme and genome. We performed systematic in silico genome-wide surveys of restriction sites across the eukaryotic tree of live and compared them with expectations generated from stochastic models based on genome compositions using the newly developed software pipeline PredRAD (https://github.com/phrh/PredRAD). Our analyses reveal that in most cases the trinucleotide genome composition model is the best predictor, and the GC content and mononucleotide models are the worst predictors of the expected number of restriction sites in a eukaryotic genome. However, we argue that the predictability of restriction site frequencies in eukaryotic genomes needs to be treated in a case-specific basis, because the phylogenetic position of the taxon of interest and the specific recognition sequence of the selected restriction enzyme are the most determinant factors. The results from this study, and the software developed, will help guide the design of any study using RAD sequencing and related methods.

Postmating reproductive barriers contribute to the incipient sexual isolation of US and Caribbean Drosophila melanogaster

Postmating reproductive barriers contribute to the incipient sexual isolation of US and Caribbean Drosophila melanogaster

Joyce Y Kao, Seana Lymer, Sea H Hwang, Albert Sung, Sergey V Nuzhdin
doi: http://dx.doi.org/10.1101/007765

The nascent stages of speciation start with the emergence of sexual isolation. Understanding how reproductive barriers influence this evolutionary process is an ongoing effort. We present here a study of Drosophila melanogaster populations from the southeast United States and Caribbean islands undergoing incipient sexual isolation. The existence of premating reproductive barriers have been previously established, but they do not fully account for the degree of isolation present. To assess the influence of postmating barriers, we investigated putative postmating barriers of female remating and egg laying behavior, as well as hatchability of eggs laid and female longevity after mating. While we did not find any effects in female remating or egg laying, we did observe lower hatchability in the central region of our geographical spread as well as shorten female life spans after mating to genetically different males in females originating from the northern- and southernmost locations of those surveyed. These results serve as evidence that long-term consequences after mating such as the fitness of offspring and shortened lifespan have a stronger effect than short-term postmating behaviors.

Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

Benchmarking undedicated cloud computing providers for analysis of genomic datasets.

Seyhan Yazar, George EC Gooden, David A Mackey, Alex Hewitt
doi: http://dx.doi.org/10.1101/007724

A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95%CI: 27.5-78.2) for E.coli and 53.5% (95%CI: 34.4-72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95%CI: 211.5-303.1) and 173.9% (95%CI: 134.6-213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.

Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents?

Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents?

Matthias Birkner, Jochen Blath, Bjarki Eldon, Fabian Freund
doi: http://dx.doi.org/10.1101/007690

The ability of the site-frequency spectrum (SFS) to reflect the particularities of gene genealogies exhibiting multiple mergers of ancestral lines as opposed to those obtained in the presence of exponential population growth is our focus. An excess of singletons is a well-known characteristic of both population growth and multiple mergers. Other aspects of the SFS, in particular the weight of the right tail, are, however, affected in specific ways by the two model classes. Using minimum-distance statistics, and an approximate likelihood method, our estimates of statistical power indicate that exponential growth can indeed be distinguished from multiple merger coalescents, even for moderate sample size, if the number of segregating sites is high enough. Additionally, we use a normalised version of the SFS as a summary statistic in an approximate bayesian computation (ABC) approach to distinguish multiple mergers from exponential population growth. The ABC approach gives further positive evidence as to the general eligibility of the SFS to distinguish between the different histories, but also reveals that suitable weighing of parts of the SFS can improve the distinction ability. The important issue of the difference in timescales between different coalescent processes (and their implications for the scaling of mutation parameters) is also discussed.

Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels

Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels

Patrick Deelen, Daria Zhernakova, Mark de Haan, Marijke van der Sijde, Marc Jan Bonder, Juha Karjalainen, K. Joeri van der Velde, Kristin M. Abbott, Jingyuan Fu, Cisca Wijmenga, Richard J. Sinke, Morris A. Swertz, Lude Franke
doi: http://dx.doi.org/10.1101/007633

Given increasing numbers of RNA-seq samples in the public domain, we studied to what extent expression quantitative trait loci (eQTLs) and allele-specific expression (ASE) can be identified in public RNA-seq data while also deriving the genotypes from the RNA-seq reads. 4,978 human RNA-seq runs, representing many different tissues and cell-types, passed quality control. Even though this data originated from many different laboratories, samples reflecting the same cell-type clustered together, suggesting that technical biases due to different sequencing protocols were limited. We derived genotypes from the RNA-seq reads and imputed non-coding variants. In a joint analysis on 1,262 samples combined, we identified cis-eQTLs effects for 8,034 unique genes. Additionally, we observed strong ASE effects for 34 rare pathogenic variants, corroborating previously observed effects on the corresponding protein levels. Given the exponential growth of the number of publicly available RNA-seq samples, we expect this approach will become relevant for studying tissue-specific effects of rare pathogenic genetic variants.

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences

N. Ari Wijetunga, Fabien Delahaye, Yong Mei Zhao, Aaron Golden, Jessica C Mar, Francine H. Einstein, John M. Greally
doi: http://dx.doi.org/10.1101/007591

The mechanism and significance of epigenetic variability in the same cell type between healthy individuals are not clear. Here, we purify human CD34+ hematopoietic stem and progenitor cells (HSPCs) from different individuals and find that there is increased variability of DNA methylation at loci with properties of promoters and enhancers. The variability is especially enriched at candidate enhancers near genes transitioning between silent and expressed states, and encoding proteins with leukocyte differentiation properties. Our findings of increased variability at loci with intermediate DNA methylation values, at candidate “poised” enhancers, and at genes involved in HSPC lineage commitment suggest that CD34+ cell subtype heterogeneity between individuals is a major mechanism for the variability observed. Epigenomic studies performed on cell populations, even when purified, are testing collections of epigenomes, or meta-epigenomes. Our findings show that meta-epigenomic approaches to data analysis can provide insights into cell subpopulation structure.