Error correction and assembly complexity of single molecule sequencing reads.
Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W. Richard McCombie, Michael Schatz
Third generation single molecule sequencing technology is poised to revolutionize genomics by enabling the sequencing of long, individual molecules of DNA and RNA. These technologies now routinely produce reads exceeding 5,000 basepairs, and can achieve reads as long as 50,000 basepairs. Here we evaluate the limits of single molecule sequencing by assessing the impact of long read sequencing in the assembly of the human genome and 25 other important genomes across the tree of life. From this, we develop a new data-driven model using support vector regression that can accurately predict assembly performance. We also present a novel hybrid error correction algorithm for long PacBio sequencing reads that uses pre-assembled Illumina sequences for the error correction. We apply it several prokaryotic and eukaryotic genomes, and show it can achieve near-perfect assemblies of small genomes (< 100Mbp) and substantially improved assemblies of larger ones. All source code and the assembly model are available open-source.
Bayesian inference of sampled ancestor trees for epidemiology and fossil calibration
Alexandra Gavryushkina, David Welch, Tanja Stadler, Alexei Drummond
(Submitted on 18 Jun 2014)
Phylogenetic analyses which include fossils or molecular sequences that are sampled through time require models that allow one sample to be a direct ancestor of another sample. As previously available phylogenetic inference tools assume that all samples are tips, they do not allow for this possibility. We have developed and implemented a Bayesian Markov Chain Monte Carlo (MCMC) algorithm to infer what we call sampled ancestor trees, that is, trees in which sampled individuals can be direct ancestors of other sampled individuals. We use a family of birth-death models where individuals may remain in the tree process after the sampling, in particular we extend the birth-death skyline model [Stadler et al, 2013] to sampled ancestor trees. This method allows the detection of sampled ancestors as well as estimation of the probability that an individual will be removed from the process when it is sampled. We show that sampled ancestor birth-death models where all samples come from different time points are non-identifiable and thus require one parameter to be known in order to infer other parameters. We apply this method to epidemiological data, where the possibility of sampled ancestors enables us to identify individuals that infected other individuals after being sampled and to infer fundamental epidemiological parameters. We also apply the method to infer divergence times and diversification rates when fossils are included among the species samples, so that fossilisation events are modelled as a part of the tree branching process. Such modelling has many advantages as argued in literature. The sampler is available as an open-source BEAST2 package (this https URL ancestors/).
Nanopore Sequencing of the phi X 174 genome
Andrew H. Laszlo, Ian M. Derrington, Brian C. Ross, Henry Brinkerhoff, Andrew Adey, Ian C. Nova, Jonathan M. Craig, Kyle W. Langford, Jenny Mae Samson, Riza Daza, Kenji Doering, Jay Shendure, Jens H. Gundlach
(Submitted on 17 Jun 2014)
Nanopore sequencing of DNA is a single-molecule technique that may achieve long reads, low cost, and high speed with minimal sample preparation and instrumentation. Here, we build on recent progress with respect to nanopore resolution and DNA control to interpret the procession of ion current levels observed during the translocation of DNA through the pore MspA. As approximately four nucleotides affect the ion current of each level, we measured the ion current corresponding to all 256 four-nucleotide combinations (quadromers). This quadromer map is highly predictive of ion current levels of previously unmeasured sequences derived from the bacteriophage phi X 174 genome. Furthermore, we show nanopore sequencing reads of phi X 174 up to 4,500 bases in length that can be unambiguously aligned to the phi X 174 reference genome, and demonstrate proof-of-concept utility with respect to hybrid genome assembly and polymorphism detection. All methods and data are made fully available.
Assessing phenotypic correlation through the multivariate phylogenetic latent liability model
Gabriela B. Cybis, Janet S. Sinsheimer, Trevor Bedford, Alison E. Mather, Philippe Lemey, Marc A. Suchard
(Submitted on 15 Jun 2014)
Understanding which phenotypic traits are consistently correlated throughout evolution is a highly pertinent problem in modern evolutionary biology. Here, we propose a multivariate phylogenetic latent liability model for assessing the correlation between multiple types of data, while simultaneously controlling for their unknown shared evolutionary history informed through molecular sequences. The latent formulation enables us to consider in a single model combinations of continuous traits, discrete binary traits, and discrete traits with multiple ordered and unordered states. Previous approaches have entertained a single data type generally along a fixed history, precluding estimation of correlation between traits and ignoring uncertainty in the history. We implement our model in a Bayesian phylogenetic framework, and discuss inference techniques for hypothesis testing. Finally, we showcase the method through applications to columbine flower morphology, antibiotic resistance in Salmonella, and epitope evolution in influenza.
Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors
Victor Hanson-Smith, Christopher Baker, Alexander Johnson
(Submitted on 11 Jun 2014)
A central challenge in the study of protein evolution is the identification of historic amino acid sequence changes responsible for creating novel functions observed in present-day proteins. To address this problem, we developed a new method to identify and rank amino acid mutations in ancestral protein sequences according to their function-shifting potential. Our approach scans the changes between two reconstructed ancestral sequences in order to find (1) sites with sequence changes that significantly deviate from our model-based probabilistic expectations, (2) sites that demonstrate extreme changes in mutual information, and (3) sites with extreme gains or losses of information content. By taking the overlaps of these statistical signals, the method accurately identifies cryptic evolutionary patterns that are often not obvious when examining only the conservation of modern-day protein sequences. We validated this method with a training set of previously-discovered function-shifting mutations in three essential protein families in animals and fungi, whose evolutionary histories were the prior subject of systematic molecular biological investigation. Our method identified the known function-shifting mutations in the training set with a very low rate of false positive discovery. Further, our approach significantly outperformed other methods that use variability in evolutionary rates to detect functional loci. The accuracy of our approach indicates it could be a useful tool for generating specific testable hypotheses regarding the acquisition of new functions across a wide range of protein families.
Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation
Carlo G. Artieri, Hunter B. Fraser
The recent advent of ribosome profiling ? sequencing of short ribosome-bound fragments of mRNA ? has offered an unprecedented opportunity to interrogate the sequence features responsible for modulating translational rates. Nevertheless, numerous analyses of the first riboprofiling dataset have produced equivocal and often incompatible results. Here we analyze three independent yeast riboprofiling data sets, including two with much higher coverage than previously available, and find that all three show substantial technical sequence biases that confound interpretations of ribosomal occupancy. After accounting for these biases, we find no effect of previously implicated factors on ribosomal pausing. Rather, we find that incorporation of proline, whose unique side-chain stalls peptide synthesis in vitro, also slows the ribosome in vivo. We also reanalyze a recent method that reported positively charged amino acids as the major determinant of ribosomal stalling and demonstrate that its assumptions lead to false signals of stalling in low-coverage data. Our results suggest that any analysis of riboprofiling data should account for sequencing biases and sparse coverage. To this end, we establish a robust methodology that enables analysis of ribosome profiling data without prior assumptions regarding which positions spanned by the ribosome cause stalling.
The rugged adaptive landscape of an emerging plant RNA virus
Jasna Lalic, Santiago F. Elena
RNA viruses are the main source of emerging infectious diseases owed to the evolutionary potential bestowed by their fast replication, large population sizes and high mutation and recombination rates. However, an equally important parameter, which is usually neglected, is the topography of the fitness landscape, that is, how many fitness maxima exist and how well connected they are, which determines the number of accessible evolutionary pathways. To address this question, we have reconstructed the fitness landscape describing the adaptation of Tobacco etch potyvirus to its new host, Arabidopsis thaliana. Fitness was measured for most of the genotypes in the landscape, showing the existence of peaks and holes. We found prevailing epistatic effects between mutations, with cases of reciprocal sign epistasis being common at latter stages. Therefore, results suggest that the landscape was rugged and holey, with several local fitness peaks and a very limited number of potential neutral paths. The viral genotype fixed at the end of the evolutionary process was not on the global fitness optima but stuck into a suboptimal peak.
Genome-wide Identification of Zero Nucleotide Recursive Splicing in Drosophila
Michael O Duff, Sara Olson, Xintao Wei, Ahmad Osman, Alex Plocik, Mohan Bolisetty, Susan Celniker, Brenton Graveley
Recursive splicing is a process in which large introns are removed in multiple steps by resplicing at ratchet points – 5? splice sites recreated after splicing. Recursive splicing was first identified in the Drosophila Ultrabithorax (Ubx) gene and only three additional Drosophila genes have since been experimentally shown to undergo recursive splicing. Here, we identify 196 zero nucleotide exon ratchet points in 130 introns of 115 Drosophila genes from total RNA sequencing data generated from developmental time points, dissected tissues, and cultured cells. Recursive splicing events were identified by splice junctions that map to annotated 5? splice sites and unannotated intronic 3? splice sites, the presence of the sequence AG/GT at the 3? splice site, and a 5? to 3? gradient of decreasing RNA-Seq read density indicative of co-transcriptional splicing. The sequential nature of recursive splicing was confirmed by identification of lariat introns generated by splicing to and from the ratchet points. We also show that recursive splicing is a constitutive process, and that the sequence and function of ratchet points are evolutionarily conserved. Together these results indicate that recursive splicing is commonly used in Drosophila and provides insight into the mechanisms by which some introns are removed.
Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels.
Nicholas E Banovich, Xun Lan, Graham McVicker, Bryce Van de Geijn, Jacob F Degner, John D. Blischak, Jonathan K. Pritchard, Yoav Gilad
DNA methylation is an important epigenetic regulator of gene expression. Recent studies have revealed widespread associations between genetic variation and methylation levels. However, the mechanistic links between genetic variation and methylation remain unclear. To begin addressing this gap, we collected methylation data at ~300,000 loci in lymphoblastoid cell lines (LCLs) from 64 HapMap Yoruba individuals, and genome-wide bisulfite sequence data in ten of these individuals. We identified (at an FDR of 10%) 11,752 methylation QTLs (meQTLs)?i.e., loci in which genetic variation is associated with changes in DNA methylation. We found that meQTLs are frequently associated with changes in methylation at multiple CpGs across regions of up to 3 kb. Interestingly, meQTLs are also frequently associated with variation in other properties of gene regulation, including histone modifications, DNase I accessibility, chromatin accessibility, and expression levels of nearby genes. These observations suggest that genetic variants may lead to coordinated molecular changes in all of these regulatory phenotypes. One plausible driver of coordinated changes in different regulatory mechanisms is variation in transcription factor (TF) binding. Indeed, we found that SNPs that change predicted TF binding affinities are significantly enriched for associations with DNA methylation at nearby CpGs. Taken together, our observations are consistent with a model whereby changes in TF binding may frequently drive coordinated changes in DNA methylation, histone modification, and gene expression levels.
Validation of methods for Low-volume RNA-seq
Peter Acuña Combs, Michael B Eisen
Recently, a number of protocols extending RNA-sequencing to the single-cell regime have been published. However, we were concerned that the additional steps to deal with such minute quantities of input sample would introduce serious biases that would make analysis of the data using existing approaches invalid. In this study, we performed a critical evaluation of several of these low-volume RNA-seq protocols, and found that they performed slightly less well in metrics of interest to us than a more standard protocol, but with at least two orders of magnitude less sample required. We also explored a simple modification to one of these protocols that, for many samples, reduced the cost of library preparation to approximately $20/sample