Sequence capture of ultraconserved elements from bird museum specimens

Sequence capture of ultraconserved elements from bird museum specimens

John McCormack, Whitney L.E. Tsai, Brant C Faircloth
doi: http://dx.doi.org/10.1101/020271

New DNA sequencing technologies are allowing researchers to explore the genomes of the millions of natural history specimens collected prior to the molecular era. Yet, we know little about how well specific next-generation sequencing (NGS) techniques work with the degraded DNA typically extracted from museum specimens. Here, we use one type of NGS approach, sequence capture of ultraconserved elements (UCEs), to collect data from bird museum specimens as old as 120 years. We targeted approximately 5,000 UCE loci in 27 Western Scrub-Jays (Aphelocoma californica) representing three evolutionary lineages, and we collected an average of 3,749 UCE loci containing 4,460 single nucleotide polymorphisms (SNPs). Despite older specimens producing fewer and shorter loci in general, we collected thousands of markers from even the oldest specimens. More sequencing reads per individual helped to boost the number of UCE loci we recovered from older specimens, but more sequencing was not as successful at increasing the length of loci. We detected contamination in some samples and determined contamination was more prevalent in older samples that were subject to less sequencing. For the phylogeny generated from concatenated UCE loci, contamination led to incorrect placement of some individuals. In contrast, a species tree constructed from SNPs called within UCE loci correctly placed individuals into three monophyletic groups, perhaps because of the stricter analytical procedures we used for SNP calling. This study and other recent studies on the genomics of museums specimens have profound implications for natural history collections, where millions of older specimens should now be considered genomic resources.

The next 20 years of genome research

The next 20 years of genome research

Michael Schatz
doi: http://dx.doi.org/10.1101/020289

The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than one billion genomes, bringing with it even deeper insights into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches can keep pace with the rapid improvements to biotechnology. In this perspective, we aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world.

Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species

Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species

Jeanneth Mosquera-Rendón, Ana M. Rada-Bravo, Sonia Cárdenas-Brito, Mauricio Corredor, Eliana Restrepo-Pineda, Alfonso Benítez-Páez
doi: http://dx.doi.org/10.1101/020305

Background. Drug treatments and vaccine designs against the opportunistic human pathogen Pseudomonas aeruginosa have multiple issues, all associated with the diverse genetic traits present in this pathogen, ranging from multi-drug resistant genes to the molecular machinery for the biosynthesis of biofilms. Several candidate vaccines against P. aeruginosa have been developed, which target the outer membrane proteins; however, major issues arise when attempting to establish complete protection against this pathogen due to its presumably genotypic variation at the strain level. To shed light on this concern, we proposed this study to assess the P. aeruginosa pangenome and its molecular evolution across multiple strains. Results. The P. aeruginosa pangenome was estimated to contain almost 17,000 non-redundant genes, and approximately 15% of these constituted the core genome. Functional analyses of the accessory genome indicated a wide presence of genetic elements directly associated with pathogenicity. An in-depth molecular evolution analysis revealed the full landscape of selection forces acting on the P. aeruginosa pangenome, in which purifying selection drives evolution in the genome of this human pathogen. We also detected distinctive positive selection in a wide variety of outer membrane proteins, with the data supporting the concept of substantial genetic variation in proteins probably recognized as antigens. Approaching the evolutionary information of genes under extremely positive selection, we designed a new Multi-Locus Sequencing Typing assay for an informative, rapid, and cost-effective genotyping of P. aeruginosa clinical isolates. Conclusions. We report the unprecedented pangenome characterization of P. aeruginosa on a large scale, which included almost 200 bacterial genomes from one single species and a molecular evolutionary analysis at the pangenome scale. Evolutionary information presented here provides a clear explanation of the issues associated with the use of protein conjugates from pili, flagella, or secretion systems as antigens for vaccine design, which exhibit high genetic variation in terms of non-synonymous substitutions in P. aeruginosa strains.

Pyvolve: a flexible Python module for simulating sequences along phylogenies

Pyvolve: a flexible Python module for simulating sequences along phylogenies
Stephanie J Spielman, Claus O Wilke
doi: http://dx.doi.org/10.1101/020214
We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny according to continuous-time Markov models of sequence evolution. Pyvolve incorporates most standard models of nucleotide, amino-acid, and codon sequence evolution, and it allows users to fully customize all model parameters. Pyvolve additionally allows users to specify custom evolutionary models and incorporates several novel features, including a novel rate matrix scaling algorithm and branch-length perturbations. Easily incorporated into Python bioinformatics pipelines, Pyvolve represents a convenient and flexible alternative to third-party simulation softwares. Pyvolve is an open-source project available, along with a detailed user-manual, under a FreeBSD license from https://github.com/sjspielman/pyvolve. API documentation is available from http://sjspielman.org/pyvolve.

OPERA-LG: Efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees

OPERA-LG: Efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
Song Gao, Denis Bertrand, Niranjan Nagarajan
doi: http://dx.doi.org/10.1101/020230

The assembly of large, repeat-rich eukaryotic genomes continues to represent a significant challenge in genomics. While long-read technologies have made the high-quality assembly of small, microbial genomes increasingly feasible, data generation can be prohibitively expensive for larger genomes. Fundamental advances in assembly algorithms are thus essential to exploit the characteristics of short and long-read sequencing technologies to consistently and reliably provide high-qualities assemblies in a cost-efficient manner. Here we present a scalable, exact algorithm (OPERA-LG) for the scaffold assembly of large, repeat-rich genomes that exhibits almost an order of magnitude improvement over the state-of-the-art programs in both correctness (>5X on average) and contiguity (>10X). This provides a systematic approach for combining data from different sequencing technologies, as well as a rigorous framework for scaffolding of repetitive sequences. OPERA-LG represents the first in a new class of algorithms that can efficiently assemble large genomes while providing formal guarantees about assembly quality, providing an avenue for systematic augmentation and improvement of 1000s of existing draft eukaryotic genome assemblies.

Approximately independent linkage disequilibrium blocks in human populations

Approximately independent linkage disequilibrium blocks in human populations
Tomaz Berisa, Joseph K. Pickrell
doi: http://dx.doi.org/10.1101/020255
We present a method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome. These blocks enable automated analysis of multiple genome-wide association studies.

Learning quantitative sequence-function relationships from high-throughput biological data

Learning quantitative sequence-function relationships from high-throughput biological data

Gurinder S Atwal, Justin B Kinney
doi: http://dx.doi.org/10.1101/020172

Understanding the transcriptional regulatory code, as well as other types of information encoded within biomolecular sequences, will require learning biophysical models of sequence-function relationships from high-throughput data. Controlling and characterizing the noise in such experiments, however, is notoriously difficult. The unpredictability of such noise creates problems for standard likelihood-based methods in statistical learning, which require that the quantitative form of experimental noise be known precisely. However, when this unpredictability is properly accounted for, important theoretical aspects of statistical learning which remain hidden in standard treatments are revealed. Specifically, one finds a close relationship between the standard inference method, based on likelihood, and an alternative inference method based on mutual information. Here we review and extend this relationship. We also describe its implications for learning sequence-function relationships from real biological data. Finally, we detail an idealized experiment in which these results can be demonstrated analytically.

Most viewed on Haldane’s Sieve: May 2015

The most viewed posts on Haldane’s Sieve in May 2015 were:

Optimizing error correction of RNAseq reads

Optimizing error correction of RNAseq reads

Matthew D MacManes
doi: http://dx.doi.org/10.1101/020123

Motivation: The correction of sequencing errors contained in Illumina reads derived from genomic DNA is a common pre-processing step in many de novo genome assembly pipelines, and has been shown to improved the quality of resultant assemblies. In contrast, the correction of errors in transcriptome sequence data is much less common, but can potentially yield similar improvements in mapping and assembly quality. This manuscript evaluates several popular read-correction tool’s ability to correct sequence errors commonplace to transcriptome derived Illumina reads. Results: I evaluated the efficacy of correction of transcriptome derived sequencing reads using using several metrics across a variety of sequencing depths. This evaluation demonstrates a complex relationship between the quality of the correction, depth of sequencing, and hardware availability which results in variable recommendations depending on the goals of the experiment, tolerance for false positives, and depth of coverage. Overall, read error correction is an important step in read quality control, and should become a standard part of analytical pipelines. Availability: Results are non-deterministically repeatable using AMI:ami-3dae4956 (MacManes EC 2015) and the Makefile available here: https://goo.gl/oVIuE0

Mixed Models for Meta-Analysis and Sequencing

Mixed Models for Meta-Analysis and Sequencing

Brendan Bulik-Sullivan
doi: http://dx.doi.org/10.1101/020115

Mixed models are an effective statistical method for increasing power and avoiding confounding in genetic association studies. Existing mixed model methods have been designed for “pooled” studies where all individual-level genotype and phenotype data are simultaneously visible to a single analyst. Many studies follow a “meta-analysis” design, wherein a large number of independent cohorts share only summary statistics with a central meta-analysis group, and no one person can view individual-level data for more than a small fraction of the total sample. When using linear regression for GWAS, there is no difference in power between pooled studies and meta-analyses \cite{lin2010meta}; however, we show that when using mixed models, standard meta-analysis is much less powerful than mixed model association on a pooled study of equal size. We describe a method that allows meta-analyses to capture almost all of the power available to mixed model association on a pooled study without sharing individual-level genotype data. The added computational cost and analytical complexity of this method is minimal, but the increase in power can be large: based on the predictive performance of polygenic scoring reported in \cite{wood2014defining} and \cite{locke2015genetic}, we estimate that the next height and BMI studies could see increases in effective sample size of $\approx$15\% and $\approx$8\%, respectively. Last, we describe how a related technique can be used to increase power in sequencing, targeted sequencing and exome array studies. Note that these techniques are presently only applicable to randomly ascertained studies and will sometimes result in loss of power in ascertained case/control studies. We are developing similar methods for case/control studies, but this is more complicated.