Tools and Methods from the Anopheles 16 Genome Project

Tools and Methods from the Anopheles 16 Genome Project

Aaron Steele, Michael C. Fontaine, Andres Martin, Scott J Emrich
doi: http://dx.doi.org/10.1101/011205

The dramatic reduction in sequencing costs has resulted in many initiatives to sequence certain organisms and populations. These initiatives aim to not only sequence and assemble genomes but also to perform a more broader analysis of the population structure. As part of the Anopheline Genome Consortium, which has a vested interest in studying anpopheline mosquitoes, we developed novel methods and tools to further the communities goals. We provide a brief description of these methods and tools as well as assess the contributions that each offers to the broader study of comparative genomics.

Recombination and peak jumping

Recombination and peak jumping

Kristina Crona
(Submitted on 7 Nov 2014)

We find an advantage of recombination for a category of complex fitness landscapes. Recent studies of empirical fitness landscapes reveal complex gene interactions and multiple peaks, and recombination can be a powerful mechanism for escaping suboptimal peaks. However classical work on recombination largely ignores the effect of complex gene interactions. The advantage we find has no correspondence for 2-locus systems or for smooth landscapes. The effect is sometimes extreme, in the sense that shutting off recombination could result in that the organism fails to adapt. A standard question about recombination is if the mechanism tends to accelerate or decelerate adaptation. However, we argue that extreme effects may be more important than how the majority falls.

Network Methods for Pathway Analysis of Genomic Data (Review)

Network Methods for Pathway Analysis of Genomic Data (Review)

Rosemary Braun, Sahil Shah
(Submitted on 7 Nov 2014)

Rapid advances in high-throughput technologies have led to considerable interest in analyzing genome-scale data in the context of biological pathways, with the goal of identifying functional systems that are involved in a given phenotype. In the most common approaches, biological pathways are modeled as simple sets of genes, neglecting the network of interactions comprising the pathway and treating all genes as equally important to the pathway’s function. Recently, a number of new methods have been proposed to integrate pathway topology in the analyses, harnessing existing knowledge and enabling more nuanced models of complex biological systems. However, there is little guidance available to researches choosing between these methods. In this review, we discuss eight topology-based methods, comparing their methodological approaches and appropriate use cases. In addition, we present the results of the application of these methods to a curated set of ten gene expression profiling studies using a common set of pathway annotations. We report the computational efficiency of the methods and the consistency of the results across methods and studies to help guide users in choosing a method. We also discuss the challenges and future outlook for improved network analysis methodologies.

A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians

A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians

Heejung Shim, Daniel I Chasman, Joshua D Smith, Samia Mora, Paul M Ridker, Deborah A Nickerson, Ronald M Krauss, Matthew Stephens
doi: http://dx.doi.org/10.1101/011270

We conducted a genome-wide association analysis of 7 subfractions of low density lipoproteins (LDLs) and 3 subfractions of intermediate density lipoproteins (IDLs) measured by gradient gel electrophoresis, and their response to statin treatment, in 1868 individuals of European ancestry from the Pharmacogenomics and Risk of Cardiovascular Disease study. Our analyses identified four previously-implicated loci (SORT1, APOE, LPA, and CETP) as containing variants that are very strongly associated with lipoprotein subfractions (log10 Bayes Factor > 15). Subsequent conditional analyses suggest that three of these (APOE, LPA and CETP) likely harbor multiple independently associated SNPs. Further, while different variants typically showed different characteristic patterns of association with combinations of subfractions, the two SNPs in CETP show strikingly similar patterns – both in our original data and in a replication cohort – consistent with a common underlying molecular mechanism. Notably, the CETP variants are very strongly associated with LDL subfractions, despite showing no association with total LDLs in our study, illustrating the potential value of the more detailed phenotypic measurements. In contrast with these strong subfraction associations, genetic association analysis of subfraction response to statins showed much weaker signals (none exceeding log10 Bayes Factor of 6). However, two SNPs (in APOE and LPA) previously-reported to be associated with LDL statin response do show some modest evidence for association in our data, and the subfraction response profiles at the LPA SNP are consistent with the LPA association, with response likely being due primarily to resistance of Lp(a) particles to statin therapy. An additional important feature of our analysis is that, unlike most previous analyses of multiple related phenotypes, we analyzed the subfractions jointly, rather than one at a time. Comparisons of our multivariate analyses with standard univariate analyses demonstrate that multivariate analyses can substantially increase power to detect associations. Software implementing our multivariate analysis methods is available at http://stephenslab.uchicago.edu/software.html.

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

Bryce van de Geijn, Graham McVicker, Yoav Gilad, Jonathan Pritchard
doi: http://dx.doi.org/10.1101/011221

Allele-specific sequencing reads provide a powerful signal for identifying molecular quantitative trait loci (QTLs), however they are challenging to analyze and prone to technical artefacts. Here we describe WASP, a suite of tools for unbiased allele-specific read mapping and discovery of molecular QTLs. Using simulated reads, RNA-seq reads and ChIP-seq reads, we demonstrate that our approach has a low error rate and is far more powerful than existing QTL mapping approaches.

Differential gene co-expression networks via Bayesian biclustering models

Differential gene co-expression networks via Bayesian biclustering models

Chuan Gao, Shiwen Zhao, Ian C. McDowell, Christopher D. Brown, Barbara E. Engelhardt
(Submitted on 7 Nov 2014)

Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes whose covariation may be observed in only a subset of the samples. Our biclustering method, BicMix, has desirable properties, including allowing overcomplete representations of the data, computational tractability, and jointly modeling unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios. Further, we develop a method to recover gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and recover a gene co-expression network that is differential across ER+ and ER- samples.

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

Hua Chen, Jody Hey, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/011247

Recent positive selection can increase the frequency of an advantageous mutant rapidly enough that a relatively long ancestral haplotype will be remained intact around it. We present a hidden Markov model (HMM) to identify such haplotype structures. With HMM identified haplotype structures, a population genetic model for the extent of ancestral haplotypes is then adopted for parameter inference of the selection intensity and the allele age. Simulations show that this method can detect selection under a wide range of conditions and has higher power than the existing frequency spectrum-based method. In addition, it provides good estimate of the selection coefficients and allele ages for strong selection. The method analyzes large data sets in a reasonable amount of running time. This method is applied to HapMap III data for a genome scan, and identifies a list of candidate regions putatively under recent positive selection. It is also applied to several genes known to be under recent positive selection, including the LCT, KITLG and TYRP1 genes in Northern Europeans, and OCA2 in East Asians, to estimate their allele ages and selection coefficients.

Resolving microbial microdiversity with high accuracy full length 16S rRNA Illumina sequencing

Resolving microbial microdiversity with high accuracy full length 16S rRNA Illumina sequencing
Catherine Burke, Aaron E Darling
doi: http://dx.doi.org/10.1101/010967

We describe a method for sequencing full-length 16S rRNA gene amplicons using the high throughput Illumina MiSeq platform. The resulting sequences have about 100-fold higher accuracy than standard Illumina reads and are chimera filtered using information from a single molecule dual tagging scheme that boosts the signal available for chimera detection. We demonstrate that the data provides fine scale phylogenetic resolution not available from Illumina amplicon methods targeting smaller variable regions of the 16S rRNA gene.

Epidemiological and evolutionary analysis of the 2014 Ebola virus outbreak

Epidemiological and evolutionary analysis of the 2014 Ebola virus outbreak
Marta Łuksza, Trevor Bedford, Michael Lässig
Subjects: Populations and Evolution (q-bio.PE)

The 2014 epidemic of the Ebola virus is governed by a genetically diverse viral population. In the early Sierra Leone outbreak, a recent study has identified new mutations that generate genetically distinct sequence clades. Here we find evidence that major Sierra Leone clades have systematic differences in growth rate and reproduction number. If this growth heterogeneity remains stable, it will generate major shifts in clade frequencies and influence the overall epidemic dynamics on time scales within the current outbreak. Our method is based on simple summary statistics of clade growth, which can be inferred from genealogical trees with an underlying clade-specific birth-death model of the infection dynamics. This method can be used to perform realtime tracking of an evolving epidemic and identify emerging clades of epidemiological or evolutionary significance.

Annotating RNA motifs in sequences and alignments

Annotating RNA motifs in sequences and alignments
Paul P Gardner, Hisham Eldai
doi: http://dx.doi.org/10.1101/011197

RNA performs a diverse array of important functions across all cellular life. These functions include important roles in translation, building translational machinery and maturing messenger RNA. More recent discoveries include the miRNAs and bacterial sRNAs that regulate gene expression, the thermosensors, riboswitches and other cis-regulatory elements that help prokaryotes sense their environment and eukaryotic piRNAs that suppress transposition. However, there can be a long period between the initial discovery of a RNA and determining its function. We present a bioinformatic approach to characterise RNA motifs, which are the central building blocks of RNA structure. These motifs can, in some instances, provide researchers with functional hypotheses for uncharacterised RNAs. Moreover, we introduce a new profile-based database of RNA motifs – RMfam – and illustrate its application for investigating the evolution and functional characterisation of RNA. All the data and scripts associated with this work is available from: https://github.com/ppgardne/RMfam