WASP: allele-specific software for robust discovery of molecular quantitative trait loci

WASP: allele-specific software for robust discovery of molecular quantitative trait loci

Bryce van de Geijn, Graham McVicker, Yoav Gilad, Jonathan Pritchard
doi: http://dx.doi.org/10.1101/011221

Allele-specific sequencing reads provide a powerful signal for identifying molecular quantitative trait loci (QTLs), however they are challenging to analyze and prone to technical artefacts. Here we describe WASP, a suite of tools for unbiased allele-specific read mapping and discovery of molecular QTLs. Using simulated reads, RNA-seq reads and ChIP-seq reads, we demonstrate that our approach has a low error rate and is far more powerful than existing QTL mapping approaches.

Differential gene co-expression networks via Bayesian biclustering models

Differential gene co-expression networks via Bayesian biclustering models

Chuan Gao, Shiwen Zhao, Ian C. McDowell, Christopher D. Brown, Barbara E. Engelhardt
(Submitted on 7 Nov 2014)

Identifying latent structure in large data matrices is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are locally co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes whose covariation may be observed in only a subset of the samples. Our biclustering method, BicMix, has desirable properties, including allowing overcomplete representations of the data, computational tractability, and jointly modeling unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios. Further, we develop a method to recover gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and recover a gene co-expression network that is differential across ER+ and ER- samples.

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

A Hidden Markov Model for Investigating Recent Positive Selection through Haplotype Structure

Hua Chen, Jody Hey, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/011247

Recent positive selection can increase the frequency of an advantageous mutant rapidly enough that a relatively long ancestral haplotype will be remained intact around it. We present a hidden Markov model (HMM) to identify such haplotype structures. With HMM identified haplotype structures, a population genetic model for the extent of ancestral haplotypes is then adopted for parameter inference of the selection intensity and the allele age. Simulations show that this method can detect selection under a wide range of conditions and has higher power than the existing frequency spectrum-based method. In addition, it provides good estimate of the selection coefficients and allele ages for strong selection. The method analyzes large data sets in a reasonable amount of running time. This method is applied to HapMap III data for a genome scan, and identifies a list of candidate regions putatively under recent positive selection. It is also applied to several genes known to be under recent positive selection, including the LCT, KITLG and TYRP1 genes in Northern Europeans, and OCA2 in East Asians, to estimate their allele ages and selection coefficients.

Resolving microbial microdiversity with high accuracy full length 16S rRNA Illumina sequencing

Resolving microbial microdiversity with high accuracy full length 16S rRNA Illumina sequencing
Catherine Burke, Aaron E Darling
doi: http://dx.doi.org/10.1101/010967

We describe a method for sequencing full-length 16S rRNA gene amplicons using the high throughput Illumina MiSeq platform. The resulting sequences have about 100-fold higher accuracy than standard Illumina reads and are chimera filtered using information from a single molecule dual tagging scheme that boosts the signal available for chimera detection. We demonstrate that the data provides fine scale phylogenetic resolution not available from Illumina amplicon methods targeting smaller variable regions of the 16S rRNA gene.

Epidemiological and evolutionary analysis of the 2014 Ebola virus outbreak

Epidemiological and evolutionary analysis of the 2014 Ebola virus outbreak
Marta Łuksza, Trevor Bedford, Michael Lässig
Subjects: Populations and Evolution (q-bio.PE)

The 2014 epidemic of the Ebola virus is governed by a genetically diverse viral population. In the early Sierra Leone outbreak, a recent study has identified new mutations that generate genetically distinct sequence clades. Here we find evidence that major Sierra Leone clades have systematic differences in growth rate and reproduction number. If this growth heterogeneity remains stable, it will generate major shifts in clade frequencies and influence the overall epidemic dynamics on time scales within the current outbreak. Our method is based on simple summary statistics of clade growth, which can be inferred from genealogical trees with an underlying clade-specific birth-death model of the infection dynamics. This method can be used to perform realtime tracking of an evolving epidemic and identify emerging clades of epidemiological or evolutionary significance.

Annotating RNA motifs in sequences and alignments

Annotating RNA motifs in sequences and alignments
Paul P Gardner, Hisham Eldai
doi: http://dx.doi.org/10.1101/011197

RNA performs a diverse array of important functions across all cellular life. These functions include important roles in translation, building translational machinery and maturing messenger RNA. More recent discoveries include the miRNAs and bacterial sRNAs that regulate gene expression, the thermosensors, riboswitches and other cis-regulatory elements that help prokaryotes sense their environment and eukaryotic piRNAs that suppress transposition. However, there can be a long period between the initial discovery of a RNA and determining its function. We present a bioinformatic approach to characterise RNA motifs, which are the central building blocks of RNA structure. These motifs can, in some instances, provide researchers with functional hypotheses for uncharacterised RNAs. Moreover, we introduce a new profile-based database of RNA motifs – RMfam – and illustrate its application for investigating the evolution and functional characterisation of RNA. All the data and scripts associated with this work is available from: https://github.com/ppgardne/RMfam

Estimating the Relative Rate of Recombination to Mutation in Bacteria from Single-Locus Variants using Composite Likelihood Methods

Estimating the Relative Rate of Recombination to Mutation in Bacteria from Single-Locus Variants using Composite Likelihood Methods

Paul Fearnhead, Shoukai Yu, Patrick Biggs, Barbara Holland, Nigel French
(Submitted on 5 Nov 2014)

A number of studies have suggested using comparisons between DNA sequences of closely related bacterial isolates to estimate the relative rate of recombination to mutation for that bacterial species. We consider such an approach which uses single locus variants: pairs of isolates whose DNA differ at a single gene locus. One way of deriving point estimates for the relative rate of recombination to mutation from such data is to use composite likelihood methods. We extend recent work in this area so as to be able to construct confidence intervals for our estimates, without needing to resort to computationally-intensive bootstrap procedures, and to develop a test for whether the relative rate varies across loci. Both our test and method for constructing confidence intervals are obtained by modelling the dependence structure in the data, and then applying asymptotic theory regarding the distribution of estimators obtained using a composite likelihood. We applied these methods to multi-locus sequence typing (MLST) data from eight bacteria, finding strong evidence for considerable rate variation in three of these: Bacillus cereus, Enterococcus faecium and Klebsiella pneumoniae.

GC-content evolution in bacterial genomes: the biased gene conversion hypothesis expands.

GC-content evolution in bacterial genomes: the biased gene conversion hypothesis expands.
Florent Lassalle, Séverine Périan, Thomas Bataillon, Xavier Nesme, Laurent Duret, Vincent Daubin
doi: http://dx.doi.org/10.1101/011023

The characterization of functional elements in genomes relies on the identification of the footprints of natural selection. In this quest, taking into account neutral evolutionary processes such as mutation and genetic drift is crucial because these forces can generate patterns that may obscure or mimic signatures of selection. In mammals, and probably in many eukaryotes, another such confounding factor called GC-Biased Gene Conversion (gBGC) has been documented. This mechanism generates patterns identical to what is expected under selection for higher GC-content, specifically in highly recombining genomic regions. Recent results have suggested that a mysterious selective force favouring higher GC-content exists in Bacteria but the possibility that it could be gBGC has been excluded. Here, we show that gBGC is probably at work in most if not all bacterial species. First we find a consistent positive relationship between the GC-content of a gene and evidence of intra-genic recombination throughout a broad spectrum of bacterial clades. Second, we show that the evolutionary force responsible for this pattern is acting independently from selection on codon usage, and could potentially interfere with selection in favor of optimal AU-ending codons. A comparison with data from human populations shows that the intensity of gBGC in Bacteria is comparable to what has been reported in mammals. We propose that gBGC is not restricted to sexual Eukaryotes but also widespread among Bacteria and could therefore be an ancestral feature of cellular organisms. We argue that if gBGC occurs in bacteria, it can account for previously unexplained observations, such as the apparent non-equilibrium of base substitution patterns and the heterogeneity of gene composition within bacterial genomes. Because gBGC produces patterns similar to positive selection, it is essential to take this process into account when studying the evolutionary forces at work in bacterial genomes.

Ancestries of a Recombining Diploid Population

Ancestries of a Recombining Diploid Population,
R Sainudiin, B. Thatte and A. Veber, UCDMS Research Report 2014/3, 42 pages, 2014

We derive the exact one-step transition probabilities of the number of lineages
that are ancestral to a random sample from the current generation of a bi-parental
population that is evolving under the discrete Wright-Fisher model with n diploid
individuals. Our model allows for a per-generation recombination probability of
r. When r = 1, our model is equivalent to Chang’s model [4] for the karyotic
pedigree. When r = 0, our model is equivalent to Kingman’s discrete coalescent
model [16] for the cytoplasmic tree or sub-karyotic tree containing a DNA locus that
is free of intra-locus recombination. When 0 < r < 1 our model can be thought to
track a sub-karyotic ancestral graph containing a DNA sequence from an autosomal
chromosome that has an intra-locus recombination probability r. Thus, our family
of models indexed by r 2 [0; 1] connects Kingman's discrete coalescent to Chang's
pedigree in a continuous way as r goes from 0 to 1. For large populations, we
also study three properties of the r-specific ancestral process: the time Tn to a
most recent common ancestor (MRCA) of the population, the time Un at which all
individuals are either common ancestors to all present day individuals or ancestral
to none of them, and the fraction of individuals that are common ancestors at time
Un. These results generalize the three main results in [4]. When we appropriately
rescale time and recombination probability by the population size, our model leads
to the continuous time Markov chain called the ancestral recombination graph of
Hudson [12] and Griffiths [9].

Tackling drug resistant infection outbreaks of global pandemic Escherichia coli ST131 using evolutionary and epidemiological genomics

Tackling drug resistant infection outbreaks of global pandemic Escherichia coli ST131 using evolutionary and epidemiological genomics
Tim Downing
(Submitted on 4 Nov 2014)

High-throughput molecular approaches are required to investigate the origin and diffusion of antimicrobial resistance in rapidly radiating pathogen outbreaks. The most frequent cause of human infection is Escherichia coli, which is dominated by ST131, a single pandemic clone. This epidemic subtype possesses an extensive array of virulence elements and tolerates many drugs. Frequent global sweeps of new dominant ST131 varieties necessitate deep genomic scrutiny of their spread, evolution and lateral transfer of drug resistance genes. Phylogenetic methods that decipher past events can predict future patterns of virulence and transmission based on genetic signatures of adaptation and recombination. Antibiotic tolerance is controlled by natural variation in gene expression levels, which can initiate delayed cell growth. This dormancy allows survival despite drug exposure, and yet may only be present in part of the infecting cell population. Consequently, genomic epidemiology needs to explore the scale of phenotypic regulatory control acting on RNA. A multi-faceted approach can comprehensively assess antimicrobial resistance in E. coli ST131 in terms of within-host genetic heterogeneity, regulation of gene expression, and transmission dynamics between hosts to achieve a goal of pre-empting resistance before it emerges by optimising drug treatment protocols.