A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data
David Coil, Guillaume Jospin, Aaron E. Darling
(Submitted on 21 Jan 2014)

Motivation: Open-source bacterial genome assembly remains inaccessible to many biologists due to its complexity. Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data.
Results: A5-miseq can produce high quality and microbial genome assemblies on a laptop computer without any parameter tuning. A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation, and detection of misassemblies. Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation, and includes several improvements to read trimming. Together these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods.
Availability: A5-miseq is licensed under the GPL open source license. Source code and precompiled binaries for Mac OS X 10.6+ and Linux 2.6.15+ are available from this http URL

A C++ template library for efficient forward-time population genetic simulation of large populations

A C++ template library for efficient forward-time population genetic simulation of large populations
Kevin R. Thornton
(Submitted on 15 Jan 2014)

fwdpp is a C++ library of routines intended to facilitate the development of forward-time simulations under arbitrary mutation and fitness models. The library design provides a combination of speed, low memory overhead, and modeling flexibility not currently available from other forward simulation tools. The library is particularly useful when the simulation of large populations is required, as programs implemented using the library are much more efficient that other available forward simulation programs.

Integrative genomics analysis identifies pericentromeric regions of human chromosomes affecting patterns of inter-chromosomal interactions

Integrative genomics analysis identifies pericentromeric regions of human chromosomes affecting patterns of inter-chromosomal interactions
Gennadi V. Glinsky
(Submitted on 10 Jan 2014)

Genome-wide analysis of distributions of densities of long-range interactions of human chromosomes with each other, nucleoli, nuclear lamina, and binding sites of chromatin state regulatory proteins, CTCF and STAT1, identifies non-random highly correlated patterns of density distributions along the chromosome length for all these features. Marked co-enrichments and clustering of all these interactions are detected at discrete genomic regions on selected chromosomes, which are located within pericentromeric heterochromatin and designated Centromeric Regions of Interphase Chromatin Homing (CENTRICH). CENTRICH manifest 199-716-fold higher density of inter-chromosomal binding sites compared to genome-wide or chromosomal averages (p = 2.10E-101-1.08E-292). Sequence alignment analysis shows that CENTRICH represent unique DNA sequences of 3.9 to 22.4 Kb in size which are: 1) associated with nucleolus; 2) exhibit highly diverse set of DNA-bound chromatin state regulators, including marked enrichment of CTCF and STAT1 binding sites; 3) bind multiple intergenic disease-associated genomic loci (IDAGL) with documented long-range enhancer activities and established links to increased risk of developing epithelial malignancies and other common human disorders. Using distances of SNP loci homing sites within genomic coordinates of CENTRICH as a proxy of likelihood of disease-linked SNP loci binding to CENTRICH, we demonstrate statistically significant correlations between the probability of SNP loci binding to CENTRICH and GWAS-defined odds ratios of increased risk of a disease for cancer, coronary artery disease, and type 2 diabetes. Our analysis suggests that centromeric sequences and pericentromeric heterochromatin may play an important role in human cells beyond the critical functions in chromosome segregation.

VCF2Networks: applying Genotype Networks to Single Nucleotide Variants data

VCF2Networks: applying Genotype Networks to Single Nucleotide Variants data
Giovanni Marco Dall’Olio, Ali R. Vahdat, Bertranpetit Jaume, Wagner Andreas, Laayouni Hafid
(Submitted on 9 Jan 2014)

Summary: Genotype networks are a method used in systems biology to study the innovability of a given phenotype, determining whether the phenotype is robust to mutations, and how do the genotypes associated to it are distributed in the genotype space. Here we developed VCF2Networks, a tool to apply this method to population genetics data, and in particular to single Nucleotide Variants data encoded in the Variant Call file Format (VCF). A complete summary of the properties of the genotype network that can be calculated by VCF2Networks is given in the Supplementary Materials 1.
Availability and Implementation: The home page of the project is this https URL . VCF2Networks is also available directly from the Python Package Index (PyPI), under the name vcf2networks.

Distribution of population averaged observables in stochastic gene expression

Distribution of population averaged observables in stochastic gene expression
Bhaswati Bhattacharyya, Ziya Kalay
(Submitted on 9 Jan 2014)

Observation of phenotypic diversity in a population of genetically identical cells is often linked to the stochastic nature of chemical reactions involved in gene regulatory networks. We investigate the distribution of population averaged gene expression levels as a function of population, or sample, size for several stochastic gene expression models to find out to what extent population averaged quantities reflect the underlying mechanism of gene expression. We consider three basic gene regulation networks corresponding to transcription with and without gene state switching and translation. Using analytical expressions for the probability generating function of observables and Large Deviation Theory, we calculate the distribution and first two moments of the population averaged mRNA and protein levels as a function of model parameters, population size and number of measurements contained in a data set. We validate our results using stochastic simulations also report exact results on the asymptotic properties of population averages which show qualitative differences among different models.

Physical constraints determine the logic of bacterial promoter architectures

Physical constraints determine the logic of bacterial promoter architectures
Daphne Ezer, Nicolae Radu Zabet, Boris Adryan
(Submitted on 27 Dec 2013)

Site-specific transcription factors (TFs) bind to their target sites on the DNA, where they regulate the rate at which genes are transcribed. Bacterial TFs undergo facilitated diffusion (a combination of 3D diffusion around and 1D random walk on the DNA) when searching for their target sites. Using computer simulations of this search process, we show that the organisation of the binding sites, in conjunction with TF copy number and binding site affinity, plays an important role in determining not only the steady state of promoter occupancy, but also the order at which TFs bind. These effects can be captured by facilitated diffusion-based models, but not by standard thermodynamics. We show that the spacing of binding sites encodes complex logic, which can be derived from combinations of three basic building blocks: switches, barriers and clusters, whose response alone and in higher orders of organisation we characterise in detail. Effective promoter organizations are commonly found in the E. coli genome and are highly conserved between strains. This will allow studies of gene regulation at a previously unprecedented level of detail, where our framework can create testable hypothesis of promoter logic.

Fast and accurate alignment of long bisulfite-seq reads

Fast and accurate alignment of long bisulfite-seq reads
Brent S. Pedersen, Kenneth Eyring, Subhajyoti De, Ivana V. Yang, David A. Schwartz
(Submitted on 6 Jan 2014)

Summary: Longer sequencing reads, with at least 200 bases per template are now common. While traditional aligners have adopted new strategies to improve the mapping of longer reads, aligners specific to bisulfite-sequencing were optimized when much shorter reads were the norm. We sought to perform the first comparison using longer reads to determine which aligners were most accurate and efficient and to evaluate a novel software tool, bwa-meth, built on a traditional mapper that supports insertions, deletions and clipped alignments. We gauge accuracy by comparing the number of on and off-target reads from a targeted sequencing project and by simulations. Availability and Implementation: The benchmarking scripts and the bwa-meth software are available at this https URL under the MIT License.

Massively differential bias between two widely used Illumina library preparation methods for small RNA sequencing

Massively differential bias between two widely used Illumina library preparation methods for small RNA sequencing

Jeanette Baran-Gale, Michael R Erdos, Christina Sison, Alice Young, Emily E Fannin, Peter S Chines, Praveen Sethupathy

Recent advances in sequencing technology have helped unveil the unexpected complexity and diversity of small RNAs. A critical step in small RNA library preparation for sequencing is the ligation of adapter sequences to both the 5’ and 3’ ends of small RNAs. Two widely used protocols for small RNA library preparation, Illumina v1.5 and Illumina TruSeq, use different pairs of adapter sequences. In this study, we compare the results of small RNA-sequencing between v1.5 and TruSeq and observe a striking differential bias. Nearly 100 highly expressed microRNAs (miRNAs) are >5-fold differentially detected and 48 miRNAs are >10-fold differentially detected between the two methods of library preparation. In fact, some miRNAs, such as miR-24-3p, are over 30-fold differentially detected. The results are reproducible across different sequencing centers (NIH and UNC) and both major Illumina sequencing platforms, GAIIx and HiSeq. While some level of bias in library preparation is not surprising, the apparent massive differential bias between these two widely used adapter sets is not well appreciated. As increasingly more laboratories transition to the newer TruSeq-based library preparation for small RNAs, researchers should be aware of the extent to which the results may differ from previously published results using v1.5.

CONCOCT: Clustering cONtigs on COverage and ComposiTion

CONCOCT: Clustering cONtigs on COverage and ComposiTion
Johannes Alneberg, Brynjar Smari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z. Ijaz, Nicholas J. Loman, Anders F. Andersson, Christopher Quince
(Submitted on 14 Dec 2013)

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information – sequence composition, coverage across multiple sample, and read-pair linkage – to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.

Bayesian inference of infectious disease transmission from whole genome sequence data

Bayesian inference of infectious disease transmission from whole genome sequence data
Xavier Didelot, Jennifer Gardy, Caroline Colijn

Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.