VCF2Networks: applying Genotype Networks to Single Nucleotide Variants data

VCF2Networks: applying Genotype Networks to Single Nucleotide Variants data
Giovanni Marco Dall’Olio, Ali R. Vahdat, Bertranpetit Jaume, Wagner Andreas, Laayouni Hafid
(Submitted on 9 Jan 2014)

Summary: Genotype networks are a method used in systems biology to study the innovability of a given phenotype, determining whether the phenotype is robust to mutations, and how do the genotypes associated to it are distributed in the genotype space. Here we developed VCF2Networks, a tool to apply this method to population genetics data, and in particular to single Nucleotide Variants data encoded in the Variant Call file Format (VCF). A complete summary of the properties of the genotype network that can be calculated by VCF2Networks is given in the Supplementary Materials 1.
Availability and Implementation: The home page of the project is this https URL . VCF2Networks is also available directly from the Python Package Index (PyPI), under the name vcf2networks.

Author post: Physical constraints determine the logic of bacterial promoter architectures

Our next guest post is by Radu Zabet on his manuscript (with co-workers) Physical constraints determine the logic of bacterial promoter architectures, arXived here

Earlier last year we explored the possibility of understanding ‘real biology’ using our stochastic simulation framework GRiP (http://logic.sysbiol.cam.ac.uk/grip). That software simulates how transcription factors (TFs) find their target sites in the genome, using a combination of three-dimensional diffusion around and one-dimensional walk on the DNA. This biophysical mechanism is quite well studied and is commonly termed ‘facilitated diffusion’. Unlike a homing missile, the trace of a TF molecule to its target site occurs somewhat erratic, and with many other factors around, even ‘traffic jams’ on the DNA seem possible (that and other interesting phenomena were subject of two other arXiv contributions we put online last year – see Haldane’s Sieve for more, https://haldanessieve.org/2013/04/09/our-paper-the-effects-of-transcription-factor-competition-on-gene-regulation/ and the two publications: http://dx.doi.org/10.3389/fgene.2013.00197 and http://dx.doi.org/10.1371/journal.pone.0073714).

Often times, TF binding sites are closely packed or even overlapping. In our latest paper, we explore how the spacing of binding sites along the DNA can influence the probability of a “TF traffic jam” occurring, and thereby influencing the length of a TF’s “commute” to its binding site (http://arxiv.org/abs/1312.7262). We notice that one of the promoter organisations that we predict would cause massive traffic jams is underrepresented among E. coli promoters, suggesting that this phenomenon may have an important biological role.

One of the most common approaches to predicting TF occupancy is statistical thermodynamics, which assumes that the system is in steady state. Here we show that under biologically relevant parameters, a TF might take longer than a cell cycle to arrive to its binding site when the promoter is organized in a “traffic jam” inducing way. Therefore, it is important to consider the dynamics of TF binding, rather than just the steady state.

Usually, transcriptional logic refers to the idea that the specific combinations of TFs that bind to a gene promoter control the expression level of that gene. We extend this notion of transcriptional logic by proposing that the response to multiple regulatory inputs can also depend on the dynamics of TF binding. In other words: not only the final combinatorial pattern, but also the order in which these sites are occupied matters. In this context, we suggest that the spatial organisation of the promoter encodes the logic, influenced by TF concentrations that modulate promoter occupancy dynamics in biologically relevant time scales.

Using computer simulations of the search process, we show that the logic of complex bacterial promoters can be explained by the combinatorial action of three commonly found basic building blocks: switches, barriers and clusters, whose characteristics we analyse in detail.

The precise spacing of TF binding sites plays a key role in our model, and we show that physically constrained promoter organizations are commonly found in bacterial genomes and are conserved.

Finally, we also developed a new web-based computational tool (faster GRiP, or fGRIP), which is able to generate the dynamics of promoter occupancy for bacterial systems. This tool is available at http://logic.sysbiol.cam.ac.uk/fgrip/

Distribution of population averaged observables in stochastic gene expression

Distribution of population averaged observables in stochastic gene expression
Bhaswati Bhattacharyya, Ziya Kalay
(Submitted on 9 Jan 2014)

Observation of phenotypic diversity in a population of genetically identical cells is often linked to the stochastic nature of chemical reactions involved in gene regulatory networks. We investigate the distribution of population averaged gene expression levels as a function of population, or sample, size for several stochastic gene expression models to find out to what extent population averaged quantities reflect the underlying mechanism of gene expression. We consider three basic gene regulation networks corresponding to transcription with and without gene state switching and translation. Using analytical expressions for the probability generating function of observables and Large Deviation Theory, we calculate the distribution and first two moments of the population averaged mRNA and protein levels as a function of model parameters, population size and number of measurements contained in a data set. We validate our results using stochastic simulations also report exact results on the asymptotic properties of population averages which show qualitative differences among different models.

Physical constraints determine the logic of bacterial promoter architectures

Physical constraints determine the logic of bacterial promoter architectures
Daphne Ezer, Nicolae Radu Zabet, Boris Adryan
(Submitted on 27 Dec 2013)

Site-specific transcription factors (TFs) bind to their target sites on the DNA, where they regulate the rate at which genes are transcribed. Bacterial TFs undergo facilitated diffusion (a combination of 3D diffusion around and 1D random walk on the DNA) when searching for their target sites. Using computer simulations of this search process, we show that the organisation of the binding sites, in conjunction with TF copy number and binding site affinity, plays an important role in determining not only the steady state of promoter occupancy, but also the order at which TFs bind. These effects can be captured by facilitated diffusion-based models, but not by standard thermodynamics. We show that the spacing of binding sites encodes complex logic, which can be derived from combinations of three basic building blocks: switches, barriers and clusters, whose response alone and in higher orders of organisation we characterise in detail. Effective promoter organizations are commonly found in the E. coli genome and are highly conserved between strains. This will allow studies of gene regulation at a previously unprecedented level of detail, where our framework can create testable hypothesis of promoter logic.

Reconstructing transmission networks for communicable diseases using densely sampled genomic data: a generalized approach

Reconstructing transmission networks for communicable diseases using densely sampled genomic data: a generalized approach
Colin J. Worby, Philip D. O’Neill, Theodore Kypraios, Julie V. Robotham, Daniela De Angelis, Edward J. P. Cartwright, Sharon J. Peacock, Ben S. Cooper
(Submitted on 8 Jan 2014)

Probabilistic reconstruction of transmission networks for communicable diseases can provide important insights into epidemic dynamics, the effectiveness of infection control measures, and contact patterns in an at-risk population. Whole genome sequencing of pathogens from multiple hosts provides an opportunity to investigate who infected whom with unparalleled resolution. We considered disease outbreaks in a community with high frequency genomic sampling, and formulated stochastic epidemic models to investigate person-to-person transmission, based on genomic and epidemiological data. Our approach, which combines a stochastic epidemic transmission model with a genetic distance model, overcomes key limitations of previous methods by providing a framework with the flexibility to allow for unobserved infection times, multiple independent introductions of the pathogen, and within-host genetic diversity, as well as allowing forward simulation. We defined two genetic models: a transmission diversity model, in which genetic diversity increases along a transmission chain, and an importation structure model, which groups isolates into genetically similar clusters. We evaluated their predictive performance using simulated data, demonstrating high sensitivity and specificity, particularly for rapidly mutating pathogens with low transmissibility. We then analyzed data collected during an outbreak of MRSA in a hospital. We identified three probable transmission events (posterior probability > 0.5) among the twenty observed cases. We estimated that genetic diversity across transmission links was approximately the same as within-host, with an expected 3.9 (95% CrI: 3.3, 4.6) single nucleotide polymorphisms between isolates. Our methodology avoids restrictive assumptions required in many analyses, and has broad applicability to epidemics with densely sampled genomic data.

Author post: Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer

Our next guest post is by David Baltrus (@surt_lab) on his group’s preprint Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer, posted on bioRviv here.

The function of modern pickup trucks is usually to haul heavy loads from point A to point B. However, the F-150 sitting in my driveway right now looks very different from its Model T ancestor from ~100 years ago. Over the years, as truck design has been modified and improved, all of the parts (brakes, air conditioning systems, doors, wheels, etc…) have been crafted to fit and work efficiently together. In process, each of the parts you see on a pickup truck today have been selectively co-evolving with all of the other design elements on the truck. The function of a house is to provide shelter.You can easily extend the the co-evolutionary metaphor from above to explain how different aspects of the house I live in relate to one another.

Some time ago, someone had the brilliant idea merge houses and pickup trucks into a camper. These hybrids between pickups and houses provide the functionality of being able to drive around, while also maintaining the ability to provide shelter. However, in the beginning, these hybrids likely didn’t accelerate as fast and consumed more energy and resources than unweighted pickups. They were likely a little taller than unweighted pickups, and as such might not be able to use certain bridges or tunnels. The brakes probably didn’t work as well. I can go on and on, but that would belabor the point I’m trying to make. In the beginning, if you just place two independently designed systems together Rube Goldberg style, the result will likely be functional but inefficient. Over the years, as engineers have worked to smoothly merge all of the systems of pickup and house together, campers have gotten much better at doing both jobs simultaneously.

Fig. 1: A truck-house hybrid is born. Images from Wikipedia

Fig. 1: A truck-house hybrid is born. Images from Wikipedia

Continue reading

Fast and accurate alignment of long bisulfite-seq reads

Fast and accurate alignment of long bisulfite-seq reads
Brent S. Pedersen, Kenneth Eyring, Subhajyoti De, Ivana V. Yang, David A. Schwartz
(Submitted on 6 Jan 2014)

Summary: Longer sequencing reads, with at least 200 bases per template are now common. While traditional aligners have adopted new strategies to improve the mapping of longer reads, aligners specific to bisulfite-sequencing were optimized when much shorter reads were the norm. We sought to perform the first comparison using longer reads to determine which aligners were most accurate and efficient and to evaluate a novel software tool, bwa-meth, built on a traditional mapper that supports insertions, deletions and clipped alignments. We gauge accuracy by comparing the number of on and off-target reads from a targeted sequencing project and by simulations. Availability and Implementation: The benchmarking scripts and the bwa-meth software are available at this https URL under the MIT License.

Sifting through 2013 with Haldane’s Sieve

2013 was the first full year of Haldane’s Sieve, which we started in 2012 to bring attention to preprints in evolutionary and population genetics. Perhaps the most exciting development of the year was the expansion of preprint server options–instead of arXiv, some biologists are now using bioRxiv or PeerJ Preprints. This year at Haldane’s Sieve, we received over 100,000 visitors from all over the world. Our most viewed posts of the year were:

Most viewed on Haldane’s Sieve: December 2013

The most viewed posts on Haldane’s Sieve in December 2013 were:

Happy New Year Homo erectus? More evidence for interbreeding with archaics predating the modern human/Neanderthal split

Happy New Year Homo erectus? More evidence for interbreeding with archaics predating the modern human/Neanderthal split
Peter J. Waddell
(Submitted on 30 Dec 2013)

A range of a priori hypotheses about the evolution of modern and archaic genomes are further evaluated and tested. In addition to the well-known splits/introgressions involving Neanderthal genes into out-of- Africa people, or Denisovan genes into Oceanians, a further series of archaic splits and hypotheses proposed in Waddell et al. (2011) are considered in detail. These include signals of Denisovans with something markedly more archaic and possibly something more archaic into Papuans as well. These are compared and contrasted with some well-advertised introgressions such as Denisovan genes across East Asia, archaic genes into San or non-tree mixing between Oceanians, East Asians and Europeans. The general result is that these less appreciated and surprising archaic splits have just as much or more support in genome sequence data. Further, evaluation confirms the hypothesis that archaic genes are much rarer on modern X chromosomes, and may even be near totally absent, suggesting strong selection against their introgression. Modeling of relative split weights allows an inference of the proportion of the genome the Denisovan seems to have gotten from an older archaic, and the best estimate is around 2%. Using a mix of quantitative and qualitative morphological data and novel phylogenetic methods, robust support is found for multiple distinct middle Pleistocene lineages. Of these, fossil hominids such as SH5, Petralona, and Dali, in particular, look like prime candidates for contributing pre-Neanderthal/Modern archaic genes to Denisovans, while the Jinniu-Shan fossil looks like the best candidate for a close relative of the Denisovan. That the Papuans might have received some truly archaic genes appears a good possibility and they might even be from Homo erectus.