Global Epistasis Makes Adaptation Predictable Despite Sequence-Level Stochasticity

Global Epistasis Makes Adaptation Predictable Despite Sequence-Level Stochasticity
Sergey Kryazhimskiy, Daniel Paul Rice, Elizabeth Jerison, Michael M Desai

Epistasis can make adaptation highly unpredictable, rendering evolutionary trajectories contingent on the chance effects of initial mutations. We used experimental evolution in Saccharomyces cerevisiae to quantify this effect, finding dramatic differences in adaptability between 64 closely related genotypes. Despite these differences, sequencing of 105 evolved clones showed no significant effect of initial genotype on future sequence-level evolution. Instead, reconstruction experiments revealed a consistent pattern of diminishing returns epistasis. Our results suggest that many beneficial mutations affecting a variety of biological processes are globally coupled: they interact strongly, but only through their combined effect on fitness. Sequence-level adaptation is thus highly stochastic. Nevertheless, fitness evolution is strikingly predictable because differences in adaptability are determined only by global fitness-mediated epistasis, not by the identity of individual mutations.

Human paternal and maternal demographic histories: insights from high-resolution Y chromosome and mtDNA sequences

Human paternal and maternal demographic histories: insights from high-resolution Y chromosome and mtDNA sequences
Sebastian Lippold, Hongyang Xu, Albert Ko, Mingkun Li, Gabriel Renaud, Anne Butthof, Roland Schroeder, Mark Stoneking

To investigate in detail the paternal and maternal demographic histories of humans, we obtained ~500 kb of non-recombining Y chromosome (NRY) sequences and complete mtDNA genome sequences from 623 males from 51 populations in the CEPH Human Genome Diversity Panel (HGDP). Our results: confirm the controversial assertion that genetic differences between human populations on a global scale are bigger for the NRY than for mtDNA; suggest very small ancestral effective population sizes (<100) for the out-of-Africa migration as well as for many human populations; and indicate that the ratio of female effective population size to male effective population size (Nf/Nm) has been greater than one throughout the history of modern humans, and has recently increased due to faster growth in Nf. However, we also find substantial differences in patterns of mtDNA vs. NRY variation in different regional groups; thus, global patterns of variation are not necessarily representative of specific geographic regions.

Entropy Rates of the Multidimensional Moran Processes and Generalizations

Entropy Rates of the Multidimensional Moran Processes and Generalizations

Marc Harper
(Submitted on 13 Jan 2014)

The interrelationships of the fundamental biological processes natural selection, mutation, and stochastic drift are quantified by the entropy rate of Moran processes with mutation, measuring the long-run variation of a Markov process. The entropy rate is shown to behave intuitively with respect to evolutionary parameters such as monotonicity with respect to mutation probability (for the neutral landscape), relative fitness, and strength of selection. Strict upper bounds, depending only on the number of replicating types, for the entropy rate are given and the neutral fitness landscape attains the maximum in the large population limit. Various additional limits are computed including small mutation, weak and strong selection, and large population holding the other parameters constant, revealing the individual contributions and dependences of each evolutionary parameter on the long-run outcomes of the processes.

Integrative genomics analysis identifies pericentromeric regions of human chromosomes affecting patterns of inter-chromosomal interactions

Integrative genomics analysis identifies pericentromeric regions of human chromosomes affecting patterns of inter-chromosomal interactions
Gennadi V. Glinsky
(Submitted on 10 Jan 2014)

Genome-wide analysis of distributions of densities of long-range interactions of human chromosomes with each other, nucleoli, nuclear lamina, and binding sites of chromatin state regulatory proteins, CTCF and STAT1, identifies non-random highly correlated patterns of density distributions along the chromosome length for all these features. Marked co-enrichments and clustering of all these interactions are detected at discrete genomic regions on selected chromosomes, which are located within pericentromeric heterochromatin and designated Centromeric Regions of Interphase Chromatin Homing (CENTRICH). CENTRICH manifest 199-716-fold higher density of inter-chromosomal binding sites compared to genome-wide or chromosomal averages (p = 2.10E-101-1.08E-292). Sequence alignment analysis shows that CENTRICH represent unique DNA sequences of 3.9 to 22.4 Kb in size which are: 1) associated with nucleolus; 2) exhibit highly diverse set of DNA-bound chromatin state regulators, including marked enrichment of CTCF and STAT1 binding sites; 3) bind multiple intergenic disease-associated genomic loci (IDAGL) with documented long-range enhancer activities and established links to increased risk of developing epithelial malignancies and other common human disorders. Using distances of SNP loci homing sites within genomic coordinates of CENTRICH as a proxy of likelihood of disease-linked SNP loci binding to CENTRICH, we demonstrate statistically significant correlations between the probability of SNP loci binding to CENTRICH and GWAS-defined odds ratios of increased risk of a disease for cancer, coronary artery disease, and type 2 diabetes. Our analysis suggests that centromeric sequences and pericentromeric heterochromatin may play an important role in human cells beyond the critical functions in chromosome segregation.

Palaeosymbiosis revealed by genomic fossils of Wolbachia in a strongyloidean nematode

Palaeosymbiosis revealed by genomic fossils of Wolbachia in a strongyloidean nematode
Georgios Koutsovoulos, Benjamin Makepeace, Vincent N. Tanya, Mark Blaxter
(Submitted on 10 Jan 2014)

Wolbachia are common endosymbionts of terrestrial arthropods, and are also found in nematodes: the animal-parasitic filaria, and the plant-parasite Radopholus similis. Lateral transfer of Wolbachia DNA to the host genome is common. We generated a draft genome sequence for the strongyloidean nematode parasite Dictyocaulus viviparus, the cattle lungworm. In the assembly, we identified nearly 1 Mb of sequence with similarity to Wolbachia. The fragments were unlikely to derive from a live Wolbachia infection: most were short, and the genes were disabled through inactivating mutations. Many fragments were co-assembled with definitively nematode-derived sequence. We found limited evidence of expression of the Wolbachia-derived genes. The D. viviparus Wolbachia genes were most similar to filarial strains, and strains from the host-promiscuous clade F. We conclude that D. viviparus was infected by Wolbachia in the past. Genome sequence based surveys are a powerful tool for revealing the genome archaeology of infection and symbiosis.

VCF2Networks: applying Genotype Networks to Single Nucleotide Variants data

VCF2Networks: applying Genotype Networks to Single Nucleotide Variants data
Giovanni Marco Dall’Olio, Ali R. Vahdat, Bertranpetit Jaume, Wagner Andreas, Laayouni Hafid
(Submitted on 9 Jan 2014)

Summary: Genotype networks are a method used in systems biology to study the innovability of a given phenotype, determining whether the phenotype is robust to mutations, and how do the genotypes associated to it are distributed in the genotype space. Here we developed VCF2Networks, a tool to apply this method to population genetics data, and in particular to single Nucleotide Variants data encoded in the Variant Call file Format (VCF). A complete summary of the properties of the genotype network that can be calculated by VCF2Networks is given in the Supplementary Materials 1.
Availability and Implementation: The home page of the project is this https URL . VCF2Networks is also available directly from the Python Package Index (PyPI), under the name vcf2networks.

Author post: Physical constraints determine the logic of bacterial promoter architectures

Our next guest post is by Radu Zabet on his manuscript (with co-workers) Physical constraints determine the logic of bacterial promoter architectures, arXived here

Earlier last year we explored the possibility of understanding ‘real biology’ using our stochastic simulation framework GRiP ( That software simulates how transcription factors (TFs) find their target sites in the genome, using a combination of three-dimensional diffusion around and one-dimensional walk on the DNA. This biophysical mechanism is quite well studied and is commonly termed ‘facilitated diffusion’. Unlike a homing missile, the trace of a TF molecule to its target site occurs somewhat erratic, and with many other factors around, even ‘traffic jams’ on the DNA seem possible (that and other interesting phenomena were subject of two other arXiv contributions we put online last year – see Haldane’s Sieve for more, and the two publications: and

Often times, TF binding sites are closely packed or even overlapping. In our latest paper, we explore how the spacing of binding sites along the DNA can influence the probability of a “TF traffic jam” occurring, and thereby influencing the length of a TF’s “commute” to its binding site ( We notice that one of the promoter organisations that we predict would cause massive traffic jams is underrepresented among E. coli promoters, suggesting that this phenomenon may have an important biological role.

One of the most common approaches to predicting TF occupancy is statistical thermodynamics, which assumes that the system is in steady state. Here we show that under biologically relevant parameters, a TF might take longer than a cell cycle to arrive to its binding site when the promoter is organized in a “traffic jam” inducing way. Therefore, it is important to consider the dynamics of TF binding, rather than just the steady state.

Usually, transcriptional logic refers to the idea that the specific combinations of TFs that bind to a gene promoter control the expression level of that gene. We extend this notion of transcriptional logic by proposing that the response to multiple regulatory inputs can also depend on the dynamics of TF binding. In other words: not only the final combinatorial pattern, but also the order in which these sites are occupied matters. In this context, we suggest that the spatial organisation of the promoter encodes the logic, influenced by TF concentrations that modulate promoter occupancy dynamics in biologically relevant time scales.

Using computer simulations of the search process, we show that the logic of complex bacterial promoters can be explained by the combinatorial action of three commonly found basic building blocks: switches, barriers and clusters, whose characteristics we analyse in detail.

The precise spacing of TF binding sites plays a key role in our model, and we show that physically constrained promoter organizations are commonly found in bacterial genomes and are conserved.

Finally, we also developed a new web-based computational tool (faster GRiP, or fGRIP), which is able to generate the dynamics of promoter occupancy for bacterial systems. This tool is available at

Distribution of population averaged observables in stochastic gene expression

Distribution of population averaged observables in stochastic gene expression
Bhaswati Bhattacharyya, Ziya Kalay
(Submitted on 9 Jan 2014)

Observation of phenotypic diversity in a population of genetically identical cells is often linked to the stochastic nature of chemical reactions involved in gene regulatory networks. We investigate the distribution of population averaged gene expression levels as a function of population, or sample, size for several stochastic gene expression models to find out to what extent population averaged quantities reflect the underlying mechanism of gene expression. We consider three basic gene regulation networks corresponding to transcription with and without gene state switching and translation. Using analytical expressions for the probability generating function of observables and Large Deviation Theory, we calculate the distribution and first two moments of the population averaged mRNA and protein levels as a function of model parameters, population size and number of measurements contained in a data set. We validate our results using stochastic simulations also report exact results on the asymptotic properties of population averages which show qualitative differences among different models.

Physical constraints determine the logic of bacterial promoter architectures

Physical constraints determine the logic of bacterial promoter architectures
Daphne Ezer, Nicolae Radu Zabet, Boris Adryan
(Submitted on 27 Dec 2013)

Site-specific transcription factors (TFs) bind to their target sites on the DNA, where they regulate the rate at which genes are transcribed. Bacterial TFs undergo facilitated diffusion (a combination of 3D diffusion around and 1D random walk on the DNA) when searching for their target sites. Using computer simulations of this search process, we show that the organisation of the binding sites, in conjunction with TF copy number and binding site affinity, plays an important role in determining not only the steady state of promoter occupancy, but also the order at which TFs bind. These effects can be captured by facilitated diffusion-based models, but not by standard thermodynamics. We show that the spacing of binding sites encodes complex logic, which can be derived from combinations of three basic building blocks: switches, barriers and clusters, whose response alone and in higher orders of organisation we characterise in detail. Effective promoter organizations are commonly found in the E. coli genome and are highly conserved between strains. This will allow studies of gene regulation at a previously unprecedented level of detail, where our framework can create testable hypothesis of promoter logic.

Reconstructing transmission networks for communicable diseases using densely sampled genomic data: a generalized approach

Reconstructing transmission networks for communicable diseases using densely sampled genomic data: a generalized approach
Colin J. Worby, Philip D. O’Neill, Theodore Kypraios, Julie V. Robotham, Daniela De Angelis, Edward J. P. Cartwright, Sharon J. Peacock, Ben S. Cooper
(Submitted on 8 Jan 2014)

Probabilistic reconstruction of transmission networks for communicable diseases can provide important insights into epidemic dynamics, the effectiveness of infection control measures, and contact patterns in an at-risk population. Whole genome sequencing of pathogens from multiple hosts provides an opportunity to investigate who infected whom with unparalleled resolution. We considered disease outbreaks in a community with high frequency genomic sampling, and formulated stochastic epidemic models to investigate person-to-person transmission, based on genomic and epidemiological data. Our approach, which combines a stochastic epidemic transmission model with a genetic distance model, overcomes key limitations of previous methods by providing a framework with the flexibility to allow for unobserved infection times, multiple independent introductions of the pathogen, and within-host genetic diversity, as well as allowing forward simulation. We defined two genetic models: a transmission diversity model, in which genetic diversity increases along a transmission chain, and an importation structure model, which groups isolates into genetically similar clusters. We evaluated their predictive performance using simulated data, demonstrating high sensitivity and specificity, particularly for rapidly mutating pathogens with low transmissibility. We then analyzed data collected during an outbreak of MRSA in a hospital. We identified three probable transmission events (posterior probability > 0.5) among the twenty observed cases. We estimated that genetic diversity across transmission links was approximately the same as within-host, with an expected 3.9 (95% CrI: 3.3, 4.6) single nucleotide polymorphisms between isolates. Our methodology avoids restrictive assumptions required in many analyses, and has broad applicability to epidemics with densely sampled genomic data.