Cell-cycle regulated transcription associates with DNA replication timing in yeast and human

Cell-cycle regulated transcription associates with DNA replication timing in yeast and human
Hunter B. Fraser
(Submitted on 8 Aug 2013)

Eukaryotic DNA replication follows a specific temporal program, with some genomic regions consistently replicating earlier than others, yet what determines this program is largely unknown. Highly transcribed regions have been observed to replicate in early S-phase in all plant and animal species studied to date, but this relationship is thought to be absent from both budding yeast and fission yeast. No association between cell-cycle regulated transcription and replication timing has been reported for any species. Here I show that in budding yeast, fission yeast, and human, the genes most highly transcribed during S-phase replicate early, whereas those repressed in S-phase replicate late. Transcription during other cell-cycle phases shows either the opposite correlation with replication timing, or no relation. The relationship is strongest near late-firing origins of replication, which is not consistent with a previously proposed model — that replication timing may affect transcription — and instead suggests a potential mechanism involving the recruitment of limiting replication initiation factors during S-phase. These results suggest that S-phase transcription may be an important determinant of DNA replication timing across eukaryotes, which may explain the well-established association between transcription and replication timing.

Our Paper: The genomic impacts of drift and selection for hybrid performance in maize

This next paper is by Jeff Ross-Ibarra (@jrossibarra) on his paper (along with coauthors) Gerke et al The genomic impacts of drift and selection for hybrid performance in maize arXived here.

Iowa recurrent selection as an evolutionary experiment in hybrid vigor

Maize is an outcrossing species, and was cultivated as such up through the first quarter of the 20th century. Starting in the 1920’s, however, breeders began to abandon open-pollinated maize in favor of hybrid varieties resulting from crosses between inbred lines. Hybrids are often more robust and higher yielding than either inbred parent, a phenomenon known as hybrid vigor or heterosis.

Breeding for hybrid varieties – and presumably increased heterosis – has had a profound impact on diversity across the maize genome. There are at least two important differences from previous breeding efforts: first, breeders select on and work with inbred maize lines rather than mass selection on open-pollinated populations. This results in much smaller effective population sizes, and has implications for recessive traits and deleterious alleles that could be masked in heterozygotes. The second difference is that instead of selecting the best plants per se, breeders now select for inbreds that make high-yielding hybrids. This means a breeder might favor an inbred that itself is not high-yielding if it consistently makes good hybrids when paired with other inbreds.

We set out to study the effects of these breeding strategies on patterns of diversity across the maize genome. We took addvantage of one of the longest-running ongoing experiments on selection for hybrid performance, started in the late 1940’s by the US Dept. of Agriculture’s Agricutural Research Service. Two small (12 and 16) sets of founder inbred lines were randomly mated to create two base populations: the Iowa Stiff Stalk Synthetic (BSSS) and the Iowa Corn Borer Synthetic No. 1 (BSCB1). In addition to its role as an important selection experiment, multiple maize breeding lines have come out of the BSSS population, including the line used for the maize reference genome.

Diversity in the BSSS and BSCB1 is patterned predominantly by drift

Over the course of the experiment we studied, the two base populations underwent 16 cycles of recurrent selection, in which lines from each population were crossed to each other and evaluated for both hybrid and per-se performance. Selected lines were intermated within each population to form the next generation. To investigate the genomic impact of this selection scheme, we genotyped progenitor lines and over 600 individuals from multiple selection cycles using the Illumina MaizeSNP50 SNP array. And because we know the exact crossing and selection scheme used, we can compare the observed changes in genome-wide diversity with strictly neutral crossing simulations using the genotypes of the starting populations.

Both populations steadily lost genetic diversity as they became more diverged from one another, but diversity and divergence between BSSS and BSCB1 can be largely reproduced by simulation without any selection. In fact, principal component analysis clearly reveals changes in population structure and diversity that mirror alterations in rates of inbreeding and effective population size that occurred over the course of the experiment. This indicates the structure is not necessarily related to the phenotypic improvement, but might be a by-product of the breeding scheme. Similar population structure is reflected in a recent broad comparison of US maize germplasm and suggests that much of the diversity and structure of modern maize germplasm has been effected by genetic drift.

Selection efficacy and fixation at regions of low-recombination.

But genetic drift can’t be the whole story in these populations. Numerous experiments have shown that the later populations are superior to their progenitors in terms of hybrid yield and traits important to increased planting density (more plants per acre = more yield). These same trends are observed across North American maize as a whole, suggesting common themes in how maize has improved over time. Selection is difficult to detect in the face of strong genetic drift, especially when the selection has been on traits with complex genetic architectures. However our simulations do detect regions of low heterozygosity in each population that are longer than expected given their genetic distance.

The most striking pattern of these regions is their lack overlap between the two populations. In simple cases, classic overdominance models of heterosis predict that at a single locus, two distinct alleles confer heterozygote advantage when combined. In this case, selection should lead to decreased heterozygosity at a locus in both populations as complementary alleles rise in frequency. We don’t observe this, and neither did a different study that used other populations.

A popular alternative to the over-dominance model is the dominance model, which predicts that heterosis is caused by the complementation of linked recessive deleterious alleles. In this case, multiple haplotypes in the other population may complement a fixed region if most deleterious alleles in maize are rare. Evidence from numerous studies supports a dominance model of heterosis, including findings of excess residual heterozygosity in low recombination regions of a maize mapping population. In regions of low recombination, heterozygosity (and thus complementation) becomes important due to an inabilty to efficiently select for new recombinants in these regions, especially with low effective population sizes. And because of low rates of recombination, a small genetic interval in these regions becomes massive in physical space and encompasses the composite effects of many deleterious loci. We observe fixation in these regions in the BSSS and BSCB1 populations. They are short genetically (1-2 centimorgans), but make up very large fractions of the chromosome. We find that in many cases, these regions have been inherited largely intact from the original population founders, indicating that selection for new haplotype combinations in these regions has been ineffective. Large haplotypes in some cases may have fixed early on in the formation of many breeding programs, and the combination of limited exchange between breeding pools and small effective population sizes has provided little opportunity for selective removal of deleterious alleles. Complementation and the inefficiency of selection in these pericentromeric regions, which span a large portion of the physical genome, may thus explain the difference between hybrid and inbred yield and why it has remained fairly constant.

Predicting protein contact map using evolutionary and physical constraints by integer programming

Predicting protein contact map using evolutionary and physical constraints by integer programming
Zhiyong Wang, Jinbo Xu
(Submitted on 8 Aug 2013)

Motivation. Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains very challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole contact map. A couple of recent methods predict contact map based upon residue co-evolution, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods require a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically unfavorable.
Results. This paper presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming (ILP). The evolutionary restraints include sequence profile, residue co-evolution and context-specific statistical potential. The physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration. PhyCMAP can predict contacts within minutes after PSIBLAST search for sequence homologs is done, much faster than the two recent methods PSICOV and EvFold.

How Population Growth Affects Linkage Disequilibrium

How Population Growth Affects Linkage Disequilibrium
Alan R. Rogers
(Submitted on 8 Aug 2013)

Linkage disequilibrium (LD) is often summarized using the “LD curve,” which relates the LD between pairs of sites to the distance that separates them along the chromosome. This paper shows how the LD curve responds to changes in population size. An expansion of population size generates an LD curve that declines steeply, especially if that expansion has followed a bottleneck. A reduction in size generates an LD curve that is high but relatively flat. In European data, the curve is steep, suggesting a history of population expansion.
These conclusions emerge from the study of $\sigma_d^2$, a measure of LD that has never played a central role. It has been seen merely as an approximation to another measure, $r^2$. Yet $\sigma_d^2$ has different dynamical behavior and provides deeper time depth. Furthermore, it is easily estimated from data and can be predicted from population history using a fast, deterministic algorithm.

The dynamics of alternative pathways to compensatory substitution

The dynamics of alternative pathways to compensatory substitution
Chris A. Nasrallah
(Submitted on 9 Aug 2013)

The role of epistatic interactions among loci is a central question in evolutionary biology and is increasingly relevant in the genomic age. While the population genetics of compensatory substitution have received considerable attention, most studies have focused on the case when natural selection is very strong against deleterious intermediates. In the biologically-plausible scenario of weak to moderate selection there exist two alternate pathways for compensatory substitution. In one pathway, a deleterious mutation becomes fixed prior to occurrence of the compensatory mutation. In the other, the two loci are simultaneously polymorphic. The rates of compensatory substitution along these two pathways and their relative probabilities are functions of the population size, selection strength, mutation rate, and recombination rate. In this paper these rates and path probabilities are derived analytically and verified using population genetic simulations. The expected time durations of these two paths are similar when selection is moderate, but not when selection is weak. The effect of recombination on the dynamics of the substitution process are explored using simulation. Using the derived rates, a phylogenetic substitution model of the compensatory evolution process is presented that could be used for inference of population genetic parameters from interspecific data.

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects
Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, Wei Fan
(Submitted on 9 Aug 2013)

Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++ and freely accessible at this ftp URL

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements

Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements
Weiwei Zhang, Tim D Spector, Panos Deloukas, Jordana T Bell, Barbara E Engelhardt
(Submitted on 9 Aug 2013)

Background: Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is important, but current approaches tackle average methylation within a genomic locus and are often limited to specific genomic regions. Results: We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict CpG site methylation levels using as features neighboring CpG site methylation levels and genomic distance, and co-localization with coding regions, CGIs, and regulatory elements from the ENCODE project, among others. Our approach achieves 91% — 94% prediction accuracy of genome-wide methylation levels at single CpG site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs. Our classifier outperforms state-of-the-art methylation classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation status, CpG island status, co-localized DNase I hypersensitive sites, and specific transcription factor binding sites were found to be most predictive of methylation levels. Conclusions: Our observations of DNA methylation patterns led us to develop a classifier to predict site-specific methylation levels that achieves the best DNA methylation predictive accuracy to date. Furthermore, our method identified genomic features that interact with DNA methylation, elucidating mechanisms involved in DNA methylation modification and regulation, and linking different epigenetic processes.

A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers

A novel test for selection on cis-regulatory elements reveals positive and negative selection acting on mammalian transcriptional enhancers
Justin D. Smith, Kimberly F. McManus, Hunter B. Fraser
(Submitted on 7 Aug 2013)

Measuring natural selection on genomic elements involved in the cis-regulation of gene expression — such as transcriptional enhancers and promoters — is critical for understanding the evolution of genomes, yet it remains a major challenge. Many studies have attempted to detect positive or negative selection in these noncoding elements by searching for those with the fastest or slowest rates of evolution, but this can be problematic. Here we introduce a new approach to this issue, and demonstrate its utility on three mammalian transcriptional enhancers. Using results from saturation mutagenesis studies of these enhancers, we classified all possible point mutations as up-regulating, down-regulating, or silent, and determined which of these mutations have occurred on each branch of a phylogeny. Applying a framework analogous to Ka/Ks in protein-coding genes, we measured the strength of selection on up-regulating and down-regulating mutations, in specific branches as well as entire phylogenies. We discovered distinct modes of selection acting on different enhancers: while all three have experienced negative selection against down-regulating mutations, the selection pressures on up-regulating mutations vary. In one case we detected positive selection for up-regulation, while the other two had no detectable selection on up-regulating mutations. Our methodology is applicable to the growing number of saturation mutagenesis data sets, and provides a detailed picture of the mode and strength of natural selection acting on cis-regulatory elements.

The molecular mechanism of a cis-regulatory adaptation in yeast

The molecular mechanism of a cis-regulatory adaptation in yeast
Jessica Chang, Yiqi Zhou, Xiaoli Hu, Lucia Lam, Cameron Henry, Erin M. Green, Ryosuke Kita, Michael S. Kobor, Hunter B. Fraser
(Submitted on 7 Aug 2013)

Despite recent advances in our ability to detect adaptive evolution involving the cis-regulation of gene expression, our knowledge of the molecular mechanisms underlying these adaptations has lagged far behind. Across all model organisms the causal mutations have been discovered for only a handful of gene expression adaptations, and even for these, mechanistic details (e.g. the trans-regulatory factors involved) have not been determined. We previously reported a polygenic gene expression adaptation involving down-regulation of the ergosterol biosynthesis pathway in the budding yeast Saccharomyces cerevisiae. Here we investigate the molecular mechanism of a cis-acting mutation affecting a member of this pathway, ERG28. We show that the causal mutation is a two-base deletion in the promoter of ERG28 that strongly reduces the binding of two transcription factors, Sok2 and Mot3, thus abolishing their regulation of ERG28. This down-regulation increases resistance to a widely used antifungal drug targeting ergosterol, similar to mutations disrupting this pathway in clinical yeast isolates. The identification of the causal genetic variant revealed that the selection likely occurred after the deletion was already present at high frequency in the population, rather than when it was a new mutation. These results provide a detailed view of the molecular mechanism of a cis-regulatory adaptation, and underscore the importance of this view to our understanding of evolution at the molecular level.

Our paper: Inferring HIV escape rates from multi-locus genotype data

This guest post is by Richard Neher on his paper with Taylor Kessinger and Alan Perelson: Kessinger et al. Inferring HIV escape rates from multi-locus genotype data. arXived here.
This is cross posted from the Neher lab website.

We have a new preprint on the arXiv (here on Haldane’s sieve). This work is the result of a collaboration between us and Alan Perelson, LANL, and explores methods to estimate parameters of the HIV-immune system interaction from time resolved sequence data. The focus of this paper is on early infeImagection dominated by a few rapid substitutions that fix because they prevent or reduce recognition of infected cells by the immune system via cytotoxic T-lymphocytes (CTL). CTL escape is one of the fastest instances of evolution I have come across. 4-6 mutations spread within a few weeks. It happens in most HIV infections and is partly predictable based on the HLA genotype of the infected person. These substitutions are so rapid that clonal interference has to be modeled. Our method fits a reduced model of clonal interference to the typically very sparse data and thereby estimates the selection coefficients, aka escape rates.

Why do we want to know these numbers?
The number of viruses in the blood of an infected person peaks 2-3 weeks after infection and thereafter drops by 2-3 order of magnitude. This drop is partly due to a response by the adaptive immune system. However, it has proved difficult to attribute this drop to specific parts of the immune response. The rates at which different mutations sweep through the population gives us information about the pressure exerted by the T-cell clones that target the epitope containing this mutation.

How do we do it?
Early in infection, the viral population is large and selection is strong. In these conditions, recombination is of minor importance since most double/triple… mutants are more efficiently produced by recurrent mutation than recombination. This implies that mutations accumulate sequentially always on a background one which already all previous mutations are present. The time at which a novel mutation happens in tightly constrained by the trajectory of preceding genotype. These constraints regularize the fitting problem to some degree and the multi-locus fitting is more robust than single locus fitting.

What do we learn about evolution in general?
In addition to the intrinsic interest in the HIV/CTL interaction, CTL escape is an ideal setting to study rapidly evolving populations. This evolution happens in its “natural” habitat and the selective pressure as well as the functional consequences of the observed molecular changes can be quantified via immunological data, protein structure, and replication assays. In addition, we have ample cross-sectional data (HIV sequences from many different patients) that allows us to look at prevalence of the escape mutations and potential compensatory mutations. None of this is done in this paper, but studying HIV/immune-system coevolution is a fascinating show case of rapid evolution.