Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression

Identifying and Mapping Cell-type Specific Chromatin Programming of Gene Expression
Troels T. Marstrand, John D. Storey
(Submitted on 11 Oct 2012)

A problem of substantial interest is to systematically map variation in chromatin structure to gene expression regulation across conditions, environments, or differentiated cell types. We developed and applied a quantitative framework for determining the existence, strength, and type of relationship between high-resolution chromatin structure in terms of DNaseI hypersensitivity (DHS) and genome-wide gene expression levels in 20 diverse human cell lines. We show that ~25% of genes show cell-type specific expression explained by alterations in chromatin structure. We find that distal regions of chromatin structure (e.g., +/- 200kb) capture more genes with this relationship than local regions (e.g., +/- 2.5kb), yet the local regions show a more pronounced effect. By exploiting variation across cell-types, we were capable of pinpointing the most likely hypersensitive sites related to cell-type specific expression, which we show have a range of contextual usages. This quantitative framework is likely applicable to other settings aimed at relating continuous genomic measurements to gene expression variation.

LUMPY: A probabilistic framework for structural variant discovery

LUMPY: A probabilistic framework for structural variant discovery
Ryan M. Layer, Ira M. Hall, Aaron R. Quinlan
(Submitted on 8 Oct 2012)
Comprehensive discovery of structural variation (SV) in human genomes from DNA sequencing requires the integration of multiple alignment signals including read-pair, split-read and read-depth. However, owing to inherent technical challenges, most existing SV discovery approaches utilize only one signal and consequently suffer from reduced sensitivity, especially at low sequence coverage and for smaller SVs. We present a novel and extremely flexible probabilistic SV discovery framework that is capable of integrating any number of SV detection signals including those generated from read alignments or prior evidence. We demonstrate improved sensitivity over extant methods by combining paired-end and split-read alignments and emphasize the utility of our framework for comprehensive studies of structural variation in heterogeneous tumor genomes. We further discuss the broader utility of this approach for probabilistic integration of diverse genomic interval datasets.

LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data

LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data
Alison F. Feder, Dmitri A. Petrov, Alan O. Bergland
(Submitted on 8 Oct 2012)
High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r2) between pairs of SNPs that can be observed within and among single reads. LDx also reports r2 estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r2 estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r2 estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.

Birth and death processes with neutral mutations

Birth and death processes with neutral mutations
Nicolas Champagnat, Amaury Lambert, Mathieu Richard
(Submitted on 27 Sep 2012)

In this paper, we review recent results of ours concerning branching processes with general lifetimes and neutral mutations, under the infinitely many alleles model, where mutations can occur either at birth of individuals or at a constant rate during their lives.
In both models, we study the allelic partition of the population at time t. We give closed formulae for the expected frequency spectrum at t and prove pathwise convergence to an explicit limit, as t goes to infinity, of the relative numbers of types younger than some given age and carried by a given number of individuals (small families). We also provide convergences in distribution of the sizes or ages of the largest families and of the oldest families.
In the case of exponential lifetimes, population dynamics are given by linear birth and death processes, and we can most of the time provide general formulations of our results unifying both models.

Forward Simulation of Fisher-Wright Populations with Stochastic Population Size and Neutral Single Step Mutations in Haplotypes

Efficient Forward Simulation of Fisher-Wright Populations with Stochastic Population Size and Neutral Single Step Mutations in Haplotypes
Mikkel Meyer Andersen, Poul Svante Eriksen
(Submitted on 5 Oct 2012)

In both population genetics and forensic genetics it is important to know how haplotypes are distributed in a population. Simulation of population dynamics helps facilitating research on the distribution of haplotypes. In forensic genetics, the haplotypes can for example consist of lineage markers such as short tandem repeat loci on the Y chromosome (Y-STR). A dominating model for describing population dynamics is the simple, yet powerful, Fisher-Wright model. We describe an efficient algorithm for exact forward simulation of exact Fisher-Wright populations (and not approximative such as the coalescent model). The efficiency comes from convenient data structures by changing the traditional view from individuals to haplotypes. The algorithm is implemented in the open-source R package ‘fwsim’ and is able to simulate very large populations. We focus on a haploid model and assume stochastic population size with flexible growth specification, no selection, a neutral single step mutation process, and self-reproducing individuals. These assumptions make the algorithm ideal for studying lineage markers such as Y-STR.

A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing

A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing
John E. McCormack, Michael G. Harvey, Brant C. Faircloth, Nicholas G. Crawford, Travis C. Glenn, Robb T. Brumfield
(Submitted on 4 Oct 2012)

Evolutionary relationships among birds in Neoaves, a clade including the vast majority of avian diversity, have vexed systematists due to the ancient, rapid radiation of numerous lineages. We applied a new phylogenomic approach to resolve relationships in Neoaves using target enrichment (sequence capture) and high-throughput sequencing of ultraconserved elements (UCEs) in avian genomes. We collected sequence data from UCE loci for 32 members of Neoaves and one outgroup (chicken) and analyzed data sets that differed in amount of missing data. An alignment of 1,541 loci that allowed missing data was 87% complete and resulted in a highly resolved phylogeny with broad agreement between the Bayesian and maximum-likelihood (ML) trees. Although the 100% complete matrix of 416 UCE loci was broadly similar, the Bayesian and ML trees differed to a greater extent in this analysis, suggesting that increasing from 416 to 1,541 loci led to increased stability and resolution of the tree. Novel results of our study include surprisingly close relationships between phenotypically divergent bird families, such as tropicbirds (Phaethontidae) and the sunbittern (Eurypygidae) as well as a sister relationship between bustards (Otididae) and turacos (Musophagidae). This phylogeny bolsters support for monophyletic waterbird and landbird clades and also strongly supports controversial relationships from previous studies, including the sister relationship between passerines and parrots and the non-monophyly of raptorial birds in the hawk and falcon families. Although significant challenges remain to fully resolving some of the deep relationships in Neoaves, especially among lineages outside the waterbirds and landbirds, this study suggests that increased data will yield an increasingly resolved avian phylogeny.

Our paper: The geography of recent genetic ancestry across Europe

This guest post is by Peter Ralph and Graham Coop (@graham_coop) on their paper The geography of recent genetic ancestry across Europe arXived here

In this paper we look at the genetic traces of very recent common ancestry between pairs of individuals from across Europe. We’ll likely write a few more accessible posts on this work when the paper is closer to publication, but for now (in the spirit of Haldane’s sieve) we write a little bit more of a technical post now [the full details are in the paper].

We started this project wanting to estimate recent migration rates, across continents like Europe — if we could learn how far away distant cousins are from each other, then, all else equal, we could then estimate typical migration distances. This isn’t where we ended up (that’s another project we are working on), but the basic idea, of looking at the geographic distribution of close relatives, led us to some interesting places.

As most populations lack the amazing pedigrees like that worked on by Decode in Iceland [e.g. here] we can’t actually know the true relationships between the samples (other than a few obvious siblings and full cousins). However, long segments of chromosome shared (almost) identically by descent (IBD) between two people have probably been inherited from a recent common ancestor. The length of these IBD segments tells us something about how long ago the ancestor lived,
since the older the ancestor, the more opportunity for recombination to whittle down the segment.

This has been worked on by a bunch of different groups, but the historical inference has usually been applied to small or relatively isolated populations. To really push the boundaries of these approaches we used the European subset of the POPRES dataset, which consists of thousands of human individuals. This is currently one of the best genome-scale, geographically indexed datasets, and represents a huge outbred population where we’d expect patterns of variation to be (at least partly) due to continuous migration, rather than, say, recent mixing of diverged populations or bottlenecks. So, we ran BEAGLE on the dataset to find IBD segments, and got lots of wonderful signal — it turned out that most pairs of people in the European sample (around 75%) shared IBD segments that were megabases long (i.e. longer than 1 centi-Morgan, cM). After a bunch of power and false-positive simulations, we were convinced that most of those blocks of IBD had been inherited from single common ancestors.

You could think of our results in two pieces: first, doing descriptive statistics on the distribution of IBD abundances and lengths across geography; and second, doing some inference on this distribution to see what we can learn about when those common ancestors lived.

As we hoped, there was a nice relationship to geography – people nearer to each other typically shared more and longer IBD than people farther away, in a nice monotonic relationship. This convinced us that continuous, local migration had played an important role in shaping current patterns of relatedness across Europe. Geographic distance was definitely not the only factor – superimposed on top of this was distinctive regional variation. For example, of one the strongest signals we saw was that there are higher levels of IBD sharing in Eastern Europe. As you’ll see from the paper, after further work, we think this is a potential signal of the Slavic or Hunnic expansions.

There were also some surprises to us along the way — like people in the UK sharing more IBD with Irish than with other people in the UK — that turned out to make sense after thinking about rapidly growing populations with directional migration (although there are other explanations). We correlated the patterns we saw with historical events, but (as with most genomic studies of human history) there was a lot of uncertainty. Sure, the patterns we see are consistent with the story we told, but there could potentially be a lot of other explanations, especially given the complicated and often unknown demographic history of European populations. What if all the IBD we saw came from the Neolithic expansion rather than the last two or three thousand years? This turns out to be a bigger worry than you might think — it’s fairly unlikely that two people inherited a 3cM block from a single common ancestor from 6000 years ago,but if they have enough common ancestors from back then (e.g. a strong enough bottleneck), it turns out to be reasonably likely.

So, we did some coalescent theory to work out the relationship between numbers of shared ancestors back through time (closely related to coalescent rate) and the observed distribution of IBD block lengths. We could then invert this relationship to estimate from the observed distribution of IBD blocks the mean number of ancestors that pairs of people from different parts of Europe share with each other, as a function of time. Unfortunately, this turns out to have a lot of unavoidable uncertainty– the inversion problem is “ill-conditioned” (in other terms, the likelihood surface is ridged), meaning that there were a lot of different histories that gave the same IBD length distribution.

Despite this, we could still rigorously learn a lot of good information — in particular, nearly all the IBD blocks we found did actually come from ancestors living during the last 3,000 years. Although we could only tie down the ages of the common ancestors to within a few hundred year, the major patterns can be likely tied to known historical events. There is quite a bit of uncertainty about the specific interpretations — it is still not straightforward to go from pairwise numbers of shared ancestors 1,500 years ago to conclusions about demographic events at the time — but used in conjunction with other sources of information has the promise to conclusively resolve some longstanding debates about recent history.

Finally, two addendums (addenda?) about the methods: The first is that we took an empirical approach to estimating the relationship between coalescent time distribution and observed IBD block length, by simulating a bunch (actually copying over blocks and re-running BEAGLE). We did this because BEAGLE is effectively a black box, for our purposes. This sort of approach is more common in experimental physics, where the empirical properties of detectors have to be worked out (and the problem of inferring the signal is known as “data unfolding”).

Second, we should emphasize that the uncertainty we came across in inferring dates is theoretically unavoidable, using IBD block length data. We think this is a common issue for many sorts of population genetics data — situations in which, even though we have a ton of data, getting specific, tightly constrained inferences requires making fairly strong assumptions (or equivalently, working in a specific set of parametric models). This has been highlighted in some cases [like this one], but more work is needed on this to ensure that we represent the inherent uncertainties in population genetics inferences correctly.

We’d love feedback from the popgen community about what aspects of the paper they’d like to see clarified/improved [obviously it is a pretty involved paper already, so concentrate on specific suggestions]. We have a tonne more ideas of how to improve this inference technology and extend it to other applications. But we’d love to hear your thoughts too.

Best Practices for Scientific Computing

Outside our usual remit, but likely of interest to many of our readers. See here for online peer review.

Best Practices for Scientific Computing
D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Katy Huff, Ian Mitchell, Mark Plumbley, Ben Waugh, Ethan P. White, Greg Wilson, Paul Wilson
(Submitted on 1 Oct 2012)
Scientists spend an increasing amount of time building and using software. However, most scientists are never taught how to do this efficiently. As a result, many are unaware of tools and practices that would allow them to write more reliable and maintainable code with less effort. We describe a set of best practices for scientific software development that have solid foundations in research and experience, and that improve scientists’ productivity and the reliability of their software.

Horizontal gene transfer may explain variation in θs

Horizontal gene transfer may explain variation in θs
Rohan Maddamsetti, Philip J. Hatcher, Stéphane Cruveiller, Claudine Médigue, Jeffrey E. Barrick, Richard E. Lenski
(Submitted on 28 Sep 2012)

Martincorena et al. estimated synonymous diversity (\theta s = 2N \mu ) across 2,930 orthologous gene alignments from 34 Escherichia coli genomes, and found substantial variation among genes in the density of synonymous polymorphisms. They argue that this pattern reflects variation in the mutation rate per nucleotide (\mu) among genes. However, the effective population size (N) is not necessarily constant across the genome. In particular, different genes may have different histories of horizontal gene transfer (HGT), whereas Martincorena et al. used a model with random recombination to calculate \theta s. They did filter alignments in an effort to minimize the effects of HGT, but we doubt that any procedure can completely eliminate HGT among closely related genomes, such as E. coli living in the complex gut community.
Here we show that there is no significant variation among genes in rates of synonymous substitutions in a long-term evolution experiment with E. coli and that the per-gene rates are not correlated with \theta s estimates from genome comparisons. However, there is a significant association between \theta s and HGT events. Together, these findings imply that \theta s variation reflects different histories of HGT, not local optimization of mutation rates to reduce the risk of deleterious mutations as proposed by Martincorena et al.

Our paper: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution

This guest post is by Daniël Melters [@DPMelters] and Keith Bradnam [@kbradnam] on their paper [along with co-authors]: Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. ArXived here.

The centromere poses an interesting paradox; although its function is essential, its molecular components are fast evolving. Centromeres in many animal and plant genomes have been characterized by the presence of large tandem repeat arrays. Numerous studies have suggested that the composition and length of the repeat units that comprise these arrays vary between species.
In this paper we tried to answer three main questions:
1) Can we identify the candidate centromere repeat sequences in genomes from hundreds of different species?
2) Do candidate centromere repeat sequences from different species share any common properties (sequence composition, length, GC% etc)?
3) How do these tandem repeats evolve?
To answer these questions, we took advantage of the large number of species with publicly available whole genome shotgun sequence data from various sequencing platforms. In total we analyzed 282 animal and plant genomes for the presence of high copy tandem repeat sequences, with the assumption that the most abundant tandem repeat is a good candidate for the centromere repeat.

We found high copy tandem repeats in the vast majority of the 282 genomes that we analyzed. For the smaller number of species with published cytology data, we correctly identified the published repeat sequence in 38 out of 43 cases. This confirms our assumption that the most abundant tandem repeat in any genome is likely to be the centromere repeat. In the five cases were we did not find the published centromere tandem repeats, we did not have data from sequencing platforms that would have allowed us to identify these repeats.

If an individual sequencing read contains at least four tandem repeats, then there is the possibility of detecting higher order repeat (HOR) structure. I.e. where a tandem array is made up of two alternating types of related sequence (A and B) to produce an A->B->A->B structure. In these cases, the AB dimer is more similar to other AB dimers, than A is to B. We found that HOR structure was surprisingly common in the candidate centromere repeats of many different species. The very long reads from Pacific Biosciences (PacBio) sequencing allowed us to further characterize repeat structure in great detail (for a few selected species), and this revealed additional levels of HOR structure.

To address the important question of ‘how similar are centromere repeats in different species?’, we performed an all-vs-all comparison between the most abundant tandem repeat in every species. Surprisingly, we found only 26 groups of species that shared any significant sequence similarity in their candidate centromere repeat sequence. The species that make up these 26 groups were always closely related species which had diverged less than 50 million years ago. When comparing the repeat sequences in these groups of closely related species, we found that repeats evolve not only by accumulation of mutations, but also by the spread of indels or by repeat doubling.

These results are in line with the ‘library’ hypothesis, which aims to describe how ratios of repeat variants can change over time. In addition, PacBio sequencing found very long tandem repeats (~1,500 bp). Furthermore, in switchgrass (Panicum virgatum) we identified several centromere repeat variants, but PacBio sequences did not show any mixing of these repeat variants. In summary, tandem repeats are frequently associated with the centromere function and most probably evolve according to the “library” hypothesis (a.k.a. molecular drive).

This paper is dedicated to the late Simon Chan, who passed away on the 22nd of August 2012 at the young age of 38 (see here for more infomation).

Daniël Melters and Keith Bradnam
PS. Supplementary table can be provided upon email request.