Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data

Modeling the Clonal Evolution of Cancer from Next Generation Sequencing Data

Wei Jiao, Shankar Vembu, Amit G. Deshwar, Lincoln Stein, Quaid Morris
(Submitted on 11 Oct 2012)

We consider the problem of inferring the clonal evolutionary structure of cancer cells from high-throughput next generation sequencing data. We address this problem using statistical machine learning to infer a relational clustering of objects, where the clusters are connected in the form of a rooted tree. We present a hierarchical Bayesian mixture model that uses a non-parametric prior over trees to automatically estimate the number of clones (clusters) and their clonal frequencies (cluster means) in the population, and to identify the phylogenetic relationship between these subclones. Experiments on three real data sets comprising 12 tumor samples from triple-negative breast cancer, acute myeloid leukemia and chronic lymphocytic leukemia patients demonstrate the efficacy of our method.

A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data

A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data

Barbara Rakitsch, Christoph Lippert, Hande Topa, Karsten Borgwardt, Antti Honkela, Oliver Stegle
(Submitted on 10 Oct 2012)

RNA-Seq technology allows for studying the transcriptional state of the cell at an unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible to disentangle the abundance of individual alternatively spliced transcript isoforms of a gene. A central question is to understand the regulatory processes that lead to differences in relative abundance variation due to external and genetic factors. Here, we present a mixed model approach that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and (ii) mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes of variation and correlation between transcript isoforms, including the genomic background and technical quantification uncertainty. As a result, our method allows to accurately test for shared as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained picture of the genetic basis of gene expression variation.

Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

Ruchi Chaudhary, J. Gordon Burleigh, David Fernández-Baca
(Submitted on 9 Oct 2012)

We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.

A phylogenomic perspective on the radiation of ray-finned fishes based upon targeted sequencing of ultraconserved elements

A phylogenomic perspective on the radiation of ray-finned fishes based upon targeted sequencing of ultraconserved elements

Michael E. Alfaro, Brant C. Faircloth, Laurie Sorenson, Francesco Santini
(Submitted on 29 Sep 2012)

Ray-finned fishes constitute the dominant radiation of vertebrates with over 30,000 species. Although molecular phylogenetics has begun to disentangle major evolutionary relationships within this vast section of the Tree of Life, there is no widely available approach for efficiently collecting phylogenomic data within fishes, leaving much of the enormous potential of massively parallel sequencing technologies for resolving major radiations in ray-finned fishes unrealized. Here, we provide a genomic perspective on longstanding questions regarding the diversification of major groups of ray-finned fishes through targeted enrichment of ultraconserved nuclear DNA elements (UCEs) and their flanking sequence. Our workflow efficiently and economically generates data sets that are orders of magnitude larger than those produced by traditional approaches and is well-suited to working with museum specimens. Analysis of the UCE data set recovers a well-supported phylogeny at both shallow and deep time-scales that supports a monophyletic relationship between Amia and Lepisosteus (Holostei) and reveals elopomorphs and then osteoglossomorphs to be the earliest diverging teleost lineages. Divergence time estimation based upon 14 fossil calibrations reveals that crown teleosts appeared ~270 Ma at the end of the Permian and that elopomorphs, osteoglossomorphs, ostarioclupeomorphs, and euteleosts diverged from one another by 205 Ma during the Triassic. Our approach additionally reveals that sequence capture of UCE regions and their flanking sequence offers enormous potential for resolving phylogenetic relationships within ray-finned fishes.

Thoughts on: Finding the sources of missing heritability in a yeast cross

[This commentary post is by Joe Pickrell on Finding the sources of missing heritability in a yeast cross by Bloom et al., available from the arXiv here]

For decades, human geneticists have used twin studies and related methods to show that a considerable amount of the phenotypic variation in humans is driven by genetic variation (without any knowledge of the underlying loci). More recently, genome-wide association studies have made incredible progress in identifying specific genetic variants that are important in various traits. However, the loci identified in the latter studies often have small effects, and the sums of their effects rarely come close to the genetic effects known from the former. The difference between the genetic effects on a trait known from heritability studies and the effects estimated from individual loci has come to be known as the “missing heritability”, and much ink has been spilled on speculation as to its cause.

Bloom et al. take an elegant and straightforward approach to this question using a model system, the budding yeast Saccharomyces cerevisiae. The insight is that in order to make progress, one needs both an experimental design to isolate some of the possible causes of the “missing heritability”. To achieve this, Bloom et al. use a cross between two different yeast strains and grow the segregants in identical conditions. They thus remove much of the environmental variation in the phenotypes, while also removing the effect of allele frequency (since all alleles are at frequency 0.5 in the cross). While this means that they cannot address some controversies about potential sources of heritability in humans (for example, rare versus common variants), they are able to estimate how much phenotypic variation is due to detectable additive effects. The authors additionally develop high-throughput assays to measure phenotypes (in their case, growth rates in 46 different conditions) and genotypes in the segregants from the cross, so that they can perform high-powered mapping studies.

The main result relevant to the issue of “missing heritability” is presented in their Figure 3a (reproduced above). After performing a well-powered mapping study, the authors compare the effects from their identified loci to the narrow-sense heritability of each trait. As it turns out, the heritability is not missing! In this cross, the identified loci, though of small effect, add up to a substantial fraction of the overall (narrow-sense) heritability for most traits. The authors additional identify some gene-gene interactions that contribute to the broad-sense heritability (but by definition not the narrow-sense heritability) of many traits.

The authors provocatively interpret their results as supporting a model in which the majority of the “missing heritability” lies in a large number of variants with small effect sizes (in line with the model proposed most notably by Peter Visscher and colleagues, though the authors here make no claims about the allele frequencies of the relevant variants). While this seems to be true in yeast, it remains to be seen in humans. It’s of course easy to come up with reasons why this might not hold in humans–our species is a special snowflake and so forth–but this paper should be in the back of the mind of anyone who is thinking about this problem.

Most viewed on Haldane’s Sieve: August-September 2012

Haldane’s Sieve has now been operating for a little over a month, and we’ve enjoyed reading the steadily growing stream of interesting manuscripts posted to preprint servers. In this post, we revisit the most viewed preprints since the site began. Perhaps unsurprisingly, these are all papers that the authors described to the community with “Our paper” posts:

Our paper: An age-of-allele test of neutrality for transposable element insertions not at equilibrium

[This author post is by Justin Blumenstiel and Casey Bergman on An age-of-allele test of neutrality for transposable element insertions not at equilibrium, available from the arXiv here]

Studies over the past several decades in Drosophila melanogaster have demonstrated that TE insertion alleles in natural populations tend to segregate at low frequency, particularly in regions of the genome that have a high recombination rate where natural selection is most effective. These results have largely supported a model where natural selection acts to remove deleterious TE insertions from the genome.  The prevailing model of why TE insertions are deleterious is that they lead to chromosomal aberrations that occur when dispersed, non-allelic repeated sequences crossover with one another. This model is known as the ectopic recombination model and it has an important feature. Since each new insertion has the potential to recombine with all the other copies in the genome, fitness will go down faster and faster with each new copy. This yields a stable equilibrium in TE copy number.

But, are TEs at equilibrium in natural populations? Genome sequencing studies have shown that the rate of TE proliferation can vary widely over time and any given TE family may demonstrate non-equilibrium “boom and bust” behavior. How do we reconcile studies that assume equilibrium with the fact that we know TE dynamics are not at equilibrium? To deal with this problem, I began developing this model out of a class project with John Wakeley while I was a graduate student over a decade ago. This model arose of some work I published in 200­2 with Hartl and Lozovsky on the age structure of non-LTR elements in D. melanogaster. I wrote this model up for my Ph.D. thesis and presented a preliminary version in a paper with Neafsey and Hartl in 2004, but it sat on the back burner until I reviewed a paper by Bergman and Bensasson in 2007 that showed many TE families in D. melanogaster have recently inserted in the genome and may not be at equilibrium.

Shortly after their paper came out I contacted Casey with the model from my thesis and we decided to push this idea forward as a collaboration, which has taken several a few years to come to fruition (both being busy with other projects and starting our labs). Things started to really move ahead when Miaomiao He in Casey’s lab generated a crucial data set that could be specifically applied to the model – strain-specific presence/absence data for a very large number of TE insertions ascertained from the D. melanogaster genome sequence.  After a few more years with it on simmer, working out several kinks in the mean time (e.g. incorporating host  demography, trying many different methods for estimating the posterior distribution of TE ages), Casey and I finally wrapped it up just as Haldane’s Sieve is starting to hit its stride. I expect that all my papers in the future will be pre-released on arXiv.

I could speak at length on the specific results, but I would just be saying what is already in the abstract. So, I would like to bring up three points for potential conversation.

First, what does it mean for TEs to be at transposition-selection balance when we know different TE families show a signature of “boom and bust” in genome sequences? There may be one way to reconcile this apparent problem. Any particular TE family may in fact not be at transposition-selection balance. For example, the P element, which invaded Drosophila melanogaster only a few decades ago, is hardly at transposition-selection balance. Therefore, one must be careful in using insertion frequencies for P elements to describe general TE dynamics. However, by integrating over all TE families in the genome, one may in fact reach an approximation that might be reasonable for assuming equilibrium transposition-selection balance. But one must be careful of something I call “family ascertainment bias”. Sometimes the most recently activated TEs are the ones easiest to discover and annotate because these ones are easily cloned from insertion mutations or are most frequent in genome sequences.

Second, in this paper, we derive the probability distribution for each individual TE insertion frequency based on its age. We demonstrate that this provides a method for TE insertions that are either positively or negatively selected. In the case where we show allele frequencies are less than expected (i.e. predicted to be negatively selected), many of these are copies that have zero substitutions. In principle, all of these could have inserted one generation before the reference strain was collected for genome sequencing. The inference that selection is acting against these TEs implicitly assumes either: 1) this wasn’t the case for many of these insertions, and the posterior distribution of ages is a good representation of the true age distribution, or 2) it may have been the case, but natural selection has already acted to remove slightly older TEs from the population, therefore making them absent from the genome sequence.

Third, when putting the finishing touches on our analysis of TE insertion data in North America, we ran up against the issue that nobody has yet published an explicit demographic scenario for North American populations of D. melanogaster, similar to those that have been developed by Wolfgang Stephan‘s Lab and others for European and African populations. We found one paper by Yukilevich et al (2010) from John True’s Lab that generated similar findings to the demography of European populations, which is consistent with the idea that North America populations of D. melanogaster are mainly derived from European ancestors.  However, Yukilevich et al (2010) didn’t explicitly model the admixture with African populations, which is known to occur in North American populations as shown by Caracristi and Schlötterer in 2003. We were surprised that an explicit admixture scenario has not been published yet, especially since this is crucial for interpreting the data from population genomic projects like the Drosophila Genetic Reference Panel. This should be an important line of work for someone to pursue (if it isn’t being done already) and if anyone has information about this a demographic model for North American populations of D. melanogaster, we’d be keen to know more so we can see if might improve our analysis.

Justin and Casey

Genomic tests of variation in inbreeding among individuals and among chromosomes

Genomic tests of variation in inbreeding among individuals and among chromosomes

Joshua G. Schraiber, Stephannie Shih, Montgomery Slatkin
(Submitted on 26 Sep 2012)

We examine the distribution of heterozygous sites in nine European and nine Yoruban individuals whose genomic sequences were made publicly available by Complete Genomics. We show that it is possible to obtain detailed information about inbreeding when a relatively small set of whole-genome sequences is available. Rather than focus on testing for deviations from Hardy-Weinberg genotype frequencies at each site, we analyze the entire distribution of heterozygotes conditioned on the number of copies of the derived (non-chimpanzee) allele. Using Levene’s exact test, we reject Hardy-Weinberg in both populations. We generalized Levene’s distribution to obtain the exact distribution of the number of heterozygous individuals given that every individual has the same inbreeding coefficient, F. We estimated F to be 0.0026 in Europeans and 0.0005 in Yorubans, but we could also reject the hypothesis that F was the same in each individual. We used a composite likelihood method to estimate F in each individual and within each chromosome. Variation in F across chromosomes within individuals was too large to be consistent with sampling effects alone. Furthermore, estimates of F for each chromosome in different populations were not correlated. Our results show how detailed comparisons of population genomic data can be made to theoretical predictions. The application of methods to the Complete Genomics data set shows that the extent of apparent inbreeding varies across chromosomes and across individuals, and estimates of inbreeding coefficients are subject to unexpected levels of variation which might be partly accounted for by selection.

Protein function influences frequency of encoded regions containing VNTRs and number of unique interactions

Protein function influences frequency of encoded regions containing VNTRs and number of unique interactions

Suzanne Bowen
(Submitted on 25 Sep 2012)

Proteins encoded by genes containing regions of variable number tandem repeats (VNTRs) are known to be polymorphic within species but the influence of their instability in molecular interactions remains unclear. VNTRS are overrepresented in encoding sequence of particular functional groups where their presence could influence protein interactions. Using human consensus coding sequence, this work examines if genomic instability, determined by regions of VNTRs, influences the number of protein interactions. Findings reveal that, in relation to protein function, the frequency of unique interactions in human proteins increase with the number of repeated regions. This supports experimental evidence that repeat expansion may lead to an increase in molecular interactions. Genetic diversity, estimated by Ka/Ks, appeared to decrease as the number of protein-protein interactions increased. Additionally, G+C and CpG content were negatively correlated with increasing occurrence of VNTRs. This may indicate that nucleotide composition along with selective processes can increase genomic stability and thereby restrict the expansion of repeated regions. Proteins involved in acetylation are associated with a high number of repeated regions and interactions but a low G+C and CpG content. While in contrast, less interactive membrane proteins contain a lower number of repeated regions but higher levels of C+G and CpGs. This work provides further evidence that VNTRs may provide the genetic variability to generate unique interactions between proteins.

Non-stationary patterns of isolation-by-distance: inferring measures of genetic friction

Non-stationary patterns of isolation-by-distance: inferring measures of genetic friction

Nicolas Duforet-Frebourg, Michael G. B. Blum
(Submitted on 24 Sep 2012)

The pattern of isolation-by-distance arises when population differentiation increases with increasing geographic distances. This pattern is usually caused by local spatial dispersal which explains why differences of allele frequencies between populations accumulate with distance. However, the pattern of isolation-by-distance can mask complex variations of demographic parameters. Spatial variations of demographic parameters such as migration rate or population density generate non-stationary patterns of isolation-by-distance where the rate at which genetic differentiation accumulates varies across space. Barriers to gene flow are particularly well studied examples that generate non-stationary patterns of isolation-by-distance. Using the concept of genetic friction, we develop a statistical method that characterizes non-stationary patterns of isolation-by-distance. Genetic friction at a sampled site corresponds to the local genetic differentiation between the sampled population and fictive populations living in the neighborhood of the sampling site. To avoid defining populations in advance, the method can also be applied at the scale of individuals. The proposed framework is appropriate for dealing with massive data because it relies on a pairwise similarity matrix, which can be obtained with computationally efficient methods. A simulation study shows that maps of genetic friction can detect barriers to gene flow but also other patterns such as continuous variations of gene flow across habitat. The potential of the method is illustrated with 2 data sets: genome-wide SNP data for the human Swedish populations, and AFLP markers for alpine plant species. The software FRICTION implementing the method is available at this http URL