Multi Loci Phylogenetic Analysis with Gene Tree Clustering

Multi Loci Phylogenetic Analysis with Gene Tree Clustering

Ruriko Yoshida, Kenji Fukumizu
(Submitted on 26 Jun 2015)

Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display precisely matched topologies. The main reason for such phylogenetic incongruence is reticulated evolutionary history of most species due to meiotic sexual recombination in eukaryotes, orhorizontal transfers of genetic materials in prokaryotes. Nevertheless, most genes should display topologically related phylogenies, and should group into one or more (for genetic hybrids) clusters in the “tree space.” In this paper we propose to apply the normalized-cut (Ncut) clustering algorithm to the set of gene trees with the geodesic distance between trees over the Billera-Holmes-Vogtmann (BHV) tree space. We first show by simulated data sets that the Ncut algorithm accurately clusters the set of gene trees given a species tree under the coalescent process, and show that the Ncut algorithm works better on the gene trees reconstructed via the neighbor-joining method than these reconstructed via the maximum likelihood estimator under the evolutionary models. Moreover, we apply the methods to a genome-wide data set (1290 genes encoding 690,838 amino acid residues) on coelacanths, lungfishes, and tetrapods. The result suggests that there are two clusters in the data set. Finally we reconstruct the consensus trees from these two clusters; the consensus tree constructed from one cluster has the tree topology that coelacanths are most closely related to the tetrapods, and the consensus tree from the other includes an irresolvable trichotomy over the coelacanth, lungfish, and tetrapod lineages, suggesting divergence within a very short time interval.

Mitochondrial DNA Copy Number Variation Across Human Cancers

Mitochondrial DNA Copy Number Variation Across Human Cancers

Ed Reznik, Martin Miller, Yasin Senbabaoglu, Nadeem Riaz, William Lee, Chris Sander
doi: http://dx.doi.org/10.1101/021535

In cancer, mitochondrial dysfunction, through mutations, deletions, and changes in copy number of mitochondrial DNA (mtDNA), contributes to the malignant transformation and progression of tumors. Here, we report the first large-scale survey of mtDNA copy number variation across 21 distinct solid tumor types, examining over 13,000 tissue samples profiled with next-generation sequencing methods. We find a tendency for cancers, especially of the bladder and kidney, to be significantly depleted of mtDNA, relative to matched normal tissue. We show that mtDNA copy number is correlated to the expression of mitochondrially-localized metabolic pathways, suggesting that mtDNA copy number variation reflect gross changes in mitochondrial metabolic activity. Finally, we identify a subset of tumor-type-specific somatic alterations, including IDH1 and NF1 mutations in gliomas, whose incidence is strongly correlated to mtDNA copy number. Our findings suggest that modulation of mtDNA copy number may play a role in the pathology of cancer.

TransRate: reference free quality assessment of de-novo transcriptome assemblies

TransRate: reference free quality assessment of de-novo transcriptome assemblies

Richard D Smith-Unna, Chris Boursnell, Rob Patro, Julian M Hibberd, Steven Kelly
doi: http://dx.doi.org/10.1101/021626

TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies. Using only sequenced reads as the input, TransRate measures the quality of individual contigs and whole assemblies, enabling assembly optimization and comparison. TransRate can accurately evaluate assemblies of conserved and novel RNA molecules of any kind in any species. We show that it is more accurate than comparable methods and demonstrate its use on a variety of data.

Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model

Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model

Clayton E. Cressler, Marguerite A. Butler, Aaron A. King
(Submitted on 25 Jun 2015)

Phylogenetic comparative analysis is an approach to inferring evolutionary process from a combination of phylogenetic and phenotypic data. The last few years have seen increasingly sophisticated models employed in the evaluation of more and more detailed evolutionary hypotheses, including adaptive hypotheses with multiple selective optima and hypotheses with rate variation within and across lineages. The statistical performance of these sophisticated models has received relatively little systematic attention, however. We conducted an extensive simulation study to quantify the statistical properties of a class of models toward the simpler end of the spectrum that model phenotypic evolution using Ornstein-Uhlenbeck processes. We focused on identifying where, how, and why these methods break down so that users can apply them with greater understanding of their strengths and weaknesses. Our analysis identifies three key determinants of performance: a discriminability ratio, a signal-to-noise ratio, and the number of taxa sampled. Interestingly, we find that model-selection power can be high even in regions that were previously thought to be difficult, such as when tree size is small. On the other hand, we find that model parameters are in many circumstances difficult to estimate accurately, indicating a relative paucity of information in the data relative to these parameters. Nevertheless, we note that accurate model selection is often possible when parameters are only weakly identified. Our results have implications for more sophisticated methods inasmuch as the latter are generalizations of the case we study.

SSCM: A method to analyze and predict the pathogenicity of sequence variants

SSCM: A method to analyze and predict the pathogenicity of sequence variants

Sharad Vikram, Matthew D Rasmussen, Eric A Evans, Imran S Haque
doi: http://dx.doi.org/10.1101/021527

The advent of cost-effective DNA sequencing has provided clinics with high-resolution information about patients’ genetic variants, which has resulted in the need for efficient interpretation of this genomic data. Traditionally, variant interpretation has been dominated by many manual, time-consuming processes due to the disparate forms of relevant information in clinical databases and literature. Computational techniques promise to automate much of this, and while they currently play only a supporting role, their continued improvement for variant interpretation is necessary to tackle the problem of scaling genetic sequencing to ever larger populations. Here, we present SSCM-Pathogenic, a genome-wide, allele-specific score for predicting variant pathogenicity. The score, generated by a semi-supervised clustering algorithm, shows predictive power on clinically relevant mutations, while also displaying predictive ability in noncoding regions of the genome.

Adaptive evolution is substantially impeded by Hill-Robertson interference in Drosophila

Adaptive evolution is substantially impeded by Hill-Robertson interference in Drosophila

David Castellano, Marta Coronado, Jose Campos, Antonio Barbadilla, Adam Eyre-Walker
doi: http://dx.doi.org/10.1101/021600

It is known that rates of mutation and recombination vary across the genome in many species. Here we investigate whether these factors affect the rate at which genes undergo adaptive evolution both individually and in combination and quantify the degree to which Hill-Robertson interference (HRi) impedes the rate of adaptive evolution. To do this we compiled a dataset of 6,141 autosomal protein coding genes from Drosophila, for which we have polymorphism data from D. melanogaster and divergence out to D. yakuba. We estimated the rate of adaptive evolution using a derivative of the McDonald-Kreitman test that controls for the slightly deleterious mutations. We find that the rate of adaptive amino acid substitution is positively correlated to both the rates of recombination and mutation. We also find that these correlations are robust to controlling for each other, synonymous codon bias and gene functions related to immune response and testes. We estimate that HRi reduces the rate of adaptive evolution by ~27%. We also show that this fraction depends on a gene’s mutation rate; genes with low mutation rates lose ~11% of their adaptive substitutions while genes with high mutation rates lose ~43%. In conclusion, we show that the mutation rate and the rate of recombination, are important modifiers of the rate of adaptive evolution in Drosophila.

Approaches to estimating inbreeding coefficients in clinical isolates of Plasmodium falciparum from genomic sequence data

Approaches to estimating inbreeding coefficients in clinical isolates of Plasmodium falciparum from genomic sequence data

John D O’Brien, Lucas Amenga-Etego, Ruiqi Li
doi: http://dx.doi.org/10.1101/021519

A recent genomic characterization of more than $200$ Plasmodium falciparum samples isolated from the bloodstreams of clinical patients across three continents further supports the presence of significant strain mixture within infections. Consistent with previous studies, these data suggest that the degree of genetic strain admixture within infections varies significantly both within and across populations. The life cycle of the parasite implies that the mixture of multiple genotypes within an infected individual controls the outcrossing rate across populations, making methods for measuring this process in situ central to understanding the genetic epidemiology of the disease. Peculiar features of the P. falciparum genome mean that standard methods for assessing structure within a population — inbreeding coefficients and related $F$-statistics — cannot be used directly. Here we review an initial effort to estimate the degree of mixture within clinical isolates of P. falciparum using these statistics, and provide several generalizations using both frequentist and Bayesian approaches. Using the Bayesian approach, based on the Balding-Nichols model, we provide estimates of inbreeding coefficients for 168 samples from northern Ghana and find significant admixture in more than 70% of samples, and characterize the model fit using posterior predictive checks. We also compare this approach to a recently introduced mixture model and find that for a significant minority of samples the F-statistic-based approach provides a significantly better explanation for the data. We show how to extend this model to a multi-level testing framework that can integrate other data types and use it to demonstrate that transmission intensity significantly associates with degree of structure of within-sample mixture in northern Ghana.

A novel test for detecting gene-gene interactions in trio studies

A novel test for detecting gene-gene interactions in trio studies

Brunilda Balliu, Noah Zaitlen
doi: http://dx.doi.org/10.1101/021469

Epistasis plays a significant role in the genetic architecture of many complex phenotypes in model organisms. To date, there have been very few interactions replicated in human studies due in part to the multiple hypothesis burden implicit in genome-wide tests of epistasis. Therefore, it is of paramount importance to develop the most powerful tests possible for detecting interactions. In this work we develop a new gene-gene interaction test for use in trio studies called the trio correlation (TC) test. The TC test computes the expected joint distribution of marker pairs in offspring conditional on parental genotypes. This distribution is then incorporated into a standard one degree of freedom correlation test of interaction. We show via extensive simulations that our test substantially outperforms existing tests of interaction in trio studies. The gain in power under standard models of phenotype is large, with previous tests requiring more than twice the number of trios to obtain the power of our test. We also demonstrate a bias in a previous trio interaction test and identify its origin. We conclude that the TC test shows improved power to identify interactions in existing, as well as emerging, trio association studies. The method is publicly available at http://www.github.com/BrunildaBalliu/TrioEpi.

On enhancing variation detection through pan-genome indexing

On enhancing variation detection through pan-genome indexing

Daniel Valenzuela, Niko Välimäki, Esa Pitkänen, Veli Mäkinen
doi: http://dx.doi.org/10.1101/021444

Detection of genomic variants is commonly conducted by aligning a set of reads sequenced from an individual to the reference genome of the species and analyzing the resulting read pileup. Typically, this process finds a subset of variants already reported in databases and additional novel variants characteristic to the sequenced individual. Most of the effort in the literature has been put to the alignment problem on a single reference sequence, although our gathered knowledge on species such as human is pan-genomic: We know most of the common variation in addition to the reference sequence. There have been some efforts to exploit pan-genome indexing, where the most widely adopted approach is to build an index structure on a set of reference sequences containing observed variation combinations. The enhancement in alignment accuracy when using pan-genome indexing has been demonstrated in experiments, but so far the above multiple references pan-genome indexing approach has not been tested on its final goal, that is, in enhancing variation detection. This is the focus of this article: We study a generic approach to add variation detection support on top of the multiple references pan-genomic indexing approach. Namely, we study the read pileup on a multiple alignment of reference genomes, and propose a heaviest path algorithm to extract a new recombined reference sequence. This recombined reference sequence can then be utilized in any standard read alignment and variation detection workflow. We demonstrate that the approach enhances variation detection on realistic data sets.

The structure of the genotype-phenotype map strongly constrains the evolution of non-coding RNA

The structure of the genotype-phenotype map strongly constrains the evolution of non-coding RNA

Kamaludin Dingle, Steffen Schaper, Ard A. Louis
(Submitted on 17 Jun 2015)

The prevalence of neutral mutations implies that biological systems typically have many more genotypes than phenotypes. But can the way that genotypes are distributed over phenotypes determine evolutionary outcomes? Answering such questions is difficult because the number of genotypes can be hyper-astronomically large. By solving the genotype-phentoype (GP) map for RNA secondary structure for systems up to length L=126 nucleotides (where the set of all possible RNA strands would weigh more than the mass of the visible universe) we show that the GP map strongly constrains the evolution of non-coding RNA (ncRNA). Remarkably, simple random sampling over genotypes accurately predicts the distribution of properties such as the mutational robustness or the number of stems per secondary structure found in naturally occurring ncRNA. Since we ignore natural selection, this close correspondence with the mapping suggests that structures allowing for functionality are easily discovered, despite the enormous size of the genetic spaces. The mapping is extremely biased: the majority of genotypes map to an exponentially small portion of the morphospace of all biophysically possible structures. Such strong constraints provide a non-adaptive explanation for the convergent evolution of structures such as the hammerhead ribozyme. ncRNA presents a particularly clear example of bias in the arrival of variation strongly shaping evolutionary outcomes.