Twisted trees and inconsistency of tree estimation when gaps are treated as missing data — the impact of model mis-specification in distance corrections

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data — the impact of model mis-specification in distance corrections
Emily Jane McTavish, Mike Steel, Mark T. Holder
Comments: 29 pages, 3 figures
Subjects: Populations and Evolution (q-bio.PE)

Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two “long branch repulsion” trees will be preferred over the true tree — though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a “twisted Farris-zone” tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda

Selection for Intermediate Genotypes Enables a Key Innovation in Phage Lambda
Alita Burmeister , Richard Lenski , Justin Meyer
doi: http://dx.doi.org/10.1101/018606

The evolution of qualitatively new functions is fundamental for shaping the diversity of life. Such innovations are rare because they require multiple coordinated changes. We sought to understand the evolutionary processes involved in a particular key innovation, whereby phage λ evolved the ability to exploit a novel receptor, OmpF, on the surface of Escherichia coli cells. Previous work has shown that this transition repeatedly evolves in the laboratory, despite requiring four mutations in specific regions of a single gene. Here we examine how this innovation evolved by studying six intermediate genotypes that arose during independent transitions to use OmpF. In particular, we tested whether these genotypes were favored by selection, and how a coevolved change in the hosts influenced the fitness of the phage genotypes. To do so, we measured the fitness of the intermediate types relative to the ancestral λ when competing for either ancestral or coevolved host cells. All six intermediates had improved fitness on at least one host, and four had higher fitness on the coevolved host than on the ancestral host. These results show that the evolution of the phage’s new ability to use OmpF was repeatable because the intermediate genotypes were adaptive and, in many cases, because coevolution of the host favored their emergence.

Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns

Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns
Tychele Turner , Christopher Douville , Dewey Kim , Peter D Stenson , David N Cooper , Aravinda Chakravarti , Rachel Karchin
doi: http://dx.doi.org/10.1101/018648

The role of rare missense variants in disease causation remains difficult to interpret. We explore whether the clustering pattern of rare missense variants (MAF<0.01) in a protein is associated with mode of inheritance. Mutations in genes associated with autosomal dominant (AD) conditions are known to result in either loss or gain of function, whereas mutations in genes associated with autosomal recessive (AR) conditions invariably result in loss of function. Loss- of-function mutations tend to be distributed uniformly along protein sequence, while gain-of- function mutations tend to localize to key regions. It has not previously been ascertained whether these patterns hold in general for rare missense mutations. We consider the extent to which rare missense variants are located within annotated protein domains and whether they form clusters, using a new unbiased method called CLUstering by Mutation Position (CLUMP). These approaches quantified a significant difference in clustering between AD and AR diseases. Proteins linked to AD diseases exhibited more clustering of rare missense mutations than those linked to AR diseases (Wilcoxon P=5.7×10-4, permutation P=8.4×10-4). Rare missense mutation in proteins linked to either AD or AR diseases were more clustered than controls (1000G) (Wilcoxon P=2.8×10-15 for AD and P=4.5×10-4 for AR, permutation P=3.1×10-12 for AD and P=0.03 for AR). Differences in clustering patterns persisted even after removal of the most prominent genes. Testing for such non-random patterns may reveal novel aspects of disease etiology in large sample studies.

FermiKit: assembly-based variant calling for Illumina resequencing data

FermiKit: assembly-based variant calling for Illumina resequencing data
Heng Li
Subjects: Genomics (q-bio.GN)

Summary: FermiKit is a variant calling pipeline for Illumina data. It de novo assembles short reads and then maps the assembly against a reference genome to call SNPs, short insertions/deletions (INDELs) and structural variations (SVs). FermiKit takes about one day to assemble 30-fold human whole-genome data on a modern 16-core server with 85GB RAM at the peak, and calls variants in half an hour to an accuracy comparable to the current practice. FermiKit assembly is a reduced representation of raw data while retaining most of the original information.
Availability and implementation: https://github.com/lh3/fermikit
Contact: hengli@broadinstitute.org

Integrative analysis of RNA, translation and protein levels reveals distinct regulatory variation across humans

Integrative analysis of RNA, translation and protein levels reveals distinct regulatory variation across humans
Can Cenik , Elif Sarinay Cenik , Gun W Byeon , Fabian Grubert , Sophie I Candille , Damek Spacek , Bilal Alsallakh , Hagen Tilgner , Carlos L Araya , Hua Tang , Emiliano Ricci , Michael P Snyder
doi: http://dx.doi.org/10.1101/018572

Elucidating the consequences of genetic differences between humans is essential for understanding phenotypic diversity and personalized medicine. Although variation in RNA levels, transcription factor binding and chromatin have been explored, little is known about global variation in translation and its genetic determinants. We used ribosome profiling, RNA sequencing, and mass spectrometry to perform an integrated analysis in lymphoblastoid cell lines from a diverse group of individuals. We find significant differences in RNA, translation, and protein levels suggesting diverse mechanisms of personalized gene expression control. Combined analysis of RNA expression and ribosome occupancy improves the identification of individual protein level differences. Finally, we identify genetic differences that specifically modulate ribosome occupancy – many of these differences lie close to start codons and upstream ORFs. Our results reveal a new level of gene expression variation among humans and indicate that genetic variants can cause changes in protein levels through effects on translation.

Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic

Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic
Vincent Lynch , Oscar C. Bedoya-Reina , Aakrosh Ratan , Michael Sulak , Daniela I. Drautz-Moses , George H. Perry , Webb Miller , Stephan C. Schuster
doi: http://dx.doi.org/10.1101/018366

Woolly mammoths and the living elephants are characterized by major phenotypic differences that allowed them to live in very different environments. To identify the genetic changes that underlie the suite of adaptations in woolly mammoths to life in extreme cold, we sequenced the nuclear genome from three Asian elephants and two woolly mammoths, identified and functionally annotated genetic changes unique to the woolly mammoth lineage. We find that genes with mammoth specific amino acid changes are enriched in functions related to circadian biology, skin and hair development and physiology, lipid metabolism, adipose development and physiology, and temperature sensation. Finally we resurrect and functionally test the mammoth and ancestral elephant TRPV3 gene, which encodes a temperature sensitive transient receptor potential (thermoTRP) channel involved in thermal sensation and hair growth, and show that a single mammoth-specific amino acid substitution in an otherwise highly conserved region of the TRPV3 channel strongly affected its temperature sensitivity. Our results have identified a set of genetic changes that likely played important roles in the adaptation of woolly mammoths to life in the high artic.

Site-specific amino-acid preferences are mostly conserved in two closely related protein homologs

Site-specific amino-acid preferences are mostly conserved in two closely related protein homologs
Michael B Doud , Orr Ashenberg , Jesse Bloom
doi: http://dx.doi.org/10.1101/018457

Evolution drives changes in a protein’s sequence over time. The extent to which these changes in sequence affect the underlying preferences for each amino acid at each site is an important question with implications for comparative sequence-analysis methods such as molecular phylogenetics. To quantify the extent that site-specific amino-acid preferences change during evolution, we performed deep mutational scanning on two homologs of human influenza nucleoprotein with 94% amino-acid identity. We found that only a small fraction of sites (14 out of 497) exhibited changes in their amino-acid preferences that exceeded the noise in our experiments. Given the limited change in amino-acid preferences between these close homologs, we tested whether our measurements could be used to build site-specific substitution models that describe the evolution of nucleoproteins from more diverse influenza viruses. We found that site-specific evolutionary models informed by our experiments greatly outperformed non-site-specific alternatives in fitting the phylogenies of nucleoproteins from human, swine, equine, and avian influenza. Combining the experimental data from both nucleoprotein homologs improved phylogenetic fit, in part because measurements in multiple genetic contexts better captured the evolutionary average of the amino-acid preferences for sites with changing preferences. Overall, our results show that site-specific amino-acid preferences are sufficiently conserved during evolution that measuring mutational effects in one protein provides information that can improve quantitative evolutionary modeling of nearby homologs.

Efficient compression and analysis of large genetic variation datasets

Efficient compression and analysis of large genetic variation datasets
Ryan M Layer , Neil Kindlon , Konrad J Karczewski , Exome Aggregation Consortium ExAC , Aaron R Quinlan
doi: http://dx.doi.org/10.1101/018259

The economy of human genome sequencing has catalyzed ambitious efforts to interrogate the genomes of large cohorts in search of deeper insight into the genetic basis of disease. This manuscript introduces Genotype Query Tools (GQT) as a new indexing strategy and powerful toolset that enables interactive analyses based on genotypes, phenotypes and sample relationships. Speed improvements are achieved by operating directly on a compressed index without decompression. GQT’s data compression ratios increase favorably with cohort size and therefore, by avoiding data inflation, relative analysis performance improves in kind. We demonstrate substantial query performance improvements over state-of-the-art tools using datasets from the 1000 Genomes Project (46 fold), the Exome Aggregation Consortium (443 fold), and simulated datasets of up to 100,000 genomes (218 fold). Moreover, our genotype indexing strategy complements existing formats and toolsets to provide a powerful framework for current and future analyses of massive genome datasets.

Efficient Privacy-Preserving String Search and an Application in Genomics

Efficient Privacy-Preserving String Search and an Application in Genomics
Kana Shimizu , Koji Nuida , Gunnar Rätsch
doi: http://dx.doi.org/10.1101/018267

Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g., an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. Approach: We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queries. In an experiment based on 2,184 aligned haploid genomes from the 1,000 Genomes Project, our algorithm was able to perform typical queries within ≈2 seconds and ≈20 seconds seconds for client and server side, respectively, on a laptop computer. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm.

Relationship between LD Score and Haseman-Elston Regression

Relationship between LD Score and Haseman-Elston Regression
Brendan Bulik-Sullivan
doi: http://dx.doi.org/10.1101/018283

Estimating SNP-heritability from summary statistics using LD Score regression provides a convenient alternative to standard variance component models, because LD Score regression is computationally very fast and does not require individual genotype data. However, the mathematical relationship between variance component methods and LD Score regression is not clear; in particular, it is not known in general how much of an increase in standard error one incurs by working with summary data instead of individual genotypes. In this paper, I show that in samples of unrelated individuals, LD Score regression with constrained intercept is essentially the same as Haseman-Elston (HE) regression, which is currently the state-of-the-art method for estimating SNP-heritability from ascertained case/control samples. Similar results hold for SNP-genetic correlation.