Efficient compression and analysis of large genetic variation datasets

Efficient compression and analysis of large genetic variation datasets
Ryan M Layer , Neil Kindlon , Konrad J Karczewski , Exome Aggregation Consortium ExAC , Aaron R Quinlan
doi: http://dx.doi.org/10.1101/018259

The economy of human genome sequencing has catalyzed ambitious efforts to interrogate the genomes of large cohorts in search of deeper insight into the genetic basis of disease. This manuscript introduces Genotype Query Tools (GQT) as a new indexing strategy and powerful toolset that enables interactive analyses based on genotypes, phenotypes and sample relationships. Speed improvements are achieved by operating directly on a compressed index without decompression. GQT’s data compression ratios increase favorably with cohort size and therefore, by avoiding data inflation, relative analysis performance improves in kind. We demonstrate substantial query performance improvements over state-of-the-art tools using datasets from the 1000 Genomes Project (46 fold), the Exome Aggregation Consortium (443 fold), and simulated datasets of up to 100,000 genomes (218 fold). Moreover, our genotype indexing strategy complements existing formats and toolsets to provide a powerful framework for current and future analyses of massive genome datasets.

Efficient Privacy-Preserving String Search and an Application in Genomics

Efficient Privacy-Preserving String Search and an Application in Genomics
Kana Shimizu , Koji Nuida , Gunnar Rätsch
doi: http://dx.doi.org/10.1101/018267

Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g., an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. Approach: We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queries. In an experiment based on 2,184 aligned haploid genomes from the 1,000 Genomes Project, our algorithm was able to perform typical queries within ≈2 seconds and ≈20 seconds seconds for client and server side, respectively, on a laptop computer. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm.

Relationship between LD Score and Haseman-Elston Regression

Relationship between LD Score and Haseman-Elston Regression
Brendan Bulik-Sullivan
doi: http://dx.doi.org/10.1101/018283

Estimating SNP-heritability from summary statistics using LD Score regression provides a convenient alternative to standard variance component models, because LD Score regression is computationally very fast and does not require individual genotype data. However, the mathematical relationship between variance component methods and LD Score regression is not clear; in particular, it is not known in general how much of an increase in standard error one incurs by working with summary data instead of individual genotypes. In this paper, I show that in samples of unrelated individuals, LD Score regression with constrained intercept is essentially the same as Haseman-Elston (HE) regression, which is currently the state-of-the-art method for estimating SNP-heritability from ascertained case/control samples. Similar results hold for SNP-genetic correlation.

Low levels of transposable element activity in Drosophila mauritiana: causes and consequences

Low levels of transposable element activity in Drosophila mauritiana: causes and consequences

Robert Kofler , Christian Schlötterer
doi: http://dx.doi.org/10.1101/018218

Transposable elements (TEs) are major drivers of genomic and phenotypic evolution, yet many questions about their biology remain poorly understood. Here, we compare TE abundance between populations of the two sister species D. mauritiana und D. simulans and relate it to the more distantly related D. melanogaster. The low population frequency of most TE insertions in D. melanogaster and D. simulans has been a key feature of several models of TE evolution. In D. mauritiana, however, the majority of TE insertions are fixed (66%). We attribute this to a lower transposition activity of up to 47 TE families in D. mauritiana, rather than stronger purifying selection. Only three families, including the extensively studied Mariner, may have a higher activity in D. mauritiana. This remarkable difference in TE activity between two recently diverged Drosophila species (≈ 250,000 years), also supports the hypothesis that TE copy numbers in Drosophila may not reflect a stable equilibrium where the rate of TE gains equals the rate of TE losses by negative selection. We propose that the transposition rate heterogeneity results from the contrasting ecology of the two species: the extent of vertical extinction of TE families and horizontal acquisition of active TE copies may be very different between the colonizing D. simulans and the island endemic D. mauritiana. Our findings provide novel insights in the evolution of TEs in Drosophila and suggest that the ecology of the host species could be a major, yet underappreciated, factor governing the evolutionary dynamics of TEs.

When the mean is not enough: Calculating fixation time distributions in birth-death processes

When the mean is not enough: Calculating fixation time distributions in birth-death processes

Peter Ashcroft, Arne Traulsen, Tobias Galla
(Submitted on 16 Apr 2015)

Studies of fixation dynamics in Markov processes predominantly focus on the mean time to absorption. This may be inadequate if the distribution is broad and skewed. We compute the distribution of fixation times in one-step birth-death processes with two absorbing states. These are expressed in terms of the spectrum of the process, and we provide different representations as forward-only processes in eigenspace. These allow efficient sampling of fixation time distributions. As an application we study evolutionary game dynamics, where invading mutants can reach fixation or go extinct. We also highlight the median fixation time as a possible analog of mixing times in systems with small mutation rates and no absorbing states, whereas the mean fixation time has no such interpretation.

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Kevin J Galinsky , Gaurav Bhatia , Po-Ru Loh , Stoyan Georgiev , Sayan Mukherjee , Nick J Patterson , Alkes L Price
doi: http://dx.doi.org/10.1101/018143

Principal components analysis (PCA) is a widely used tool for inferring population structure and correcting confounding in genetic data. We introduce a new algorithm, FastPCA, that leverages recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using a new test for natural selection based on population differentiation along these PCs, we replicate previously known selected loci and identify three new signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984 has previously been associated to alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents.

Fulfilling the promise of Mendelian randomization

Fulfilling the promise of Mendelian randomization

Joseph Pickrell
doi: http://dx.doi.org/10.1101/018150

Many important questions in medicine involve questions about causality, For example, do low levels of high-density lipoproteins (HDL) cause heart disease? Does high body mass index (BMI) cause type 2 diabetes? Or are these traits simply correlated in the population for other reasons? A popular approach to answering these problems using human genetics is called “Mendelian randomization”. We discuss the prospects and limitations of this approach, and some ways forward.

Is there such a thing as Landscape Genetics?

Is there such a thing as Landscape Genetics?

Rodney J Dyer
doi: http://dx.doi.org/10.1101/018192

For a scientific discipline to be interdisciplinary it must satisfy two conditions; it must consist of contributions from at least two existing disciplines and it must be able to provide insights, through this interaction, that neither progenitor discipline could address. In this paper, I examine the complete body of peer-reviewed literature self-identified as landscape genetics using the statistical approaches of text mining and natural language processing. The goal here is to quantify the kinds of questions being addressed in landscape genetic studies, the ways in which questions are evaluated mechanistically, and how they are differentiated from the progenitor disciplines of landscape ecology and population genetics. I then circumscribe the main factions within published landscape genetic papers examining the extent to which emergent questions are being addressed and highlighting a deep bifurcation between existing individual- and population-based approaches. I close by providing some suggestions on where theoretical and analytical work is needed if landscape genetics is to serve as a real bridge connecting evolution and ecology sensu lato.

The design and analysis of binary variable traits in common garden genetic experiments of highly fecund species to assess heritability

The design and analysis of binary variable traits in common garden genetic experiments of highly fecund species to assess heritability

Sarah W Davies , Samuel Scarpino , Thanapat Pongwarin , James Scott , Mikhail V Matz
doi: http://dx.doi.org/10.1101/018044

Many biologically important traits are binomially distributed, with their key phenotypes being presence or absence. Despite their prevalence, estimating the heritability of binomial traits presents both experimental and statistical challenges. Here we develop both an empirical and computational methodology for estimating the narrow-sense heritability of binary traits for highly fecund species. Our experimental approach controls for undesirable culturing effects, while minimizing culture numbers, increasing feasibility in the field. Our statistical approach accounts for known issues with model-selection by using a permutation test to calculate significance values and includes both fitting and power calculation methods. We illustrate our methodology by estimating the narrow-sense heritability for larval settlement, a key life-history trait, in the reef-building coral Orbicella faveolata. The experimental, statistical and computational methods, along with all of the data from this study, were deployed in the R package multiDimBio.

A pooling-based approach to mapping genetic variants associated with DNA methylation

A pooling-based approach to mapping genetic variants associated with DNA methylation

Irene Miriam Kaplow , Julia L MacIsaac , Sarah M Mah , Lisa M McEwen , Michael S Kobor , Hunter B Fraser
doi: http://dx.doi.org/10.1101/013649

DNA methylation is an epigenetic modification that plays a key role in gene regulation. Previous studies have investigated its genetic basis by mapping genetic variants that are associated with DNA methylation at specific sites, but these have been limited to microarrays that cover less than 2% of the genome and cannot account for allele-specific methylation (ASM). Other studies have performed whole-genome bisulfite sequencing on a few individuals, but these lack statistical power to identify variants associated with DNA methylation. We present a novel approach in which bisulfite-treated DNA from many individuals is sequenced together in a single pool, resulting in a truly genome-wide map of DNA methylation. Compared to methods that do not account for ASM, our approach increases statistical power to detect associations while sharply reducing cost, effort, and experimental variability. As a proof of concept, we generated deep sequencing data from a pool of 60 human cell lines; we evaluated almost twice as many CpGs as the largest microarray studies and identified over 2,000 genetic variants associated with DNA methylation. We found that these variants are highly enriched for associations with chromatin accessibility and CTCF binding but are less likely to be associated with traits indirectly linked to DNA, such as gene expression and disease phenotypes. In summary, our approach allows genome-wide mapping of genetic variants associated with DNA methylation in any tissue of any species, without the need for individual-level genotype or methylation data.