Nonparametric inference of the distribution of fitness effects across functional categories in humans

Nonparametric inference of the distribution of fitness effects across functional categories in humans

Fernando Racimo, Joshua G Schraiber

Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral, or rely on fitting the DFE of new nonsynonymous mutations to a particular parametric probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model. We then use our coefficient mapping to quantify the distribution of all scored single-nucleotide polymorphisms in Yoruba and Europeans. Our method serves to approximate the DFE of any type of segregating mutations, regardless of its genomic consequence, and so allows us to compare the proportion of mutations that are negatively selected or neutral across various genomic categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly leptokurtic, with a strong peak at neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and a second peak at s ≈ −10^(−4). Other types of polymorphisms have shapes that fall roughly in between these two.

Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries using Sparse Linear Regression

Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries using Sparse Linear Regression

Charles K. Fisher, Pankaj Mehta
(Submitted on 3 Feb 2014)

Human associated microbial communities exert tremendous influence over human health and disease. With modern metagenomic sequencing methods it is possible to follow the relative abundance of microbes in a community over time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer the interactions between species from sequence data. Any algorithm for inferring species interactions must overcome three obstacles: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units, bias inferences of species interactions. Here we introduce an approach, Learning Interactions from MIcrobial Time Series (LIMITS), that overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut microbiomes of two individuals and found that the interaction networks varied significantly between individuals. Furthermore, we found that the interaction networks of the two individuals are dominated by distinct “keystone species”, Bacteroides fragilis and Bacteroided stercosis, that have a disproportionate influence on the structure of the gut microbiome even though they are only found in moderate abundance. Based on our results, we hypothesize that the abundances of certain keystone species may be responsible for individuality in the human gut microbiome.

Genetic variants associated with motion sickness point to roles for inner ear development, neurological processes, and glucose homeostasis

Genetic variants associated with motion sickness point to roles for inner ear development, neurological processes, and glucose homeostasis

Bethann S Hromatka, Joyce Y Tung, Amy K Kiefer, Chuong B Do, David A Hinds, Nicholas Eriksson

Roughly one in three individuals is highly susceptible to motion sickness and yet the underlying causes of this condition are not well understood. Despite high heritability, no associated genetic factors have been discovered to date. Here, we conducted the first genome-wide association study on motion sickness in 80,494 individuals from the 23andMe database who were surveyed about car sickness. Thirty-five single-nucleotide polymorphisms (SNPs) were associated with motion sickness at a genome-wide-significant level (p< 5e-8). Many of these SNPs are near genes involved in balance, and eye, ear, and cranial development (e.g., PVRL3, TSHZ1, MUTED, HOXB3, HOXD3). Other SNPs may affect motion sickness through nearby genes with roles in the nervous system, glucose homeostasis, or hypoxia. We show that several of these SNPs display sex-specific effects, with as much as three times stronger effects in women. We searched for comorbid phenotypes with motion sickness, confirming associations with known comorbidities including migraines, postoperative nausea and vomiting (PONV), vertigo, and morning sickness, and observing new associations with altitude sickness and many gastrointestinal conditions. We also show that two of these related phenotypes (PONV and migraines) share underlying genetic factors with motion sickness. These results point to the importance of the nervous system in motion sickness and suggest a role for glucose levels in motion-induced nausea and vomiting, a finding that may provide insight into other nausea-related phenotypes such as PONV. They also highlight personal characteristics (e.g., being a poor sleeper) that correlate with motion sickness, findings that could help identify risk factors or treatments.

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences

motifDiverge: a model for assessing the statistical significance of gene regulatory motif divergence between two DNA sequences
Dennis Kostka, Tara Friedrich, Alisha K. Holloway, Katherine S. Pollard
(Submitted on 1 Feb 2014)

Next-generation sequencing technology enables the identification of thousands of gene regulatory sequences in many cell types and organisms. We consider the problem of testing if two such sequences differ in their number of binding site motifs for a given transcription factor (TF) protein. Binding site motifs impart regulatory function by providing TFs the opportunity to bind to genomic elements and thereby affect the expression of nearby genes. Evolutionary changes to such functional DNA are hypothesized to be major contributors to phenotypic diversity within and between species; but despite the importance of TF motifs for gene expression, no method exists to test for motif loss or gain. Assuming that motif counts are Binomially distributed, and allowing for dependencies between motif instances in evolutionarily related sequences, we derive the probability mass function of the difference in motif counts between two nucleotide sequences. We provide a method to numerically estimate this distribution from genomic data and show through simulations that our estimator is accurate. Finally, we introduce the R package {\tt motifDiverge} that implements our methodology and illustrate its application to gene regulatory enhancers identified by a mouse developmental time course experiment. While this study was motivated by analysis of regulatory motifs, our results can be applied to any problem involving two correlated Bernoulli trials.

Most viewed on Haldane’s Sieve: January 2014

The most viewed preprints on Haldane’s Sieve this month were:

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data
Richard W Lusk
(Submitted on 30 Jan 2014)

Background: Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species.
Results: I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from “blank” samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood.
Conclusions: Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.

Impact of RNA degradation on measurements of gene expression

Impact of RNA degradation on measurements of gene expression

Irene Gallego Romero, Athma A. Pai, Jenny Tung, Yoav Gilad

The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be normalized, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples are at times the sole means of addressing specific questions – e.g., samples collected in the course of fieldwork. We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a quality metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples. While standard normalizations failed to account for the effects of degradation, we found that a simple linear model that controls for the effects of RIN can correct for the majority of these effects. We conclude that in instances where RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Gad Abraham, Michael Inouye

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy compared with existing tools in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans

Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans
Rebekah L. Rogers, Julie M. Cridland, Ling Shao, Tina T. Hu, Peter Andolfatto, Kevin R. Thornton
(Submitted on 28 Jan 2014)

We have used whole genome paired-end Illumina sequence data to identify tandem duplications in 20 isofemale lines of D. yakuba, and 20 isofemale lines of D. simulans and performed genome wide validation with PacBio long molecule sequencing. We identify 1,415 tandem duplications that are segregating in D. yakuba as well as 975 duplications in D. simulans, indicating greater variation in D. yakuba. Additionally, we observe high rates of secondary deletions at duplicated sites, with 8% of duplicated sites in D. simulans and 17% of sites in D. yakuba modified with deletions. These secondary deletions are consistent with the action of the large loop mismatch repair system acting to remove polymorphic tandem duplication, resulting in rapid dynamics of gain and loss in duplicated alleles and a richer substrate of genetic novelty than has been previously reported. Most duplications are present in only single strains, suggesting deleterious impacts are common. However, we do observe signals consistent with adaptive evolution. D. simulans shows an excess of whole gene duplications and an excess of high frequency variants on the X chromosome, consistent with adaptive evolution through duplications on the D. simulans X. We identify 79 chimeric genes in D. yakuba and 38 chimeric genes in D. simulans, as well as 143 cases of recruited non-coding sequence in D. yakuba and 96 in D. simulans, in agreement with rates of chimeric gene origination in D. melanogaster. Together, these results suggest that tandem duplications often result in complex variation beyond whole gene duplications that offers a rich substrate of standing variation that is likely to contribute both to detrimental phenotypes and disease, as well as to adaptive evolutionary change.

Footprints of ancient balanced polymorphisms in genetic variation data

Footprints of ancient balanced polymorphisms in genetic variation data
Ziyue Gao, Molly Przeworski, Guy Sella
(Submitted on 29 Jan 2014)

When long-lived, balancing selection can lead to trans-species polymorphisms that are shared by two or more species identical by descent. In this case, the gene genealogies at the selected sites cluster by allele instead of by species and, because of linkage, nearby neutral sites also have unusual genealogies. Although it is clear that this scenario should lead to discernible footprints in genetic variation data, notably the presence of additional neutral polymorphisms shared between species and the absence of fixed differences, the effects remain poorly characterized. We focus on the case of a single site under long-lived balancing selection and derive approximations for summaries of the data that are sensitive to a trans-species polymorphism: the length of the segment that carries most of the signals, the expected number of shared neutral SNPs within the segment and the patterns of allelic associations among them. Coalescent simulations of ancient balancing selection confirm the accuracy of our approximations. We further show that for humans and chimpanzees, and more generally for pairs of species with low genetic diversity levels, the patterns of genetic variation on which we focus are highly unlikely to be generated by neutral recurrent mutations, so these statistics are specific as well as sensitive. We discuss the implications of our results for the design and interpretation of genome scans for ancient balancing selection in apes and other taxa.