Multiple Quantitative Trait Analysis Using Bayesian Networks

Multiple Quantitative Trait Analysis Using Bayesian Networks

Marco Scutari, Phil Howell, David J. Balding, Ian Mackay
(Submitted on 12 Feb 2014)

Models for genome-wide prediction and association studies usually target a single phenotypic trait. However, in animal and plant genetics it is common to record information on multiple phenotypes for each individual that will be genotyped. Modeling traits individually disregards the fact that they are most likely associated due to pleiotropy and shared biological basis, thus providing only a partial, confounded view of genetic effects and phenotypic interactions. In this paper we use data from a Multiparent Advanced Generation Inter-Cross (MAGIC) winter wheat population to explore Bayesian networks as a convenient and interpretable framework for the simultaneous modeling of multiple quantitative traits. We show that they are equivalent to multivariate genetic best linear unbiased prediction (GBLUP), and that they outperform single-trait elastic net and single-trait GBLUP in predictive performance. Finally, we discuss their relationship with other additive-effects models and their advantages in inference and interpretation. MAGIC populations provide an ideal setting for this kind of investigation because the very low population structure and large sample size result in predictive models with good power and limited confounding due to relatedness.

The arrival of the frequent: how bias in genotype-phenotype maps can steer populations to local optima

The arrival of the frequent: how bias in genotype-phenotype maps can steer populations to local optima

Ard A Louis, Steffen Schaper
(Submitted on 6 Feb 2014)

Genotype-phenotype (GP) maps specify how the random mutations that change genotypes generate variation by altering phenotypes, which, in turn, can trigger selection. Many GP maps share the following general properties: 1) The number of genotypes NG is much larger than the number of selectable phenotypes; 2) Neutral exploration changes the variation that is accessible to the population; 3) The distribution of phenotype frequencies Fp=Np/NG, with Np the number of genotypes mapping onto phenotype p, is highly biased: the majority of genotypes map to only a small minority of the phenotypes. Here we explore how these properties affect the evolutionary dynamics of haploid Wright-Fisher models that are coupled to a simplified and general random GP map or to a more complex RNA sequence to secondary structure map. For both maps the probability of a mutation leading to a phenotype p scales to first order as Fp, although for the RNA map there are further correlations as well. By using mean-field theory, supported by computer simulations, we show that the discovery time Tp of a phenotype p similarly scales to first order as 1/Fp for a wide range of population sizes and mutation rates in both the monomorphic and polymorphic regimes. These differences in the rate at which variation arises can vary over many orders of magnitude. Phenotypic variation with a larger Fp is therefore be much more likely to arise than variation with a small Fp. We show, using the RNA model, that frequent phenotypes (with larger Fp) can fix in a population even when alternative, but less frequent, phenotypes with much higher fitness are potentially accessible. In other words, if the fittest never `arrive’ on the timescales of evolutionary change, then they can’t fix. We call this highly non-ergodic effect the `arrival of the frequent’.

The disruption of trace element homeostasis due to aneuploidy as a unifying theme in the etiology of cancer

The disruption of trace element homeostasis due to aneuploidy as a unifying theme in the etiology of cancer

Johannes Engelken, Matthias Altmeyer, Renty Franklin

#### #### Abstract for Scientists: While decades of cancer research have firmly established multiple “hallmarks of cancer”, cancer’s genomic landscape remains to be fully understood. Particularly, the phenomenon of aneuploidy – gains and losses of large genomic regions, i.e. whole chromosomes or chromosome arms – and why most cancer cells are aneuploid remains enigmatic. This is despite the achievements of cytogenomics and whole genome sequencing which have successfully pinpointed focal amplifications and focal deletions as well as point mutations affecting numerous genes involved in carcinogenesis. A characteristic of many different cancers is the deregulation of the homeostasis of trace elements, such as copper (Cu), zinc (Zn) and iron (Fe). Concentrations of copper are markedly increased in cancer tissue and the blood plasma of cancer patients, while zinc levels are typically decreased. Here we discuss the hypothesis that the disruption of trace element homeostasis and the phenomenon of aneuploidy might be linked. Our tentative analysis of genomic data from diverse tumor types mainly from The Cancer Genome Atlas (TCGA) project suggests that gains and losses of metal transporter genes occur frequently and correlate well with transporter gene expression levels. Hereby they may confer a cancer-driving selective growth advantage at early and possibly also later stages during cancer development. This idea is consistent with recent observations in yeast, which suggest that through chromosomal gains and losses cells can adapt quickly to new carbon sources, nutrient starvation as well as to copper toxicity. In human cancer development, candidate driving events may include, among others, the gains of zinc transporter genes SLC39A1 and SLC39A4 on chromosome arms 1q and 8q, respectively, and the losses of zinc transporter genes SLC30A5, SLC39A14 and SLC39A6 on 5q, 8p and 18q. The recurrent gain of 3q might be associated with the iron transporter gene TFRC and the loss of 13q with the copper transporter gene ATP7B. By altering cellular trace element homeostasis (especially fluctuations in labile and total zinc) such events might contribute to the initiation of the malignant transformation. Consistently, it has been shown that zinc affects a number of the observed hallmark characteristics including DNA repair, inflammation and apoptosis. We term this model the “aneuploidy metal transporter cancer” (AMTC) hypothesis. While the AMTC hypothesis does not contradict the cancer-promoting role of point and focal mutations in established tumor suppressor genes and oncogenes (e.g. MYC, MYCN, TP53, PIK3CA, BRCA1, ERBB2), it seems possible that some of these mutations may be a response to the prior disruption of trace element homeostasis. We suggest a number of approaches for how this hypothesis could be tested experimentally and briefly touch on possible implications for cancer etiology, metastasis, drug resistance and therapy.

Nonparametric inference of the distribution of fitness effects across functional categories in humans

Nonparametric inference of the distribution of fitness effects across functional categories in humans

Fernando Racimo, Joshua G Schraiber

Quantifying the proportion of polymorphic mutations that are deleterious or neutral is of fundamental importance to our understanding of evolution, disease genetics and the maintenance of variation genome-wide. Here, we develop an approximation to the distribution of fitness effects (DFE) of segregating single-nucleotide mutations in humans. Unlike previous methods, we do not assume that synonymous mutations are neutral, or rely on fitting the DFE of new nonsynonymous mutations to a particular parametric probability distribution, which is poorly motivated on a biological level. We rely on a previously developed method that utilizes a variety of published annotations (including conservation scores, protein deleteriousness estimates and regulatory data) to score all mutations in the human genome based on how likely they are to be affected by negative selection, controlling for mutation rate. We map this score to a scale of fitness coefficients via maximum likelihood using diffusion theory and a Poisson random field model. We then use our coefficient mapping to quantify the distribution of all scored single-nucleotide polymorphisms in Yoruba and Europeans. Our method serves to approximate the DFE of any type of segregating mutations, regardless of its genomic consequence, and so allows us to compare the proportion of mutations that are negatively selected or neutral across various genomic categories, including different types of regulatory sites. We observe that the distribution of intergenic polymorphisms is highly leptokurtic, with a strong peak at neutrality, while the distribution of nonsynonymous polymorphisms is bimodal, with a neutral peak and a second peak at s ≈ −10^(−4). Other types of polymorphisms have shapes that fall roughly in between these two.

Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries using Sparse Linear Regression

Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries using Sparse Linear Regression

Charles K. Fisher, Pankaj Mehta
(Submitted on 3 Feb 2014)

Human associated microbial communities exert tremendous influence over human health and disease. With modern metagenomic sequencing methods it is possible to follow the relative abundance of microbes in a community over time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer the interactions between species from sequence data. Any algorithm for inferring species interactions must overcome three obstacles: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units, bias inferences of species interactions. Here we introduce an approach, Learning Interactions from MIcrobial Time Series (LIMITS), that overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut microbiomes of two individuals and found that the interaction networks varied significantly between individuals. Furthermore, we found that the interaction networks of the two individuals are dominated by distinct “keystone species”, Bacteroides fragilis and Bacteroided stercosis, that have a disproportionate influence on the structure of the gut microbiome even though they are only found in moderate abundance. Based on our results, we hypothesize that the abundances of certain keystone species may be responsible for individuality in the human gut microbiome.

Genetic variants associated with motion sickness point to roles for inner ear development, neurological processes, and glucose homeostasis

Genetic variants associated with motion sickness point to roles for inner ear development, neurological processes, and glucose homeostasis

Bethann S Hromatka, Joyce Y Tung, Amy K Kiefer, Chuong B Do, David A Hinds, Nicholas Eriksson

Roughly one in three individuals is highly susceptible to motion sickness and yet the underlying causes of this condition are not well understood. Despite high heritability, no associated genetic factors have been discovered to date. Here, we conducted the first genome-wide association study on motion sickness in 80,494 individuals from the 23andMe database who were surveyed about car sickness. Thirty-five single-nucleotide polymorphisms (SNPs) were associated with motion sickness at a genome-wide-significant level (p< 5e-8). Many of these SNPs are near genes involved in balance, and eye, ear, and cranial development (e.g., PVRL3, TSHZ1, MUTED, HOXB3, HOXD3). Other SNPs may affect motion sickness through nearby genes with roles in the nervous system, glucose homeostasis, or hypoxia. We show that several of these SNPs display sex-specific effects, with as much as three times stronger effects in women. We searched for comorbid phenotypes with motion sickness, confirming associations with known comorbidities including migraines, postoperative nausea and vomiting (PONV), vertigo, and morning sickness, and observing new associations with altitude sickness and many gastrointestinal conditions. We also show that two of these related phenotypes (PONV and migraines) share underlying genetic factors with motion sickness. These results point to the importance of the nervous system in motion sickness and suggest a role for glucose levels in motion-induced nausea and vomiting, a finding that may provide insight into other nausea-related phenotypes such as PONV. They also highlight personal characteristics (e.g., being a poor sleeper) that correlate with motion sickness, findings that could help identify risk factors or treatments.

Most viewed on Haldane’s Sieve: January 2014

The most viewed preprints on Haldane’s Sieve this month were:

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data
Richard W Lusk
(Submitted on 30 Jan 2014)

Background: Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species.
Results: I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from “blank” samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood.
Conclusions: Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.

Impact of RNA degradation on measurements of gene expression

Impact of RNA degradation on measurements of gene expression

Irene Gallego Romero, Athma A. Pai, Jenny Tung, Yoav Gilad

The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of degradation can be normalized, or whether different transcripts are degraded at different rates, potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA samples in whole-genome expression profiling problematic. Yet, low quality samples are at times the sole means of addressing specific questions – e.g., samples collected in the course of fieldwork. We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of RNA Integrity Number (RIN) values (a quality metric commonly used to assess RNA quality). We observed widespread effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library complexity in more degraded samples. While standard normalizations failed to account for the effects of degradation, we found that a simple linear model that controls for the effects of RIN can correct for the majority of these effects. We conclude that in instances where RIN and the effect of interest are not associated, this approach can help recover biologically meaningful signals in data from degraded RNA samples.

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Gad Abraham, Michael Inouye

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy compared with existing tools in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.