XWAS: a toolset for genetic data analysis and association studies of the X chromosome

XWAS: a toolset for genetic data analysis and association studies of the X chromosome

Diana Chang, Feng Gao, Alon Keinan
doi: http://dx.doi.org/10.1101/009795

Summary: We present XWAS (chromosome X-Wide Analysis tool-Set)–a toolset specially designed for analysis of the X chromosome in association studies, both on the level of single markers and the level of entire genes. It further offers other X-specific analysis tools, including quality control (QC) procedures for X-linked data. We have applied and tested this software by carrying out several X-wide association studies of autoimmune diseases. Availability and Implementation: The XWAS software package, which includes scripts, the binary executable PLINK/XWAS and all source code is freely available for download from http://keinanlab.cb.bscb.cornell.edu/content/tools-data. PLINK/XWAS is implemented in C++ and other features in shell scripts and Perl. This software package is designed for Linux systems.


Estimating gene expression and codon specific translational efficiencies, mutation biases, and selection coefficients from genomic data

Estimating gene expression and codon specific translational efficiencies, mutation biases, and selection coefficients from genomic data

Michael Gilchrist, Wei-Chen Chen, Premal Shah, Russell Zaretzki
doi: http://dx.doi.org/10.1101/009670

The time and cost of generating a genomic dataset is expected to continue to decline dramatically in the upcoming years. As a result, extracting biologically meaningful information from this continuing flood of data is a major challenge in biology. In response, we present a powerful Bayesian MCMC method based on a nested model of protein synthesis and population genetics. Analyzing the patterns of codon usage observed within a genome, our algorithm extracts and decouples information on codon specific translational efficiencies and mutation biases as well as gene specific expression levels for all coding sequences. This information can be combined to generate gene and codon specific estimates of selection on synonymous substitutions. One major advance over previous work is that our method can be used without independent measurements of gene expression. Using the Saccharomyces cerevisiae S288c genome, we compare our model fits with and without independent gene expression measurements and observe an exceptionally high correlation between our codon specific parameters and gene specific expression levels (ρ > 0.99 in all cases). We also observe robust correlations between our predictions generated without independent expression measurements and previously published estimates of mutation bias, ribosome pausing time, and empirical estimates of mRNA abundance (ρ=0.53-0.72). Our results indicate that failing to take mutation bias into account can lead to the misidentification of an amino acid’s `optimal’ codon. In conclusion, our method demonstrates that an enormous amount of biologically important information is encoded within genome scale patterns of codon usage and this information can be accessed through carefully formulated, biologically based models.

WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data

WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data

Matthieu Foll, Hyunjin Shim, Jeffrey D. Jensen
doi: http://dx.doi.org/10.1101/009696

With novel developments in sequencing technologies, time-sampled data are becoming more available and accessible. Naturally, there have been efforts in parallel to infer population genetic parameters from these datasets. Here, we compare and analyze four recent approaches based on the Wright-Fisher model for inferring selection coefficients (s) given effective population size (Ne), with simulated temporal datasets. Furthermore, we demonstrate the advantage of a recently proposed ABC-based method that is able to correctly infer genome-wide average Ne from time-serial data, which is then set as a prior for inferring per-site selection coefficients accurately and precisely. We implement this ABC method in a new software and apply it to a classical time-serial dataset of the medionigra genotype in the moth Panaxia dominula. We show that a recessive lethal model is the best explanation for the observed variation in allele frequency by implementing an estimator of the dominance ratio (h).

Thinking too positive? Revisiting current methods of population-genetic selection inference

Thinking too positive? Revisiting current methods of population-genetic selection inference
Claudia Bank, Gregory B Ewing, Anna Ferrer-Admettla, Matthieu Foll, Jeffrey D Jensen
doi: http://dx.doi.org/10.1101/009654

In the age of next-generation sequencing, the availability of increasing amounts and quality of data at decreasing cost ought to allow for a better understanding of how natural selection is shaping the genome than ever before. Yet, alternative forces such as demography and background selection obscure the footprints of positive selection that we would like to identify. Here, we illustrate recent developments in this area, and outline a roadmap for improved selection inference. We argue (1) that the development and obligatory use of advanced simulation tools is necessary for improved identification of selected loci, (2) that genomic information from multiple- time points will enhance the power of inference, and (3) that results from experimental evolution should be utilized to better inform population-genomic studies.

On the prospect of identifying adaptive loci in recently bottlenecked populations

On the prospect of identifying adaptive loci in recently bottlenecked populations
Yu-Ping Poh, Vera S Domingues, Hopi Hoekstra, Jeffrey Jensen
doi: http://dx.doi.org/10.1101/009456

Identifying adaptively important loci in recently bottlenecked populations—be it natural selection acting on a population following the colonization of novel habitats in the wild, or artificial selection during the domestication of a breed—remains a major challenge. Here we report the results of a simulation study examining the performance of available population-genetic tools for identifying genomic regions under selection. To illustrate our findings, we examined the interplay between selection and demography in two species of Peromyscus mice, for which we have independent evidence of selection acting on phenotype as well as functional evidence identifying the underlying genotype. With this unusual information, we tested whether population-genetic-based approaches could have been utilized to identify the adaptive locus. Contrary to published claims, we conclude that the use of the background site frequency spectrum as a null model is largely ineffective in bottlenecked populations. Results are quantified both for site frequency spectrum and linkage disequilibrium-based predictions, and are found to hold true across a large parameter space that encompasses many species and populations currently under study. These results suggest that the genomic footprint left by selection on both new and standing variation in strongly bottlenecked populations will be difficult, if not impossible, to find using current approaches.

On the unfounded enthusiasm for soft selective sweeps

On the unfounded enthusiasm for soft selective sweeps
Jeffrey D. Jensen
doi: http://dx.doi.org/10.1101/009563

Underlying any understanding of the mode, tempo, and relative importance of the adaptive process in the evolution of natural populations is the notion of whether adaptation is mutation-limited. Two very different population genetic models have recently been proposed in which the rate of adaptation is not strongly limited by the rate at which newly arising beneficial mutations enter the population. This review discusses the theoretical underpinnings and requirements of these models, as well as the experimental insights on the parameters of relevance. Importantly, empirical and experimental evidence to date challenges the recent enthusiasm for invoking these models to explain observed patterns of variation in humans and Drosophila.

Author post: Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

This guest post is by Xiaoquan (William) Wen, Francesca Luca, and Roger Pique-Regi on their preprint Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study, bioRxived here.

Our paper presents an integrative analysis framework to perform fine mapping and functional analysis of cis-eQTLs. In particular, we consider a setting where eQTL data are collected from multiple population groups. Although the details of our methods and analysis results are described in the manuscript, here we’d like to take the opportunity to discuss some of our main features and interesting findings of this work.

From the methodological perspective, the Bayesian inference framework that we present in this paper enables efficient multiple SNP analysis in the presence of multiple heterogeneous (population) groups. This framework is a natural extension from our previous works in dealing with heterogeneous genetic association data in single SNP analysis (Flutre et al 2013, Wen and Stephens 2014). The output from our multiple cis-eQTL analysis fully characterizes the uncertainty of eQTL calls, which becomes critical in the downstream functional analysis. This represents a significant advantage over the commonly applied conditional analysis approach, which is non-trivial to generalize when there are multiple heterogeneous subgroups. Taking advantage of these features, we further extend our analysis framework to incorporate functional genomic annotations and assess their levels of enrichment in association signals. Although in this paper we solely focus on eQTLs, it should be noted that our statistical methods are completely general, and applicable in other contexts of genetic association analysis.

Applying this analysis framework, we re-analyzed the eQTL data from the GUEVADIS project that consists of samples from five population groups. Importantly, a key motivation is to identify eQTL signals that are consistently presented in all population groups. This analysis yields some interesting findings, which we will highlight below:

  1. Cross-population meta-analysis greatly improves the power of eQTL discovery. The power gain by integrating data across population groups is well expected: with a combined sample size ~400 in five population groups, we are able to identify 6,555 genes that harbor at least one cis-eQTL (which we refer to as “eGenes”) from 11,838 tested protein coding and lincRNA genes at 5% FDR level; in comparison, the union set from the population-by-population analysis yields 3,447 eGenes.
  2. Cross-population samples provide unique resources to fine map cis-eQTLs. We perform multiple SNP analysis for each identified eGene, and find that for a non-trivial proportion of genes (7% of all genes analyzed, or 14% of identified eGenes), two or more independent cis-eQTL signals can be confidently identified in the GEUVADIS data. In most of those cases, we are relatively certain about the existence of multiple eQTL signals, but cannot pinpoint the causal variants by fully resolving the LD. Nevertheless, we find that utilizing cross-population samples, the population heterogeneity in local patterns of LD can be effectively leveraged to narrow down the genomic regions that harbor causal eQTLs, a phenomenon that we refer to as “LD filtering”. Using the GEUVADIS data, we are able to quantify the effect of LD filtering. More specifically, we select a set of genes that are identified harboring exactly one cis-eQTL with high confidence, and construct credible regions for each eQTL signal based on both the population-by-population and the cross-population analyses. We find that for the majority of the genes tested (92% of 526 selected genes), the joint analysis yields a smaller credible region comparing to the minimum credible region length from separate population analyses. The median reduction in region length from the separate analysis to joint analysis is close to 50% in the set of genes examined.

    On the other hand, there are cases that population specific LD patterns can cause some SNPs to display large degree of heterogeneity across populations in their estimated effect sizes from single SNP analysis. In some extreme cases, a SNP may appear to possess strong “population specific” effects. As we acknowledge that genuine population specific eQTLs are certainly interesting phenomena and very much likely exist, we suggest interpreting highly heterogeneous eQTL signals from single SNP analysis with caution. In the paper, we demonstrate one such example where a set of SNPs in LD, when analyzed alone, appear to show strong but opposite effects on expression levels in European and African populations. The multiple SNP analysis yields a seemingly much more plausible alternative explanation: it identifies two independent eQTL signals in the region, and the “opposite effect” eQTLs tag one signal in the African population and the other signal in the European populations. This example, we believe, fully demonstrates the necessity and benefit of multiple SNP analysis using cross-population samples.

  3. Genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs. This point is demonstrated by our functional analysis approach based on the fine mapping results of cis-eQTLs. In brief, we classify every cis-SNP into three mutually exclusive categories based on the computational predictions of CENTIPEDE model: 1) SNPs strongly affecting TF binding 2) SNPs residing in a DNAse-I footprint region but with little or no effects on TF binding 3) all other SNPs, or baseline SNPs. We find that the first category of SNPs are 1.49 fold more likely than baseline SNPs to be eQTLs, and its enrichment level is statistically highly significant (p-value = 4.93 x 10-22). The SNPs in category 2 is also enriched but with much less impressive fold change (1.15) and statistical significance (p-value = 0.0035). Very interestingly, this finding seems in agreement with the results reported in our recent work Moyerbrailean et al 2014) where other cellular and organismal phenotype QTLs are examined.

Overall, the ability of our method to disentangle multiple eQTL signals represents a significant step forward towards fully comprehending the complex mechanisms regulating gene expression. Using the natural interventions represented by genetic polymorphisms can be used in future studies to identify multiple functional regulatory elements for a gene. The computational methods used in this paper are implemented in the software packages FM-eQTL and eQTLBMA. Our analysis results are also available for browsing and downloading at this site.


1. T Flutre, X Wen, J Pritchard, M Stephens (2013). A statistical framework for joint eQTL analysis in multiple tissues. PLoS genetics 9 (5), e1003486

2. X Wen, M Stephens (2014). Bayesian methods for genetic association analysis with heterogeneous subgroups: From meta-analyses to gene–environment interactions. The Annals of Applied Statistics 8 (1), 176-203

3. T Lappalainen et al (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511

4. Moyberbrailean et al (2014) Are all genetic variants in DNase I sensitivity regions functional? bioRxiv, 007559

Molecular phenotypes that are causal to complex traits can have low heritability and are expected to have small influence.

Molecular phenotypes that are causal to complex traits can have low heritability and are expected to have small influence.

Leopold Parts
doi: http://dx.doi.org/10.1101/009506

Work on genetic makeup of complex traits has led to some unexpected findings. Molecular trait heritability estimates have consistently been lower than those of common diseases, even though it is intuitively expected that the genotype signal weakens as it becomes more dissociated from DNA. Further, results from very large studies have not been sufficient to explain most of the heritable signal, and suggest hundreds if not thousands of responsible alleles. Here, I demonstrate how trait heritability depends crucially on the definition of the phenotype, and is influenced by the variability of the assay, measurement strategy, and the quantification approach used. For a phenotype downstream of many molecular traits, it is possible that its heritability is larger than for any of its upstream determinants. I also rearticulate via models and data that if a phenotype has many dependencies, a large number of small effect alleles are expected. However, even if these alleles do drive highly heritable causal intermediates that can be modulated, it does not imply that large changes in phenotype can be obtained.

An Approximate Bayesian Computation Approach to Examining the Phylogenetic Relationships among the Four Gibbon Genera using Whole Genome Sequence Data

An Approximate Bayesian Computation Approach to Examining the Phylogenetic Relationships among the Four Gibbon Genera using Whole Genome Sequence Data

Krishna Veeramah, August E Woerner, Laurel Johnstone, Ivo Gut, Marta Gut, Tomas Marques-Bonet, Lucia Carbone, Jeff D Wall, Michael F Hammer
doi: http://dx.doi.org/10.1101/009498

Gibbons are believed to have diverged from the larger great apes ~16.8 Mya and today reside in the rainforests of Southeast Asia. Based on their diploid chromosome number, the family Hylobatidae is divided into four genera, Nomascus, Symphalangus, Hoolock and Hylobates. Genetic studies attempting to elucidate the phylogenetic relationships among gibbons using karyotypes, mtDNA, the Y chromosome, and short autosomal sequences have been inconclusive. To examine the relationships among gibbon genera in more depth, we performed 2nd generation whole genome sequencing to a mean of ~15X coverage in two individuals from each genus. We developed a coalescent-based Approximate Bayesian Computation method incorporating a model of sequencing error generated by high coverage exome validation to infer the branching order, divergence times, and effective population sizes of gibbon taxa. Although Hoolock and Symphalangus are likely sister taxa, we could not confidently resolve a single bifurcating tree despite the large amount of data analyzed. Our combined results support the hypothesis that all four gibbon genera diverged at approximately the same time. Assuming an autosomal mutation rate of 1×10-9/site/year this speciation process occurred ~5 Mya during a period in the Early Pliocene characterized by climatic shifts and fragmentation of the Sunda shelf forests. Whole genome sequencing of additional individuals will be vital for inferring the extent of gene flow among species after the separation of the gibbon genera.

Virulence genes are a signature of the microbiome in the colorectal tumor microenvironment

Virulence genes are a signature of the microbiome in the colorectal tumor microenvironment

Michael B Burns, Joshua Lynch, Timothy K Starr, Dan Knights, Ran Blekhman
doi: http://dx.doi.org/10.1101/009431

Background The human gut microbiome is associated with the development of colon cancer, and recent studies have found changes in the composition of the microbial communities in cancer patients compared to healthy controls. However, host-bacteria interactions are mainly expected to occur in the cancer microenvironment, whereas current studies primarily use stool samples to survey the microbiome. Here, we highlight the major shifts in the colorectal tumor microbiome relative to that of matched normal colon tissue from the same individual, allowing us to survey the microbial communities at the tumor microenvironment, and provides intrinsic control for environmental and host genetic effects on the microbiome. Results We characterized the microbiome in 44 primary tumor and 44 patient-matched normal colon tissues. We find that tumors harbor distinct microbial communities compared to nearby healthy tissue. Our results show increased microbial diversity at the tumor microenvironment, with changes in the abundances of commensal and pathogenic bacterial taxa, including Fusobacterium and Providencia. While Fusobacteria has previously been implicated in CRC, Providencia is a novel tumor- associated agent, and has several features that make it a potential cancer driver, including a strong immunogenic LPS and an ability to damage colorectal tissue. Additionally, we identified a significant enrichment of virulence-associated genes in the colorectal cancer microenvironment. Conclusions This work identifies bacterial taxa significantly correlated with colorectal cancer, including a novel finding of an elevated abundance of Providencia in the tumor microenvironment. We also describe several metabolic pathways and enzymes differentially present in the tumor associated microbiome, and show that the bacterial genes in the tumor microenvironment are enriched for virulence associated genes from the aggregate microbial community. This virulence enrichment indicates that the microbiome likely plays an active role in colorectal cancer development and/or progression. These reuslts provide a starting point for future prognostic and therapeutic research with the potential to improve patient outcomes.