Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants

Aziz Belkadi, Alexandre Bolze, Yuval Itan, Quentin B Vincent, Alexander Antipenko, Bertrand Boisson, Jean-Laurent Casanova, Laurent Abel

doi: http://dx.doi.org/10.1101/010363

We compared whole-exome sequencing (WES) and whole-genome sequencing (WGS) for the detection of single-nucleotide variants (SNVs) in the exomes of six unrelated individuals. In the regions targeted by exome capture, the mean number of SNVs detected was 84,192 for WES and 84,968 for WGS. Only 96% of the variants were detected by both methods, with the same genotype identified for 99.2% of them. The distributions of coverage depth (CD), genotype quality (GQ), and minor read ratio (MRR) were much more homogeneous for WGS than for WES data. Most variants with discordant genotypes were filtered out when we used thresholds of CD≥8X, GQ≥20, and MRR≥0.2. However, a substantial number of coding variants were identified exclusively by WES (105 on average) or WGS (692). We Sanger sequenced a random selection of 170 of these exclusive variants, and estimated the mean number of false-positive coding variants per sample at 79 for WES and 36 for WGS. Importantly, the mean number of real coding variants identified by WGS and missed by WES (656) was much larger than the number of real coding variants identified by WES and missed by WGS (26). A substantial proportion of these exclusive variants (32%) were predicted to be damaging. In addition, about 380 genes were poorly covered (~27% of base pairs with CD<8X) by WES for all samples, including 49 genes underlying Mendelian disorders. We conclude that WGS is more powerful and reliable than WES for detecting potential disease-causing mutations in the exome.

# Category Archives: statistical genomics

# Bayesian Structured Sparsity from Gaussian Fields

Bayesian Structured Sparsity from Gaussian Fields

Barbara E. Engelhardt, Ryan P. Adams

(Submitted on 8 Jul 2014)

Substantial research on structured sparsity has contributed to analysis of many different applications. However, there have been few Bayesian procedures among this work. Here, we develop a Bayesian model for structured sparsity that uses a Gaussian process (GP) to share parameters of the sparsity-inducing prior in proportion to feature similarity as defined by an arbitrary positive definite kernel. For linear regression, this sparsity-inducing prior on regression coefficients is a relaxation of the canonical spike-and-slab prior that flattens the mixture model into a scale mixture of normals. This prior retains the explicit posterior probability on inclusion parameters—now with GP probit prior distributions—but enables tractable computation via elliptical slice sampling for the latent Gaussian field. We motivate development of this prior using the genomic application of association mapping, or identifying genetic variants associated with a continuous trait. Our Bayesian structured sparsity model produced sparse results with substantially improved sensitivity and precision relative to comparable methods. Through simulations, we show that three properties are key to this improvement: i) modeling structure in the covariates, ii) significance testing using the posterior probabilities of inclusion, and iii) model averaging. We present results from applying this model to a large genomic dataset to demonstrate computational tractability.

# Improved genome inference in the MHC using a population reference graph

Improved genome inference in the MHC using a population reference graph

Alexander Dilthey, Charles J Cox, Zamin Iqbal, Matthew R Nelson, Gil McVean

In humans and many other species, while much is known about the extent and structure of genetic variation, such information is typically not used in assembling novel genomes. Rather, a single reference is used against which to map reads, which can lead to poor characterisation of regions of high sequence or structural diversity. Here, we introduce a population reference graph, which combines multiple reference sequences as well as catalogues of SNPs and short indels. The genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov Model, allowing for recombination between different haplotypes and variants. By applying the method to the 4.5Mb extended MHC region on chromosome 6, combining eight assembled haplotypes, sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate, using simulations, SNP genotyping, short-read and long-read data, how the method improves the accuracy of genome inference. Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference-quality genome sequences.

# A Simple Data-Adaptive Probabilistic Variant Calling Model

A Simple Data-Adaptive Probabilistic Variant Calling Model

Steve Hoffmann, Peter F. Stadler, Korbinian Strimmer

(Submitted on 20 May 2014)

Background: Several sources of noise obfuscate the identification of single nucleotide variation in next generation sequencing data. Not only errors introduced during library construction and sequencing steps but also the quality of the reference genome and the algorithms used for the alignment of the reads play an influential role. It is not trivial to estimate the influence these factors for individual sequencing experiments.

Results: We introduce a simple data-adaptive model for variant calling. Several characteristics are sampled from sites with low mismatch rates and uses to estimate empirical log-likelihoods. These likelihoods are then combined to a score that typically gives rise to a mixture distribution. From these we determine a decision threshold to separate potentially variant sites from the noisy background.

Conclusions: In simulations we show that the proposed model is at par with frequently used SNV calling algorithms in terms of sensitivity and specificity. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.

# Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing

Xiaoquan Wen

(Submitted on 29 Apr 2014)

We consider the problems of hypothesis testing and model comparison under a flexible Bayesian linear regression model whose formulation is closely connected with the linear mixed effect model and the parametric models for SNP set analysis in genetic association studies. We derive a class of analytic approximate Bayes factors and illustrate their connections with a variety of frequentist test statistics, including the Wald statistic and the variance component score statistic. Taking advantage of Bayesian model averaging and hierarchical modeling, we demonstrate some distinct advantages and flexibilities in the approaches utilizing the derived Bayes factors in the context of genetic association studies. We demonstrate our proposed methods using real or simulated numerical examples in applications of single SNP association testing, multi-locus fine-mapping and SNP set association testin

# Principal component gene set enrichment (PCGSE)

Principal component gene set enrichment (PCGSE)

H. Robert Frost, Zhigang Li, Jason H. Moore

Motivation: Although principal component analysis (PCA) is widely used for the dimensional reduction of biomedical data, interpretation of PCA results remains daunting. Most existing methods attempt to explain each principal component (PC) in terms of a small number of variables by generating approximate PCs with few non-zero loadings. Although useful when just a few variables dominate the population PCs, these methods are often inadequate for characterizing the PCs of high-dimensional genomic data. For genomic data, reproducible and biologically meaningful PC interpretation requires methods based on the combined signal of functionally related sets of genes. While gene set testing methods have been widely used in supervised settings to quantify the association of groups of genes with clinical outcomes, these methods have seen only limited application for testing the enrichment of gene sets relative to sample PCs. Results: We describe a novel approach, principal component gene set enrichment (PCGSE), for computing the statistical association between gene sets and the PCs of genomic data. The PCGSE method performs a two-stage competitive gene set test using the correlation between each gene and each PC as the gene-level test statistic with flexible choice of both the gene set test statistic and the method used to compute the null distribution of the gene set statistic. Using simulated data with simulated gene sets and real gene expression data with curated gene sets, we demonstrate that biologically meaningful and computationally efficient results can be obtained from a simple parametric version of the PCGSE method that performs a correlation-adjusted two-sample t-test between the gene-level test statistics for gene set members and genes not in the set. Availability: this http URL Contact: rob.frost@dartmouth.edu or jason.h.moore@dartmouth.edu

# Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible

Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible

Paul J. McMurdie, Susan Holmes

(Submitted on 1 Oct 2013)

The interpretation of count data originating from the current generation of DNA sequencing platforms requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are overdispersed relative to a simple Poisson model These challenges can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability. This approach is already well-characterized and implemented for RNA-Seq data in R packages such as edgeR and DESeq.

We use statistical theory, extensive simulations, and empirical data to show that variance stabilizing normalization using a mixture model like the negative binomial is appropriate for microbiome count data. In simulations detecting differential abundance, normalization procedures based on a Gamma-Poisson mixture model provided systematic improvement in performance over crude proportions or rarefied counts — both of which led to a high rate of false positives. In simulations evaluating clustering accuracy, we found that the rarefying procedure discarded samples that were nevertheless accurately clustered by alternative methods, and that the choice of minimum library size threshold was critical in some settings, but with an optimum that is unknown in practice. Techniques that use variance stabilizing transformations by modeling microbiome count data with a mixture distribution, such as those implemented in edgeR and DESeq, substantially improved upon techniques that attempt to normalize by rarefying or crude proportions. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.