Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution
Eric Y Durand, Chuong B Do, Joanna L Mountain, J. Michael Macpherson
Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.
Transcriptome Sequencing Reveals Widespread Gene-Gene and Gene-Environment Interactions
Alfonso Buil, Andrew A Brown, Tuuli Lappalainen, Ana Viñuela, Matthew N Davies, Houfeng F Zheng, Brent J Richards, Daniel Glass, Kerrin S Small, Richard Durbin, Timothy D Spector, Emmanouil T Dermitzakis
Understanding the genetic architecture of gene expression is an intermediate step to understand the genetic architecture of complex diseases. RNA-seq technologies have improved the quantification of gene expression and allow to measure allelic specific expression (ASE)1-3. ASE is hypothesized to result from the direct effect of cis regulatory variants, but a proper estimation of the causes of ASE has not been performed to date. In this study we take advantage of a sample of twins to measure the relative contribution of genetic and environmental effects on ASE and we found substantial effects of gene x gene (GxG) and gene x environment (GxE) interactions. We propose a model where ASE requires genetic variability in cis, a difference in the sequence of both alleles, but the magnitude of the ASE effect depends on trans genetic and environmental factors that interact with the cis genetic variants. We uncover large GxG and GxE effects on gene expression and likely complex phenotypes that currently remain elusive.
Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C. Chang, Carson C. Chow, Laurent C.A.M. Tellier, Shashaank Vattikuti, Shaun M. Purcell, James J. Lee
Comments: 2 figures, 1 additional file
Subjects: Genomics (q-bio.GN); Computation (stat.CO)
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information.
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
An extended reply to Mendez et al.: The ‘extremely ancient’ chromosome that still isn’t
Eran Elhaik, Tatiana V. Tatarinova, Anatole A. Klyosov, Dan Graur
(Submitted on 15 Oct 2014)
Earlier this year, we published a scathing critique of a paper by Mendez et al. (2013) in which the claim was made that a Y chromosome was 237,000-581,000 years old. Elhaik et al. (2014) also attacked a popular article in Scientific American by the senior author of Mendez et al. (2013), whose title was “Sex with other human species might have been the secret of Homo sapiens’s [sic] success” (Hammer 2013). Five of the 11 authors of Mendez et al. (2013) have now written a “rebuttal,” and we were allowed to reply.
Unfortunately, our reply was censored for being “too sarcastic and inflamed.” References were removed, meanings were castrated, and a dedication in the Acknowledgments was deleted. Now, that the so-called rebuttal by 45% of the authors of Mendez et al. (2013) has been published together with our vasectomized reply, we decided to make public our entire reply to the so called “rebuttal.” In fact, we go one step further, and publish a version of the reply that has not even been self-censored.
Now, that the so-called rebuttal by 45% of the authors of Mendez et al. (2013) has been published together with our vasectomized reply, we decided to make public our entire reply to the so called “rebuttal.” In fact, we go one step further, and publish a version of the reply that has not even been self-censored.
Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants
Aziz Belkadi, Alexandre Bolze, Yuval Itan, Quentin B Vincent, Alexander Antipenko, Bertrand Boisson, Jean-Laurent Casanova, Laurent Abel
We compared whole-exome sequencing (WES) and whole-genome sequencing (WGS) for the detection of single-nucleotide variants (SNVs) in the exomes of six unrelated individuals. In the regions targeted by exome capture, the mean number of SNVs detected was 84,192 for WES and 84,968 for WGS. Only 96% of the variants were detected by both methods, with the same genotype identified for 99.2% of them. The distributions of coverage depth (CD), genotype quality (GQ), and minor read ratio (MRR) were much more homogeneous for WGS than for WES data. Most variants with discordant genotypes were filtered out when we used thresholds of CD≥8X, GQ≥20, and MRR≥0.2. However, a substantial number of coding variants were identified exclusively by WES (105 on average) or WGS (692). We Sanger sequenced a random selection of 170 of these exclusive variants, and estimated the mean number of false-positive coding variants per sample at 79 for WES and 36 for WGS. Importantly, the mean number of real coding variants identified by WGS and missed by WES (656) was much larger than the number of real coding variants identified by WES and missed by WGS (26). A substantial proportion of these exclusive variants (32%) were predicted to be damaging. In addition, about 380 genes were poorly covered (~27% of base pairs with CD<8X) by WES for all samples, including 49 genes underlying Mendelian disorders. We conclude that WGS is more powerful and reliable than WES for detecting potential disease-causing mutations in the exome.
Similar efficacies of selection shape mitochondrial and nuclear genes in Drosophila melanogaster and Homo sapiens
Brandon S. Cooper, Chad Burrus, Chao Ji, Matthew W. Hahn, Kristi L. Montooth
Deleterious mutations contribute to polymorphism even when selection effectively prevents their fixation. The efficacy of selection in removing deleterious mitochondrial mutations from populations depends on the effective population size (Ne) of the mtDNA, and the degree to which a lack of recombination magnifies the effects of linked selection. Using complete mitochondrial genomes from Drosophila melanogaster and nuclear data available from the same samples, we re-examine the hypothesis that non-recombining animal mtDNA harbor an excess of deleterious polymorphisms relative to the nuclear genome. We find no evidence of recombination in the mitochondrial genome, and the much-reduced level of mitochondrial synonymous polymorphism relative to nuclear genes is consistent with a reduction in Ne. Nevertheless, we find that the neutrality index (NI), a measure of the excess on nonsynonymous polymorphism relative to the neutral expectation, is not significantly different between mitochondrial and nuclear loci. Reanalysis of published data from Homo sapiens reveals the same lack of a difference between the two genomes, though small samples in previous studies had suggested a strong difference in both species. Thus, despite a smaller Ne, mitochondrial loci of both flies and humans appear to experience similar efficacies of selection as do loci in the recombining nuclear genome.
Recent evolution of the mutation rate and spectrum in Europeans
As humans dispersed out of Africa, they adapted to new environmental challenges including changes in exposure to mutagenic solar radiation. This raises the possibility that different populations experienced different selective pressures affecting genome integrity. Prior work has uncovered divergent selection in tropical versus temperate latitudes on eQTLs that regulate the DNA damage response, as well as evidence that the human mutation rate per year has changed at least 2-fold since we shared a common ancestor with chimpanzees. Here, I present evidence that the rate of a particular mutation type has recently increased in the European lineage, rising in frequency by 50% during the 30,000–50,000 years since Europeans diverged from Asians. A comparison of single nucleotide polymorphisms (SNPs) private to Africa, Asia, and Europe in the 1000 Genomes data reveals that private European variation is enriched for the transition 5’-TCC-3’→5’-TTC-3’. Although it is not clear whether UV played a causal role in the changing the European mutational spectrum, 5’-TCC-3’→5’-TTC-3’ is known to be the most common somatic mutation present in melanoma skin cancers, as well as the mutation most frequently induced in vitro by UV. Regardless of its causality, this change indicates that DNA replication fidelity has not remained stable even since the origin of modern humans and might have changed numerous times during our recent evolutionary history.