Conflations of short IBD blocks can bias inferred length of IBD

Conflations of short IBD blocks can bias inferred length of IBD
Charleston W.K. Chiang, Peter Ralph, John Novembre
Comments: 12 figures, 1 table
Subjects: Populations and Evolution (q-bio.PE)

Identity-by-descent (IBD) is a fundamental concept in genetics with many applications. Often, segments between two haplotypes are said to be IBD if they are inherited from a recent shared common ancestor without intervening recombination. Long IBD blocks (> 1cM) can be efficiently detected by a number of computer programs using high-density SNP array data from a population sample. However, all programs detect IBD based on contiguous segments of identity-by-state, and can therefore be due to the conflation of smaller, nearby IBD blocks. We quantified this effect using coalescent simulations, finding that nearly 40% of inferred blocks 1-2cM long are false conflations of two or more longer blocks, under demographic scenarios typical for modern humans. This biases the inferred IBD block length distribution, and so can affect downstream inferences. We observed this conflation effect universally across different IBD detection programs and human demographic histories, and found inference of segments longer than 2cM to be much more reliable (less than 5% conflation rate). We then present and analyze a novel estimator of the de novo mutation rate using IBD blocks, and demonstrate that the biased length distribution of the IBD segments due to conflation can strongly affect this estimator if the conflation is not modeled. Thus, the conflation effect should be carefully considered, especially as methods to detect shorter IBD blocks using sequencing data are being developed.

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution

Ancestry Composition: A Novel, Efficient Pipeline for Ancestry Deconvolution
Eric Y Durand, Chuong B Do, Joanna L Mountain, J. Michael Macpherson
doi: http://dx.doi.org/10.1101/010512

Ancestry deconvolution, the task of identifying the ancestral origin of chromosomal segments in admixed individuals, has important implications, from mapping disease genes to identifying candidate loci under natural selection. To date, however, most existing methods for ancestry deconvolution are typically limited to two or three ancestral populations, and cannot resolve contributions from populations related at a sub-continental scale. We describe Ancestry Composition, a modular three-stage pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals. It assumes the genotype data have been phased. In the first stage, a support vector machine classifier assigns tentative ancestry labels to short local phased genomic regions. In the second stage, an autoregressive pair hidden Markov model simultaneously corrects phasing errors and produces reconciled local ancestry estimates and confidence scores based on the tentative ancestry labels. In the third stage, confidence estimates are recalibrated using isotonic regression. We compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and over 8,000 individuals reporting four grandparents with the same country-of-origin from the member database of the personal genetics company, 23andMe, Inc., and excluding outliers identified through principal components analysis (PCA). In cross-validation experiments, Ancestry Composition achieves high precision and recall for labeling chromosomal segments across over 25 different populations worldwide.

Transcriptome Sequencing Reveals Widespread Gene-Gene and Gene-Environment Interactions

Transcriptome Sequencing Reveals Widespread Gene-Gene and Gene-Environment Interactions
Alfonso Buil, Andrew A Brown, Tuuli Lappalainen, Ana Viñuela, Matthew N Davies, Houfeng F Zheng, Brent J Richards, Daniel Glass, Kerrin S Small, Richard Durbin, Timothy D Spector, Emmanouil T Dermitzakis
doi: http://dx.doi.org/10.1101/010546

Understanding the genetic architecture of gene expression is an intermediate step to understand the genetic architecture of complex diseases. RNA-seq technologies have improved the quantification of gene expression and allow to measure allelic specific expression (ASE)1-3. ASE is hypothesized to result from the direct effect of cis regulatory variants, but a proper estimation of the causes of ASE has not been performed to date. In this study we take advantage of a sample of twins to measure the relative contribution of genetic and environmental effects on ASE and we found substantial effects of gene x gene (GxG) and gene x environment (GxE) interactions. We propose a model where ASE requires genetic variability in cis, a difference in the sequence of both alleles, but the magnitude of the ASE effect depends on trans genetic and environmental factors that interact with the cis genetic variants. We uncover large GxG and GxE effects on gene expression and likely complex phenotypes that currently remain elusive.

Second-generation PLINK: rising to the challenge of larger and richer datasets

Second-generation PLINK: rising to the challenge of larger and richer datasets
Christopher C. Chang, Carson C. Chow, Laurent C.A.M. Tellier, Shashaank Vattikuti, Shaun M. Purcell, James J. Lee
Comments: 2 figures, 1 additional file
Subjects: Genomics (q-bio.GN); Computation (stat.CO)

PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format.
To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information.
The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants

Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants
Aziz Belkadi, Alexandre Bolze, Yuval Itan, Quentin B Vincent, Alexander Antipenko, Bertrand Boisson, Jean-Laurent Casanova, Laurent Abel
doi: http://dx.doi.org/10.1101/010363
We compared whole-exome sequencing (WES) and whole-genome sequencing (WGS) for the detection of single-nucleotide variants (SNVs) in the exomes of six unrelated individuals. In the regions targeted by exome capture, the mean number of SNVs detected was 84,192 for WES and 84,968 for WGS. Only 96% of the variants were detected by both methods, with the same genotype identified for 99.2% of them. The distributions of coverage depth (CD), genotype quality (GQ), and minor read ratio (MRR) were much more homogeneous for WGS than for WES data. Most variants with discordant genotypes were filtered out when we used thresholds of CD≥8X, GQ≥20, and MRR≥0.2. However, a substantial number of coding variants were identified exclusively by WES (105 on average) or WGS (692). We Sanger sequenced a random selection of 170 of these exclusive variants, and estimated the mean number of false-positive coding variants per sample at 79 for WES and 36 for WGS. Importantly, the mean number of real coding variants identified by WGS and missed by WES (656) was much larger than the number of real coding variants identified by WES and missed by WGS (26). A substantial proportion of these exclusive variants (32%) were predicted to be damaging. In addition, about 380 genes were poorly covered (~27% of base pairs with CD<8X) by WES for all samples, including 49 genes underlying Mendelian disorders. We conclude that WGS is more powerful and reliable than WES for detecting potential disease-causing mutations in the exome.

Similar efficacies of selection shape mitochondrial and nuclear genes in Drosophila melanogaster and Homo sapiens

Similar efficacies of selection shape mitochondrial and nuclear genes in Drosophila melanogaster and Homo sapiens
Brandon S. Cooper, Chad Burrus, Chao Ji, Matthew W. Hahn, Kristi L. Montooth
doi: http://dx.doi.org/10.1101/010355

Deleterious mutations contribute to polymorphism even when selection effectively prevents their fixation. The efficacy of selection in removing deleterious mitochondrial mutations from populations depends on the effective population size (Ne) of the mtDNA, and the degree to which a lack of recombination magnifies the effects of linked selection. Using complete mitochondrial genomes from Drosophila melanogaster and nuclear data available from the same samples, we re-examine the hypothesis that non-recombining animal mtDNA harbor an excess of deleterious polymorphisms relative to the nuclear genome. We find no evidence of recombination in the mitochondrial genome, and the much-reduced level of mitochondrial synonymous polymorphism relative to nuclear genes is consistent with a reduction in Ne. Nevertheless, we find that the neutrality index (NI), a measure of the excess on nonsynonymous polymorphism relative to the neutral expectation, is not significantly different between mitochondrial and nuclear loci. Reanalysis of published data from Homo sapiens reveals the same lack of a difference between the two genomes, though small samples in previous studies had suggested a strong difference in both species. Thus, despite a smaller Ne, mitochondrial loci of both flies and humans appear to experience similar efficacies of selection as do loci in the recombining nuclear genome.

Recent evolution of the mutation rate and spectrum in Europeans

Recent evolution of the mutation rate and spectrum in Europeans
Kelley Harris
doi: http://dx.doi.org/10.1101/010314

As humans dispersed out of Africa, they adapted to new environmental challenges including changes in exposure to mutagenic solar radiation. This raises the possibility that different populations experienced different selective pressures affecting genome integrity. Prior work has uncovered divergent selection in tropical versus temperate latitudes on eQTLs that regulate the DNA damage response, as well as evidence that the human mutation rate per year has changed at least 2-fold since we shared a common ancestor with chimpanzees. Here, I present evidence that the rate of a particular mutation type has recently increased in the European lineage, rising in frequency by 50% during the 30,000–50,000 years since Europeans diverged from Asians. A comparison of single nucleotide polymorphisms (SNPs) private to Africa, Asia, and Europe in the 1000 Genomes data reveals that private European variation is enriched for the transition 5’-TCC-3’→5’-TTC-3’. Although it is not clear whether UV played a causal role in the changing the European mutational spectrum, 5’-TCC-3’→5’-TTC-3’ is known to be the most common somatic mutation present in melanoma skin cancers, as well as the mutation most frequently induced in vitro by UV. Regardless of its causality, this change indicates that DNA replication fidelity has not remained stable even since the origin of modern humans and might have changed numerous times during our recent evolutionary history.