A comparison of control samples for ChIP-seq of histone modifications

A comparison of control samples for ChIP-seq of histone modifications

Christoffer Flensburg, Sarah A Kinkel, Andrew Keniry, Marnie Blewitt, Alicia Oshlack
doi: http://dx.doi.org/10.1101/007609

The advent of high-throughput sequencing has allowed genome wide profiling of histone modifications by Chromatin ImmunoPrecipitation (ChIP) followed by sequencing (ChIP-seq). In this assay the histone mark of interest is enriched through a chromatin pull-down assay using an antibody for the mark. Due to imperfect antibodies and other factors, many of the sequenced fragments do not originate from the histone mark of interest, and are referred to as background reads. Background reads are not uniformly distributed and therefore control samples are usually used to estimate the background distribution at any given genomic position. The Encyclopedia of DNA Elements (ENCODE) Consortium guidelines suggest sequencing a whole cell extract (WCE, or “input”) sample, or a mock ChIP reaction such as an IgG control, as a background sample. However, for a histone modification ChIP-seq investigation it is also possible to use a Histone H3 (H3) pull-down to map the underlying distribution of histones. In this paper we generated data from a hematopoietic stem and progenitor cell population isolated from mouse foetal liver to compare WCE and H3 ChIP-seq as control samples. The quality of the control samples is estimated by a comparison to pull-downs of histone modifications and to expression data. We find minor differences between WCE and H3 ChIP-seq, such as coverage in mitochondria and behaviour close to transcription start sites. Where the two controls differ, the H3 pull-down is generally more similar to the ChIP-seq of histone modifications. However, the differences between H3 and WCE have a negligible impact on the quality of a standard analysis.

Leveraging local identity-by-descent increases the power of case/control GWAS with related individuals

Leveraging local identity-by-descent increases the power of case/control GWAS with related individuals

Joshua N. Sampson, Bill Wheeler, Peng Li, Jianxin Shi
(Submitted on 31 Jul 2014)

Large case/control Genome-Wide Association Studies (GWAS) often include groups of related individuals with known relationships. When testing for associations at a given locus, current methods incorporate only the familial relationships between individuals. Here, we introduce the chromosome-based Quasi Likelihood Score (cQLS) statistic that incorporates local Identity-By-Descent (IBD) to increase the power to detect associations. In studies robust to population stratification, such as those with case/control sibling pairs, simulations show that the study power can be increased by over 50%. In our example, a GWAS examining late-onset Alzheimer’s disease, the p-values among the most strongly associated SNPs in the APOE gene tend to decrease, with the smallest p-value decreasing from 1.23×10−8 to 7.70×10−9. Furthermore, as a part of our simulations, we reevaluate our expectations about the use of families in GWAS. We show that, although adding only half as many unique chromosomes, genotyping affected siblings is more efficient than genotyping randomly ascertained cases. We also show that genotyping cases with a family history of disease will be less beneficial when searching for SNPs with smaller effect sizes.

Most viewed on Haldane’s Sieve: July 2014

The most viewed posts on Haldane’s Sieve this month were:

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data

Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data

Hua Zhou, John Blangero, Thomas D Dyer, Kei-hang K Chan, Eric M Sobel, Kenneth Lange
(Submitted on 31 Jul 2014)

Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even data sets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered and controlled for. In addition, family designs possess compelling advantages. They are better equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Pedigrees selected for extreme trait values often segregate a single gene with strong effect. Finally, many pedigrees are available as an important legacy from the era of linkage analysis. Unfortunately, pedigree likelihoods are notoriously hard to compute. In this paper we re-examine the computational bottlenecks and implement ultra-fast pedigree-based GWAS analysis. Kinship coefficients can either be based on explicitly provided pedigrees or automatically estimated from dense markers. Our strategy (a) works for random sample data, pedigree data, or a mix of both; (b) entails no loss of power; (c) allows for any number of covariate adjustments, including correction for population stratification; (d) allows for testing SNPs under additive, dominant, and recessive models; and (e) accommodates both univariate and multivariate quantitative traits. On a typical personal computer (6 CPU cores at 2.67 GHz), analyzing a univariate HDL (high-density lipoprotein) trait from the San Antonio Family Heart Study (935,392 SNPs on 1357 individuals in 124 pedigrees) takes less than 2 minutes and 1.5 GB of memory. Complete multivariate QTL analysis of the three time-points of the longitudinal HDL multivariate trait takes less than 5 minutes and 1.5 GB of memory.

Fast Genome-Wide QTL Analysis Using Mendel

Fast Genome-Wide QTL Analysis Using Mendel

Hua Zhou, Jin Zhou, Tao Hu, Eric M Sobel, Kenneth Lange
(Submitted on 31 Jul 2014)

Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (e) correctly deals with missing data in traits, covariates, and genotypes, (f) allows for covariate adjustment and constraints among parameters, (g) uses either theoretical or SNP-based empirical kinship matrix for additive polygenic effects, (h) allows extra variance components such as dominant polygenic effects and household effects, (i) detects and reports outlier individuals and pedigrees, and (j) allows for robust estimation via the t-distribution. The current paper assesses these capabilities on the genetics analysis workshop 19 (GAW19) sequencing data. We analyzed simulated and real phenotypes for both family and random sample data sets. For instance, when jointly testing the 8 longitudinally measured systolic blood pressure (SBP) and diastolic blood pressure (DBP) traits, it takes Mendel 78 minutes on a standard laptop computer to read, quality check, and analyze a data set with 849 individuals and 8.3 million SNPs. Genome-wide eQTL analysis of 20,643 expression traits on 641 individuals with 8.3 million SNPs takes 30 hours using 20 parallel runs on a cluster. Mendel is freely available at \url{this http URL}.

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Fast Bayesian Feature Selection for High Dimensional Linear Regression in Genomics via the Ising Approximation

Charles K. Fisher, Pankaj Mehta
(Submitted on 30 Jul 2014)

Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach — the Bayesian Ising Approximation (BIA) — to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high dimensional regression by analyzing a gene expression dataset with nearly 30,000 features.

Reproductive isolation of hybrid populations driven by genetic incompatibilities

Reproductive isolation of hybrid populations driven by genetic incompatibilities

Molly Schumer, Rongfeng Cui, Gil G Rosenthal, Peter Andolfatto
doi: http://dx.doi.org/10.1101/007518

Despite its role in homogenizing populations, hybridization has also been proposed as a means to generate new species. The conceptual basis for this idea is that hybridization can result in novel phenotypes through recombination between the parental genomes, allowing a hybrid population to occupy ecological niches unavailable to parental species. A key feature of these models is that these novel phenotypes ecologically isolate hybrid populations from parental populations, precipitating speciation. Here we present an alternative model of the evolution of reproductive isolation in hybrid populations that occurs as a simple consequence of selection against incompatibilities. Unlike previous models, our model does not require small population sizes, the availability of new niches for hybrids or ecological or sexual selection on hybrid traits. We show that reproductive isolation between hybrids and parents evolves frequently and rapidly under this model, even in the presence of ongoing migration with parental species and strong selection against hybrids. Our model predicts that multiple distinct hybrid species can emerge from replicate hybrid populations formed from the same parental species, potentially generating patterns of species diversity and relatedness that mimic adaptive radiations.

YFitter: Maximum likelihood assignment of Y chromosome haplogroups from low-coverage sequence data

YFitter: Maximum likelihood assignment of Y chromosome haplogroups from low-coverage sequence data

Luke Jostins, Yali Xu, Shane McCarthy, Qasim Ayub, Richard Durbin, Jeff Barrett, Chris Tyler-Smith
(Submitted on 30 Jul 2014)

Low-coverage short-read resequencing experiments have the potential to expand our understanding of Y chromosome haplogroups. However, the uncertainty associated with these experiments mean that haplogroups must be assigned probabilistically to avoid false inferences. We propose an efficient dynamic programming algorithm that can assign haplogroups by maximum likelihood, and represent the uncertainty in assignment. We apply this to both genotype and low-coverage sequencing data, and show that it can assign haplogroups accurately and with high resolution. The method is implemented as the program YFitter, which can be downloaded from this http URL

Inferring the Clonal Structure of Viral Populations from Time Series Sequencing

Inferring the Clonal Structure of Viral Populations from Time Series Sequencing

Donatien Fotso-Chedom, Pablo R. Murcia, Chris D. Greenman
(Submitted on 30 Jul 2014)

RNA virus populations will undergo processes of mutation and selection resulting in a mixed population of viral particles. High throughput sequencing of a viral population subsequently contains a mixed signal of the underlying clones. We would like to identify the underlying evolutionary structures. We utilize two sources of information to attempt this; within segment linkage information, and mutation prevalence. We demonstrate that clone haplotypes, their prevalence, and maximum parsimony reticulate evolutionary structures can be identified, although the solutions may not be unique, even for complete sets of information. This is applied to a chain of influenza infection, where we infer evolutionary structures, including reassortment, and demonstrate some of the difficulties of interpretation that arise from deep sequencing due to artifacts such as template switching during PCR amplification.

The Genetic Legacy of the Expansion of Turkic-Speaking Nomads Across Eurasia

The Genetic Legacy of the Expansion of Turkic-Speaking Nomads Across Eurasia

Bayazit Yunusbayev, Mait Metspalu, Ene Metspalu, Albert Valeev, Sergei Litvinov, Ruslan Valiev, Vita Akhmetova, Elena Balanovska, Oleg Balanovsky, Shahlo Turdikulova, Dilbar Dalimova, Pagbajabyn Nymadawa, Ardeshir Bahmanimehr, Hovhannes Sahakyan, Kristiina Tambets, Sardana Fedorova, Nikolay Barashkov, Irina Khidiatova, Evelin Mihailov, Rita Khusainova, Larisa Damba, Miroslava Derenko, Boris Malyarchuk, Ludmila Osipova, Mikhail Voevoda, Levon Yepiskoposyan, Toomas Kivisild, Elza Khusnutdinova, Richard Villems
doi: http://dx.doi.org/10.1101/005850

The Turkic peoples represent a diverse collection of ethnic groups defined by the Turkic languages. These groups have dispersed across a vast area, including Siberia, Northwest China, Central Asia, East Europe, the Caucasus, Anatolia, the Middle East, and Afghanistan. The origin and early dispersal history of the Turkic peoples is disputed, with candidates for their ancient homeland ranging from the Transcaspian steppe to Manchuria in Northeast Asia. Previous genetic studies have not identified a clear-cut unifying genetic signal for the Turkic peoples, which lends support for language replacement rather than demic diffusion as the model for the Turkic language?s expansion. We addressed the genetic origin of 373 individuals from 22 Turkic-speaking populations, representing their current geographic range, by analyzing genome-wide high-density genotype data. Most of the Turkic peoples studied, except those in Central Asia, genetically resembled their geographic neighbors, in agreement with the elite dominance model of language expansion. However, western Turkic peoples sampled across West Eurasia shared an excess of long chromosomal tracts that are identical by descent (IBD) with populations from present-day South Siberia and Mongolia (SSM), an area where historians center a series of early Turkic and non-Turkic steppe polities. The observed excess of long chromosomal tracts IBD (> 1cM) between populations from SSM and Turkic peoples across West Eurasia was statistically significant. Finally, we used the ALDER method and inferred admixture dates (~9th?17th centuries) that overlap with the Turkic migrations of the 5th?16th centuries. Thus, our results indicate historical admixture among Turkic peoples, and the recent shared ancestry with modern populations in SSM supports one of the hypothesized homelands for their nomadic Turkic and related Mongolic ancestors.