Scalable Genomics with R and Bioconductor

Scalable Genomics with R and Bioconductor
Michael Lawrence, Martin Morgan
Journal-ref: Statistical Science 2014, Vol. 29, No. 2, 214-226
Subjects: Genomics (q-bio.GN); Distributed, Parallel, and Cluster Computing (cs.DC)

This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.

Genome-wide characterization of RNA editing in chicken: lack of evidence for non-A-to-I events

Genome-wide characterization of RNA editing in chicken: lack of evidence for non-A-to-I events

Laure Frésard, Sophie Leroux, Pierre-François Roux, C Klopp, Stéphane Fabre, Diane Esquerré, Patrice Dehais, Anis Djari, David Gourichon, Sandrine Lagarrigue, Frédérique Pitel
doi: http://dx.doi.org/10.1101/008912

RNA editing corresponds to a post-transcriptional nucleotide change in the RNA sequence, creating an alternative nucleotide, not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is more and more described in mammals, notably since the availability of next generation sequencing technologies allowing a whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken are still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, to understand to what extent this phenomenon is conserved in vertebrates. We performed RNA and DNA sequencing from 8 embryos. Being aware of common pitfalls inherent to sequence analyses leading to false positive discovery, we stringently filtered our datasets and found less than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual editing level increase with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, and provide support of an absence of non A-to-I events from the chicken transcriptome.

Accurate Liability Estimation Substantially Improves Power in Ascertained Case Control Studies

Accurate Liability Estimation Substantially Improves Power in Ascertained Case Control Studies

Omer Weissbrod, Christoph Lippert, Dan Geiger, David Heckerman
(Submitted on 8 Sep 2014)

Future genome wide association studies (GWAS) of diseases will include hundreds of thousands of individuals in order to detect risk variants with small effect sizes. Such samples are susceptible to confounding, which can lead to spurious results. Recently, linear mixed models (LMMs) have emerged as the method of choice for GWAS, due to their robustness to confounding. However, the performance of LMMs in case-control studies deteriorates with increasing sample size, resulting in reduced power. This loss of power can be remedied by transforming observed case-control status to liability space, wherein each individual is assigned a score corresponding to severity of phenotype. We propose a novel method for estimating liabilities, and demonstrate that testing for associations with estimated liabilities by way of an LMM leads to a substantial power increase. The proposed framework enables testing for association in ascertained case-control studies, without suffering from reduced power, while remaining resilient to confounding. Extensive experiments on synthetic and real data demonstrate that the proposed framework can lead to an average increase of over 20 percent for test statistics of causal variants, thus dramatically improving GWAS power.

Biallelic Mutation-Drift Diffusion in the Limit of Small Scaled Mutation Rates

Biallelic Mutation-Drift Diffusion in the Limit of Small Scaled Mutation Rates

Claus Vogl
(Submitted on 8 Sep 2014)

The evolution of the allelic proportion x of a biallelic locus subject to the forces of mutation and drift is investigated in a diffusion model, assuming small scaled mutation rates. The overall scaled mutation rate is parametrized with θ=(μ1+μ0)N and the ratio of mutation rates with α=μ1/(μ1+μ0)=1−β. The equilibrium density of this process is beta with parameters αθ and βθ. Away from equilibrium, the transition density can be expanded into a series of modified Jacobi polynomials. If the scaled mutation rates are small, i.e., θ≪1, it may be assumed that polymorphism derives from mutations at the boundaries. A model, where the interior dynamics conform to the pure drift diffusion model and the mutations are entering from the boundaries is derived. In equilibrium, the density of the proportion of polymorphic alleles, \ie\ x within the polymorphic region [1/N,1−1/N], is αβθ(1x+11−x)=αβθx(1−x), while the mutation bias α influences the proportion of monomorphic alleles at 0 and 1. Analogous to the expansion with modified Jacobi polynomials, a series expansion of the transition density is derived, which is connected to Kimura’s well known solution of the pure drift model using Gegenbauer polynomials. Two temporal and two spatial regions are separated. The eigenvectors representing the spatial component within the polymorphic region depend neither on the on the scaled mutation rate θ nor on the mutation bias α. Therefore parameter changes, e.g., growing or shrinking populations or changes in the mutation bias, can be modeled relatively easily, without the change of the eigenfunctions necessary for the series expansion with Jacobi polynomials.

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Likelihood-based tree reconstruction on a concatenation of alignments can be positively misleading

Sebastien Roch, Mike Steel
(Submitted on 6 Sep 2014)

The reconstruction of a species tree from genomic data faces a double hurdle. First, the (gene) tree describing the evolution of each gene may differ from the species tree, for instance, due to incomplete lineage sorting. Second, the aligned genetic sequences at the leaves of each gene tree provide merely an imperfect estimate of the topology of the gene tree. In this note, we demonstrate formally that a basic statistical problem arises if one tries to avoid accounting for these two processes and analyses the genetic data directly via a concatenation approach. More precisely, we show that, under the multi-species coalescent with a standard site substitution model, maximum likelihood estimation on sequence data that has been concatenated across genes and performed under the incorrect assumption that all sites have evolved independently and identically on a fixed tree is a statistically inconsistent estimator of the species tree. Our results provide a formal justification of simulation results described of Kubatko and Degnan (2007) and others, and complements recent theoretical results by DeGorgio and Degnan (2010) and Chifman and Kubtako (2014).

Functional analysis and co-evolutionary model of chromatin and DNA methylation networks in embryonic stem cells

Functional analysis and co-evolutionary model of chromatin and DNA methylation networks in embryonic stem cells
Enrique Carrillo de Santa Pau, Juliane Perner, David Juan, Simone Marsili, David Ochoa, Ho-Ryun Chung, Daniel Rico, Martin Vingron, Alfonso Valencia
doi: http://dx.doi.org/10.1101/008821
We have analyzed publicly available epigenomic data of mouse embryonic stem cells (ESCs) combining diverse next-generation sequencing (NGS) studies (139 experiments from 30 datasets with a total of 77 epigenomic features) into a homogeneous dataset comprising various cytosine modifications (5mC, 5hmC and 5fC), histone marks and Chromatin related Proteins (CrPs). We applied a set of newly developed statistical analysis methods with the goal of understanding the associations between chromatin states, detecting co-occurrence of DNA-protein binding and epigenetic modification events, as well as detecting coevolution of core CrPs. The resulting networks reveal the complex relations between cytosine modifications and protein complexes and their dependence on defined ESC chromatin contexts. A detailed analysis allows us to detect proteins associated to particular chromatin states whose functions are related to the different cytosine modifications, i.e. RYBP with 5fC and 5hmC, NIPBL with 5hmC and OGT with 5hmC. Moreover, in a co-evolutionary analysis suggesting a central role of the Cohesin complex in the evolution of the epigenomic network, as well as strong co-evolutionary links between proteins that co-locate in the ESC epigenome with DNA methylation (MBD2 and CBX3) and hydroxymethylation (TET1 and KDM2A). In summary, the new application of computational methodologies reveals the complex network of relations between cytosine modifications and epigenomic players that is essential in shaping the molecular state of ESCs.

Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics

Generation of a Panel of Induced Pluripotent Stem Cells From Chimpanzees: a Resource for Comparative Functional Genomics
Irene Gallego Romero, Bryan J Pavlovic, Irene Hernando-Herraez, Nicholas E Banovich, Courtney L Kagan, Jonathan E Burnett, Constance H Huang, Amy Mitrano, Claudia I Chavarria, Inbar F Ben-Nun, Yingchun Li, Karen Sabatini, Trevor Leonardo, Mana Parast, Tomas Marques-Bonet, Louise C Laurent, Jeanne F Loring, Yoav Gilad
doi: http://dx.doi.org/10.1101/008862

Comparative genomics studies in primates are extremely restricted because we only have access to a few types of cell lines from non-human apes and to a limited collection of frozen tissues. In order to gain better insight into regulatory processes that underlie variation in complex phenotypes, we must have access to faithful model systems for a wide range of tissues and cell types. To facilitate this, we have generated a panel of 7 fully characterized chimpanzee (Pan troglodytes) induced pluripotent stem cell (iPSC) lines derived from fibroblasts of healthy donors. All lines appear to be free of integration from exogenous reprogramming vectors, can be maintained using standard iPSC culture techniques, and have proliferative and differentiation potential similar to human and mouse lines. To begin demonstrating the utility of comparative iPSC panels, we collected RNA sequencing data and methylation profiles from the chimpanzee iPSCs and their corresponding fibroblast precursors, as well as from 7 human iPSCs and their precursors, which were of multiple cell type and population origins. Overall, we observed much less regulatory variation within species in the iPSCs than in the somatic precursors, indicating that the reprogramming process has erased many of the differences observed between somatic cells of different origins. We identified 4,918 differentially expressed genes and 3,598 differentially methylated regions between iPSCs of the two species, many of which are novel inter-species differences that were not observed between the somatic cells of the two species. Our panel will help realise the potential of iPSCs in primate studies, and in combination with genomic technologies, transform studies of comparative evolution.

Mixed Model with Correction for Case-Control Ascertainment Increases Association Power

Mixed Model with Correction for Case-Control Ascertainment Increases Association Power

tristan hayeck, Noah Zaitlen, Po-Ru Loh, Bjarni Vilhjalmsson, Samuela Pollack, Alexander Gusev, Jian Yang, Guo-Bo Chen, Michael E. Goddard, Peter M. Visscher, Nick Patterson, Alkes Price
doi: http://dx.doi.org/10.1101/008755

We introduce a Liability Threshold Mixed Linear Model (LTMLM) association statistic for ascertained case-control studies that increases power vs. existing mixed model methods, with a well-controlled false-positive rate. Recent work has shown that existing mixed model methods suffer a loss in power under case-control ascertainment, but no solution has been proposed. Here, we solve this problem using a chi-square score statistic computed from posterior mean liabilities (PML) under the liability threshold model. Each individual’s PML is conditional not only on that individual’s case-control status, but also on every individual’s case-control status and on the genetic relationship matrix obtained from the data. The PML are estimated using a multivariate Gibbs sampler, with the liability-scale phenotypic covariance matrix based on the genetic relationship matrix (GRM) and a heritability parameter estimated via Haseman-Elston regression on case-control phenotypes followed by transformation to liability scale. In simulations of unrelated individuals, the LTMLM statistic was correctly calibrated and achieved higher power than existing mixed model methods in all scenarios tested, with the magnitude of the improvement depending on sample size and severity of case-control ascertainment. In a WTCCC2 multiple sclerosis data set with >10,000 samples, LTMLM was correctly calibrated and attained a 4.1% improvement (P=0.007) in chi-square statistics (vs. existing mixed model methods) at 75 known associated SNPs, consistent with simulations. Larger increases in power are expected at larger sample sizes. In conclusion, an increase in power over existing mixed model methods is available for ascertained case-control studies of diseases with low prevalence.

Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

Cross-population Meta-analysis of eQTLs: Fine Mapping and Functional Study

Xiaoquan Wen, Francesca Luca, Roger Pique-Regi
doi: http://dx.doi.org/10.1101/008797

Mapping expression quantitative trait loci (eQTLs) has been shown as a powerful tool to uncover the genetic underpinnings of many complex traits at the molecular level. In this paper, we present an integrative analysis approach that leverages eQTL data collected from multiple population groups. In particular, our approach effectively identifies multiple independent cis-eQTL signals that are consistently presented across populations, accounting for population heterogeneity in allele frequencies and linkage disequilibrium patterns. Furthermore, by integrating genomic annotations, our analysis framework enables high-resolution functional analysis of eQTLs. We applied our statistical approach to analyze the GEUVADIS data consisting of samples from five population groups. From this analysis, we concluded that i) jointly analysis across population groups greatly improves the power of eQTL discovery and the resolution of fine mapping of causal eQTL. ii) many genes harbor multiple independent eQTLs in their cis regions iii) genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs (p-value = 4.93 × 10-22).

The projection of a test genome onto a reference population and applications to humans and archaic hominins

The projection of a test genome onto a reference population and applications to humans and archaic hominins

Melinda A Yang, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/008805

We introduce a method for comparing a test genome with numerous genomes from a reference population. Sites in the test genome are given a weight w that depends on the allele frequency x in the reference population. The projection of the test genome onto the reference population is the average weight for each x, w(x). The weight is assigned in such a way that if the test genome is a random sample from the reference population, w(x)=1. Using analytic theory, numerical analysis, and simulations, we show how the projection depends on the time of population splitting, the history of admixture and changes in past population size. The projection is sensitive to small amounts of past admixture, the direction of admixture and admixture from a population not sampled (a ghost population). We compute the projection of several human and two archaic genomes onto three reference populations from the 1000 Genomes project, Europeans (CEU), Han Chinese (CHB) and Yoruba (YRI) and discuss the consistency of our analysis with previously published results for European and Yoruba demographic history. Including higher amounts of admixture between Europeans and Yoruba soon after their separation and low amounts of admixture more recently can resolve discrepancies between the projections and demographic inferences from some previous studies.