Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions

Bayesian Methods for Genetic Association Analysis with Heterogeneous Subgroups: from Meta-Analyses to Gene-Environment Interactions

Xiaoquan Wen, Matthew Stephens
(Submitted on 4 Nov 2011 (v1), last revised 8 Nov 2011 (this version, v2))

In genetic association analyses, it is often desired to analyze data from multiple potentially-heterogeneous subgroups. The amount of expected heterogeneity can vary from modest (as might typically be expected in a meta-analysis of multiple studies of the same phenotype, for example), to large (e.g. a strong gene-environment interaction, where the environmental exposure defines discrete subgroups). Here, we consider a flexible set of Bayesian models and priors that can capture these different levels of heterogeneity. We provide accurate numerical approaches to compute approximate Bayes Factors for these different models, and also some simple analytic forms which have natural interpretations and, in some cases, close connections with standard frequentist test statistics. These approximations also have the convenient feature that they require only summary-level data from each subgroup (in the simplest case, a point estimate for the genetic effect, and its standard error, from each subgroup). We illustrate the flexibility of these approaches on three examples: an analysis of a potential gene-environment interaction for a recombination phenotype, a large scale meta-analysis of genome-wide association data from the Global Lipids consortium, and a cross-population analysis for expression quantitative trait loci (eQTLs).

Inference of population splits and mixtures from genome-wide allele frequency data

Inference of population splits and mixtures from genome-wide allele frequency data

Joseph K. Pickrell, Jonathan K. Pritchard
(Submitted on 11 Jun 2012)

Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In this model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication, and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and “ancient” Asian breeds. Software implementing the model described here, called TreeMix, is available at this http URL

Our paper: Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster

Casey Bergman [@caseybergman and @bergmanlab] kindly wrote a post about his recently arXived paper:
Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster
ArXived here.
As part of the Drosophila 12 Genome Project, Steve Salzberg and colleagues’ published a pioneering paper in 2005 showing that complete genomes of the bacterial endosymbiont Wolbachia pipientis can be extracted from the whole-genome shotgun sequence assemblies of Drosophila species. This paper always left an impression on me as a very clever use of extracting new biology from existing genomic data, and when the era of resequencing multiple strains of D. melanogaster kicked off a few years ago, it seemed like a natural extension to ask if this approach could be adapted to a next-generation sequencing data to study the co-evolution of Wolbachia and Drosophila using whole genome data.

In the current work, we used short-read next generation sequencing data from two major resequencing efforts in D. melanogaster — the Drosophila Genetic Reference Panel (DGRP) and Drosophila Population Genomics Project (DPGP) — together with the reference Wolbachia genome published by Wu et al. (2005) and extracted over 175 complete Wolbachia genomes and nearly 300 complete mitochondrial genomes. Readers can find the main results in the paper, which is currently in review. I’d like to discuss here the social context of the project and some of the reasons we submitted to arXiv.

This project started out as summer project for a masters student, Mark Richardson, in 2010 who did an amazing job developing the initial pipeline made most of the initial discoveries in the paper. Mark and I started a collaboration with Frank Jiggins and Mike McGwire shortly after to verify that our in silico genotyping results were making sense, who suggested to bring in Lucy Weinart and John Welch to help with the more sophisticated Bayesian phylogenetic analysis. Another PhD student in my lab, Raquel Linheiro, adapted her transposable element detection pipeline to identify particular Wolbachia sublineages which was crucial to linking our data with previous results. This was a great collaboration, where everyone made significant contributions, and I would collaborate with everyone again (and I hope to!).

At the time (summer 2010), we only had access to the North American strains from the DGRP sample; knowing that North American D. melanogaster are derived populations, we were cautious about the impact that population structure had on our results. We planned in early 2011 to publish on only the DGRP dataset since Mark was going off to do a PhD in Australia and I didn’t have anyone else in the group working on this project. In the summer of 2011, the African DPGP data came online and I decided to take a peek and run the pipeline on the African strains as well. This led to a major overhaul of the project and set us back a year, since all the data had to be reanalyzed again together and the interpretation of the biogeography results was substantially altered. This was in some ways lucky because our initial interpretation of evidence for a selective sweep on one of the cytoplasmic lineages was probably wrong, and it saved us from having to back peddle on this misinterpretation in a later publication.

As we plugged away at trying to finish this project, we had inquiries about the status of the project from several other groups working in the Wolbachia field. Honestly this stressed me out quite a bit, since some of the inquiries were coming from post-docs in big labs. But instead of just sitting on the data, after we finalized the dataset we decided to release these data openly on our lab blog in April 2012. We decided on an open release as a way to help these teams (and others we didn’t know about), but also to get some priority in this area by providing the “gold standard” that other groups could use (and cite!). For the record, I will note that we asked two teams who contacted us about our project if they would reciprocate by sharing unpublished genomic data or in one case published genomic data that was not submitted to GenBank; both declined.

After making the decision to release the data pre-publication, it was a natural step to submit the manuscript to arXiv. I’m an open science advocate and used the Nature Preprint server occasionally in the past. I never really liked the Nature Preprint server, though, since I thought people posted there to give their manuscript the stink of being “Nature (in prep)” on their CV. And I never posted to arXiv in the past, since I always thought it was for more hardcore computational or mathematical biology. But recently, I was convinced by Rosie Redfield, Leonid Kruglyak and colleagues putting their Arsenic Life paper on arXiv that more empirical work in quantitative biology was arXiv-able. And just as with releasing our data early, it seemed like the best way to prevent being scooped was to get our results out as quickly as possible and letting people know about it.

So we went for it. And I have to say the experience has been thoroughly rewarding. Submitting was a piece of cake, easier than any journal I’ve ever submitted to. Having a URL to point to allowed me to tweet about it, which got some exposure to the paper and some new colleagues on twitter. It also allowed me to send a submitted manuscript around to colleagues for informal review, without cluttering up their inboxes with big attachments or providing a moral dilemma about who they can share the manuscript with. And somehow submitting to arXiv pushed the “it’s submitted” button in my brain, which made me a whole lot less stressed about the possibility of being scooped and I’ve been more relaxed throughout the formal submission process. Finally, I know that the pre-publication release of the data and posting of the manuscript has led to a group in Russia using these sequences into their work, and I’ve just gotten a manuscript to review from this group citing our arXiv manuscript and extending our results before our paper is even published! This is what research is all about, right: doing science, getting it out, and letting others build on it. I’ll definitely submit to arXiv for all my papers from my lab, and look forward to the Haldane’s Sieve readership giving us a hard time about our manuscripts while they evolve into formal publications.

Casey Bergman

Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture

Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture
John E. Pool, Russell B. Corbett-Detig, Ryuichi P. Sugino, Kristian A. Stevens, Charis M. Cardeno, Marc W. Crepeau, Pablo Duchen, J. J. Emerson, Perot Saelao, David J. Begun, Charles H. Langley
(Submitted on 23 Aug 2012)

(ABRIDGED) We report the genome sequencing of 139 wild-derived strains of D. melanogaster, representing 22 population samples from the sub-Saharan ancestral range of this species, along with one European population. Most genomes were sequenced above 25X depth from haploid embryos. Results indicated a pervasive influence of non-African admixture in many African populations, motivating the development and application of a novel admixture detection method. Admixture proportions varied among populations, with greater admixture in urban locations. Admixture levels also varied across the genome, with localized peaks and valleys suggestive of a non-neutral introgression process. Genomes from the same location differed starkly in ancestry, suggesting that isolation mechanisms may exist within African populations. After removing putatively admixed genomic segments, the greatest genetic diversity was observed in southern Africa (e.g. Zambia), while diversity in other populations was largely consistent with a geographic expansion from this potentially ancestral region. The European population showed different levels of diversity reduction on each chromosome arm, and some African populations displayed chromosome arm-specific diversity reductions. Inversions in the European sample were associated with strong elevations in diversity across chromosome arms. Genomic scans were conducted to identify loci that may represent targets of positive selection. A disproportionate number of candidate selective sweep regions were located near genes with varied roles in gene regulation. Outliers for Europe-Africa FST were found to be enriched in genomic regions of locally elevated cosmopolitan admixture, possibly reflecting a role for some of these loci in driving the introgression of non-African alleles into African populations.

Detection of correlation between genotypes and environmental variables. A fast computational approach for genomewide studies

Detection of correlation between genotypes and environmental variables. A fast computational approach for genomewide studies
Gilles Guillot
(Submitted on 5 Jun 2012)

Genomic regions displaying outstanding correlation with some environmental variables are likely to be under selection and this is the rationale of recent methods of identifying selected loci and retrieve functional information about them. To be efficient, such methods need to be able to disentangle the potential effect of environmental variables from the confounding effect of population history. For the routine analysis of genomewide data-sets, one also need fast inference and model selection algorithms. We describe a method based on an explicit spatial model that builds on the theoretical and computational framework developed by Rue et al. (2009) and Lindgren et al. (2011}. The methods allows one to quantify correlation between genotypes and environmental variables and to rank loci accordingly. It works for SNP and AFLP data obtained either at the individual or at the population level. We provide R scripts with detailed comments that can be used readily for the analysis of real data without specific prior knowledge of the R language.

Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals

Lineage-specific transposons drove massive gene expression recruitments during the evolution of pregnancy in mammals
Vincent J. Lynch, Mauris Nnamani, Kathryn J. Brayer, Deena Emera, Joel O. Wertheim, Sergei L. Kosakovsky Pond, Frank Grützner, Stefan Bauersachs, Alexander Graf, Aurélie Kapusta, Cédric Feschotte, Günter P. Wagner
(Submitted on 22 Aug 2012)

A major challenge in biology is explaining how novel characters originate, however, the molecular mechanisms that underlie the emergence of evolutionary innovations are unclear. Here we show that while gene expression in the uterus evolves at a slow and relatively constant rate, it has been punctuated by periods of rapid change associated with the recruitment of thousands of genes into uterine expression during the evolution of pregnancy in mammals. We found that numerous genes and signaling pathways essential for the establishment of pregnancy and maternal-fetal communication evolved uterine expression in mammals. Remarkably the majority of genes recruited into endometrial expression have cis-regulatory elements derived from lineage-specific transposons, suggesting that that bursts of transposition facilitate adaptation and speciation through genomic and regulatory reorganization.

Blood ties: ABO is a trans-species polymorphism in primates

Blood ties: ABO is a trans-species polymorphism in primates
Laure Ségurel, Emma E. Thompson, Timothée Flutre, Jessica Lovstad, Aarti Venkat, Susan W. Margulis, Jill Moyse, Steve Ross, Kathryn Gamble, Guy Sella, Carole Ober, Molly Przeworski
(Submitted on 22 Aug 2012)

The ABO histo-blood group, the critical determinant of transfusion incompatibility, was the first genetic polymorphism discovered in humans. Remarkably, ABO antigens are also polymorphic in many other primates, with the same two amino acid changes responsible for A and B specificity in all species sequenced to date. Whether this recurrence of A and B antigens is the result of an ancient polymorphism maintained across species or due to numerous, more recent instances of convergent evolution has been debated for decades, with a current consensus in support of convergent evolution. We show instead that genetic variation data in humans and gibbons as well as in Old World Monkeys are inconsistent with a model of convergent evolution and support the hypothesis of an ancient, multi-allelic polymorphism of which some alleles are shared by descent among species. These results demonstrate that the ABO polymorphism is a trans-species polymorphism among distantly related species and has remained under balancing selection for tens of millions of years, to date, the only such example in Hominoids and Old World Monkeys outside of the Major Histocompatibility Complex.

Our paper: The Genomic Signature of Crop-Wild Introgression in Maize

Our inaugural author post is by Matt Hufford and Jeff Ross-Ibarra [@lab_ri] on their paper:
The Genomic Signature of Crop-Wild Introgression in Maize ArXived here.

Evolutionary biologists have long been fascinated by introgressive hybridization. Numerous examples in which introgression has played an important evolutionary role are known, but genetic characterization has typically focused on only a handful of loci.

We took advantage of the recent development of inexpensive genotyping to address a long-standing question of introgression in maize evolution. Maize was domesticated in the warm low elevations of southwest Mexico, and likely colonized the highlands of central Mexico only thousands of years later. Maize is frequently cultivated in sympatry with its wild relatives the teosintes and is known to hybridize with them. Hybridization is especially common in the highlands, where maize and teosinte share several derived morphological features thought to be adaptive to high elevation.

We set out to discover the genomic extent of introgression in highland maize and teosinte populations and the degree to which this has been adaptive. We genotyped 9 sympatric population pairs of maize and teosinte at ~39,000 SNPs. We used two different algorithms (in the software STRUCTURE and HAPMIX) to model chromosomes as mosaics of maize and teosinte, and characterized regions of putative introgression. Surprisingly, we found shared regions of introgression across many populations and primarily only from teosinte into maize. To test whether this introgression may have facilitated maize adaptation to the highlands, we conducted a growth chamber experiment that revealed significant differences in putatively adaptive morphological traits between maize populations with and without introgression.

We submitted the paper to arXiv because this is a fast-moving area for empirical evolutionary genomics and we hoped to start the dialogue early on how to move forward with our results. We’d like feedback on the paper and specifically the following questions:

Are there recent advances in modeling admixture and introgression that we should apply?

Are our main findings surprising considering the putative history of maize diffusion?

Matt Hufford and Jeff Ross Ibarra

The variance of identity-by-descent sharing in the Wright-Fisher model

The variance of identity-by-descent sharing in the Wright-Fisher model

Shai Carmi, Pier Francesco Palamara, Vladimir Vacic, Todd Lencz, Ariel Darvasi, Itsik Pe’er
(Submitted on 21 Jun 2012)

Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of populations that have experienced a recent bottleneck. The detection of these IBD segments is now feasible, enabling a wide range of applications from phasing and imputation to demographic inference. Here, we study the distribution of IBD sharing in the Wright-Fisher model. Using coalescent theory, we calculate the mean and variance of the total sharing between arbitrary pairs of individuals. We then study the cohort-averaged sharing: the average total sharing between one individual to the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing is distributed approximately normally. Surprisingly, the variance of this distribution remains large even for large cohorts, implying the existence of “hyper-sharing” individuals. The presence of such individuals bears important consequences to the design of sequencing studies, since, if they are selected for whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate the expected gain in power of imputation by IBD, and subsequently, in power to detect an association, when individuals are either randomly selected or are specifically the hyper-sharing individuals. Finally, we study the distribution of pairwise sharing and cohort-averaged sharing in the Ashkenazi Jewish population.

Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

Integrated analysis of variants and pathways in genome-wide association studies using polygenic models of disease

Peter Carbonetto, Matthew Stephens
(Submitted on 21 Aug 2012)

Many common diseases are highly polygenic, modulated by a large number genetic factors with small effects on susceptibility to disease. These small effects are difficult to map reliably in genetic association studies. To address this problem, researchers have developed methods that aggregate information over sets of related genes, such as biological pathways, to identify gene sets that are enriched for genetic variants associated with disease. However, these methods fail to answer a key question: which genes and genetic variants are associated with disease risk? We develop a method based on sparse multiple regression that simultaneously identifies enriched pathways, and prioritizes the variants within these pathways, to locate additional variants associated with disease susceptibility. A central feature of our approach is an estimate of the strength of enrichment, which yields a coherent way to prioritize variants in enriched pathways. We illustrate the benefits of our approach in a genome-wide association study of Crohn’s disease with ~440,000 genetic variants genotyped for ~4700 study subjects. We obtain strong support for enrichment of IL-12, IL-23 and other cytokine signaling pathways. Furthermore, prioritizing variants in these enriched pathways yields support for additional disease-association variants, all of which have been independently reported in other case-control studies for Crohn’s disease.