Testing for genetic associations in arbitrarily structured populations
Minsun Song, Wei Hao, John D. Storey
We present a new statistical test of association between a trait (either quantitative or binary) and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as that measured in genome-wide associations studies (GWAS). We also derive a new set of methodologies, called a genotype-conditional association test (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and environmental contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods. We provide some discussion on its similarities and differences with the linear mixed model and principal component approaches.
Explicit modeling of ancestry improves polygenic risk scores and BLUP prediction
Chia-Yen Chen, Jiali Han, David J. Hunter, Peter Kraft, Alkes L. Price
Polygenic prediction using genome-wide SNPs can provide high prediction accuracy for complex traits. Here, we investigate the question of how to account for genetic ancestry when conducting polygenic prediction. We show that the accuracy of polygenic prediction in structured populations may be partly due to genetic ancestry. However, we hypothesized that explicitly modeling ancestry could improve polygenic prediction accuracy. We analyzed three GWAS of hair color, tanning ability and basal cell carcinoma (BCC) in European Americans (sample size from 7,440 to 9,822) and considered two widely used polygenic prediction approaches: polygenic risk scores (PRS) and Best Linear Unbiased Prediction (BLUP). We compared polygenic prediction without correction for ancestry to polygenic prediction with ancestry as a separate component in the model. In 10-fold cross-validation using the PRS approach, the R2 for hair color increased by 66% (0.0456 to 0.0755; p<10-16), the R2 for tanning ability increased by 123% (0.0154 to 0.0344; p<10-16) and the liability-scale R2 for BCC increased by 68% (0.0138 to 0.0232; p<10-16) when explicitly modeling ancestry, which prevents ancestry effects from entering into each SNP effect and being over-weighted. Surprisingly, explicitly modeling ancestry produces a similar improvement when using the BLUP approach, which fits all SNPs simultaneously in a single variance component and causes ancestry to be under-weighted. We validate our findings via simulations, which show that the differences in prediction accuracy will increase in magnitude as sample sizes increase. In summary, our results show that explicitly modeling ancestry can be important in both PRS and BLUP prediction.
A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians
Heejung Shim, Daniel I Chasman, Joshua D Smith, Samia Mora, Paul M Ridker, Deborah A Nickerson, Ronald M Krauss, Matthew Stephens
We conducted a genome-wide association analysis of 7 subfractions of low density lipoproteins (LDLs) and 3 subfractions of intermediate density lipoproteins (IDLs) measured by gradient gel electrophoresis, and their response to statin treatment, in 1868 individuals of European ancestry from the Pharmacogenomics and Risk of Cardiovascular Disease study. Our analyses identified four previously-implicated loci (SORT1, APOE, LPA, and CETP) as containing variants that are very strongly associated with lipoprotein subfractions (log10 Bayes Factor > 15). Subsequent conditional analyses suggest that three of these (APOE, LPA and CETP) likely harbor multiple independently associated SNPs. Further, while different variants typically showed different characteristic patterns of association with combinations of subfractions, the two SNPs in CETP show strikingly similar patterns – both in our original data and in a replication cohort – consistent with a common underlying molecular mechanism. Notably, the CETP variants are very strongly associated with LDL subfractions, despite showing no association with total LDLs in our study, illustrating the potential value of the more detailed phenotypic measurements. In contrast with these strong subfraction associations, genetic association analysis of subfraction response to statins showed much weaker signals (none exceeding log10 Bayes Factor of 6). However, two SNPs (in APOE and LPA) previously-reported to be associated with LDL statin response do show some modest evidence for association in our data, and the subfraction response profiles at the LPA SNP are consistent with the LPA association, with response likely being due primarily to resistance of Lp(a) particles to statin therapy. An additional important feature of our analysis is that, unlike most previous analyses of multiple related phenotypes, we analyzed the subfractions jointly, rather than one at a time. Comparisons of our multivariate analyses with standard univariate analyses demonstrate that multivariate analyses can substantially increase power to detect associations. Software implementing our multivariate analysis methods is available at http://stephenslab.uchicago.edu/software.html.
Exploring the phenotypic space and the evolutionary history of a natural mutation in Drosophila melanogaster
Anna Ullastres, Natalia Petit, Josefa González
A major challenge of modern Biology is elucidating the functional consequences of natural mutations. While we have a good understanding of the effects of lab-induced mutations on the molecular- and organismal-level phenotypes, the study of natural mutations has lagged behind. In this work, we explore the phenotypic space and the evolutionary history of a previously identified adaptive transposable element insertion. We first combined several tests that capture different signatures of selection to show that there is evidence of positive selection in the regions flanking FBti0019386 insertion. We then explored several phenotypes related to known phenotypic effects of nearby genes, and having plausible connections to fitness variation in nature. We found that flies with FBti0019386 insertion had a shorter developmental time and were more sensitive to stress, which are likely to be the adaptive effect and the cost of selection of this mutation, respectively. Interestingly, these phenotypic effects are not consistent with a role of FBti0019386 in temperate adaptation as has been previously suggested. Indeed, a global analysis of the population frequency of FBti0019386 showed that clinal frequency patterns are found in North America and Australia but not in Europe. Finally, we showed that FBti0019386 is associated with down-regulation of sra most likely because it induces the formation of heterochromatin by recruiting HP1a protein. Overall, our integrative approach allowed us to shed light on the evolutionary history, the relevant fitness effects and the likely molecular mechanisms of an adaptive mutation and highlights the complexity of natural genetic variants.
Genome-wide association study of carbon and nitrogen metabolism in the maize nested association mapping population
Nengyi Zhang, Yves Gibon, Nicholas Lepak, Pinghua Li, Lauren Dedow, Charles Chen, Yoon-Sup So, Jason Wallace, Karl Kremling, Peter Bradbury, Thomas Brutnell, Mark Stitt, Edward Buckler
Carbon (C) and nitrogen (N) metabolism are critical to plant growth and development and at the basis of yield and adaptation. We have applied high throughput metabolite analyses to over 12,000 diverse field grown samples from the maize nested association mapping population. This allowed us to identify natural variation controlling the levels of twelve key C and N metabolites, often with single gene resolution. In addition to expected genes like invertases, critical natural variation was identified in key C4 metabolism genes like carbonic anhydrases and a malate transporter. Unlike prior maize studies, extensive pleiotropy was found for C and N metabolites. This integration of field-derived metabolite data with powerful mapping and genomics resources allows dissection of key metabolic pathways, providing avenues for future genetic improvement.
A 22,403 marker composite genetic linkage map for cassava (Manihot esculenta Crantz) derived from ten populations
International Cassava Genetic Map Consortium
Cassava (Manihot esculenta Crantz) is a major staple crop in Africa, Asia, and South America, and its starchy roots provide nourishment for 800 million people worldwide. Although native to South America, cassava was brought to Africa approximately 400 years ago and is now widely cultivated across sub-Saharan Africa. The widespread use of clonal planting material, however, aids the spread of disease. Breeding for disease resistance and improved yield began in the 1920s and has accelerated in the last 45 years. To assist in the rapid identification of markers for pathogen resistance and crop traits, and to accelerate breeding programs, we generated a framework map for M. esculenta Crantz derived from reduced representation sequencing (genotyping-by- sequencing [GBS]). The composite 2,412 cM map integrates ten biparental maps (comprising 3,480 meioses) and organizes 22,403 genetic markers on 18 chromosomes, in agreement with the observed karyotype. The map anchors 71.9% of the draft genome assembly and 90.7% of the predicted protein-coding genes. The resulting chromosome-anchored genome sequence provides an essential framework for identification of trait markers and causal genes as well as genomics-enhanced breeding of this important crop.
DNA methylation variation in Arabidopsis has a genetic basis and shows evidence of local adaptation
Manu J. Dubin, Pei Zhang, Dazhe Meng, Marie-Stanislas Remigereau, Edward J. Osborne, Francesco Paolo Casale, Phillip Drewe, André Kahles, Bjarni Vilhjálmsson, Joanna Jagoda, Selen Irez, Viktor Voronin, Qiang Song, Quan Long, Gunnar Rätsch, Oliver Stegle, Richard M. Clark, Magnus Nordborg
(Submitted on 21 Oct 2014)
Epigenome modulation in response to the environment potentially provides a mechanism for organisms to adapt, both within and between generations. However, neither the extent to which this occurs, nor the molecular mechanisms involved are known. Here we investigate DNA methylation variation in Swedish Arabidopsis thaliana accessions grown at two different temperatures. Environmental effects on DNA methylation were limited to transposons, where CHH methylation was found to increase with temperature. Genome-wide association mapping revealed that the extensive CHH methylation variation was strongly associated with genetic variants in both cis and trans, including a major trans-association close to the DNA methyltransferase CMT2. Unlike CHH methylation, CpG gene body methylation (GBM) on the coding region of genes was not affected by growth temperature, but was instead strongly correlated with the latitude of origin. Accessions from colder regions had higher levels of GBM for a significant fraction of the genome, and this was correlated with elevated transcription levels for the genes affected. Genome-wide association mapping revealed that this effect was largely due to trans-acting loci, a significant fraction of which showed evidence of local adaptation. These findings constitute the first direct link between DNA methylation and adaptation to the environment, and provide a basis for further dissecting how environmentally driven and genetically determined epigenetic variation interact and influence organismal fitness.