Joint analysis of functional genomic data and genome-wide association studies of 18 human traits

Joint analysis of functional genomic data and genome-wide association studies of 18 human traits
Joseph Pickrell

Annotations of gene structures and regulatory elements can inform genome-wide association studies (GWAS). However, choosing the relevant annotations for interpreting an association study of a given trait remains challenging. We describe a statistical model that uses association statistics computed across the genome to identify classes of genomic element that are enriched or depleted for loci that influence a trait. The model naturally incorporates multiple types of annotations. We applied the model to GWAS of 18 human traits, including red blood cell traits, platelet traits, glucose levels, lipid levels, height, BMI, and Crohn’s disease. For each trait, we evaluated the relevance of 450 different genomic annotations, including protein-coding genes, enhancers, and DNase-I hypersensitive sites in over a hundred tissues and cell lines. We show that the fraction of phenotype-associated SNPs that influence protein sequence ranges from around 2% (for platelet volume) up to around 20% (for LDL cholesterol); that repressed chromatin is significantly depleted for SNPs associated with several traits; and that cell type-specific DNase-I hypersensitive sites are enriched for SNPs associated with several traits (for example, fibroblasts in Crohn’s disease and muscle tissue in bone density). Finally, by re-weighting each GWAS using information from functional genomics, we increase the number of loci with high-confidence associations by around 5%.

Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population

Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population
Michael Fire, Yuval Elovici
(Submitted on 18 Nov 2013)

Online genealogy datasets contain extensive information about millions of people and their past and present family connections. This vast amount of data can assist in identifying various patterns in human population. In this study, we present methods and algorithms which can assist in identifying variations in lifespan distributions of human population in the past centuries, in detecting social and genetic features which correlate with human lifespan, and in constructing predictive models of human lifespan based on various features which can easily be extracted from genealogy datasets.
We have evaluated the presented methods and algorithms on a large online genealogy dataset with over a million profiles and over 8.8 million connections, all of which were collected from the WikiTree website. Our findings indicate that significant but small positive correlations exist between the parents’ lifespan and their children’s lifespan. Additionally, we found slightly higher and significant correlations between the lifespans of spouses. We also discovered a very small positive and significant correlation between longevity and reproductive success in males, and a small and significant negative correlation between longevity and reproductive success in females. Moreover, our machine learning algorithms presented better than random classification results in predicting which people who outlive the age of 50 will also outlive the age of 80.
We believe that this study will be the first of many studies which utilize the wealth of data on human populations, existing in online genealogy datasets, to better understand factors which influence human lifespan. Understanding these factors can assist scientists in providing solutions for successful aging.

The evolution of sex differences in disease genetics

The evolution of sex differences in disease genetics
William P Gilks, Jessica K Abbott, Edward H Morrow
There are significant differences in the biology of males and females, ranging from biochemical pathways to behavioural responses, which are relevant to modern medicine. Broad-sense heritability estimates differ between the sexes for many common medical disorders, indicating that genetic architecture can be sex-dependent. Recent genome-wide association studies (GWAS) have successfully identified sex-specific and sex-biased effects, where in addition to sex-specific effects on gene expression, twenty-two medical traits have sex-specific or sex-biased loci. Sex-specific genetic architecture of complex traits is also extensively documented in model organisms using genome-wide linkage or association mapping, and in gene disruption studies. The evolutionary origins of sex-specific genetic architecture and sexual dimorphism lie in the fact that males and females share most of their genetic variation yet experience different selection pressures. At the extreme is sexual antagonism, where selection on an allele acts in opposite directions between the sexes. Sexual antagonism has been repeatedly identified via a number of experimental methods in a range of different taxa. Although the molecular basis remains to be identified, mathematical models predict the maintenance of deleterious variants that experience selection in a sex-dependent manner. There are multiple mechanisms by which sexual antagonism and alleles under sex-differential selection could contribute toward the genetics of common, complex disorders. The evidence we review clearly indicates that further research into sex-dependent selection and the sex-specific genetic architecture of diseases would be rewarding. This would be aided by studies of laboratory and wild animal populations, and by modelling sex-specific effects in genome-wide association data with joint, gene-by-sex interaction tests. We predict that even sexually monomorphic diseases may harbour cryptic sex-specific genetic architecture. Furthermore, empirical evidence suggests that investigating sex-dependent epistasis may be especially rewarding. Finally, the prevalent nature of sex-specific genetic architecture in disease offers scope for the development of more effective, sex-specific therapies.

Mutant epigenetic machinery mediates climate adaptation in Arabidopsis thaliana

Mutant epigenetic machinery mediates climate adaptation in Arabidopsis thaliana
Xia Shen, Simon Forsberg, Mats Pettersson, Zheya Sheng, Orjan Carlborg
(Submitted on 16 Oct 2013)

The genetic basis of adaptation to climate is largely unknown. We explored the genetic regulation of climate plasticity and its contribution to adaptation using publicly available data from two collections of natural Arabidopsis thaliana accessions from a wide range of habitats. Sixteen loci with plastic alleles were mapped and many of these contained candidate genes with amino acid changes. The Chromomethylase 2 (CMT2) genotype influenced adaptation to seasonal temperature variability and accessions carrying a mutant CMT2 allele disrupting the genome-wide CHH-methylation pattern displayed a more plastic response to climate. We conclude that genetic regulation of plasticity appears to be important for climate adaptation and that genetic variation in the epigenetic machinery, leading to altered genome-wide epigenetic modifications, is one of the underlying molecular mechanisms.

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection

forqs: Forward-in-time Simulation of Recombination, Quantitative Traits, and Selection
Darren Kessner, John Novembre
(Submitted on 11 Oct 2013)

forqs is a forward-in-time simulation of recombination, quantitative traits, and selection. It was designed to investigate haplotype patterns resulting from scenarios where substantial evolutionary change has taken place in a small number of generations due to recombination and/or selection on polygenic quantitative traits. forqs is implemented as a command- line C++ program. Source code and binary executables for Linux, OSX, and Windows are freely available under a permissive BSD license.

Application of compressed sensing to genome wide association studies and genomic selection

Application of compressed sensing to genome wide association studies and genomic selection
Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.

Integrating diverse datasets improves developmental enhancer prediction

Integrating diverse datasets improves developmental enhancer prediction
Genevieve D. Erwin, Rebecca M. Truty, Dennis Kostka, Katherine S. Pollard, John A. Capra
(Submitted on 27 Sep 2013)

Gene-regulatory enhancers have been identified by many lines of evidence, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a novel method for predicting developmental enhancers and their tissue specificity. EnhancerFinder uses a two-step multiple-kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and thousands of diverse functional genomics datasets from a variety of cell types and developmental stages. We trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser, in contrast to histone mark or sequence-based enhancer definitions commonly used. We comprehensively evaluated EnhancerFinder, and found that our integrative approach improves enhancer prediction accuracy over previous approaches that consider a single type of data. Our evaluation highlights the importance of considering information from many tissues when predicting specific types of enhancers. We find that VISTA enhancers active in embryonic heart are easier to predict than enhancers active in several other tissues due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and hits from genome-wide association studies. We demonstrate the utility of our enhancer predictions by identifying and validating a novel cranial nerve enhancer in the ZEB2 locus. Our genome-wide developmental enhancer predictions will be freely available as a UCSC Genome Browser track.

The effect of paternal age on offspring intelligence and personality when controlling for paternal trait level

The effect of paternal age on offspring intelligence and personality when controlling for paternal trait level

Ruben C. Arslan, Lars Penke, Wendy Johnson, William G. Iacono, Matt McGue
(Submitted on 18 Sep 2013)

Paternal age at conception has been found to predict the number of new genetic mutations. We examined the effect of father’s age at birth on offspring intelligence, head circumference and personality traits. Using the Minnesota Twin Family Study sample we tested paternal age effects while controlling for parents’ trait levels measured with the same precision as offspring’s. From evolutionary genetic considerations we predicted a negative effect of paternal age on offspring intelligence, but not on other traits. Controlling for parental IQ had the effect of turning a positive-zero order association negative. We found paternal age effects on offspring IQ and MPQ Absorption, but they were not robustly significant, nor replicable with additional covariates. No other noteworthy effects were found. Parents’ intelligence and personality correlated with their ages at twin birth, which may have obscured a small negative effect of advanced paternal age (< 1% of variance explained) on intelligence. We discuss future avenues for studies of paternal age effects and suggest that stronger research designs are needed to rule out confounding factors involving birth order and the Flynn effect.

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment

Fast and accurate imputation of summary statistics enhances evidence of functional enrichment
Bogdan Pasaniuc, Noah Zaitlen, Huwenbo Shi, Gaurav Bhatia, Alexander Gusev, Joseph Pickrell, Joel Hirschhorn, David P Strachan, Nick Patterson, Alkes L. Price
(Submitted on 12 Sep 2013)

Imputation using external reference panels is a widely used approach for increasing power in GWAS and meta-analysis. Existing HMM-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. In simulations using 1000 Genomes (1000G) data, this method recovers 84% (54%) of the effective sample size for common (>5%) and low-frequency (1-5%) variants (increasing to 87% (60%) when summary LD information is available from target samples) versus 89% (67%) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and is computationally very fast. As an empirical demonstration, we apply our method to 7 case-control phenotypes from the WTCCC data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95% (105%) of the effective sample size (as quantified by the ratio of $\chi^2$ association statistics) compared to HMM-based imputation from individual-level genotypes at the 227 (176) published SNPs in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of 4 lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic vs. non-genic loci for these traits, as compared to an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses.

An integrative genomic approach illuminates the causes and consequences of genetic background effects

An integrative genomic approach illuminates the causes and consequences of genetic background effects
Christopher H. Chandler, Sudarshan Chari, David Tack, Ian Dworkin
(Submitted on 2 Sep 2013)

(abridged) – The phenotypic consequences of mutations are modulated by the wild type genetic background in which they occur, sometimes dramatically so. Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist, nor about the mechanisms underlying it. We also lack knowledge on how mutations interact with the genetic background to influence gene expression patterns, and how gene expression may in turn mediate mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome-level datasets, from a mapping-by-introgression experiment, as well as a tagged RNA gene expression dataset. In addition we used whole genome re-sequencing of the parental lines-two commonly used laboratory strains-to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutational screen. By searching for genes showing a congruent signal in multiple datasets, we identified candidate loci contributing to the background-dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers are caused by higher-order epistasis, not quantitative non-complementation of alleles. Our results also suggest that cis-regulatory variation contributes little to the background dependence of this mutant phenotype. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well.