Effect of Genetic Variation in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Effect of Genetic Variation in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin
Bin Z. He, Michael Z. Ludwig, Desiree A. Dickerson, Levi Barse, Bharath Arun, Soo Young Park, Natalia A. Tamarina, Scott B. Selleck, Patricia Wittkopp, Graeme I. Bell, Martin Kreitman
(Submitted on 23 May 2013)

The identification and validation of gene-gene interactions is a major challenge in human studies. Here, we explore an approach for studying epistasis in humans using a Drosophila melanogaster model of neonatal diabetes mellitus. Expression of mutant preproinsulin, hINSC96Y, in the eye imaginal disc mimics the human disease activating conserved cell stress response pathways leading to cell death and reduction in eye area. Dominant-acting variants in wild-derived inbred lines from the Drosophila Genetics Reference Panel produce a continuous, highly heritable, distribution of eye degeneration phenotypes. A genome-wide association study (GWAS) in 154 sequenced lines identified 29 candidate SNPs in 16 loci with P 7.62). RNAi knock-downs of sfl enhanced the eye degeneration phenotype in a mutant-hINS-dependent manner. sfl encodes a protein required for sulfation of the glycosaminoglycan, heparan sulfate. Two additional genes in the heparan sulfate (HS) biosynthetic pathway (tout velu, ttv and brother of tout velu, botv) also modified the eye phenotype, suggesting a link between HS-modified proteins and cellular responses to misfolded proteins. Finally, intronic variants marking the QTL were associated with decreased sfl expression, a result consistent with that predicted by RNAi studies. The ability to create a model of human genetic disease in the fly, map a QTL by GWAS to a specific gene (and noncoding variant), validate its contribution to disease with available genetic resources, and experimentally link the variant to a molecular mechanism, demonstrate the many advantages Drosophila holds in determining the genetic underpinnings of human disease.

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS
David Golan, Saharon Rosset
(Submitted on 23 May 2013)

One of the major developments in recent years in the search for missing heritability of human phenotypes is the adoption of linear mixed-effects models (LMMs) to estimate heritability due to genetic variants which are not significantly associated with the phenotype. A variant of the LMM approach has been adapted to case-control studies and applied to many major diseases by Lee et al. (2011), successfully accounting for a considerable portion of the missing heritability. For example, for Crohn’s disease their estimated heritability was 22% compared to 50-60% from family studies. In this letter we propose to estimate heritability of disease directly by regression of phenotype similarities on genotype correlations, corrected to account for ascertainment. We refer to this method as genetic correlation regression (GCR). Using GCR we estimate the heritability of Crohn’s disease at 34% using the same data. We demonstrate through extensive simulation that our method yields unbiased heritability estimates, which are consistently higher than LMM estimates. Moreover, we develop a heuristic correction to LMM estimates, which can be applied to published LMM results. Applying our heuristic correction increases the estimated heritability of multiple sclerosis from 30% to 52.6%.

Simpsonian ‘Evolution by Jumps’ in an Adaptive Radiation of Anolis Lizards

Simpsonian ‘Evolution by Jumps’ in an Adaptive Radiation of Anolis Lizards
Jonathan M. Eastman, Daniel Wegmann, Christoph Leuenberger, Luke J. Harmon
(Submitted on 18 May 2013)

In his highly influential view of evolution, G. G. Simpson hypothesized that clades of species evolve in adaptive zones, defined as collections of niches occupied by species with similar traits and patterns of habitat use. Simpson hypothesized that species enter new adaptive zones in one of three ways: extinction of competitor species, dispersal to a new geographic region, or the evolution of a key trait that allows species to exploit resources in a new way. However, direct tests of Simpson’s hypotheses for the entry into new adaptive zones remain elusive. Here we evaluate the fit of a Simpsonian model of jumps between adaptive zones to phylogenetic comparative data. We use a novel statistical approach to show that anoles, a well-studied adaptive radiation of Caribbean lizards, have evolved by a series of evolutionary jumps in trait evolution. Furthermore, as Simpson predicted, trait axes strongly tied to habitat specialization show jumps that correspond with the evolution of key traits and/or dispersal between islands in the Greater Antilles. We conclude that jumps are commonly associated with major adaptive shifts in the evolutionary radiation of anoles.

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Xiang Zhou, Matthew Stephens
(Submitted on 19 May 2013)

Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, existing methods for calculating the likelihood ratio test statistics in mvLMMs are time consuming, and, without approximations, cannot be directly applied to analyze even two traits jointly in a typical-size GWAS. Here, we present a novel algorithm for computing parameter estimates and test statistics (Likelihood ratio and Wald) in mvLMMs that i) reduces per-iteration optimization complexity from cubic to linear in the number of samples; and ii) in GWAS analyses, reduces per-marker complexity from cubic to approximately quadratic (or linear if the relatedness matrix is of low rank) in the number of samples. The new method effectively generalizes both the EMMA (Efficient Mixed Model Association) algorithm and the GEMMA (Genome-wide EMMA) algorithm to the multivariate case, making the likelihood ratio tests in GWASs with mvLMM possible, for the first time, for tens of thousands of samples and a moderate number of phenotypes (<10). With real examples, we show that, as expected, the new method is orders of magnitude faster than competing methods in both variance component estimation in a single mvLMM, and in GWAS applications. The method is implemented in the GEMMA software package, freely available at this http URL

Variable-length haplotype construction for gene-gene interaction studies

Variable-length haplotype construction for gene-gene interaction studies
Anunchai Assawamakin, Nachol Chaiyaratana, Chanin Limwongse, Saravudh Sinsomros, Pa-thai Yenchitsomanus, Prakarnkiat Youngkong
(Submitted on 19 May 2013)

This paper presents a non-parametric classification technique for identifying a candidate bi-allelic genetic marker set that best describes disease susceptibility in gene-gene interaction studies. The developed technique functions by creating a mapping between inferred haplotypes and case/control status. The technique cycles through all possible marker combination models generated from the available marker set where the best interaction model is determined from prediction accuracy and two auxiliary criteria including low-to-high order haplotype propagation capability and model parsimony. Since variable-length haplotypes are created during the best model identification, the developed technique is referred to as a variable-length haplotype construction for gene-gene interaction (VarHAP) technique. VarHAP has been benchmarked against a multifactor dimensionality reduction (MDR) program and a haplotype interaction technique embedded in a FAMHAP program in various two-locus interaction problems. The results reveal that VarHAP is suitable for all interaction situations with the presence of weak and strong linkage disequilibrium among genetic markers.

Our paper: Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

This guest post is by Vincent Plagnol on his group’s paper Bayesian test for co-localisation between pairs of genetic association studies using summary statistics, arXived here. This has been cross-posted from the Plagnol Lab web site.

In this paper we want to answer the following question: given two genetic association studies both showing some association signal at a locus, how likely is it that the same variant is responsible for both associations?

We care about this because a shared causal variant is likely to imply an etiological link between the traits being considered. An obvious application consists of comparing a gene expression study and a disease trait. If one can show that the same variant is affecting both measurements, then it is very likely that the expression of this gene is affecting disease pathogenesis. It also provides information about the tissue type where the effect is mediated. This is a key information to inform a drug design process.

Previous work that led to this manuscript

A while back, I started a discussion with my colleague (and co-author on this manuscript) Eric Schadt about the involvement of a gene name RPS26 in type 1 diabetes. We came up with tests of co-localisation, which were later improved by my colleague (and co-author as well) Chris Wallace, based in Cambridge. These tests are somewhat dated now. The earliest version considered situations with very small number of SNPs, and was not well suited for densely typed regions, in particular as a result of imputation procedures.

This SNP density problem can be overcome to some extent, and Chris Wallace discusses how to do this here. However, a more fundamental issue is the Bayesian/frequentist difference. These earlier tests were testing the null hypothesis of a shared causal variant. Failing to reject the null could be the result of either a lack of power, or a true shared causal variant. In this newer Bayesian framework, the probability of each scenario is computed, including the “lack of power” case. It then becomes easier to interpret the outcome of the test. The tests are about to be released in the latest version of the coloc package (which is maintained by Chris Wallace).

In this latest paper, the underlying model is closely linked to the one proposed by Matthew Stephens and colleagues in a recent PLoS Genetics paper. However, co-localisation was more a side story in this paper, whereas it is the central point of our work. In particular, we show that it is possible to use single SNP P-values to obtain a very good approximation of the correct answer. As discussed below, this has important practical applications.

Another closely related work is the software Sherlock. Sherlock also uses P-values, and also tries to match a gene expression dataset with another GWAS. However, Sherlock does not really perform a co-localisation test but rather a general matching between a gene and a GWAS. In particular, in the Sherlock framework, only the variants significantly associated with gene expression contribute to the final test statistic. In contrast, a variant flat for the expression trait but strongly associated with disease provides strong support against co-localisation. Our work incorporates this information, by adding support to the “distinct association peaks” scenario.

A warning about the interpretation

As always in statistics, correlation does not imply causality. And what we quantify here are correlations. We can find very strong evidence that the same variant is affecting two traits, but what we cannot conclude without doubt is that the two traits (say, expression of a gene and disease outcome) are causally related. It may be likely, but we are not testing this.

An illustration of the complexity of this is the commonly observed case where a single variant (or haplotype) appears to affect the expression of a group of genes in the same chromosome region. Our test may, in such a situation, provide strong evidence of co-localisation for several of these genes with a disease GWAS. However, most of the time the expression of a single of these genes will actually causally affect the disease trait of interest. It does not mean that the test is wrong but one just has to understand what it is actually testing. Precisely, two traits affected by the same causal variant may suggest a causal link between both, but it does not have to be the case.

Two limitations of this approach

There are two additional limitations to mention. One is that the causal variant should be typed or imputed. We use simulations to show that if this is not the case, the behaviour of the test becomes very conservative.

A second issue is the presence of more than one association for the same trait at a locus. If both associations have approximately the same level of significance, the test can misbehave. In addition, identifying co-localisation with the secondary association requires conditional P-values. We give a nice example of this in the paper. However, if only P-values are available (which is key for what we want to do), this requires using approximate methods. Things are much easier if the genotype level data are available and a proper conditional regression can be implemented.

Why it is important to use summary statistics

Data sharing is always a contentious issue in human genetics. I am incredibly frustrated by the lack of willingness displayed by some groups to share data, even though the claim is that they do. It is a topic for another post. Eric Schadt has been extremely helpful by sharing the liver gene expression dataset with us, but this is a rather uncommon behaviour. In most cases, data are hidden between various “regulations” and “data access committees” that rarely meet and extensively delay the process of data sharing.

Given this frustration, being able to base tests on P-values makes it much easier to interact with other groups and share data. The success of large scale meta-analyses is an example of this. This is why we worked out the statistics so that P-values alone are sufficient to derive the probabilities for each scenario.

A practical implication is that it becomes possible to build a web-based server that will take P-values uploaded by users, compare these P-values with a set of GWAS datasets stored on the server (typically expression studies but perhaps other data types) and return statistics about the overlapping association signals.

We have initiated that process and the coloc server is now live (http://coloc.cs.ucl.ac.uk/coloc/), with a lot of help from the Computer Science department at UCL. We have only loaded the liver dataset that we used in this preprint as of now, but we are in the process of adding a brain gene expression study, led by my colleagues Mike Weale, John Hardy and Mina Ryten. We very much welcome collaborations, and if other datasets, for gene expression or any other relevant traits, are available, we would love to collaborate and incorporate these data into our server.

From genome-wide to “phenome-wide”

What we really want to do with this tool in the near future is mine dozens of GWAS studies using single variant P-values summary data, and search for connections that have been missed by previous investigators. Perhaps there are lipid traits that can be linked to neurodegenerative conditions, like the well known APOE result? Perhaps some T cell genes have an unexpected effect on a cardiovascular trait? Obviously these are not likely events but the genome-wide analysis of many association studies is likely to show several results of this type. The idea is to not only work genome-wide but also “phenome-wide”, comparing as many pairs of traits as possible. Again, this is definitely a collaborative work and we would be excited if we could bring more datasets to make these comparisons more powerful. So don’t hesitate to get in touch.

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine
Kentaro Yoshida, Verena J. Schuenemann, Liliana M. Cano, Marina Pais, Bagdevi Mishra, Rahul Sharma, Christa Lanz, Frank N. Martin, Sophien Kamoun, Johannes Krause, Marco Thines, Detlef Weigel, Hernán A. Burbano
(Submitted on 17 May 2013)

Phytophthora infestans, the cause of potato late blight, is infamous for having triggered the Irish Great Famine in the 1840s. Until the late 1970s, P. infestans diversity outside of its Mexican center of origin was low, and one scenario held that a single strain, US-1, had dominated the global population for 150 years; this was later challenged based on DNA analysis of historical herbarium specimens. We have compared the genomes of 11 herbarium and 15 modern strains. We conclude that the nineteenth century epidemic was caused by a unique genotype, HERB-1, that persisted for over 50 years. HERB-1 is distinct from all examined modern strains, but it is a close relative of US-1, which replaced it outside of Mexico in the twentieth century. We propose that HERB-1 and US-1 emerged from a metapopulation that was established in the early 1800s outside of the species’ center of diversity.

Computing the posterior expectation of phylogenetic trees

Computing the posterior expectation of phylogenetic trees
Philipp Benner, Miroslav Bačák
(Submitted on 16 May 2013)

Inferring phylogenetic trees from multiple sequence alignments often relies upon Markov chain Monte Carlo (MCMC) methods to generate tree samples from a posterior distribution. To give a rigorous approximation of the posterior expectation, one needs to compute the mean of the tree samples and therefore a sound definition of a mean and algorithms for its computation are highly demanded. To the best of our knowledge, no existing method of phylogenetic inference can handle the full set of sample trees, because such trees typically have different topologies. We develop a novel statistical model for the inference of phylogenetic trees based on the tree space due to Billera et al. [2001]. Since it is an Hadamard space, the mean and median are well defined, which we also motivate from a decision theoretic perspective. The actual approximation of the posterior expectation relies on some recent developments in Hadamard spaces (Ba\v{c}\’ak [2013a], Miller et al. [2012]) and the fast computation of geodesics in tree space (Owen and Provan [2011]), which altogether enable to compute medians and means of trees with different topologies. Our intention is to give a full self-contained description of the methods required to approximate posterior expectations. We demonstrate these methods on the small ribosomal subunit rRNA sequence alignment. The posterior expectations obtained on this data set are a meaningful summary of the posterior distribution and the uncertainty about the tree topology.

Small ancestry informative marker panels for complete classification between the original four HapMap populations

Small ancestry informative marker panels for complete classification between the original four HapMap populations
Damrongrit Setsirichok, Theera Piroonratana, Anunchai Assawamakin, Touchpong Usavanarong, Chanin Limwongse, Waranyu Wongseree, Chatchawit Aporntewan, Nachol Chaiyaratana
(Submitted on 16 May 2013)

A protocol for the identification of ancestry informative markers (AIMs) from genome-wide single nucleotide polymorphism (SNP) data is proposed. The protocol consists of three main steps: (a) identification of potential positive selection regions via Fst extremity measurement, (b) SNP screening via two-stage attribute selection and (c) classification model construction using a naive Bayes classifier. The two-stage attribute selection is composed of a newly developed round robin symmetrical uncertainty ranking technique and a wrapper embedded with a naive Bayes classifier. The protocol has been applied to the HapMap Phase II data. Two AIM panels, which consist of 10 and 16 SNPs that lead to complete classification between CEU, CHB, JPT and YRI populations, are identified. Moreover, the panels are at least four times smaller than those reported in previous studies. The results suggest that the protocol could be useful in a scenario involving a larger number of populations.

SISRS: SNP Identification from Short Read Sequences

SISRS: SNP Identification from Short Read Sequences
Rachel S. Schwartz, Kelly Harkins, Anne C. Stone, Reed A. Cartwright
(Submitted on 16 May 2013)

One of the important challenges in modern phylogenetics is to identify data that can be used to resolve species relationships accurately. Whole-genome shotgun sequencing provides large amounts of data from which to identify phylogenetically informative sites; however, previous studies have required genome assembly or alignment to a reference genome, which is difficult when species are not closely related.
We have developed a pipeline to extract potentially informative sites directly from raw short-read sequence data. Reads are assembled into conserved genome fragments, reads are then aligned to these fragments, and informative sites are identified. This pipeline produced >14000 informative sites from reads for 12 species of Leishmania and a reference genome. When analyzed using standard phylogenetic methods, these data resulted in a fully bifurcating tree with strongly supported nodes.
Our procedure is implemented in the software SISRS (pronounced “scissors”) which is freely available at this https URL.