Effect of Genetic Variation in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin

Effect of Genetic Variation in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin
Bin Z. He, Michael Z. Ludwig, Desiree A. Dickerson, Levi Barse, Bharath Arun, Soo Young Park, Natalia A. Tamarina, Scott B. Selleck, Patricia Wittkopp, Graeme I. Bell, Martin Kreitman
(Submitted on 23 May 2013)

The identification and validation of gene-gene interactions is a major challenge in human studies. Here, we explore an approach for studying epistasis in humans using a Drosophila melanogaster model of neonatal diabetes mellitus. Expression of mutant preproinsulin, hINSC96Y, in the eye imaginal disc mimics the human disease activating conserved cell stress response pathways leading to cell death and reduction in eye area. Dominant-acting variants in wild-derived inbred lines from the Drosophila Genetics Reference Panel produce a continuous, highly heritable, distribution of eye degeneration phenotypes. A genome-wide association study (GWAS) in 154 sequenced lines identified 29 candidate SNPs in 16 loci with P 7.62). RNAi knock-downs of sfl enhanced the eye degeneration phenotype in a mutant-hINS-dependent manner. sfl encodes a protein required for sulfation of the glycosaminoglycan, heparan sulfate. Two additional genes in the heparan sulfate (HS) biosynthetic pathway (tout velu, ttv and brother of tout velu, botv) also modified the eye phenotype, suggesting a link between HS-modified proteins and cellular responses to misfolded proteins. Finally, intronic variants marking the QTL were associated with decreased sfl expression, a result consistent with that predicted by RNAi studies. The ability to create a model of human genetic disease in the fly, map a QTL by GWAS to a specific gene (and noncoding variant), validate its contribution to disease with available genetic resources, and experimentally link the variant to a molecular mechanism, demonstrate the many advantages Drosophila holds in determining the genetic underpinnings of human disease.

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS

Narrowing the gap on heritability of common disease by direct estimation in case-control GWAS
David Golan, Saharon Rosset
(Submitted on 23 May 2013)

One of the major developments in recent years in the search for missing heritability of human phenotypes is the adoption of linear mixed-effects models (LMMs) to estimate heritability due to genetic variants which are not significantly associated with the phenotype. A variant of the LMM approach has been adapted to case-control studies and applied to many major diseases by Lee et al. (2011), successfully accounting for a considerable portion of the missing heritability. For example, for Crohn’s disease their estimated heritability was 22% compared to 50-60% from family studies. In this letter we propose to estimate heritability of disease directly by regression of phenotype similarities on genotype correlations, corrected to account for ascertainment. We refer to this method as genetic correlation regression (GCR). Using GCR we estimate the heritability of Crohn’s disease at 34% using the same data. We demonstrate through extensive simulation that our method yields unbiased heritability estimates, which are consistently higher than LMM estimates. Moreover, we develop a heuristic correction to LMM estimates, which can be applied to published LMM results. Applying our heuristic correction increases the estimated heritability of multiple sclerosis from 30% to 52.6%.

Our paper: The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

This guest post is by Detlef Weigel (@WeigelWorld) and Hernán A. Burbano on their arXived paper [with coauthors] Yoshida et al. The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine. arXived here and in press at eLife [to appear here].

This paper is the result of a great collaboration between a lab that specializes in ancient DNA (that of Johannes Krause from the University of Tübingen), an expert in pathogen systematics (the group of Marco Thines from the Senckenberg Museum and Goethe University in Frankfurt), two pathogen genomics labs (those of Sophien Kamoun from the Sainsbury Laboratory in Norwich and Frank Martin from the USDA in California), and our evolutionary genomics group at the Max Planck Institute in Tübingen (Hernán A. Burbano and Detlef Weigel).


Phytophthora infestans made history when it destroyed large parts of the European potato crop, beginning in 1845. Potato has its origin in the Andes, in the Southeast of modern Peru and Northwest of Bolivia, while the center of diversity of P. infestans is several thousand kilometers further north, in Mexico’s Toluca Valley. There, other Phytophthora species live on a broad range of host plants. At some point in its history, evolutionary events associated with repeat-driven genome expansion [1,2] endowed P. infestans with the genetic arsenal required to infect potato. The pathogen was introduced to Europe in 1845 via infected potato tuber from the United States, where potato blight had made its first appearance in 1843. In the ensuing European blight epidemic, Ireland was hit especially hard, because the virtual absence of independent farmers and a restrictive customs policy conspired with the disease caused by P. infestans, potato blight, to have disproportionately devastating effects. The Great Famine that struck Ireland was a decisive event in both European and American history. One million Irish died of starvation, and at least another million left the country – most of them to the USA.


This part of P. infestans history has been clear, but the relationship of the strain(s) that caused the nineteenth century epidemic to modern strains has been controversial. Before a range of genetically quite distinct P. infestans strains made their debut throughout the world some 40 years ago, the global population outside Mexico was dominated by a single strain, called US-1. Because of its prevalence, US-1 was long thought to have been the cause of the fatal outbreak in the nineteenth century. From the analysis of a single SNP in the mitochondrial genome, it was, however, concluded in 2001 that the nineteenth century strains were more closely related to the modern strains that prevail today [3].


In our new paper, we resolve this paradoxical view: While the historical pathogen strain, which we call HERB-1, indeed differs at this one position from US-1, which has a derived allele, HERB-1 is far more closely related to US-1 than to other modern strains. Molecular clock analyses show that both strains probably separated from each other only a few years before the major European outbreak. HERB-1 seems to have dominated the global population without many genetic changes, and only in the twentieth century, after new potato varieties were introduced, was HERB-1 replaced by US-1 as the most successful P. infestans strain. We do not know for sure why HERB-1 was replaced, but we noted that the modern strains tend to be polyploid, while HERB-1 was diploid. We speculate that the increased genetic diversity in polyploid lineages were important for the success of US-1 (and other modern strains).


Our conclusions are based on Illumina sequencing of 11 herbarium samples of infected potato and tomato leaves collected in Ireland, the UK, Continental Europe and North America and preserved in the herbaria of the Botanical State Collection Munich and the Kew Gardens in London. Both herbaria placed a great deal of confidence in our abilities and were very generous in providing the dried plants. The degree of DNA preservation in the herbarium samples was impressive, much higher than in other examples of ancient DNA, and the majority of recovered DNA was from the host plant, with some samples having in addition over 20% pathogen DNA. In contrast to recent studies of historic human pathogens, no target DNA enrichment was required. We compared the historic samples with modern strains from Europe, Africa and North and South America as well as two closely related Phytophthora species. Due to the 150-year long period over which the individual samples had been collected, we were able to estimate with great confidence when the various P. infestans strains had emerged during evolutionary time. Here, too, we found connections with historic events: the first contact between Europeans and Americans in Mexico falls exactly into the time window in which the genetic diversity of P. infestans experienced a remarkable increase. Presumably, the social upheaval following the arrival of the Europeans somehow led to a spread of the pathogen at the beginning of the sixteenth century, which in turn accelerated its evolution.


The historical HERB-1 type is so far not known from modern collections, but we now have many diagnostic markers with which we can type the hundreds of modern isolates to determine whether perhaps there is somewhere a reservoir of HERB-1. In addition, our work highlights that herbaria constitute a rich, so far untapped source for investigating real-time evolution.


Detlef Weigel, weigel@weigelworld.org

Hernán A. Burbano, hernan.burbano@tuebingen.mpg.de


Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany



1.         Haas BJ, Kamoun S, Zody MC, Jiang RH, Handsaker RE, et al. (2009) Genome sequence and analysis of the Irish potato famine pathogen Phytophthora infestans. Nature 461: 393-398.

2.         Raffaele S, Farrer RA, Cano LM, Studholme DJ, MacLean D, et al. (2010) Genome evolution following host jumps in the Irish potato famine pathogen lineage. Science 330: 1540-1543.

3.         Ristaino JB, Groves CT, Parra GR (2001) PCR amplification of the Irish potato famine pathogen from historic specimens. Nature 411: 695-697.



Simpsonian ‘Evolution by Jumps’ in an Adaptive Radiation of Anolis Lizards

Simpsonian ‘Evolution by Jumps’ in an Adaptive Radiation of Anolis Lizards
Jonathan M. Eastman, Daniel Wegmann, Christoph Leuenberger, Luke J. Harmon
(Submitted on 18 May 2013)

In his highly influential view of evolution, G. G. Simpson hypothesized that clades of species evolve in adaptive zones, defined as collections of niches occupied by species with similar traits and patterns of habitat use. Simpson hypothesized that species enter new adaptive zones in one of three ways: extinction of competitor species, dispersal to a new geographic region, or the evolution of a key trait that allows species to exploit resources in a new way. However, direct tests of Simpson’s hypotheses for the entry into new adaptive zones remain elusive. Here we evaluate the fit of a Simpsonian model of jumps between adaptive zones to phylogenetic comparative data. We use a novel statistical approach to show that anoles, a well-studied adaptive radiation of Caribbean lizards, have evolved by a series of evolutionary jumps in trait evolution. Furthermore, as Simpson predicted, trait axes strongly tied to habitat specialization show jumps that correspond with the evolution of key traits and/or dispersal between islands in the Greater Antilles. We conclude that jumps are commonly associated with major adaptive shifts in the evolutionary radiation of anoles.

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Xiang Zhou, Matthew Stephens
(Submitted on 19 May 2013)

Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, existing methods for calculating the likelihood ratio test statistics in mvLMMs are time consuming, and, without approximations, cannot be directly applied to analyze even two traits jointly in a typical-size GWAS. Here, we present a novel algorithm for computing parameter estimates and test statistics (Likelihood ratio and Wald) in mvLMMs that i) reduces per-iteration optimization complexity from cubic to linear in the number of samples; and ii) in GWAS analyses, reduces per-marker complexity from cubic to approximately quadratic (or linear if the relatedness matrix is of low rank) in the number of samples. The new method effectively generalizes both the EMMA (Efficient Mixed Model Association) algorithm and the GEMMA (Genome-wide EMMA) algorithm to the multivariate case, making the likelihood ratio tests in GWASs with mvLMM possible, for the first time, for tens of thousands of samples and a moderate number of phenotypes (<10). With real examples, we show that, as expected, the new method is orders of magnitude faster than competing methods in both variance component estimation in a single mvLMM, and in GWAS applications. The method is implemented in the GEMMA software package, freely available at this http URL

Variable-length haplotype construction for gene-gene interaction studies

Variable-length haplotype construction for gene-gene interaction studies
Anunchai Assawamakin, Nachol Chaiyaratana, Chanin Limwongse, Saravudh Sinsomros, Pa-thai Yenchitsomanus, Prakarnkiat Youngkong
(Submitted on 19 May 2013)

This paper presents a non-parametric classification technique for identifying a candidate bi-allelic genetic marker set that best describes disease susceptibility in gene-gene interaction studies. The developed technique functions by creating a mapping between inferred haplotypes and case/control status. The technique cycles through all possible marker combination models generated from the available marker set where the best interaction model is determined from prediction accuracy and two auxiliary criteria including low-to-high order haplotype propagation capability and model parsimony. Since variable-length haplotypes are created during the best model identification, the developed technique is referred to as a variable-length haplotype construction for gene-gene interaction (VarHAP) technique. VarHAP has been benchmarked against a multifactor dimensionality reduction (MDR) program and a haplotype interaction technique embedded in a FAMHAP program in various two-locus interaction problems. The results reveal that VarHAP is suitable for all interaction situations with the presence of weak and strong linkage disequilibrium among genetic markers.

Our paper: Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

This guest post is by Vincent Plagnol on his group’s paper Bayesian test for co-localisation between pairs of genetic association studies using summary statistics, arXived here. This has been cross-posted from the Plagnol Lab web site.

In this paper we want to answer the following question: given two genetic association studies both showing some association signal at a locus, how likely is it that the same variant is responsible for both associations?

We care about this because a shared causal variant is likely to imply an etiological link between the traits being considered. An obvious application consists of comparing a gene expression study and a disease trait. If one can show that the same variant is affecting both measurements, then it is very likely that the expression of this gene is affecting disease pathogenesis. It also provides information about the tissue type where the effect is mediated. This is a key information to inform a drug design process.

Previous work that led to this manuscript

A while back, I started a discussion with my colleague (and co-author on this manuscript) Eric Schadt about the involvement of a gene name RPS26 in type 1 diabetes. We came up with tests of co-localisation, which were later improved by my colleague (and co-author as well) Chris Wallace, based in Cambridge. These tests are somewhat dated now. The earliest version considered situations with very small number of SNPs, and was not well suited for densely typed regions, in particular as a result of imputation procedures.

This SNP density problem can be overcome to some extent, and Chris Wallace discusses how to do this here. However, a more fundamental issue is the Bayesian/frequentist difference. These earlier tests were testing the null hypothesis of a shared causal variant. Failing to reject the null could be the result of either a lack of power, or a true shared causal variant. In this newer Bayesian framework, the probability of each scenario is computed, including the “lack of power” case. It then becomes easier to interpret the outcome of the test. The tests are about to be released in the latest version of the coloc package (which is maintained by Chris Wallace).

In this latest paper, the underlying model is closely linked to the one proposed by Matthew Stephens and colleagues in a recent PLoS Genetics paper. However, co-localisation was more a side story in this paper, whereas it is the central point of our work. In particular, we show that it is possible to use single SNP P-values to obtain a very good approximation of the correct answer. As discussed below, this has important practical applications.

Another closely related work is the software Sherlock. Sherlock also uses P-values, and also tries to match a gene expression dataset with another GWAS. However, Sherlock does not really perform a co-localisation test but rather a general matching between a gene and a GWAS. In particular, in the Sherlock framework, only the variants significantly associated with gene expression contribute to the final test statistic. In contrast, a variant flat for the expression trait but strongly associated with disease provides strong support against co-localisation. Our work incorporates this information, by adding support to the “distinct association peaks” scenario.

A warning about the interpretation

As always in statistics, correlation does not imply causality. And what we quantify here are correlations. We can find very strong evidence that the same variant is affecting two traits, but what we cannot conclude without doubt is that the two traits (say, expression of a gene and disease outcome) are causally related. It may be likely, but we are not testing this.

An illustration of the complexity of this is the commonly observed case where a single variant (or haplotype) appears to affect the expression of a group of genes in the same chromosome region. Our test may, in such a situation, provide strong evidence of co-localisation for several of these genes with a disease GWAS. However, most of the time the expression of a single of these genes will actually causally affect the disease trait of interest. It does not mean that the test is wrong but one just has to understand what it is actually testing. Precisely, two traits affected by the same causal variant may suggest a causal link between both, but it does not have to be the case.

Two limitations of this approach

There are two additional limitations to mention. One is that the causal variant should be typed or imputed. We use simulations to show that if this is not the case, the behaviour of the test becomes very conservative.

A second issue is the presence of more than one association for the same trait at a locus. If both associations have approximately the same level of significance, the test can misbehave. In addition, identifying co-localisation with the secondary association requires conditional P-values. We give a nice example of this in the paper. However, if only P-values are available (which is key for what we want to do), this requires using approximate methods. Things are much easier if the genotype level data are available and a proper conditional regression can be implemented.

Why it is important to use summary statistics

Data sharing is always a contentious issue in human genetics. I am incredibly frustrated by the lack of willingness displayed by some groups to share data, even though the claim is that they do. It is a topic for another post. Eric Schadt has been extremely helpful by sharing the liver gene expression dataset with us, but this is a rather uncommon behaviour. In most cases, data are hidden between various “regulations” and “data access committees” that rarely meet and extensively delay the process of data sharing.

Given this frustration, being able to base tests on P-values makes it much easier to interact with other groups and share data. The success of large scale meta-analyses is an example of this. This is why we worked out the statistics so that P-values alone are sufficient to derive the probabilities for each scenario.

A practical implication is that it becomes possible to build a web-based server that will take P-values uploaded by users, compare these P-values with a set of GWAS datasets stored on the server (typically expression studies but perhaps other data types) and return statistics about the overlapping association signals.

We have initiated that process and the coloc server is now live (http://coloc.cs.ucl.ac.uk/coloc/), with a lot of help from the Computer Science department at UCL. We have only loaded the liver dataset that we used in this preprint as of now, but we are in the process of adding a brain gene expression study, led by my colleagues Mike Weale, John Hardy and Mina Ryten. We very much welcome collaborations, and if other datasets, for gene expression or any other relevant traits, are available, we would love to collaborate and incorporate these data into our server.

From genome-wide to “phenome-wide”

What we really want to do with this tool in the near future is mine dozens of GWAS studies using single variant P-values summary data, and search for connections that have been missed by previous investigators. Perhaps there are lipid traits that can be linked to neurodegenerative conditions, like the well known APOE result? Perhaps some T cell genes have an unexpected effect on a cardiovascular trait? Obviously these are not likely events but the genome-wide analysis of many association studies is likely to show several results of this type. The idea is to not only work genome-wide but also “phenome-wide”, comparing as many pairs of traits as possible. Again, this is definitely a collaborative work and we would be excited if we could bring more datasets to make these comparisons more powerful. So don’t hesitate to get in touch.