Simpsonian ‘Evolution by Jumps’ in an Adaptive Radiation of Anolis Lizards

Simpsonian ‘Evolution by Jumps’ in an Adaptive Radiation of Anolis Lizards
Jonathan M. Eastman, Daniel Wegmann, Christoph Leuenberger, Luke J. Harmon
(Submitted on 18 May 2013)

In his highly influential view of evolution, G. G. Simpson hypothesized that clades of species evolve in adaptive zones, defined as collections of niches occupied by species with similar traits and patterns of habitat use. Simpson hypothesized that species enter new adaptive zones in one of three ways: extinction of competitor species, dispersal to a new geographic region, or the evolution of a key trait that allows species to exploit resources in a new way. However, direct tests of Simpson’s hypotheses for the entry into new adaptive zones remain elusive. Here we evaluate the fit of a Simpsonian model of jumps between adaptive zones to phylogenetic comparative data. We use a novel statistical approach to show that anoles, a well-studied adaptive radiation of Caribbean lizards, have evolved by a series of evolutionary jumps in trait evolution. Furthermore, as Simpson predicted, trait axes strongly tied to habitat specialization show jumps that correspond with the evolution of key traits and/or dispersal between islands in the Greater Antilles. We conclude that jumps are commonly associated with major adaptive shifts in the evolutionary radiation of anoles.

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies

Efficient Algorithms for Multivariate Linear Mixed Models in Genome-wide Association Studies
Xiang Zhou, Matthew Stephens
(Submitted on 19 May 2013)

Multivariate linear mixed models (mvLMMs) have been widely used in many areas of genetics, and have attracted considerable recent interest in genome-wide association studies (GWASs). However, existing methods for calculating the likelihood ratio test statistics in mvLMMs are time consuming, and, without approximations, cannot be directly applied to analyze even two traits jointly in a typical-size GWAS. Here, we present a novel algorithm for computing parameter estimates and test statistics (Likelihood ratio and Wald) in mvLMMs that i) reduces per-iteration optimization complexity from cubic to linear in the number of samples; and ii) in GWAS analyses, reduces per-marker complexity from cubic to approximately quadratic (or linear if the relatedness matrix is of low rank) in the number of samples. The new method effectively generalizes both the EMMA (Efficient Mixed Model Association) algorithm and the GEMMA (Genome-wide EMMA) algorithm to the multivariate case, making the likelihood ratio tests in GWASs with mvLMM possible, for the first time, for tens of thousands of samples and a moderate number of phenotypes (<10). With real examples, we show that, as expected, the new method is orders of magnitude faster than competing methods in both variance component estimation in a single mvLMM, and in GWAS applications. The method is implemented in the GEMMA software package, freely available at this http URL

Variable-length haplotype construction for gene-gene interaction studies

Variable-length haplotype construction for gene-gene interaction studies
Anunchai Assawamakin, Nachol Chaiyaratana, Chanin Limwongse, Saravudh Sinsomros, Pa-thai Yenchitsomanus, Prakarnkiat Youngkong
(Submitted on 19 May 2013)

This paper presents a non-parametric classification technique for identifying a candidate bi-allelic genetic marker set that best describes disease susceptibility in gene-gene interaction studies. The developed technique functions by creating a mapping between inferred haplotypes and case/control status. The technique cycles through all possible marker combination models generated from the available marker set where the best interaction model is determined from prediction accuracy and two auxiliary criteria including low-to-high order haplotype propagation capability and model parsimony. Since variable-length haplotypes are created during the best model identification, the developed technique is referred to as a variable-length haplotype construction for gene-gene interaction (VarHAP) technique. VarHAP has been benchmarked against a multifactor dimensionality reduction (MDR) program and a haplotype interaction technique embedded in a FAMHAP program in various two-locus interaction problems. The results reveal that VarHAP is suitable for all interaction situations with the presence of weak and strong linkage disequilibrium among genetic markers.

Our paper: Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

This guest post is by Vincent Plagnol on his group’s paper Bayesian test for co-localisation between pairs of genetic association studies using summary statistics, arXived here. This has been cross-posted from the Plagnol Lab web site.

In this paper we want to answer the following question: given two genetic association studies both showing some association signal at a locus, how likely is it that the same variant is responsible for both associations?

We care about this because a shared causal variant is likely to imply an etiological link between the traits being considered. An obvious application consists of comparing a gene expression study and a disease trait. If one can show that the same variant is affecting both measurements, then it is very likely that the expression of this gene is affecting disease pathogenesis. It also provides information about the tissue type where the effect is mediated. This is a key information to inform a drug design process.

Previous work that led to this manuscript

A while back, I started a discussion with my colleague (and co-author on this manuscript) Eric Schadt about the involvement of a gene name RPS26 in type 1 diabetes. We came up with tests of co-localisation, which were later improved by my colleague (and co-author as well) Chris Wallace, based in Cambridge. These tests are somewhat dated now. The earliest version considered situations with very small number of SNPs, and was not well suited for densely typed regions, in particular as a result of imputation procedures.

This SNP density problem can be overcome to some extent, and Chris Wallace discusses how to do this here. However, a more fundamental issue is the Bayesian/frequentist difference. These earlier tests were testing the null hypothesis of a shared causal variant. Failing to reject the null could be the result of either a lack of power, or a true shared causal variant. In this newer Bayesian framework, the probability of each scenario is computed, including the “lack of power” case. It then becomes easier to interpret the outcome of the test. The tests are about to be released in the latest version of the coloc package (which is maintained by Chris Wallace).

In this latest paper, the underlying model is closely linked to the one proposed by Matthew Stephens and colleagues in a recent PLoS Genetics paper. However, co-localisation was more a side story in this paper, whereas it is the central point of our work. In particular, we show that it is possible to use single SNP P-values to obtain a very good approximation of the correct answer. As discussed below, this has important practical applications.

Another closely related work is the software Sherlock. Sherlock also uses P-values, and also tries to match a gene expression dataset with another GWAS. However, Sherlock does not really perform a co-localisation test but rather a general matching between a gene and a GWAS. In particular, in the Sherlock framework, only the variants significantly associated with gene expression contribute to the final test statistic. In contrast, a variant flat for the expression trait but strongly associated with disease provides strong support against co-localisation. Our work incorporates this information, by adding support to the “distinct association peaks” scenario.

A warning about the interpretation

As always in statistics, correlation does not imply causality. And what we quantify here are correlations. We can find very strong evidence that the same variant is affecting two traits, but what we cannot conclude without doubt is that the two traits (say, expression of a gene and disease outcome) are causally related. It may be likely, but we are not testing this.

An illustration of the complexity of this is the commonly observed case where a single variant (or haplotype) appears to affect the expression of a group of genes in the same chromosome region. Our test may, in such a situation, provide strong evidence of co-localisation for several of these genes with a disease GWAS. However, most of the time the expression of a single of these genes will actually causally affect the disease trait of interest. It does not mean that the test is wrong but one just has to understand what it is actually testing. Precisely, two traits affected by the same causal variant may suggest a causal link between both, but it does not have to be the case.

Two limitations of this approach

There are two additional limitations to mention. One is that the causal variant should be typed or imputed. We use simulations to show that if this is not the case, the behaviour of the test becomes very conservative.

A second issue is the presence of more than one association for the same trait at a locus. If both associations have approximately the same level of significance, the test can misbehave. In addition, identifying co-localisation with the secondary association requires conditional P-values. We give a nice example of this in the paper. However, if only P-values are available (which is key for what we want to do), this requires using approximate methods. Things are much easier if the genotype level data are available and a proper conditional regression can be implemented.

Why it is important to use summary statistics

Data sharing is always a contentious issue in human genetics. I am incredibly frustrated by the lack of willingness displayed by some groups to share data, even though the claim is that they do. It is a topic for another post. Eric Schadt has been extremely helpful by sharing the liver gene expression dataset with us, but this is a rather uncommon behaviour. In most cases, data are hidden between various “regulations” and “data access committees” that rarely meet and extensively delay the process of data sharing.

Given this frustration, being able to base tests on P-values makes it much easier to interact with other groups and share data. The success of large scale meta-analyses is an example of this. This is why we worked out the statistics so that P-values alone are sufficient to derive the probabilities for each scenario.

A practical implication is that it becomes possible to build a web-based server that will take P-values uploaded by users, compare these P-values with a set of GWAS datasets stored on the server (typically expression studies but perhaps other data types) and return statistics about the overlapping association signals.

We have initiated that process and the coloc server is now live (http://coloc.cs.ucl.ac.uk/coloc/), with a lot of help from the Computer Science department at UCL. We have only loaded the liver dataset that we used in this preprint as of now, but we are in the process of adding a brain gene expression study, led by my colleagues Mike Weale, John Hardy and Mina Ryten. We very much welcome collaborations, and if other datasets, for gene expression or any other relevant traits, are available, we would love to collaborate and incorporate these data into our server.

From genome-wide to “phenome-wide”

What we really want to do with this tool in the near future is mine dozens of GWAS studies using single variant P-values summary data, and search for connections that have been missed by previous investigators. Perhaps there are lipid traits that can be linked to neurodegenerative conditions, like the well known APOE result? Perhaps some T cell genes have an unexpected effect on a cardiovascular trait? Obviously these are not likely events but the genome-wide analysis of many association studies is likely to show several results of this type. The idea is to not only work genome-wide but also “phenome-wide”, comparing as many pairs of traits as possible. Again, this is definitely a collaborative work and we would be excited if we could bring more datasets to make these comparisons more powerful. So don’t hesitate to get in touch.

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine
Kentaro Yoshida, Verena J. Schuenemann, Liliana M. Cano, Marina Pais, Bagdevi Mishra, Rahul Sharma, Christa Lanz, Frank N. Martin, Sophien Kamoun, Johannes Krause, Marco Thines, Detlef Weigel, Hernán A. Burbano
(Submitted on 17 May 2013)

Phytophthora infestans, the cause of potato late blight, is infamous for having triggered the Irish Great Famine in the 1840s. Until the late 1970s, P. infestans diversity outside of its Mexican center of origin was low, and one scenario held that a single strain, US-1, had dominated the global population for 150 years; this was later challenged based on DNA analysis of historical herbarium specimens. We have compared the genomes of 11 herbarium and 15 modern strains. We conclude that the nineteenth century epidemic was caused by a unique genotype, HERB-1, that persisted for over 50 years. HERB-1 is distinct from all examined modern strains, but it is a close relative of US-1, which replaced it outside of Mexico in the twentieth century. We propose that HERB-1 and US-1 emerged from a metapopulation that was established in the early 1800s outside of the species’ center of diversity.

Bayesian test for co-localisation between pairs of genetic association studies using summary statistics

Bayesian test for co-localisation between pairs of genetic association studies using summary statistics
Claudia Giambartolomei (1), Damjan Vukcevic (2), Eric E. Schadt (3), Aroon D. Hingorani (1), Chris Wallace (4), Vincent Plagnol (1) ((1) University College London (UCL), London, UK, (2) Royal Children’s Hospital, Melbourne, Australia, (3) Mount Sinai School of Medicine, New York USA, (4) University of Cambridge, Cambridge, UK)
(Submitted on 17 May 2013)

Genetic association studies, in particular the genome-wide association study (GWAS) design, have provided a wealth of novel insights into the aetiology of a wide range of human diseases and traits, in particular cardiovascular diseases and lipid biomarkers. The next challenge consists of understanding the molecular basis of these associations. The integration of multiple association datasets, including gene expression datasets, can contribute to this goal. We have developed a novel statistical methodology to assess whether two association signals are consistent with a shared causal variant. An application is the integration of disease scans with expression quantitative trait locus (eQTL) studies, but any pair of GWAS datasets can be integrated in this framework. A key feature of the method is the ability to derive the key output statistics from single SNP summary statistics, hence making it possible to perform systematic meta-analysis type comparisons across multiple GWAS datasets (implemented online at (this http URL). We demonstrate the value of the approach by re-analysing a gene expression dataset in 966 liver samples with a published meta-analysis of lipid traits including > 100,000 individuals of European ancestry. Our co-localisation results are broadly consistent with the conclusion from the published meta-analysis. Combining all lipid biomarkers, our re-analysis supported 29 out of 38 reported co-localisation results with eQTLs. Two clearly discordant findings (IFT172, CPNE1), as well as multiple new co-localisation results, highlight the value of a formal systematic statistical test. Our findings provide information about the causal gene in associated intervals and have direct implications for the understanding of complex diseases as well as the design of drugs to target disease pathways.

Computing the posterior expectation of phylogenetic trees

Computing the posterior expectation of phylogenetic trees
Philipp Benner, Miroslav Bačák
(Submitted on 16 May 2013)

Inferring phylogenetic trees from multiple sequence alignments often relies upon Markov chain Monte Carlo (MCMC) methods to generate tree samples from a posterior distribution. To give a rigorous approximation of the posterior expectation, one needs to compute the mean of the tree samples and therefore a sound definition of a mean and algorithms for its computation are highly demanded. To the best of our knowledge, no existing method of phylogenetic inference can handle the full set of sample trees, because such trees typically have different topologies. We develop a novel statistical model for the inference of phylogenetic trees based on the tree space due to Billera et al. [2001]. Since it is an Hadamard space, the mean and median are well defined, which we also motivate from a decision theoretic perspective. The actual approximation of the posterior expectation relies on some recent developments in Hadamard spaces (Ba\v{c}\’ak [2013a], Miller et al. [2012]) and the fast computation of geodesics in tree space (Owen and Provan [2011]), which altogether enable to compute medians and means of trees with different topologies. Our intention is to give a full self-contained description of the methods required to approximate posterior expectations. We demonstrate these methods on the small ribosomal subunit rRNA sequence alignment. The posterior expectations obtained on this data set are a meaningful summary of the posterior distribution and the uncertainty about the tree topology.

Small ancestry informative marker panels for complete classification between the original four HapMap populations

Small ancestry informative marker panels for complete classification between the original four HapMap populations
Damrongrit Setsirichok, Theera Piroonratana, Anunchai Assawamakin, Touchpong Usavanarong, Chanin Limwongse, Waranyu Wongseree, Chatchawit Aporntewan, Nachol Chaiyaratana
(Submitted on 16 May 2013)

A protocol for the identification of ancestry informative markers (AIMs) from genome-wide single nucleotide polymorphism (SNP) data is proposed. The protocol consists of three main steps: (a) identification of potential positive selection regions via Fst extremity measurement, (b) SNP screening via two-stage attribute selection and (c) classification model construction using a naive Bayes classifier. The two-stage attribute selection is composed of a newly developed round robin symmetrical uncertainty ranking technique and a wrapper embedded with a naive Bayes classifier. The protocol has been applied to the HapMap Phase II data. Two AIM panels, which consist of 10 and 16 SNPs that lead to complete classification between CEU, CHB, JPT and YRI populations, are identified. Moreover, the panels are at least four times smaller than those reported in previous studies. The results suggest that the protocol could be useful in a scenario involving a larger number of populations.

SISRS: SNP Identification from Short Read Sequences

SISRS: SNP Identification from Short Read Sequences
Rachel S. Schwartz, Kelly Harkins, Anne C. Stone, Reed A. Cartwright
(Submitted on 16 May 2013)

One of the important challenges in modern phylogenetics is to identify data that can be used to resolve species relationships accurately. Whole-genome shotgun sequencing provides large amounts of data from which to identify phylogenetically informative sites; however, previous studies have required genome assembly or alignment to a reference genome, which is difficult when species are not closely related.
We have developed a pipeline to extract potentially informative sites directly from raw short-read sequence data. Reads are assembled into conserved genome fragments, reads are then aligned to these fragments, and informative sites are identified. This pipeline produced >14000 informative sites from reads for 12 species of Leishmania and a reference genome. When analyzed using standard phylogenetic methods, these data resulted in a fully bifurcating tree with strongly supported nodes.
Our procedure is implemented in the software SISRS (pronounced “scissors”) which is freely available at this https URL.

Meta-Analysis of Gene Level Association Tests

Meta-Analysis of Gene Level Association Tests
Dajiang J. Liu, Gina M. Peloso, Xiaowei Zhan, Oddgeir Holmen, Matthew Zawistowski, Shuang Feng, Majid Nikpay, Paul L. Auer, Anuj Goel, He Zhang, Ulrike Peters, Martin Farrall, Marju Orho-Melander, Charles Kooperberg, Ruth McPherson, Hugh Watkins, Cristen J. Willer, Kristian Hveem, Olle Melander, Sekar Kathiresan, Gonçalo R. Abecasis
(Submitted on 6 May 2013)

The vast majority of connections between complex disease and common genetic variants were identified through meta-analysis, a powerful approach that enables large samples sizes while protecting against common artifacts due to population structure, repeated small sample analyses, and/or limitations with sharing individual level data. As the focus of genetic association studies shifts to rare variants, genes and other functional units are becoming the unit of analysis. Here, we propose and evaluate new approaches for meta-analysis of rare variant association. We show that our approach retains useful features of single variant meta-analytic approaches and demonstrate its utility in a study of blood lipid levels in ~18,500 individuals genotyped with exome arrays.