The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine

The rise and fall of the Phytophthora infestans lineage that triggered the Irish potato famine
Kentaro Yoshida, Verena J. Schuenemann, Liliana M. Cano, Marina Pais, Bagdevi Mishra, Rahul Sharma, Christa Lanz, Frank N. Martin, Sophien Kamoun, Johannes Krause, Marco Thines, Detlef Weigel, Hernán A. Burbano
(Submitted on 17 May 2013)

Phytophthora infestans, the cause of potato late blight, is infamous for having triggered the Irish Great Famine in the 1840s. Until the late 1970s, P. infestans diversity outside of its Mexican center of origin was low, and one scenario held that a single strain, US-1, had dominated the global population for 150 years; this was later challenged based on DNA analysis of historical herbarium specimens. We have compared the genomes of 11 herbarium and 15 modern strains. We conclude that the nineteenth century epidemic was caused by a unique genotype, HERB-1, that persisted for over 50 years. HERB-1 is distinct from all examined modern strains, but it is a close relative of US-1, which replaced it outside of Mexico in the twentieth century. We propose that HERB-1 and US-1 emerged from a metapopulation that was established in the early 1800s outside of the species’ center of diversity.

Computing the posterior expectation of phylogenetic trees

Computing the posterior expectation of phylogenetic trees
Philipp Benner, Miroslav Bačák
(Submitted on 16 May 2013)

Inferring phylogenetic trees from multiple sequence alignments often relies upon Markov chain Monte Carlo (MCMC) methods to generate tree samples from a posterior distribution. To give a rigorous approximation of the posterior expectation, one needs to compute the mean of the tree samples and therefore a sound definition of a mean and algorithms for its computation are highly demanded. To the best of our knowledge, no existing method of phylogenetic inference can handle the full set of sample trees, because such trees typically have different topologies. We develop a novel statistical model for the inference of phylogenetic trees based on the tree space due to Billera et al. [2001]. Since it is an Hadamard space, the mean and median are well defined, which we also motivate from a decision theoretic perspective. The actual approximation of the posterior expectation relies on some recent developments in Hadamard spaces (Ba\v{c}\’ak [2013a], Miller et al. [2012]) and the fast computation of geodesics in tree space (Owen and Provan [2011]), which altogether enable to compute medians and means of trees with different topologies. Our intention is to give a full self-contained description of the methods required to approximate posterior expectations. We demonstrate these methods on the small ribosomal subunit rRNA sequence alignment. The posterior expectations obtained on this data set are a meaningful summary of the posterior distribution and the uncertainty about the tree topology.

Small ancestry informative marker panels for complete classification between the original four HapMap populations

Small ancestry informative marker panels for complete classification between the original four HapMap populations
Damrongrit Setsirichok, Theera Piroonratana, Anunchai Assawamakin, Touchpong Usavanarong, Chanin Limwongse, Waranyu Wongseree, Chatchawit Aporntewan, Nachol Chaiyaratana
(Submitted on 16 May 2013)

A protocol for the identification of ancestry informative markers (AIMs) from genome-wide single nucleotide polymorphism (SNP) data is proposed. The protocol consists of three main steps: (a) identification of potential positive selection regions via Fst extremity measurement, (b) SNP screening via two-stage attribute selection and (c) classification model construction using a naive Bayes classifier. The two-stage attribute selection is composed of a newly developed round robin symmetrical uncertainty ranking technique and a wrapper embedded with a naive Bayes classifier. The protocol has been applied to the HapMap Phase II data. Two AIM panels, which consist of 10 and 16 SNPs that lead to complete classification between CEU, CHB, JPT and YRI populations, are identified. Moreover, the panels are at least four times smaller than those reported in previous studies. The results suggest that the protocol could be useful in a scenario involving a larger number of populations.

SISRS: SNP Identification from Short Read Sequences

SISRS: SNP Identification from Short Read Sequences
Rachel S. Schwartz, Kelly Harkins, Anne C. Stone, Reed A. Cartwright
(Submitted on 16 May 2013)

One of the important challenges in modern phylogenetics is to identify data that can be used to resolve species relationships accurately. Whole-genome shotgun sequencing provides large amounts of data from which to identify phylogenetically informative sites; however, previous studies have required genome assembly or alignment to a reference genome, which is difficult when species are not closely related.
We have developed a pipeline to extract potentially informative sites directly from raw short-read sequence data. Reads are assembled into conserved genome fragments, reads are then aligned to these fragments, and informative sites are identified. This pipeline produced >14000 informative sites from reads for 12 species of Leishmania and a reference genome. When analyzed using standard phylogenetic methods, these data resulted in a fully bifurcating tree with strongly supported nodes.
Our procedure is implemented in the software SISRS (pronounced “scissors”) which is freely available at this https URL.

Meta-Analysis of Gene Level Association Tests

Meta-Analysis of Gene Level Association Tests
Dajiang J. Liu, Gina M. Peloso, Xiaowei Zhan, Oddgeir Holmen, Matthew Zawistowski, Shuang Feng, Majid Nikpay, Paul L. Auer, Anuj Goel, He Zhang, Ulrike Peters, Martin Farrall, Marju Orho-Melander, Charles Kooperberg, Ruth McPherson, Hugh Watkins, Cristen J. Willer, Kristian Hveem, Olle Melander, Sekar Kathiresan, Gonçalo R. Abecasis
(Submitted on 6 May 2013)

The vast majority of connections between complex disease and common genetic variants were identified through meta-analysis, a powerful approach that enables large samples sizes while protecting against common artifacts due to population structure, repeated small sample analyses, and/or limitations with sharing individual level data. As the focus of genetic association studies shifts to rare variants, genes and other functional units are becoming the unit of analysis. Here, we propose and evaluate new approaches for meta-analysis of rare variant association. We show that our approach retains useful features of single variant meta-analytic approaches and demonstrate its utility in a study of blood lipid levels in ~18,500 individuals genotyped with exome arrays.

SARS-CoV originated from bats in 1998 and may still exist in humans

SARS-CoV originated from bats in 1998 and may still exist in humans
Ailin Tao, Yuyi Huang, Peilu Li1, Jun Liu, Nanshan Zhong, Chiyu Zhang
(Submitted on 13 May 2013)

SARS-CoV is believed to originate from civets and was thought to have been eliminated as a threat after the 2003 outbreak. Here, we show that human SARS-CoV (huSARS-CoV) originated directly from bats, rather than civets, by a cross-species jump in 1991, and formed a human-adapted strain in 1998. Since then huSARS-CoV has evolved further into highly virulent strains with genotype T and a 29-nt deletion mutation, and weakly virulent strains with genotype C but without the 29-nt deletion. The former can cause pneumonia in humans and could be the major causative pathogen of the SARS outbreak, whereas the latter might not cause pneumonia in humans, but evolved the ability to co-utilize civet ACE2 as an entry receptor, leading to interspecies transmission between humans and civets. Three crucial time points – 1991, for the cross-species jump from bats to humans; 1998, for the formation of the human-adapted SARS-CoV; and 2003, when there was an outbreak of SARS in humans – were found to associate with anomalously low annual precipitation and high temperatures in Guangdong. Anti-SARS-CoV sero-positivity was detected in 20% of all the samples tested from Guangzhou children who were born after 2005, suggesting that weakly virulent huSARS-CoVs might still exist in humans. These existing but undetected SARS-CoVs have a large potential to evolve into highly virulent strains when favorable climate conditions occur, highlighting a potential risk for the reemergence of SARS.

The new science of metagenomics and the challenges of its use in both developed and developing countries

The new science of metagenomics and the challenges of its use in both developed and developing countries
Edi Prifti (MICA), Jean-Daniel Zucker (MSI, UMMISCO, Nutriomique, Eq. 7)
(Submitted on 10 May 2013)

Our view of the microbial world and its impact on human health is changing radically with the ability to sequence uncultured or unculturable microbes sampled directly from their habitats, ability made possible by fast and cheap next generation sequencing technologies. Such recent developments represents a paradigmatic shift in the analysis of habitat biodiversity, be it the human, soil or ocean microbiome. We review here some research examples and results that indicate the importance of the microbiome in our lives and then discus some of the challenges faced by metagenomic experiments and the subsequent analysis of the generated data. We then analyze the economic and social impact on genomic-medicine and research in both developing and developed countries. We support the idea that there are significant benefits in building capacities for developing high-level scientific research in metagenomics in developing countries. Indeed, the notion that developing countries should wait for developed countries to make advances in science and technology that they later import at great cost has recently been challenged.

Inference in Kingman’s Coalescent with Particle Markov Chain Monte Carlo Method

Inference in Kingman’s Coalescent with Particle Markov Chain Monte Carlo Method
Yifei Chen, Xiaohui Xie
(Submitted on 3 May 2013)

We propose a new algorithm to do posterior sampling of Kingman’s coalescent, based upon the Particle Markov Chain Monte Carlo methodology. Specifically, the algorithm is an instantiation of the Particle Gibbs Sampling method, which alternately samples coalescent times conditioned on coalescent tree structures, and tree structures conditioned on coalescent times via the conditional Sequential Monte Carlo procedure. We implement our algorithm as a C++ package, and demonstrate its utility via a parameter estimation task in population genetics on both single- and multiple-locus data. The experiment results show that the proposed algorithm performs comparable to or better than several well-developed methods.

Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard

Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard
Tianyang Li, Rui Jiang, Xuegong Zhang
(Submitted on 4 May 2013)

Maximum likelihood is a popular technique for isoform reconstruction. Here, we show that isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard.

Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth

Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth
Connor O. McCoy, Frederick A. Matsen IV
(Submitted on 1 May 2013)

In microbial ecology studies, the most commonly used ways of investigating alpha (within-sample) diversity are either to apply count-only measures such as Simpson’s index to Operational Taxonomic Unit (OTU) groupings, or to use classical phylogenetic diversity (PD), which is not abundance-weighted. Although alpha diversity measures that use abundance information in a phylogenetic framework do exist, but are not widely used within the microbial ecology community. The performance of abundance-weighted phylogenetic diversity measures compared to classical discrete measures has not been explored, and the behavior of these measures under rarefaction (sub-sampling) is not yet clear. In this paper we compare the ability of various alpha diversity measures to distinguish between different community states in the human microbiome for three different data sets. We also present and compare a novel one-parameter family of alpha diversity measures, BWPD_\theta, that interpolates between classical phylogenetic diversity (PD) and an abundance-weighted extension of PD. Additionally, we examine the sensitivity of these phylogenetic diversity measures to sampling, via computational experiments and by deriving a closed form solution for the expectation of phylogenetic quadratic entropy under re-sampling. In all three of the datasets considered, an abundance-weighted measure is the best differentiator between community states. OTU-based measures, on the other hand, are less effective in distinguishing community types. In addition, abundance-weighted phylogenetic diversity measures are less sensitive to differing sampling intensity than their unweighted counterparts. Based on these results we encourage the use of abundance-weighted phylogenetic diversity measures, especially for cases such as microbial ecology where species delimitation is difficult.