Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2

Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2
Michael I Love, Wolfgang Huber, Simon Anders

In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-Seq data, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data. DESeq2 uses shrinkage estimation for dispersions and fold changes to improve stability and interpretability of the estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression and facilitates downstream tasks such as gene ranking and visualization. DESeq2 is available as an R/Bioconductor package.

Lighter: fast and memory-efficient error correction without counting

Lighter: fast and memory-efficient error correction without counting
Li Song, Liliana Florea, Ben Langmead

Lighter is a fast and memory-efficient tool for correcting sequencing errors in high-throughput sequencing datasets. Lighter avoids counting k-mers in the sequencing reads. Instead, it uses a pair of Bloom filters, one populated with a sample of the input k-mers and the other populated with k-mers likely to be correct based on a simple test. As long as the sampling fraction is adjusted in inverse proportion to the depth of sequencing, the Bloom filter size can be held constant while maintaining near-constant accuracy. Lighter is easily applied to very large sequencing datasets. It is parallelized, uses no secondary storage, and is both faster and more memory-efficient than competing approaches while achieving comparable accuracy. Lighter is free open source software available from https://github.com/mourisl/Lighter/.

Reconstructing Austronesian population history in Island Southeast Asia

Reconstructing Austronesian population history in Island Southeast Asia
Mark Lipson, Po-Ru Loh, Nick Patterson, Priya Moorjani, Ying-Chin Ko, Mark Stoneking, Bonnie Berger, David Reich

Austronesian languages are spread across half the globe, from Easter Island to Madagascar. Evidence from linguistics and archaeology indicates that the “Austronesian expansion,” which began 4-5 thousand years ago, likely had roots in Taiwan, but the ancestry of present-day Austronesian-speaking populations remains controversial. Here, focusing primarily on Island Southeast Asia, we analyze genome-wide data from 56 populations using new methods for tracing ancestral gene flow. We show that all sampled Austronesian groups harbor ancestry that is more closely related to aboriginal Taiwanese than to any present-day mainland population. Surprisingly, western Island Southeast Asian populations have also inherited ancestry from a source nested within the variation of present-day populations speaking Austro-Asiatic languages, which have historically been nearly exclusive to the mainland. Thus, either there was once a substantial Austro-Asiatic presence in Island Southeast Asia, or Austronesian speakers migrated to and through the mainland, admixing there before continuing to western Indonesia.

The Dawn of Open Access to Phylogenetic Data

The Dawn of Open Access to Phylogenetic Data
Andrew F. Magee, Michael R. May, Brian R. Moore
(Submitted on 23 May 2014)

The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computational investment. Moreover, the resulting phylogenetic data provide an explicit historical perspective that critically informs research in a vast and growing number of scientific disciplines. One such use is the study of changes in rates of lineage diversification (speciation – extinction) through time. As part of a meta-analysis in this area, we sought to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 studies published in 46 journals over a 13-year period. We document our attempts to procure those data (from online archives and by direct request to corresponding authors), and report results of analyses (using Bayesian logistic regression) to assess the impact of various factors on the success of our efforts. Overall, complete phylogenetic data for ~60% of these studies are effectively lost to science. Our study indicates that phylogenetic data are more likely to be deposited in online archives and/or shared upon request when: (1) the publishing journal has a strong data-sharing policy; (2) the publishing journal has a higher impact factor, and; (3) the data are requested from faculty rather than students. Although the situation appears dire, our analyses suggest that it is far from hopeless: recent initiatives by the scientific community — including policy changes by journals and funding agencies — are improving the state of affairs.

Genomic variation in a widespread Neotropical bird (Xenops minutus) reveals divergence, population expansion, and gene flow

Genomic variation in a widespread Neotropical bird (Xenops minutus) reveals divergence, population expansion, and gene flow
Michael G. Harvey, Robb T. Brumfield
(Submitted on 26 May 2014)

Elucidating the demographic and phylogeographic histories of species provides insight into the processes responsible for generating biological diversity, and genomic datasets are now permitting the estimation of histories and demographic parameters with unprecedented accuracy. We used a genomic single nucleotide polymorphism (SNP) dataset generated using a RAD-Seq method to investigate the historical demography and phylogeography of a widespread lowland Neotropical bird (Xenops minutus). As expected, we found that prominent landscape features that act as dispersal barriers, such as Amazonian rivers and the Andes Mountains, are associated with the deepest phylogeographic breaks, and also that isolation by distance is limited in areas between these barriers. In addition, we inferred positive population growth for most populations and detected evidence of historical gene flow between populations that are now physically isolated. Even with genomic estimates of historical demographic parameters, we found the prominent diversification hypotheses to be untestable. We conclude that investigations into the multifarious processes shaping species histories, aided by genomic datasets, will provide greater resolution of diversification in the Neotropics, but that future efforts should focus on understanding the processes shaping the histories of lineages rather than trying to reconcile these histories with landscape and climatic events in Earth history.

Human genomic regions with exceptionally high or low levels of population differentiation identified from 911 whole-genome sequences

Human genomic regions with exceptionally high or low levels of population differentiation identified from 911 whole-genome sequences
Vincenza Colonna, Qasim Ayub, Yuan Chen, Luca Pagani, Pierre Luisi, Marc Pybus, Erik Garrison, Yali Xue, Chris Tyler-Smith

Background: Population differentiation has proved to be effective for identifying loci under geographically-localized positive selection, and has the potential to identify loci subject to balancing selection. We have previously investigated the pattern of genetic differentiation among human populations at 36.8 million genomic variants to identify sites in the genome showing high frequency differences. Here, we extend this dataset to include additional variants, survey sites with low levels of differentiation, and evaluate the extent to which highly differentiated sites are likely to result from selective or other processes. Results: We demonstrate that while sites of low differentiation represent sampling effects rather than balancing selection, sites showing extremely high population differentiation are enriched for positive selection events and that one half may be the result of classic selective sweeps. Among these, we rediscover known examples, where we actually identify the established functional SNP, and discover novel examples including the genes ABCA12, CALD1 and ZNF804, which we speculate may be linked to adaptations in skin, calcium metabolism and defense, respectively. Conclusions: We have identified known and many novel candidate regions for geographically restricted positive selection, and suggest several directions for further research.

Powerful tests for multi-marker association analysis using ensemble learning

Powerful tests for multi-marker association analysis using ensemble learning
Badri Padhukasahasram

Multi-marker approaches are currently gaining a lot of interest in genome wide association studies and can enhance power to detect new associations under certain conditions. Gene and pathway based association tests are increasingly being viewed as useful complements to the more widely used single marker association analysis which have successfully uncovered numerous disease variants. A major drawback of single-marker based methods is that they do not consider pairwise and higher-order interactions between variants. Here, we describe multi-variate methods for gene and pathway based association analyses using phenotype predictions based on machine learning algorithms. Instead of utilizing only a linear or logistic regression model, we propose the use of ensembles of diverse machine learning algorithms for testing multi-variate associations. As the true mathematical relationship between a phenotype and any group of genetic and clinical variables is unknown in advance and may be complex, such a strategy gives us a general and flexible framework to approximate this relationship across different sets of SNPs. We show how phenotype prediction based on our method can be used for constructing tests for SNP set association analysis. We first apply our method to simulated datasets to demonstrate its power and correctness. Then, we apply our method to previously studied asthma-related genes in 2 independent asthma cohorts to conduct association tests.