ReproPhylo: An Environment for Reproducible Phylogenomics

ReproPhylo: An Environment for Reproducible Phylogenomics

Amir Szitenberg, Max John, Mark L Blaxter, David H Lunt
doi: http://dx.doi.org/10.1101/019349

The reproducibility of experiments is key to the scientific process, and particularly necessary for accurate reporting of analyses in data-rich fields such as phylogenomics. We present ReproPhylo, a phylogenomic analysis environment developed to ensure experimental reproducibility, to facilitate the handling of large-scale data, and to assist methodological experimentation. Reproducibility, and instantaneous repeatability, is built in to the ReproPhylo system, and does not require user intervention or configuration because it stores the experimental workflow as a single, serialized Python object containing explicit provenance and environment information. This ‘single file’ approach ensures the persistence of provenance across iterations of the analysis, with changes automatically managed by the version control program Git. ReproPhylo produces an extensive human-readable report, and generates a comprehensive experimental archive file, both of which are suitable for submission with publications. The system facilitates thorough experimental exploration of both parameters and data. ReproPhylo is a platform independent CC0 python module, and is easily installed as a Docker image, with an Jupyter GUI, or as a slimmer version in a Galaxy distribution.

FIQT: a simple, powerful method to accurately estimate effect sizes in genome scans

FIQT: a simple, powerful method to accurately estimate effect sizes in genome scans

Tim B Bigdeli, Donghyung Lee, Brien P Riley, Vladimir I Vladimirov, Ayman H Fanous, Kenneth S Kendler, Silviu-Alin Bacanu
doi: http://dx.doi.org/10.1101/019299

Genome scans, including both genome-wide association studies and deep sequencing, continue to discover a growing number of significant association signals for various traits. However, often variants meeting genome-wide significance criteria explain far less of the overall trait variance than “sub-threshold” association signals. To extract these sub-threshold signals, there is a need for methods which accurately estimate the mean of all (normally-distributed) test-statistics from a genome scan (i.e., Z-scores). This is currently achieved by the difficult procedures of adjusting all Z-score (χ_1^2) statistics for “winner’s curse” (multiple testing). Given that multiple testing adjustments are much simpler for p-values, we propose a method for estimating Z-scores means by i) first adjusting their p-values for multiple testing and then ii) transforming the adjusted p-values to upper tail Z-scores with the sign of the original statistics. Because a False Discovery Rate (FDR) procedure is used for multiple testing adjustment, we denote this method FDR Inverse Quantile Transformation (FIQT). When compared to competitors, e.g. Empirical Bayes (including proposed improvements), FIQT is more i) accurate and ii) computationally efficient by orders of magnitude. Its accuracy advantage is substantial at larger sample sizes and/or moderate numbers of association signals. Practical application of FIQT to Z-scores from the first Psychiatric Genetic Consortium (PGC) schizophrenia predicts a non-trivial fraction of the significant signal regions from the subsequent published PGC schizophrenia studies. Finally, we suggest that FIQT might be i) used to improve subject level risk prediction and ii) further improved by modelling the noncentrality of χ_1^2 statistics.

Roary: Rapid large-scale prokaryote pan genome analysis

Roary: Rapid large-scale prokaryote pan genome analysis

Andrew J Page, Carla A Cummins, Martin Hunt, Vanessa K Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Jacqueline A Keane, Julian Parkhill
doi: http://dx.doi.org/10.1101/019315

A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and dispensable accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors.

Sequencing ultra-long DNA molecules with the Oxford Nanopore MinION

Sequencing ultra-long DNA molecules with the Oxford Nanopore MinION

John M Urban, Jacob Bliss, Charles E Lawrence, Susan A Gerbi
doi: http://dx.doi.org/10.1101/019281

Oxford Nanopore Technologies’ nanopore sequencing device, the MinION, holds the promise of sequencing ultra-long DNA fragments >100kb. An obstacle to realizing this promise is delivering ultra-long DNA molecules to the nanopores. We present our progress in developing cost-effective ways to overcome this obstacle and our resulting MinION data, including multiple reads >100kb.

Ecological and evolutionary adaptations shape the gut microbiome of BaAka African rainforest hunter-gatherers

Ecological and evolutionary adaptations shape the gut microbiome of BaAka African rainforest hunter-gatherers
Andres Gomez , Klara Petrzelkova , Carl J Yeoman , Micahel B Burns , Katherine R Amato , Klara Vlckova , David Modry , Angelique Todd , Carolyn A Jost Robbinson , Melissa Remis , Manolito Torralba , Karen E Nelson , Franck Carbonero , H Rex Gaskins , Brenda A Wilson , Rebecca M Stumpf , Bryan A White , Steven R Leigh , Ran Blekhman
doi: http://dx.doi.org/10.1101/019232

The gut microbiome provides access to otherwise unavailable metabolic and immune functions, likely affecting mammalian fitness and evolution. To investigate how this microbial ecosystem impacts evolutionary adaptation of humans to particular habitats, we explore the gut microbiome and metabolome of the BaAka rainforest hunter-gatherers from Central Africa. The data demonstrate that the BaAka harbor a colonic ecosystem dominated by Prevotellaceae and other taxa likely related to an increased capacity to metabolize plant structural polysaccharides, phenolics, and lipids. A comparative analysis shows that the BaAka gut microbiome shares similar patterns with that of the Hadza, another hunter-gatherer population from Tanzania. Nevertheless, the BaAka harbor significantly higher bacterial diversity and pathogen load compared to the Hadza, as well as other Western populations. We show that the traits unique to the BaAka microbiome and metabolome likely reflect adaptations to hunter-gatherer lifestyles and particular subsistence patterns. We hypothesize that the observed increase in microbial diversity and potential pathogenicity in the BaAka microbiome has been facilitated by evolutionary adaptations in immunity genes, resulting in a more tolerant immune system.

Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies

Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies
Julia A Palacios , John Wakeley, Sohini Ramachandran
doi: http://dx.doi.org/10.1101/019216

Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Accurate methods are available for data from a single locus or from independent loci. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model which allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method’s credible intervals for population size as a function of time cover 90 percent of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.

Near-optimal RNA-Seq quantification

Near-optimal RNA-Seq quantification
Nicolas Bray, Harold Pimentel, Páll Melsted, Lior Pachter
Subjects: Quantitative Methods (q-bio.QM); Computational Engineering, Finance, and Science (cs.CE); Data Structures and Algorithms (cs.DS); Genomics (q-bio.GN)

We present a novel approach to RNA-Seq quantification that is near optimal in speed and accuracy. Software implementing the approach, called kallisto, can be used to analyze 30 million unaligned RNA-Seq reads in less than 5 minutes on a standard laptop computer while providing results as accurate as those of the best existing tools. This removes a major computational bottleneck in RNA-Seq analysis.

Integration of experiments across diverse environments identifies the genetic determinants of variation in Sorghum bicolor seed element composition

Integration of experiments across diverse environments identifies the genetic determinants of variation in Sorghum bicolor seed element composition

Nadia Shakoor , Greg Ziegler , Brian P Dilkes , Zachary Brenton , Richard Boyles , Erin L Connolly , Stephen Kresovich , Ivan Baxter

Seedling establishment and seed nutritional quality require the sequestration of sufficient mineral nutrients. Identification of genes and alleles that modify element content in the grains of cereals, including Sorghum bicolor, is fundamental to developing breeding and selection methods aimed at increasing bioavailable mineral content and improving crop growth. We have developed a high throughput workflow for the simultaneous measurement of multiple elements in Sorghum seeds. We measured seed element levels in the genotyped Sorghum Association Panel (SAP), representing all major cultivated sorghum races from diverse geographic and climatic regions, and mapped alleles contributing to seed element variation across three environments by genome-wide association. We observed significant phenotypic and genetic correlation between several elements across multiple years and diverse environments. The power of combining high-precision measurements with genome wide association was demonstrated by implementing rank transformation and a multilocus mixed model (MLMM) to map alleles controlling 20 element traits, identifying 255 loci affecting the sorghum seed ionome. Sequence similarity to genes characterized in previous studies identified likely causative genes for the accumulation of zinc (Zn) manganese (Mn), nickel (Ni), calcium (Ca) and cadmium (Cd) in sorghum seed. In addition to strong candidates for these four elements, we provide a list of candidate loci for several other elements. Our approach enabled identification of SNPs in strong LD with causative polymorphisms that can be used directly in plant breeding and improvement.

Coalescent times and patterns of genetic diversity in species with facultative sex: effects of gene conversion, population structure and heterogeneity

Coalescent times and patterns of genetic diversity in species with facultative sex: effects of gene conversion, population structure and heterogeneity

Matthew Hartfield , Stephen I. Wright , Aneil F. Agrawal

Many diploid organisms undergo facultative sexual reproduction. However, little is currently known concerning the distribution of neutral genetic variation amongst facultative sexuals except in very simple cases. Understanding this distribution is important when making inferences about rates of sexual reproduction, effective population size and demographic history. Here, we extend coalescent theory in diploids with facultative sex to consider gene conversion, selfing, population subdivision, and temporal and spatial heterogeneity in rates of sex. In addition to analytical results for two-sample coalescent times, we outline a coalescent algorithm that accommodates the complexities arising from partial sex; this algorithm can be used to generate multi-sample coalescent distributions. A key result is that when sex is rare, gene conversion becomes a significant force in reducing diversity within individuals, which can remove genomic signatures of infrequent sex (the ‘Meselson Effect’) or entirely reverse the predictions. Our models offer improved methods for assessing the null model (I.e. neutrality) of patterns of molecular variation in facultative sexuals.

Bayesian Inference of Divergence Times and Feeding Evolution in Grey Mullets (Mugilidae)

Bayesian Inference of Divergence Times and Feeding Evolution in Grey Mullets (Mugilidae)

Francesco Santini , Michael R. May , Giorgio Carnevale , Brian R. Moore
doi: http://dx.doi.org/10.1101/019075

Grey mullets (Mugilidae, Ovalentariae) are coastal fishes found in near-shore environments of tropical, subtropical, and temperate regions within marine, brackish, and freshwater habitats throughout the world. This group is noteworthy both for the highly conserved morphology of its members—which complicates species identification and delimitation—and also for the uncommon herbivorous or detritivorous diet of most mullets. In this study, we first attempt to identify the number of mullet species, and then—for the resulting species—estimate a densely sampled time-calibrated phylogeny using three mitochondrial gene regions and three fossil calibrations. Our results identify two major subgroups of mullets that diverged in the Paleocene/Early Eocene, followed by an Eocene/Oligocene radiation across both tropical and subtropical habitats. We use this phylogeny to explore the evolution of feeding preference in mullets, which indicates multiple independent origins of both herbivorous and detritivorous diets within this group. We also explore correlations between feeding preference and other variables, including body size, habitat (marine, brackish, or freshwater), and geographic distribution (tropical, subtropical, or temperate). Our analyses reveal: (1) a positive correlation between trophic index and habitat (with herbivorous and/or detritivorous species predominantly occurring in marine habitats); (2) a negative correlation between trophic index and geographic distribution (with herbivorous species occurring predominantly in subtropical and temperate regions), and; (3) a negative correlation between body size and geographic distribution (with larger species occurring predominantly in subtropical and temperate regions).