A robust statistical framework for reconstructing genomes from metagenomic data

A robust statistical framework for reconstructing genomes from metagenomic data

Dongwan Don Kang, Jeff Froula, Rob Egan, Zhong Wang
doi: http://dx.doi.org/10.1101/011460

We present software that reconstructs genomes from shotgun metagenomic sequences using a reference-independent approach. This method permits the identification of OTUs in large complex communities where many species are unknown. Binning reduces the complexity of a metagenomic dataset enabling many downstream analyses previously unavailable. In this study we developed MetaBAT, a robust statistical framework that integrates probabilistic distances of genome abundance with sequence composition for automatic binning. Applying MetaBAT to a human gut microbiome dataset identified 173 highly specific genomes bins including many representing previously unidentified species.

Bayesian analyses of Yemeni mitochondrial genomes suggest multiple migration events with Africa and Western Eurasia

Bayesian analyses of Yemeni mitochondrial genomes suggest multiple migration events with Africa and Western Eurasia
Deven Nikunj Vyas, Andrew Kitchen, Aida Teresa Miró-Herrans, Laurel Nichole Pearson, Ali Al-Meeri, Connie Jo Mulligan
doi: http://dx.doi.org/10.1101/010629

Anatomically modern humans (AMHs) left Africa ~60,000 years ago, marking the first of multiple dispersal events by AMH between Africa and the Arabian Peninsula. The southern dispersal route (SDR) out of Africa (OOA) posits that early AMHs crossed the Bab el-Mandeb strait from the Horn of Africa into what is now Yemen and followed the coast of the Indian Ocean into eastern Eurasia. If AMHs followed the SDR and left modern descendants in situ, Yemeni populations should retain old autochthonous mitogenome lineages. Alternatively, if AMHs did not follow the SDR or did not leave modern descendants in the region, only young autochthonous lineages will remain as evidence of more recent dispersals. We sequenced 113 whole mitogenomes from multiple Yemeni regions with a focus on haplogroups M, N, and L3(xM,N) as they are considered markers of the initial OOA migrations. We performed Bayesian evolutionary analyses to generate time-measured phylogenies calibrated by Neanderthal and Denisovan mitogenome sequences in order to determine the age of Yemeni-specific clades in our dataset. Our results indicate that the M1, N1, and L3(xM,N) sequences in Yemen are the product of recent migration from Africa and western Eurasia. Although these data suggest that modern Yemeni mitogenomes are not markers of the original OOA migrants, we hypothesize that recent population dynamics may obscure any genetic signature of an ancient SDR migration.

A general condition for adaptive genetic polymorphism in temporally and spatially heterogeneous environments

A general condition for adaptive genetic polymorphism in temporally and spatially heterogeneous environments
Hannes Svardal, Claus Rueffler, Joachim Hermisson
Comments: Accepted for publication in Theoretical Population Biology
Subjects: Populations and Evolution (q-bio.PE)

Both evolution and ecology have long been concerned with the impact of variable environmental conditions on observed levels of genetic diversity within and between species. We model the evolution of a quantitative trait under selection that fluctuates in space and time, and derive an analytical condition for when these fluctuations promote genetic diversification. As ecological scenario we use a generalized island model with soft selection within patches in which we incorporate generation overlap. We allow for arbitrary fluctuations in the environment including spatio-temporal correlations and any functional form of selection on the trait. Using the concepts of invasion fitness and evolutionary branching, we derive a simple and transparent condition for the adaptive evolution and maintenance of genetic diversity. This condition relates the strength of selection within patches to expectations and variances in the environmental conditions across space and time. Our results unify, clarify, and extend a number of previous results on the evolution and maintenance of genetic variation under fluctuating selection. Individual-based simulations show that our results are independent of the details of the genetic architecture and on whether reproduction is clonal or sexual. The onset of increased genetic variance is predicted accurately also in small populations in which alleles can go extinct due to environmental stochasticity.

The developmental transcriptome of contrasting Arctic charr (Salvelinus alpinus) morphs

The developmental transcriptome of contrasting Arctic charr (Salvelinus alpinus) morphs
Jóhannes Gudbrandsson, Ehsan P Ahi, Kalina H Kapralova, Sigrídur R Franzdottir, Bjarni K Kristjánsson, Sophie S Steinhaeuser, Ísak M Jóhannesson, Valerie H Maier, Sigurdur S Snorrason, Zophonías O Jónsson, Arnar Pálsson
doi: http://dx.doi.org/10.1101/011361

Species showing repeated evolution of similar traits can help illuminate the molecular and developmental basis of diverging traits and specific adaptations. Following the last glacial period, dwarfism and specialized bottom feeding morphology evolved rapidly in several landlocked Arctic charr (Salvelinus alpinus) populations in Iceland. In order to study the genetic divergence between small benthic morphs and larger morphs with limnetic morphotype, we conducted an RNA-seq transcriptome analysis of developing charr. We sequenced mRNA from whole embryos at four stages in early development of two stocks with very different morphologies, the small benthic (SB) charr from Lake Thingvallavatn and Holar aquaculture (AC) charr. The data reveal significant differences in expression of several biological pathways during charr development. There is also a difference between SB- and AC-charr in mitochondrial genes involved in energy metabolism and blood coagulation genes. We confirmed expression difference of five genes in whole embryos with qPCR, including lysozyme and natterin which was previously identified as a fish-toxin of a lectin family that may be a putative immunopeptide. We verified differential expression of 7 genes in developing heads, and the expression associated consistently with benthic v.s. limnetic charr (studied in 4 morphs total). Comparison of Single nucleotide polymorphism (SNP) frequencies reveals extensive genetic differentiation between the SB- and AC-charr (60 fixed SNPs and around 1300 differing more than 50% in frequency). In SB-charr the high frequency derived SNPs are in genes related to translation and oxidative processes. Curiously, several derived SNPs reside in the 12s and 16s mitochondrial ribosomal RNA genes, including a base highly conserved among fishes. The data implicate multiple genes and molecular pathways in divergence of small benthic charr and/or the response of aquaculture charr to domestication. Functional, genetic and population genetic studies on more freshwater and anadromous populations are needed to confirm the specific loci and mutations relating to specific ecological or domestication traits in Arctic charr.

A Composite Genome Approach to Identify Phylogenetically Informative Data from Next-Generation Sequencing

A Composite Genome Approach to Identify Phylogenetically Informative Data from Next-Generation Sequencing
Rachel S. Schwartz, Kelly Harkins, Anne C. Stone, Reed A. Cartwright
(Submitted on 16 May 2013 (v1), last revised 12 Nov 2014 (this version, v3))

We have developed a novel method to rapidly obtain homologous genomic data for phylogenetics directly from next-generation sequencing reads without the use of a reference genome. This software, called SISRS, avoids the time consuming steps of de novo whole genome assembly, genome-genome alignment, and annotation. For simulations SISRS is able to identify large numbers of loci containing variable sites with phylogenetic signal. For genomic data from apes, SISRS identified thousands of variable sites, from which we produced an accurate phylogeny. Finally, we used SISRS to identify phylogenetic markers that we used to estimate the phylogeny of placental mammals. We recovered phylogenies from multiple datasets that were consistent with previous conflicting estimates of the relationships among mammals. SISRS is open source and freely available at this https URL

What to compare and how: comparative transcriptomics for Evo-Devo

What to compare and how: comparative transcriptomics for Evo-Devo

Julien Roux, Marta Rosikiewicz, Marc Robinson-Rechavi
doi: http://dx.doi.org/10.1101/011213

Evolutionary developmental biology has grown historically from the capacity to relate patterns of evolution in anatomy to patterns of evolution of expression of specific genes, whether between very distantly related species, or very closely related species or populations. Scaling up such studies by taking advantage of modern transcriptomics brings promising improvements, allowing us to estimate the overall impact and molecular mechanisms of convergence, constraint or innovation in anatomy and development. But it also presents major challenges, including the computational definitions of anatomical homology and of organ function, the criteria for the comparison of developmental stages, the annotation of transcriptomics data to proper anatomical and developmental terms, and the statistical methods to compare transcriptomic data between species to highlight significant conservation or changes. In this article, we review these challenges, and the ongoing efforts to address them, which are emerging from bioinformatics work on ontologies, evolutionary statistics, and data curation, with a focus on their implementation in the context of the development of our database Bgee (http://bgee.org).

Tools and Methods from the Anopheles 16 Genome Project

Tools and Methods from the Anopheles 16 Genome Project

Aaron Steele, Michael C. Fontaine, Andres Martin, Scott J Emrich
doi: http://dx.doi.org/10.1101/011205

The dramatic reduction in sequencing costs has resulted in many initiatives to sequence certain organisms and populations. These initiatives aim to not only sequence and assemble genomes but also to perform a more broader analysis of the population structure. As part of the Anopheline Genome Consortium, which has a vested interest in studying anpopheline mosquitoes, we developed novel methods and tools to further the communities goals. We provide a brief description of these methods and tools as well as assess the contributions that each offers to the broader study of comparative genomics.

Recombination and peak jumping

Recombination and peak jumping

Kristina Crona
(Submitted on 7 Nov 2014)

We find an advantage of recombination for a category of complex fitness landscapes. Recent studies of empirical fitness landscapes reveal complex gene interactions and multiple peaks, and recombination can be a powerful mechanism for escaping suboptimal peaks. However classical work on recombination largely ignores the effect of complex gene interactions. The advantage we find has no correspondence for 2-locus systems or for smooth landscapes. The effect is sometimes extreme, in the sense that shutting off recombination could result in that the organism fails to adapt. A standard question about recombination is if the mechanism tends to accelerate or decelerate adaptation. However, we argue that extreme effects may be more important than how the majority falls.

Network Methods for Pathway Analysis of Genomic Data (Review)

Network Methods for Pathway Analysis of Genomic Data (Review)

Rosemary Braun, Sahil Shah
(Submitted on 7 Nov 2014)

Rapid advances in high-throughput technologies have led to considerable interest in analyzing genome-scale data in the context of biological pathways, with the goal of identifying functional systems that are involved in a given phenotype. In the most common approaches, biological pathways are modeled as simple sets of genes, neglecting the network of interactions comprising the pathway and treating all genes as equally important to the pathway’s function. Recently, a number of new methods have been proposed to integrate pathway topology in the analyses, harnessing existing knowledge and enabling more nuanced models of complex biological systems. However, there is little guidance available to researches choosing between these methods. In this review, we discuss eight topology-based methods, comparing their methodological approaches and appropriate use cases. In addition, we present the results of the application of these methods to a curated set of ten gene expression profiling studies using a common set of pathway annotations. We report the computational efficiency of the methods and the consistency of the results across methods and studies to help guide users in choosing a method. We also discuss the challenges and future outlook for improved network analysis methodologies.

A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians

A multivariate genome-wide association analysis of 10 LDL subfractions, and their response to statin treatment, in 1868 Caucasians

Heejung Shim, Daniel I Chasman, Joshua D Smith, Samia Mora, Paul M Ridker, Deborah A Nickerson, Ronald M Krauss, Matthew Stephens
doi: http://dx.doi.org/10.1101/011270

We conducted a genome-wide association analysis of 7 subfractions of low density lipoproteins (LDLs) and 3 subfractions of intermediate density lipoproteins (IDLs) measured by gradient gel electrophoresis, and their response to statin treatment, in 1868 individuals of European ancestry from the Pharmacogenomics and Risk of Cardiovascular Disease study. Our analyses identified four previously-implicated loci (SORT1, APOE, LPA, and CETP) as containing variants that are very strongly associated with lipoprotein subfractions (log10 Bayes Factor > 15). Subsequent conditional analyses suggest that three of these (APOE, LPA and CETP) likely harbor multiple independently associated SNPs. Further, while different variants typically showed different characteristic patterns of association with combinations of subfractions, the two SNPs in CETP show strikingly similar patterns – both in our original data and in a replication cohort – consistent with a common underlying molecular mechanism. Notably, the CETP variants are very strongly associated with LDL subfractions, despite showing no association with total LDLs in our study, illustrating the potential value of the more detailed phenotypic measurements. In contrast with these strong subfraction associations, genetic association analysis of subfraction response to statins showed much weaker signals (none exceeding log10 Bayes Factor of 6). However, two SNPs (in APOE and LPA) previously-reported to be associated with LDL statin response do show some modest evidence for association in our data, and the subfraction response profiles at the LPA SNP are consistent with the LPA association, with response likely being due primarily to resistance of Lp(a) particles to statin therapy. An additional important feature of our analysis is that, unlike most previous analyses of multiple related phenotypes, we analyzed the subfractions jointly, rather than one at a time. Comparisons of our multivariate analyses with standard univariate analyses demonstrate that multivariate analyses can substantially increase power to detect associations. Software implementing our multivariate analysis methods is available at http://stephenslab.uchicago.edu/software.html.