CONCOCT: Clustering cONtigs on COverage and ComposiTion
Johannes Alneberg, Brynjar Smari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z. Ijaz, Nicholas J. Loman, Anders F. Andersson, Christopher Quince
(Submitted on 14 Dec 2013)
Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information – sequence composition, coverage across multiple sample, and read-pair linkage – to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.
Bayesian inference of infectious disease transmission from whole genome sequence data
Xavier Didelot, Jennifer Gardy, Caroline Colijn
Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.
FPCB : a simple and swift strategy for mirror repeat identification
Bhardwaj Vikash, Gupta Swapni, Meena Sitaram, Sharma Kulbhushan
(Submitted on 13 Dec 2013)
After the recent advancement of sequencing strategies, mirror repeats have been found to be present in the gene sequence of many organisms and species. This presence of mirror repeats in most of the sequences indicates towards some important functional role of these repeats. However, a simple and quick strategy to search these repeats in a given sequence is not available. We in this manuscript have proposed a simple and swift strategy named as FPCB strategy to identify mirror repeats in a give sequence. The strategy includes three simple steps of downloading sequencing in FASTA format (F), making its parallel complement (PC) and finally performing a homology search with the original sequence (B). At least twenty genes were analyzed using the proposed study. A number and types of mirror repeats were observed. We have also tried to give nomenclature to these repeats. We hope that the proposed FPCB strategy will be quite helpful for the identification of mirror repeats in DNA or mRNA sequence. Also the strategy may help in unraveling the functional role of mirror repeats in various processes including evolution.
The following post is by Joe Pickrell [@joe_pickrell] on his preprint Joint analysis of functional genomic data and genome-wide association studies of 18 human traits, available on bioRxiv here.
Until recently, the field of human genetics struggled to identify genetic variants that influence complex traits and diseases like height or diabetes. With the arrival of genome-wide association studies (GWAS), studies now regularly identify tens to hundreds of genomic regions that contain such variants. The question going forward is clear: how do these variants influence traits?
One way to answer this question involves annotating variants according to their potential functions–does a given variant change the sequence of a protein? Or does it disrupt the splicing of a gene? Or does it fall in a regulatory region in an important cell type? Many groups (like those that are part of the ENCODE project) are generating hundreds of datasets that are potentially informative about these types of questions. But which of these hundreds of datasets are relevant when studying a given trait?
In this preprint, I develop a statistical method (an empirical Bayes hierarchical model) that takes summary statistics from a genome-wide association study of a given trait and identifies the types of genomic annotation that are relevant for the trait; software implementing this method is available here. I then applied this method to a set of 18 traits and 450 genomic annotations. Feedback on the method itself is of course welcome, but I’d also like to highlight what I think are the most interesting biological results:
The relative importance of protein-coding versus regulatory variants varies across traits. The fraction of GWAS hits driven by changes in protein sequence depends on the trait, ranging from a low of around 2% up to around 20% (see above).
Repressed chromatin is depleted for loci that influence traits. I was surprised to find that the most informative type of information for interpreting a GWAS is often repressed chromatin, which is depleted for loci influencing traits. This type of chromatin covers up to 70% of the genome.
Cell type-specific DNase-I hypersensitive sites are enriched for loci that influence traits. A “hypothesis-free” scan across all regulatory regions in many tissues can identify unexpected connections between traits and tissues. For example, loci that influence bone density are enriched in gene regulatory regions in muscle tissue, and loci that influence Crohn’s disease are enriched in regulatory regions in fibroblasts.
Incorporating functional information into a GWAS increases power to detect loci. Finally, re-weighting a GWAS using this method increases the number of loci identified in each GWAS by around 5%; many of the loci identified with this method have been replicated in larger studies.
I’m hopeful that this method (and others like it) will be useful in making the transition from identifying statistical associations in a GWAS to understanding the underlying biology; comments and criticisms are welcome.
Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer
Kevin Dougherty, Brian A Smith, Autum F Moore, Shannon Maitland, Chris Fanger, Rachel Murillo, David A Baltrus
Horizontal gene transfer often leads to phenotypic changes within recipient organisms independent of any immediate evolutionary benefits. While secondary phenotypic effects of horizontal transfer (i.e. changes in growth rates) have been demonstrated and studied across a variety of systems using relatively small plasmid and phage, little is known about how size of the acquired region affects the magnitude or number of such costs. Here we describe an amazing breadth of phenotypic changes which occur after a large-scale horizontal transfer event (~1Mb megaplasmid) within Pseudomonas stutzeri including sensitization to various stresses as well as changes in bacterial behavior. These results highlight the power of horizontal transfer to shift pleiotropic relationships and cellular networks within bacterial genomes. They also provide an important context for how secondary effects of transfer can bias evolutionary trajectories and interactions between species. Lastly, these results and system provide a foundation to investigate evolutionary consequences in real time as newly acquired regions are ameliorated and integrated into new genomic contexts.
Robustly detecting differential expression in RNA sequencing data using observation weights
Xiaobei Zhou, Helen Lindsay, Mark D. Robinson
(Submitted on 12 Dec 2013)
A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information) across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g., dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: this http URL
A Tale of Two Hypotheses: Genetics and the Ethnogenesis of Ashkenazi Jewry
The debate over the ethnogenesis of Ashkenazi Jewry is longstanding, and has been hampered by a lack of Jewish historiographical work between the Biblical and the early Modern eras. Most historians, as well as geneticists, situate them as the descendants of Israelite tribes whose presence in Europe is owed to deportations during the Roman conquest of Palestine, as well as migration from Babylonia, and eventual settlement along the Rhine. By contrast, a few historians and other writers, most famously Arthur Koestler, have looked to migrations following the decline of the little-understood Medieval Jewish kingdom of Khazaria as the main source for Ashkenazi Jewry. A recent study of genetic variation in southeastern European populations (Elhaik 2012) also proposed a Khazarian origin for Ashkenazi Jews, eliciting considerable criticism from other scholars investigating Jewish ancestry who favor a Near Eastern origin of Ashkenazi populations. This paper re-examines the genetic data and analytical approaches used in these studies of Jewish ancestry, and situates them in the context of historical, linguistic, and archaeological evidence from the Caucasus, Europe and the Near East. Based on this reanalysis, it appears not only that the Khazar Hypothesis per se is without serious merit, but also the veracity of the Rhineland Hypothesis may also be questionable.
Genomic architecture of human neuroanatomical diversity
Roberto Toro, Jean-Baptiste Poline, Guillaume Huguet, Eva Loth, Vincent Frouin, Tobias Banaschewski, Gareth J Barker, Arun Bokde, Christian Büchel, Fabiana Carvalho, Patricia Conrod, Mira Fauth-Bühler, Herta Flor, Jürgen Gallinat, Hugh Garavan, Penny Gowloan, Andreas Heinz, Bernd Ittermann, Claire Lawrence, Hervé Lemaître, Karl Mann, Frauke Nees, Tomá Paus, Zdenka Pausova, Marcella Rietschel, Trevor Robbins, Michael Smolka, Andreas Ströhle, Gunter Schumann, Thomas Bourgeron
Human brain anatomy is strikingly diverse and highly inheritable: genetic factors may explain up to 80% of its variability. Prior studies have tried to detect genetic variants with a large effect on neuroanatomical diversity, but those currently identified account for <5% of the variance. Here we show, based on our analyses of neuroimaging and whole-genome genotyping data from 1,765 subjects, that up to 54% of this heritability is captured by large numbers of single nucleotide polymorphisms of small effect spread throughout the genome, especially within genes and close regulatory regions. The genetic bases of neuroanatomical diversity appear to be relatively independent of those of body size (height), but shared with those of verbal intelligence scores. The study of this genomic architecture should help us better understand brain evolution and disease.
Sex-biased expression of microRNAs in Drosophila melanogaster
(Submitted on 11 Dec 2013)
Most animals have separate sexes. The differential expression of gene products, in particular that of gene regulators, is underlying sexual dimorphism. Analyses of sex-biased expression have focused mostly in protein coding genes. Several lines of evidence indicate that microRNAs, a class of major gene regulators, are likely to have a significant role in sexual dimorphism. This role has not been systematically explored so far. Here I study the sex-biased expression pattern of microRNAs in the model species Drosophila melanogaster. As with protein coding genes, sex biased microRNAs are associated with the reproductive function. Strikingly, contrary to protein-coding genes, male biased microRNAs are enriched in the X chromosome whilst female microRNAs are mostly autosomal. I propose that the chromosomal distribution is a consequence of high rates of de novo emergence, and a preference of new microRNAs to be expressed in the testis. I also suggest that demasculinization of the X chromosome may not affect microRNAs. Interestingly, female biased microRNAs are often encoded within protein coding genes that are also expressed in females. These results strongly suggest that the sex-biased expression of microRNAs is mainly a consequence of high rates of microRNA emergence in the X (male bias) or hitch-hiked expression by host genes (female bias).
Evolution of female choice and age-dependent male traits with paternal germ-line mutation
Joel James Adamson
(Submitted on 11 Dec 2013)
Several studies question the adaptive value of female preferences for older males. Theory and evidence show that older males carry more deleterious mutations in their sperm than younger males carry. These mutations are not visible to females choosing mates. Germ-line mutations could oppose preferences for “good genes.” Choosy females run the risk that offspring of older males will be no more attractive or healthy than offspring of younger males. Germ-line mutations could pose a particular problem when females can only judge male trait size, rather than assessing age directly. I ask whether or not females will prefer extreme traits, despite reduced offspring survival due to age-dependent mutation. I use a quantitative genetic model to examine the evolution of female preferences, an age-dependent male trait, and overall health (“condition”). My dynamical equation includes mutation bias that depends on the generation time of the population. I focus on the case where females form preferences for older males because male trait size depends on male age. My findings agree with good genes theory. Females at equilibrium always select above-average males. The trait size preferred by females directly correlates with the direct costs of the preference. Direct costs can accentuate the equilibrium preference at a higher rate than mutational parameters. Females can always offset direct costs by mating with older, more ornamented males. Age-dependent mutation in condition maintains genetic variation in condition and thereby maintains the selective value of female preferences. Rather than eliminating female preferences, germ-line mutations provide an essential ingredient in sexual selection.