Author post: Dynamic DNA Processing: A Microcode Model of Cell Differentiation

The following guest post is by Barry Jacobson on his preprint “Dynamic DNA Processing: A Microcode Model of Cell Differentiation”, arXived here.

The paper suggests that DNA should be viewed as a processor that operates by means of base-pairing with remote regions of the genome. If one sequence matches another (or is complementary to it) it will set up a structural loop, or other interaction. However, the paper postulates that at least one region of the genome of every cell will have a unique clock sequence that is shared by no other cell. Therefore, the clock of one cell may not match the same distant sequences as the clock of another. Thus, the pattern of loops that is formed, and the overall 3-D DNA structure, may differ from cell to cell. This will either assist or hinder binding of transcription factors in one type of cell, as compared to another, thus providing a mechanism of differential gene expression.

We discuss a method by how these differing clock sequences could be generated in cell division, so that the daughters each end up with a unique identifier. The identifier then unlocks certain conformations only for those cell types for which it is relevant. Similarly, SNP’s may function in a similar manner, by modifying 3-D configurations, thus altering TF activity.

We further postulate that if a clock or target is errantly mutated, so that it matches a target farther away than was intended, it may stretch the chromosome to the breaking point, and this is the cause of chromosomal breakage or translocations in cancer.

Finally, we allow for the possibility that a cell can modify its clock in response to the environment, such as when healing from trauma, or accepting a graft, in which case it needs to coordinate with neighboring cells. We suggest that perhaps chemical analogs of cell surface proteins may occasionally mistrigger such a clock modification, when none is necessary, and thereby cause incorrect matches and conformations in that cell, which can damage DNA, and lead to cancer, as before.

We realize this is all purely speculative, but we mention that we originally submitted this model to Nature without success 16 years ago, and since then, a number of its assumptions have been verified, as detailed in the recent submission to arXiv, therefore we believe it deserves a second look.

The availability of research data declines rapidly with article age

The availability of research data declines rapidly with article age
Timothy Vines, Arianne Albert, Rose Andrew, Florence Débarre, Dan Bock, Michelle Franklin, Kimberly Gilbert, Jean-Sébastien Moore, Sébastien Renaut, Diana J. Rennison
(Submitted on 19 Dec 2013)

Policies ensuring that research data are available on public archives are increasingly being implemented at the government [1], funding agency [2-4], and journal [5,6] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [7], and indeed many studies have found that authors are often unable or unwilling to share their data [8-11]. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested datasets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a dataset being extant fell by 17% per year. In addition, the odds that we could find a working email address for the first, last or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.

Massively differential bias between two widely used Illumina library preparation methods for small RNA sequencing

Massively differential bias between two widely used Illumina library preparation methods for small RNA sequencing

Jeanette Baran-Gale, Michael R Erdos, Christina Sison, Alice Young, Emily E Fannin, Peter S Chines, Praveen Sethupathy

Recent advances in sequencing technology have helped unveil the unexpected complexity and diversity of small RNAs. A critical step in small RNA library preparation for sequencing is the ligation of adapter sequences to both the 5’ and 3’ ends of small RNAs. Two widely used protocols for small RNA library preparation, Illumina v1.5 and Illumina TruSeq, use different pairs of adapter sequences. In this study, we compare the results of small RNA-sequencing between v1.5 and TruSeq and observe a striking differential bias. Nearly 100 highly expressed microRNAs (miRNAs) are >5-fold differentially detected and 48 miRNAs are >10-fold differentially detected between the two methods of library preparation. In fact, some miRNAs, such as miR-24-3p, are over 30-fold differentially detected. The results are reproducible across different sequencing centers (NIH and UNC) and both major Illumina sequencing platforms, GAIIx and HiSeq. While some level of bias in library preparation is not surprising, the apparent massive differential bias between these two widely used adapter sets is not well appreciated. As increasingly more laboratories transition to the newer TruSeq-based library preparation for small RNAs, researchers should be aware of the extent to which the results may differ from previously published results using v1.5.

CONCOCT: Clustering cONtigs on COverage and ComposiTion

CONCOCT: Clustering cONtigs on COverage and ComposiTion
Johannes Alneberg, Brynjar Smari Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z. Ijaz, Nicholas J. Loman, Anders F. Andersson, Christopher Quince
(Submitted on 14 Dec 2013)

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information – sequence composition, coverage across multiple sample, and read-pair linkage – to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.

Bayesian inference of infectious disease transmission from whole genome sequence data

Bayesian inference of infectious disease transmission from whole genome sequence data
Xavier Didelot, Jennifer Gardy, Caroline Colijn

Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

FPCB : a simple and swift strategy for mirror repeat identification

FPCB : a simple and swift strategy for mirror repeat identification
Bhardwaj Vikash, Gupta Swapni, Meena Sitaram, Sharma Kulbhushan
(Submitted on 13 Dec 2013)

After the recent advancement of sequencing strategies, mirror repeats have been found to be present in the gene sequence of many organisms and species. This presence of mirror repeats in most of the sequences indicates towards some important functional role of these repeats. However, a simple and quick strategy to search these repeats in a given sequence is not available. We in this manuscript have proposed a simple and swift strategy named as FPCB strategy to identify mirror repeats in a give sequence. The strategy includes three simple steps of downloading sequencing in FASTA format (F), making its parallel complement (PC) and finally performing a homology search with the original sequence (B). At least twenty genes were analyzed using the proposed study. A number and types of mirror repeats were observed. We have also tried to give nomenclature to these repeats. We hope that the proposed FPCB strategy will be quite helpful for the identification of mirror repeats in DNA or mRNA sequence. Also the strategy may help in unraveling the functional role of mirror repeats in various processes including evolution.

Author post: Joint analysis of functional genomic data and genome-wide association studies of 18 human traits

The following post is by Joe Pickrell [@joe_pickrell] on his preprint Joint analysis of functional genomic data and genome-wide association studies of 18 human traits, available on bioRxiv here.

Until recently, the field of human genetics struggled to identify genetic variants that influence complex traits and diseases like height or diabetes. With the arrival of genome-wide association studies (GWAS), studies now regularly identify tens to hundreds of genomic regions that contain such variants. The question going forward is clear: how do these variants influence traits?

One way to answer this question involves annotating variants according to their potential functions–does a given variant change the sequence of a protein? Or does it disrupt the splicing of a gene? Or does it fall in a regulatory region in an important cell type? Many groups (like those that are part of the ENCODE project) are generating hundreds of datasets that are potentially informative about these types of questions. But which of these hundreds of datasets are relevant when studying a given trait?

ns

In this preprint, I develop a statistical method (an empirical Bayes hierarchical model) that takes summary statistics from a genome-wide association study of a given trait and identifies the types of genomic annotation that are relevant for the trait; software implementing this method is available here. I then applied this method to a set of 18 traits and 450 genomic annotations. Feedback on the method itself is of course welcome, but I’d also like to highlight what I think are the most interesting biological results:

  1. The relative importance of protein-coding versus regulatory variants varies across traits. The fraction of GWAS hits driven by changes in protein sequence depends on the trait, ranging from a low of around 2% up to around 20% (see above).
  2. Repressed chromatin is depleted for loci that influence traits. I was surprised to find that the most informative type of information for interpreting a GWAS is often repressed chromatin, which is depleted for loci influencing traits. This type of chromatin covers up to 70% of the genome.
  3. Cell type-specific DNase-I hypersensitive sites are enriched for loci that influence traits. A “hypothesis-free” scan across all regulatory regions in many tissues can identify unexpected connections between traits and tissues. For example, loci that influence bone density are enriched in gene regulatory regions in muscle tissue, and loci that influence Crohn’s disease are enriched in regulatory regions in fibroblasts.
  4. Incorporating functional information into a GWAS increases power to detect loci. Finally, re-weighting a GWAS using this method increases the number of loci identified in each GWAS by around 5%; many of the loci identified with this method have been replicated in larger studies.

I’m hopeful that this method (and others like it) will be useful in making the transition from identifying statistical associations in a GWAS to understanding the underlying biology; comments and criticisms are welcome.

Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer

Extensive Phenotypic Changes Associated with Large-scale Horizontal Gene Transfer
Kevin Dougherty, Brian A Smith, Autum F Moore, Shannon Maitland, Chris Fanger, Rachel Murillo, David A Baltrus

Horizontal gene transfer often leads to phenotypic changes within recipient organisms independent of any immediate evolutionary benefits. While secondary phenotypic effects of horizontal transfer (i.e. changes in growth rates) have been demonstrated and studied across a variety of systems using relatively small plasmid and phage, little is known about how size of the acquired region affects the magnitude or number of such costs. Here we describe an amazing breadth of phenotypic changes which occur after a large-scale horizontal transfer event (~1Mb megaplasmid) within Pseudomonas stutzeri including sensitization to various stresses as well as changes in bacterial behavior. These results highlight the power of horizontal transfer to shift pleiotropic relationships and cellular networks within bacterial genomes. They also provide an important context for how secondary effects of transfer can bias evolutionary trajectories and interactions between species. Lastly, these results and system provide a foundation to investigate evolutionary consequences in real time as newly acquired regions are ameliorated and integrated into new genomic contexts.

Robustly detecting differential expression in RNA sequencing data using observation weights

Robustly detecting differential expression in RNA sequencing data using observation weights
Xiaobei Zhou, Helen Lindsay, Mark D. Robinson
(Submitted on 12 Dec 2013)

A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information) across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g., dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: this http URL

A Tale of Two Hypotheses: Genetics and the Ethnogenesis of Ashkenazi Jewry

A Tale of Two Hypotheses: Genetics and the Ethnogenesis of Ashkenazi Jewry
Aram Yardumian

The debate over the ethnogenesis of Ashkenazi Jewry is longstanding, and has been hampered by a lack of Jewish historiographical work between the Biblical and the early Modern eras. Most historians, as well as geneticists, situate them as the descendants of Israelite tribes whose presence in Europe is owed to deportations during the Roman conquest of Palestine, as well as migration from Babylonia, and eventual settlement along the Rhine. By contrast, a few historians and other writers, most famously Arthur Koestler, have looked to migrations following the decline of the little-understood Medieval Jewish kingdom of Khazaria as the main source for Ashkenazi Jewry. A recent study of genetic variation in southeastern European populations (Elhaik 2012) also proposed a Khazarian origin for Ashkenazi Jews, eliciting considerable criticism from other scholars investigating Jewish ancestry who favor a Near Eastern origin of Ashkenazi populations. This paper re-examines the genetic data and analytical approaches used in these studies of Jewish ancestry, and situates them in the context of historical, linguistic, and archaeological evidence from the Caucasus, Europe and the Near East. Based on this reanalysis, it appears not only that the Khazar Hypothesis per se is without serious merit, but also the veracity of the ‘Rhineland Hypothesis’ may also be questionable.