FPCB : a simple and swift strategy for mirror repeat identification

FPCB : a simple and swift strategy for mirror repeat identification
Bhardwaj Vikash, Gupta Swapni, Meena Sitaram, Sharma Kulbhushan
(Submitted on 13 Dec 2013)

After the recent advancement of sequencing strategies, mirror repeats have been found to be present in the gene sequence of many organisms and species. This presence of mirror repeats in most of the sequences indicates towards some important functional role of these repeats. However, a simple and quick strategy to search these repeats in a given sequence is not available. We in this manuscript have proposed a simple and swift strategy named as FPCB strategy to identify mirror repeats in a give sequence. The strategy includes three simple steps of downloading sequencing in FASTA format (F), making its parallel complement (PC) and finally performing a homology search with the original sequence (B). At least twenty genes were analyzed using the proposed study. A number and types of mirror repeats were observed. We have also tried to give nomenclature to these repeats. We hope that the proposed FPCB strategy will be quite helpful for the identification of mirror repeats in DNA or mRNA sequence. Also the strategy may help in unraveling the functional role of mirror repeats in various processes including evolution.

Robustly detecting differential expression in RNA sequencing data using observation weights

Robustly detecting differential expression in RNA sequencing data using observation weights
Xiaobei Zhou, Helen Lindsay, Mark D. Robinson
(Submitted on 12 Dec 2013)

A popular approach for comparing gene expression levels between (replicated) conditions of RNA sequencing data relies on counting reads that map to features of interest. Within such count-based methods, many flexible and advanced statistical approaches now exist and offer the ability to adjust for covariates (e.g., batch effects). Often, these methods include some sort of (sharing of information) across features to improve inferences in small samples. It is important to achieve an appropriate tradeoff between statistical power and protection against outliers. Here, we study the robustness of existing approaches for count-based differential expression analysis and propose a new strategy based on observation weights that can be used within existing frameworks. The results suggest that outliers can have a global effect on differential analyses. We demonstrate the effectiveness of our new approach with real data and simulated data that reflects properties of real datasets (e.g., dispersion-mean trend) and develop an extensible framework for comprehensive testing of current and future methods. In addition, we explore the origin of such outliers, in some cases highlighting additional biological or technical factors within the experiment. Further details can be downloaded from the project website: this http URL

Evolution at two levels of gene expression in yeast

Evolution at two levels of gene expression in yeast
Carlo G. Artieri, Hunter B. Fraser
(Submitted on 27 Nov 2013)

Despite the greater functional importance of protein levels, our knowledge of gene expression evolution is based almost entirely on studies of mRNA levels. In contrast, our understanding of how translational regulation evolves has lagged far behind. Here we have applied ribosome profiling – which measures both global mRNA levels and their translation rates – to two species of Saccharomyces yeast and their interspecific hybrid in order to assess the relative contributions of changes in mRNA abundance and translation to regulatory evolution. We report that both cis and trans-acting regulatory divergence in translation are abundant, affecting at least 35% of genes. The majority of translational divergence acts to buffer changes in mRNA abundance, suggesting a widespread role for stabilizing selection acting across regulatory levels. Nevertheless, we observe evidence of lineage-specific selection acting on a number of yeast functional modules, including instances of reinforcing selection acting at both levels of regulation. Finally, we also uncover multiple instances of stop-codon readthrough that are conserved between species. Our analysis reveals the under-appreciated complexity of post-transcriptional regulatory divergence and indicates that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ may overlook a significant source of selection, acting at multiple regulatory levels along the path from genotype to phenotype.

Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population

Data Mining of Online Genealogy Datasets for Revealing Lifespan Patterns in Human Population
Michael Fire, Yuval Elovici
(Submitted on 18 Nov 2013)

Online genealogy datasets contain extensive information about millions of people and their past and present family connections. This vast amount of data can assist in identifying various patterns in human population. In this study, we present methods and algorithms which can assist in identifying variations in lifespan distributions of human population in the past centuries, in detecting social and genetic features which correlate with human lifespan, and in constructing predictive models of human lifespan based on various features which can easily be extracted from genealogy datasets.
We have evaluated the presented methods and algorithms on a large online genealogy dataset with over a million profiles and over 8.8 million connections, all of which were collected from the WikiTree website. Our findings indicate that significant but small positive correlations exist between the parents’ lifespan and their children’s lifespan. Additionally, we found slightly higher and significant correlations between the lifespans of spouses. We also discovered a very small positive and significant correlation between longevity and reproductive success in males, and a small and significant negative correlation between longevity and reproductive success in females. Moreover, our machine learning algorithms presented better than random classification results in predicting which people who outlive the age of 50 will also outlive the age of 80.
We believe that this study will be the first of many studies which utilize the wealth of data on human populations, existing in online genealogy datasets, to better understand factors which influence human lifespan. Understanding these factors can assist scientists in providing solutions for successful aging.

On the optimal trimming of high-throughput mRNA sequence data

On the optimal trimming of high-throughput mRNA sequence data
Matthew D MacManes

The widespread and rapid adoption of high-throughput sequencing technologies has changed the face of modern studies of evolutionary genetics. Indeed, newer sequencing technologies, like Illumina sequencing, have afforded researchers the opportunity to gain a deep understanding of genome level processes that underlie evolutionary change. In particular, researchers interested in functional biology and adaptation have used these technologies to sequence mRNA transcriptomes of specific tissues, which in turn are often compared to other tissues, or other individuals with different phenotypes. While these techniques are extremely powerful, careful attention to data quality is required. In particular, because high-throughput sequencing is more error-prone than traditional Sanger sequencing, quality trimming of sequence reads should be an important step in all data processing pipelines. While several software packages for quality trimming exist, no general guidelines for the specifics of trimming have been developed. Here, using empirically derived sequence data, I provide general recommendations regarding the optimal strength of trimming, specifically in mRNA-Seq studies. Although very aggressive quality trimming is common, this study suggests that a more gentle trimming, specifically of those nucleotides whose Phred score < 2 or < 5, is optimal for most studies across a wide variety of metrics.

Functional Annotation Signatures of Disease Susceptibility Loci Improve SNP Association Analysis

Functional Annotation Signatures of Disease Susceptibility Loci Improve SNP Association Analysis

Edwin S Iversen, Gary Lipton, Merlise A. Clyde, Alvaro N. A. Monteiro
doi: 10.1101/000158

We describe the development and application of a Bayesian statistical model for the prior probability of phenotype-genotype association that incorporates data from past association studies and publicly available functional annotation data regarding the susceptibility variants under study. The model takes the form of a binary regression of association status on a set of annotation variables whose coefficients were estimated through an analysis of associated SNPs housed in the GWAS Catalog (GC). The set of functional predictors we examined includes measures that have been demonstrated to correlate with the association status of SNPs in the GC and some whose utility in this regard is speculative: summaries of the UCSC Human Genome Browser ENCODE super-track data, dbSNP function class, sequence conservation summaries, proximity to genomic variants included in the Database of Genomic Variants (DGV) and known regulatory elements included in the Open Regulatory Annotation database (ORegAnno), PolyPhen-2 probabilities and RegulomeDB categories. Because we expected that only a fraction of the annotation variables would contribute to predicting association, we employed a penalized likelihood method to reduce the impact of non-informative predictors and evaluated the model’s ability to predict GC SNPs not used to construct the model. We show that the functional data alone are predictive of a SNP’s presence in the GC. Further, using data from a genome-wide study of ovarian cancer, we demonstrate that their use as prior data when testing for association is practical at the genome-wide scale and improves power to detect associations.

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis
Eric Y. Durand, Nicholas Eriksson, Cory Y. McLean
(Submitted on 5 Nov 2013)

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, but IBD detection accuracy in non-simulated data is largely unknown. Using 25,432 genotyped European individuals, and exploiting known familial relationships in 2,952 father-mother-child trios contained therein, we identify a false positive rate over 67% for short (2-4 centiMorgan) segments. We introduce a novel, computationally-efficient, haplotype-based metric that enables accurate IBD detection on population-scale datasets.

SMASH: A Benchmarking Toolkit for Variant Calling

SMASH: A Benchmarking Toolkit for Variant Calling
Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma’ayan Bresler, Yun S. Song, Michael I. Jordan, David Patterson
(Submitted on 31 Oct 2013)

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers.
Results: We propose a benchmarking methodology for evaluating variant calling algorithms called the SMASH toolkit. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMASH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms.
Availability: We provide free and open access online to the SMASH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

Discriminative Measures for Comparison of Phylogenetic Trees

Discriminative Measures for Comparison of Phylogenetic Trees
Omur Arslan, Dan P. Guralnik, Daniel E. Koditschek
(Submitted on 19 Oct 2013)

Efficient and informative comparison of trees is a common essential interest of both computational biology and pattern classification. In this paper, we introduce a novel dissimilarity measure on non-degenerate hierarchies (rooted binary trees), called the NNI navigation distance, that counts the steps along the trajectory of a discrete dynamical system defined over the Nearest Neighbor Interchange(NNI) graph of binary hierarchies. The NNI navigation distance has a unique unifying nature of combining both edge comparison methods and edit operations for comparison of trees and is an efficient approximation to the (NP-hard) NNI distance. It is given by a closed form expression which simply generalizes to nondegenerate hierarchies as well. A relaxation on the closed form of the NNI navigation distance results a simpler dissimilarity measure on all trees, named the crossing dissimilarity, counts pairwise cluster incompatibilities of trees. Both of our dissimilarity measures on nondegenerate hierarchies are positive definite (vanishes only between identical trees) and symmetric but are not a true metric because they do not satisfy the triangle inequality. Although they are not true metrics, they are both linearly bounded below by the widely used Robinson-Foulds metric and above by a new tree metric, called the cluster-cardinality distance — the pullback metric of a matrix norm along an embedding of hierarchies into the space of matrices. All of these proposed tree measures can be efficiently computed in time O(n^2) in the number of leaves, n.

Application of compressed sensing to genome wide association studies and genomic selection

Application of compressed sensing to genome wide association studies and genomic selection
Shashaank Vattikuti, James J. Lee, Stephen D. H. Hsu, Carson C. Chow
(Submitted on 8 Oct 2013)

We show that the signal-processing paradigm known as compressed sensing (CS) is applicable to genome-wide association studies (GWAS) and genomic selection (GS). The aim of GWAS is to isolate trait-associated loci, whereas GS attempts to predict the phenotypic values of new individuals on the basis of training data. CS addresses a problem common to both endeavors, namely that the number of genotyped markers often greatly exceeds the sample size. We show using CS methods and theory that all loci of nonzero effect can be identified (selected) using an efficient algorithm, provided that they are sufficiently few in number (sparse) relative to sample size. For heritability h2 = 1, there is a sharp phase transition to complete selection as the sample size is increased. For heritability values less than one, complete selection can still occur although the transition is smoothed. The transition boundary is only weakly dependent on the total number of genotyped markers. The crossing of a transition boundary provides an objective means to determine when true effects are being recovered. For h2 = 0.5, we find that a sample size that is thirty times the number of nonzero loci is sufficient for good recovery.