SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data

SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data
Swetansu Pattnaik, Saurabh Gupta, Arjun A Rao, Binay Panda
(Submitted on 12 Jul 2013)

We report SInC (SNV, Indel and CNV) simulator and read generator, an open-source tool capable of simulating biological variants taking into account a platform-specific error model. SInC is capable of simulating and generating single- and paired-end reads with user-defined insert size with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes. Sinc can be downloaded from this https URL

kruX: Matrix-based non-parametric eQTL discovery

kruX: Matrix-based non-parametric eQTL discovery
Jianlong Qi, Hassan Foroughi Asl, Johan Bjorkegren, Tom Michoel
(Submitted on 12 Jul 2013)

The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitative trait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlying genetic model and expression trait distribution, but testing billions of marker-trait combinations one-by-one can become computationally prohibitive. We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplications to simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-trait combinations at once. KruX is more than 3,000 times faster than computing associations one-by-one on a typical human dataset.

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce
Christopher W. Whelan, Jeffrey Tyner, Alberto L’Abbate, Clelia Tiziana Storlazzi, Lucia Carbone, Kemal Sönmez
(Submitted on 9 Jul 2013)

The detection of genomic structural variations (SV) remains a difficult challenge in analyzing sequencing data, and the growing size and number of sequenced genomes have rendered SV detection a bona fide big data problem. MapReduce is a proven, scalable solution for distributed computing on huge data sets. We describe a conceptual framework for SV detection algorithms in MapReduce based on computing local genomic features, and use it to develop a deletion and insertion detection algorithm, Cloudbreak. On simulated and real data sets, Cloudbreak achieves accuracy improvements over popular SV detection algorithms, and genotypes variants from diploid samples. It provides dramatically shorter runtimes and the ability to scale to big data volumes on large compute clusters. Cloudbreak includes tools to set up and configure MapReduce (Hadoop) clusters on cloud services, enabling on-demand cluster computing. Our implementation and source code are available at this http URL

Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster

Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster
Martin Kapun, Hester van Schalkwyk, Bryant McAllister, Thomas Flatt, Christian Schlötterer
(Submitted on 9 Jul 2013)

Sequencing of pools of individuals (Pool-Seq) represents a reliable and cost- effective approach for estimating genome-wide SNP and transposable element insertion frequencies. However, Pool-Seq does not provide direct information on haplotypes so that for example obtaining inversion frequencies has not been possible until now. Here, we have developed a new set of diagnostic marker SNPs for 7 cosmopolitan inversions in Drosophila melanogaster that can be used to infer inversion frequencies from Pool-Seq data. We applied our novel marker set to Pool-Seq data from an experimental evolution study and from North American and Australian latitudinal clines. In the experimental evolution data, we find evidence that positive selection has driven the frequencies of In(3R)C and In(3R)Mo to increase over time. In the clinal data, we confirm the existence of frequency clines for In(2L)t, In(3L)P and In(3R)Payne in both North America and Australia and detect a previously unknown latitudinal cline for In(3R)Mo in North America. The inversion markers developed here provide a versatile and robust tool for characterizing inversion frequencies and their dynamics in Pool- Seq data from diverse D. melanogaster populations

The Changing Geometry of a Fitness Landscape Along an Adaptive Walk

The Changing Geometry of a Fitness Landscape Along an Adaptive Walk
Devin Greene, Krisitna Crona
(Submitted on 7 Jul 2013)

It has recently been noted that the relative prevalence of the various kinds of epistasis varies along an adaptive walk. This has been explained as a result of mean regression in NK model fitness landscapes. Here we show that this phenomenon occurs quite generally in fitness landscapes. We propose a simple and general explanation for this phenomemon, confirming the role of mean regression. We provide support for this explanation with simulations, and discuss the empirical relevance of our findings.

RNA secondary structure prediction from multi-aligned sequences

RNA secondary structure prediction from multi-aligned sequences
Michiaki Hamada
(Submitted on 8 Jul 2013)

It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.

Evaluating strategies of phylogenetic analyses by the coherence of their results

Evaluating strategies of phylogenetic analyses by the coherence of their results
Blaise Li
(Submitted on 5 Jul 2013)

I propose an approach to identify, among several strategies of phylogenetic analysis, those producing the most accurate results. This approach is based on the hypothesis that the more a result is reproduced from independent data, the more it reflects the historical signal common to the analysed data. Under this hypothesis, the capacity of an analytical strategy to extract historical signal should correlate positively with the coherence of the obtained results. I apply this approach to a series of analyses on empirical data, basing the coherence measure on the Robinson-Foulds distances between the obtained trees. At first approximation, the analytical strategies most suitable for the data produce the most coherent results. However, risks of false positives and false negatives are identified, which are difficult to rule out.

Evolution on genotype networks leads to phenotypic entrapment

Evolution on genotype networks leads to phenotypic entrapment
Susanna Manrubia, José A. Cuesta
(Submitted on 3 Jul 2013)

Large sets of genotypes give rise to the same phenotype because phenotypic expression is highly redundant. Accordingly, a population can accept mutations without altering its phenotype, as long as they transform its genotype into another one on the same set. By linking every pair of genotypes that are mutually accessible through mutation, genotypes organize themselves into genotype networks (GN). These networks are known to be heterogeneous and assortative. As these features condition the probability that mutations keep the phenotype unchanged—hence becoming blind to natural selection—it follows that the topology of the GN will influence the evolutionary dynamics of the population. In this letter we analyze this effect by studying the dynamics of random walks (RW) on assortative networks with arbitrary topology. We find that the probability that a RW leaves the network is smaller the longer the time spent in it—i.e., the process is not Markovian. From the biological viewpoint, this “phenotypic entrapment” entails an acceleration in the fixation of neutral mutations, thus implying a non-uniform increase in the ticking rate of the molecular clock with the age of branches in phylogenetic trees. We also show that this effect is stronger the larger the fitness of the current phenotype relative to that of neighboring phenotypes.

Systematic identification of gene families for use as markers for phylogenetic and phylogeny- driven ecological studies of bacteria and archaea and their major subgroups

Systematic identification of gene families for use as markers for phylogenetic and phylogeny- driven ecological studies of bacteria and archaea and their major subgroups
Dongying Wu, Guillaume Jospin, Jonathan A. Eisen
(Submitted on 2 Jul 2013)

With the astonishing rate that the genomic and metagenomic sequence data sets are accumulating, there are many reasons to constrain the data analyses. One approach to such constrained analyses is to focus on select subsets of gene families that are particularly well suited for the tasks at hand. Such gene families have generally been referred to as marker genes. We are particularly interested in identifying and using such marker genes for phylogenetic and phylogeny-driven ecological studies of microbes and their communities. We therefore refer to these as PhyEco (for phylogenetic and phylogenetic ecology) markers. The dual use of these PhyEco markers means that we needed to develop and apply a set of somewhat novel criteria for identification of the best candidates for such markers. The criteria we focused on included universality across the taxa of interest, ability to be used to produce robust phylogenetic trees that reflect as much as possible the evolution of the species from which the genes come, and low variation in copy number across taxa. We describe here an automated protocol for identifying potential PhyEco markers from a set of complete genome sequences. The protocol combines rapid searching, clustering and phylogenetic tree building algorithms to generate protein families that meet the criteria listed above. We report here the identification of PhyEco markers for different taxonomic levels including 40 for all bacteria and archaea, 114 for all bacteria, and much more for some of the individual phyla of bacteria. This new list of PhyEco markers should allow much more detailed automated phylogenetic and phylogenetic ecology analyses of these groups than possible previously.

A hierarchical network heuristic for solving the orientation problem in genome assembly

A hierarchical network heuristic for solving the orientation problem in genome assembly
Karl R. B. Schmitt, Aleksey V. Zimin, Guillaume Marcaçs, James A. Yorke, Michelle Girvan
(Submitted on 1 Jul 2013)

In the past several years, the problem of genome assembly has received considerable attention from both biologists and computer scientists. An important component of current assembly methods is the scaffolding process. This process involves building ordered and oriented linear collections of contigs (continuous overlapping sequence reads) called scaffolds and relies on the use of mate pair data. A mate pair is a set of two reads that are sequenced from the ends of a single fragment of DNA, and therefore have opposite mutual orientations. When two reads of a mate-pair are placed into two different contigs, one can infer the mutual orientation of these contigs. While several orientation algorithms exist as part of assembly programs, all encounter challenges while solving the orientation problem due to errors from mis-assemblies in contigs or errors in read placements. In this paper we present an algorithm based on hierarchical clustering that independently solves the orientation problem and is robust to errors. We show that our algorithm can correctly solve the orientation problem for both faux (generated) assembly data and real assembly data for {\em R. sphaeroides bacteria}. We demonstrate that our algorithm is stable to both changes in the initial orientations as well as noise in the data, making it advantageous compared to traditional approaches.