Synteny in Bacterial Genomes: Inference, Organization and Evolution

Synteny in Bacterial Genomes: Inference, Organization and Evolution
Ivan Junier, Olivier Rivoire
(Submitted on 16 Jul 2013)

Genes are not located randomly along genomes. Synteny, the conservation of their relative positions in genomes of different species, reflects fundamental constraints on natural evolution. We present approaches to infer pairs of co-localized genes from multiple genomes, describe their organization, and study their evolutionary history. In bacterial genomes, we thus identify synteny units, or “syntons”, which are clusters of proximal genes that encompass and extend operons. The size distribution of these syntons divide them into large syntons, which correspond to fundamental macro-molecular complexes of bacteria, and smaller ones, which display a remarkable exponential distribution of sizes. This distribution is “universal” in two respects: it holds for vastly different genomes, and for functionally distinct genes. Similar statistical laws have been reported previously in studies of bacterial genomes, and generally attributed to purifying selection or neutral processes. Here, we perform a new analysis based on the concept of parsimony, and find that the prevailing evolutionary mechanism behind the formation of small syntons is a selective process of gene aggregation. Altogether, our results imply a common evolutionary process that selectively shapes the organization and diversity of bacterial genomes.

Migration-selection balance at multiple loci and selection on dominance and recombination

Migration-selection balance at multiple loci and selection on dominance and recombination
Alexey Yanchukov, Stephen R. Proulx
(Submitted on 15 Jul 2013)

A steady influx of a single deleterious multilocus genotype will impose genetic load on the resident population and leave multiple descendants carrying various numbers of the foreign alleles. Provided that the foreign types are rare at equilibrium, and that all immigrant genes will eventually be eliminated by selection, the population structure can be inferred explicitly from the deterministic branching process taking place within a single immigrant lineage. Unless the migration and recombination rates were high, this simple method was a very close approximation to the simulated migration-selection balance with all possible multilocus genotypes considered.

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella
Yaniv Brandvain, Tanja Slotte, Khaled Hazzouri, Stephen Wright, Graham Coop
(Submitted on 15 Jul 2013)

The shift from outcrossing to self-fertilization is among the most common transitions in plants. Until recently, however, a genome-wide view of this transition has been obscured by a dearth of appropriate data and the lack of appropriate population genomic methods to interpret such data. Here, we present novel analyses detailing the origin of the selfing species, Capsella rubella, which recently split from its outcrossing sister, Capsella grandiflora. Due to the recency of the split, most variation within C. rubella is found within C. grandiflora. We can therefore identify genomic regions where two C. rubella individuals have inherited the same or different segments of ancestral diversity (i.e. founding haplotypes) present in C. rubella’s founder(s). Based on this analysis, we show that C. rubella was founded by multiple individuals drawn from a diverse ancestral population closely related to extant C. grandiflora, that drift and selection have rapidly homogenized most of this ancestral variation since C. rubella’s founding, and that little novel variation has accumulated within this time. Despite the extensive loss of ancestral variation, the approximately 25% of the genome for which two C. rubella individuals have inherited different founding haplotypes makes up roughly 90% of the genetic variation between them. To extend these findings, we develop a coalescent model that utilizes the inferred frequency of founding haplotypes and variation within founding haplotypes to estimate that C. rubella was founded by a potentially large number of individuals 50-100 kya, and has subsequently experienced a 20X reduction in its effective population size. As population genomic data from an increasing number of outcrossing/selfing pairs are generated, analyses like this here will facilitate a fine-scaled view of the evolutionary and demographic impact of the transition to self-fertilization.

QuorUM: an error corrector for Illumina reads

QuorUM: an error corrector for Illumina reads
Guillaume Marçais, James A. Yorke, Aleksey Zimin
(Submitted on 12 Jul 2013)

Motivation: Illumina Sequencing data can provide high coverage of a genome by relatively short (100 bp150 bp) reads at a low cost. Our goal is to produce trimmed and error-corrected reads to improve genome assemblies. Our error correction procedure aims at producing a set of error-corrected reads (1) minimizing the number of distinct false k-mers, i.e. that are not present in the genome, in the set of reads and (2) maximizing the number that are true, i.e. that are present in the genome. Because coverage of a genome by Illumina reads varies greatly from point to point, we cannot simply eliminate k-mers that occur rarely.
Results: Our software, called QuorUM, provides reasonably accurate correction and is suitable for large data sets (1 billion bases checked and corrected per day per core).
Availability: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at this http URL

SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data

SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data
Swetansu Pattnaik, Saurabh Gupta, Arjun A Rao, Binay Panda
(Submitted on 12 Jul 2013)

We report SInC (SNV, Indel and CNV) simulator and read generator, an open-source tool capable of simulating biological variants taking into account a platform-specific error model. SInC is capable of simulating and generating single- and paired-end reads with user-defined insert size with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes. Sinc can be downloaded from this https URL

kruX: Matrix-based non-parametric eQTL discovery

kruX: Matrix-based non-parametric eQTL discovery
Jianlong Qi, Hassan Foroughi Asl, Johan Bjorkegren, Tom Michoel
(Submitted on 12 Jul 2013)

The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitative trait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlying genetic model and expression trait distribution, but testing billions of marker-trait combinations one-by-one can become computationally prohibitive. We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplications to simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-trait combinations at once. KruX is more than 3,000 times faster than computing associations one-by-one on a typical human dataset.

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce
Christopher W. Whelan, Jeffrey Tyner, Alberto L’Abbate, Clelia Tiziana Storlazzi, Lucia Carbone, Kemal Sönmez
(Submitted on 9 Jul 2013)

The detection of genomic structural variations (SV) remains a difficult challenge in analyzing sequencing data, and the growing size and number of sequenced genomes have rendered SV detection a bona fide big data problem. MapReduce is a proven, scalable solution for distributed computing on huge data sets. We describe a conceptual framework for SV detection algorithms in MapReduce based on computing local genomic features, and use it to develop a deletion and insertion detection algorithm, Cloudbreak. On simulated and real data sets, Cloudbreak achieves accuracy improvements over popular SV detection algorithms, and genotypes variants from diploid samples. It provides dramatically shorter runtimes and the ability to scale to big data volumes on large compute clusters. Cloudbreak includes tools to set up and configure MapReduce (Hadoop) clusters on cloud services, enabling on-demand cluster computing. Our implementation and source code are available at this http URL

Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster

Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster
Martin Kapun, Hester van Schalkwyk, Bryant McAllister, Thomas Flatt, Christian Schlötterer
(Submitted on 9 Jul 2013)

Sequencing of pools of individuals (Pool-Seq) represents a reliable and cost- effective approach for estimating genome-wide SNP and transposable element insertion frequencies. However, Pool-Seq does not provide direct information on haplotypes so that for example obtaining inversion frequencies has not been possible until now. Here, we have developed a new set of diagnostic marker SNPs for 7 cosmopolitan inversions in Drosophila melanogaster that can be used to infer inversion frequencies from Pool-Seq data. We applied our novel marker set to Pool-Seq data from an experimental evolution study and from North American and Australian latitudinal clines. In the experimental evolution data, we find evidence that positive selection has driven the frequencies of In(3R)C and In(3R)Mo to increase over time. In the clinal data, we confirm the existence of frequency clines for In(2L)t, In(3L)P and In(3R)Payne in both North America and Australia and detect a previously unknown latitudinal cline for In(3R)Mo in North America. The inversion markers developed here provide a versatile and robust tool for characterizing inversion frequencies and their dynamics in Pool- Seq data from diverse D. melanogaster populations

The Changing Geometry of a Fitness Landscape Along an Adaptive Walk

The Changing Geometry of a Fitness Landscape Along an Adaptive Walk
Devin Greene, Krisitna Crona
(Submitted on 7 Jul 2013)

It has recently been noted that the relative prevalence of the various kinds of epistasis varies along an adaptive walk. This has been explained as a result of mean regression in NK model fitness landscapes. Here we show that this phenomenon occurs quite generally in fitness landscapes. We propose a simple and general explanation for this phenomemon, confirming the role of mean regression. We provide support for this explanation with simulations, and discuss the empirical relevance of our findings.

RNA secondary structure prediction from multi-aligned sequences

RNA secondary structure prediction from multi-aligned sequences
Michiaki Hamada
(Submitted on 8 Jul 2013)

It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.