Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes

Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes
Ilya Minkin, Anand Patel, Mikhail Kolmogorov, Nikolay Vyahhi, Son Pham
(Submitted on 30 Jul 2013)

Comparing strains within the same microbial species has proven effective in the identification of genes and genomic regions responsible for virulence, as well as in the diagnosis and treatment of infectious diseases. In this paper, we present Sibelia, a tool for finding synteny blocks in multiple closely related microbial genomes using iterative de Bruijn graphs. Unlike most other tools, Sibelia can find synteny blocks that are repeated within genomes as well as blocks shared by multiple genomes. It represents synteny blocks in a hierarchy structure with multiple layers, each of which representing a different granularity level. Sibelia has been designed to work efficiently with a large number of microbial genomes; it finds synteny blocks in 31 S. aureus genomes within 31 minutes and in 59 E.coli genomes within 107 minutes on a standard desktop. Sibelia software is distributed under the GNU GPL v2 license and is available at: this https URL Sibelia’s web-server is available at: this http URL

Exploring Genome Characteristics and Sequence Quality Without a Reference

Exploring Genome Characteristics and Sequence Quality Without a Reference
Jared T. Simpson
(Submitted on 30 Jul 2013)

The de novo assembly of large, complex genomes is a significant challenge with currently available DNA sequencing technology. While many de novo assembly software packages are available, comparatively little attention has been paid to assisting the user with the assembly. This paper addresses the practical aspects of de novo assembly by introducing new ways to perform quality assessment on a collection of DNA sequence reads. The software implementation calculates per-base error rates, paired-end fragment size histograms and coverage metrics in the absence of a reference genome. Additionally, the software will estimate characteristics of the sequenced genome, such as repeat content and heterozygosity, that are key determinants of assembly difficulty. The software described is freely available and open source under the GNU Public License.

Agalma: an automated phylogenomics workflow

Agalma: an automated phylogenomics workflow
Casey W. Dunn, Mark Howison, Felipe Zapata
(Submitted on 24 Jul 2013)

In the past decade, transcriptome data have become an important component of many phylogenetic studies. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies. We present Agalma, an automated tool that conducts phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, and enables rich HTML reports for all stages of the analysis. Agalma includes a test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at this https URL. Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended.

Guidelines for the design of evolve and resequencing studies

Guidelines for the design of evolve and resequencing studies
Robert Kofler, Christian Schlötterer
(Submitted on 18 Jul 2013)

Standing genetic variation provides a rich reservoir of potentially useful mutations facilitating the adaptation to novel environments. Experimental evolution studies have demonstrated that rapid and strong phenotypic responses to selection can also be obtained in the laboratory. When combined with the Next Generation Sequencing technology, these experiments promise to identify the individual loci contributing to adaption. Nevertheless, until now, very little is known about the design of such evolve and resequencing (E&R) studies. Here, we use forward simulations of entire genomes to evaluate different experimental designs that aim to maximize the power to detect selected variants. We show that low linkage disequilibrium in the starting population, population size, duration of the experiment and the number of replicates are the key factors in determining the power and accuracy of E&R studies. Furthermore, replication of E&R is more important for detecting the targets of selection than increasing the population size. Using an optimized design beneficial loci with a selective advantage as low as s=0.005 can be identified at the nucleotide level. Even when a large number of loci are selected simultaneously, up to 56% can be reliably detected without incurring large numbers of false positives. Our computer simulations suggest that, with an adequate experimental design, E&R studies are a powerful tool to identify adaptive mutations from standing genetic variation and thereby provide an excellent means to analyze the trajectories of selected alleles in evolving populations

Computational aspects of DNA mixture analysis

Computational aspects of DNA mixture analysis
Therese Graversen, Steffen Lauritzen
(Submitted on 18 Jul 2013)

Statistical analysis of DNA mixtures is known to pose computational challenges due to the enormous state space of possible DNA profiles. We propose a Bayesian network representation for genotypes, allowing computations to be performed locally involving only a few alleles at each step. In addition, we describe a general method for computing the expectation of a product of discrete random variables using auxiliary variables and probability propagation in a Bayesian network, which in combination with the genotype network allows efficient computation of the likelihood function and various other quantities relevant to the inference. Lastly, we introduce a set of diagnostic tools for assessing the adequacy of the model for describing a particular dataset.

A model-based approach for identifying signatures of balancing selection in genetic data

A model-based approach for identifying signatures of balancing selection in genetic data
Michael DeGiorgio, Kirk E. Lohmueller, Rasmus Nielsen
(Submitted on 16 Jul 2013)

While much effort has focused on detecting positive and negative directional selection in the human genome, relatively little work has been devoted to balancing selection. This lack of attention is likely due to the paucity of sophisticated methods for identifying sites under balancing selection. Here we develop two composite likelihood ratio tests for detecting balancing selection. Using simulations, we show that these methods outperform competing methods under a variety of assumptions and demographic models. We apply the new methods to whole-genome human data, and find a number of previously-identified loci with strong evidence of balancing selection, including several HLA genes. Additionally, we find evidence for many novel candidates, the strongest of which is FANK1, an imprinted gene that suppresses apoptosis, is expressed during meiosis in males, and displays marginal signs of segregation distortion. We hypothesize that balancing selection acts on this locus to stabilize the segregation distortion and negative fitness effects of the distorter allele. Thus, our methods are able to reproduce many previously-hypothesized signals of balancing selection, as well as discover novel interesting candidates.

Synteny in Bacterial Genomes: Inference, Organization and Evolution

Synteny in Bacterial Genomes: Inference, Organization and Evolution
Ivan Junier, Olivier Rivoire
(Submitted on 16 Jul 2013)

Genes are not located randomly along genomes. Synteny, the conservation of their relative positions in genomes of different species, reflects fundamental constraints on natural evolution. We present approaches to infer pairs of co-localized genes from multiple genomes, describe their organization, and study their evolutionary history. In bacterial genomes, we thus identify synteny units, or “syntons”, which are clusters of proximal genes that encompass and extend operons. The size distribution of these syntons divide them into large syntons, which correspond to fundamental macro-molecular complexes of bacteria, and smaller ones, which display a remarkable exponential distribution of sizes. This distribution is “universal” in two respects: it holds for vastly different genomes, and for functionally distinct genes. Similar statistical laws have been reported previously in studies of bacterial genomes, and generally attributed to purifying selection or neutral processes. Here, we perform a new analysis based on the concept of parsimony, and find that the prevailing evolutionary mechanism behind the formation of small syntons is a selective process of gene aggregation. Altogether, our results imply a common evolutionary process that selectively shapes the organization and diversity of bacterial genomes.

Migration-selection balance at multiple loci and selection on dominance and recombination

Migration-selection balance at multiple loci and selection on dominance and recombination
Alexey Yanchukov, Stephen R. Proulx
(Submitted on 15 Jul 2013)

A steady influx of a single deleterious multilocus genotype will impose genetic load on the resident population and leave multiple descendants carrying various numbers of the foreign alleles. Provided that the foreign types are rare at equilibrium, and that all immigrant genes will eventually be eliminated by selection, the population structure can be inferred explicitly from the deterministic branching process taking place within a single immigrant lineage. Unless the migration and recombination rates were high, this simple method was a very close approximation to the simulated migration-selection balance with all possible multilocus genotypes considered.

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella
Yaniv Brandvain, Tanja Slotte, Khaled Hazzouri, Stephen Wright, Graham Coop
(Submitted on 15 Jul 2013)

The shift from outcrossing to self-fertilization is among the most common transitions in plants. Until recently, however, a genome-wide view of this transition has been obscured by a dearth of appropriate data and the lack of appropriate population genomic methods to interpret such data. Here, we present novel analyses detailing the origin of the selfing species, Capsella rubella, which recently split from its outcrossing sister, Capsella grandiflora. Due to the recency of the split, most variation within C. rubella is found within C. grandiflora. We can therefore identify genomic regions where two C. rubella individuals have inherited the same or different segments of ancestral diversity (i.e. founding haplotypes) present in C. rubella’s founder(s). Based on this analysis, we show that C. rubella was founded by multiple individuals drawn from a diverse ancestral population closely related to extant C. grandiflora, that drift and selection have rapidly homogenized most of this ancestral variation since C. rubella’s founding, and that little novel variation has accumulated within this time. Despite the extensive loss of ancestral variation, the approximately 25% of the genome for which two C. rubella individuals have inherited different founding haplotypes makes up roughly 90% of the genetic variation between them. To extend these findings, we develop a coalescent model that utilizes the inferred frequency of founding haplotypes and variation within founding haplotypes to estimate that C. rubella was founded by a potentially large number of individuals 50-100 kya, and has subsequently experienced a 20X reduction in its effective population size. As population genomic data from an increasing number of outcrossing/selfing pairs are generated, analyses like this here will facilitate a fine-scaled view of the evolutionary and demographic impact of the transition to self-fertilization.

QuorUM: an error corrector for Illumina reads

QuorUM: an error corrector for Illumina reads
Guillaume Marçais, James A. Yorke, Aleksey Zimin
(Submitted on 12 Jul 2013)

Motivation: Illumina Sequencing data can provide high coverage of a genome by relatively short (100 bp150 bp) reads at a low cost. Our goal is to produce trimmed and error-corrected reads to improve genome assemblies. Our error correction procedure aims at producing a set of error-corrected reads (1) minimizing the number of distinct false k-mers, i.e. that are not present in the genome, in the set of reads and (2) maximizing the number that are true, i.e. that are present in the genome. Because coverage of a genome by Illumina reads varies greatly from point to point, we cannot simply eliminate k-mers that occur rarely.
Results: Our software, called QuorUM, provides reasonably accurate correction and is suitable for large data sets (1 billion bases checked and corrected per day per core).
Availability: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at this http URL