Agalma: an automated phylogenomics workflow

Agalma: an automated phylogenomics workflow
Casey W. Dunn, Mark Howison, Felipe Zapata
(Submitted on 24 Jul 2013)

In the past decade, transcriptome data have become an important component of many phylogenetic studies. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies. We present Agalma, an automated tool that conducts phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, and enables rich HTML reports for all stages of the analysis. Agalma includes a test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at this https URL. Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended.

An Arrow-type result for inferring a species tree from gene trees

An Arrow-type result for inferring a species tree from gene trees
Mike Steel
(Submitted on 19 Jul 2013)

The reconstruction of a central tendency `species tree’ from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all taxa are present in every gene tree. Here, we list four desirable properties that a method for estimating a species tree from gene trees should have. We show that while these can be achieved when taxon coverage is complete (by the Adams consensus method), they cannot all be satisfied in the more general setting of partial taxon coverage.

Guidelines for the design of evolve and resequencing studies

Guidelines for the design of evolve and resequencing studies
Robert Kofler, Christian Schlötterer
(Submitted on 18 Jul 2013)

Standing genetic variation provides a rich reservoir of potentially useful mutations facilitating the adaptation to novel environments. Experimental evolution studies have demonstrated that rapid and strong phenotypic responses to selection can also be obtained in the laboratory. When combined with the Next Generation Sequencing technology, these experiments promise to identify the individual loci contributing to adaption. Nevertheless, until now, very little is known about the design of such evolve and resequencing (E&R) studies. Here, we use forward simulations of entire genomes to evaluate different experimental designs that aim to maximize the power to detect selected variants. We show that low linkage disequilibrium in the starting population, population size, duration of the experiment and the number of replicates are the key factors in determining the power and accuracy of E&R studies. Furthermore, replication of E&R is more important for detecting the targets of selection than increasing the population size. Using an optimized design beneficial loci with a selective advantage as low as s=0.005 can be identified at the nucleotide level. Even when a large number of loci are selected simultaneously, up to 56% can be reliably detected without incurring large numbers of false positives. Our computer simulations suggest that, with an adequate experimental design, E&R studies are a powerful tool to identify adaptive mutations from standing genetic variation and thereby provide an excellent means to analyze the trajectories of selected alleles in evolving populations

Computational aspects of DNA mixture analysis

Computational aspects of DNA mixture analysis
Therese Graversen, Steffen Lauritzen
(Submitted on 18 Jul 2013)

Statistical analysis of DNA mixtures is known to pose computational challenges due to the enormous state space of possible DNA profiles. We propose a Bayesian network representation for genotypes, allowing computations to be performed locally involving only a few alleles at each step. In addition, we describe a general method for computing the expectation of a product of discrete random variables using auxiliary variables and probability propagation in a Bayesian network, which in combination with the genotype network allows efficient computation of the likelihood function and various other quantities relevant to the inference. Lastly, we introduce a set of diagnostic tools for assessing the adequacy of the model for describing a particular dataset.

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome
Justin M. Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, Marc Salit
(Submitted on 17 Jul 2013)

Clinical adoption of human genome sequencing requires methods with known accuracy of genotype calls at millions or billions of positions across a genome. Previous work showing discordance amongst sequencing methods and algorithms has made clear the need for a highly accurate set of genotypes across a whole genome that could be used as a benchmark. We present methods we used to make highly confident SNP, indel, and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. To minimize bias towards any sequencing method, we integrate 9 whole genome and 3 exome datasets from 5 different sequencing platforms (Illumina, Complete Genomics, SOLiD, 454, and Ion Torrent), 7 mappers, and 3 variant callers. The resulting genotype calls are highly sensitive and specific, and allow performance assessment of more difficult variants than typically investigated using microarrays as a benchmark. Regions for which no confident genotype call could be made are identified as uncertain, and classified into different reasons for uncertainty (e.g. low coverage, mapping/alignment bias, etc.). As a community resource, we have integrated our highly confident genotype calls into the GCAT website for interactive assessment of false positive and negative rates of different datasets and bioinformatics methods using our highly confident calls. Application of the concepts of our integration process may be interesting beyond whole genome sequencing, for other measurement problems with large datasets from multiple methods, where none of the methods is a Reference Method that can be relied upon as highly sensitive and specific.

A model-based approach for identifying signatures of balancing selection in genetic data

A model-based approach for identifying signatures of balancing selection in genetic data
Michael DeGiorgio, Kirk E. Lohmueller, Rasmus Nielsen
(Submitted on 16 Jul 2013)

While much effort has focused on detecting positive and negative directional selection in the human genome, relatively little work has been devoted to balancing selection. This lack of attention is likely due to the paucity of sophisticated methods for identifying sites under balancing selection. Here we develop two composite likelihood ratio tests for detecting balancing selection. Using simulations, we show that these methods outperform competing methods under a variety of assumptions and demographic models. We apply the new methods to whole-genome human data, and find a number of previously-identified loci with strong evidence of balancing selection, including several HLA genes. Additionally, we find evidence for many novel candidates, the strongest of which is FANK1, an imprinted gene that suppresses apoptosis, is expressed during meiosis in males, and displays marginal signs of segregation distortion. We hypothesize that balancing selection acts on this locus to stabilize the segregation distortion and negative fitness effects of the distorter allele. Thus, our methods are able to reproduce many previously-hypothesized signals of balancing selection, as well as discover novel interesting candidates.

Synteny in Bacterial Genomes: Inference, Organization and Evolution

Synteny in Bacterial Genomes: Inference, Organization and Evolution
Ivan Junier, Olivier Rivoire
(Submitted on 16 Jul 2013)

Genes are not located randomly along genomes. Synteny, the conservation of their relative positions in genomes of different species, reflects fundamental constraints on natural evolution. We present approaches to infer pairs of co-localized genes from multiple genomes, describe their organization, and study their evolutionary history. In bacterial genomes, we thus identify synteny units, or “syntons”, which are clusters of proximal genes that encompass and extend operons. The size distribution of these syntons divide them into large syntons, which correspond to fundamental macro-molecular complexes of bacteria, and smaller ones, which display a remarkable exponential distribution of sizes. This distribution is “universal” in two respects: it holds for vastly different genomes, and for functionally distinct genes. Similar statistical laws have been reported previously in studies of bacterial genomes, and generally attributed to purifying selection or neutral processes. Here, we perform a new analysis based on the concept of parsimony, and find that the prevailing evolutionary mechanism behind the formation of small syntons is a selective process of gene aggregation. Altogether, our results imply a common evolutionary process that selectively shapes the organization and diversity of bacterial genomes.

Migration-selection balance at multiple loci and selection on dominance and recombination

Migration-selection balance at multiple loci and selection on dominance and recombination
Alexey Yanchukov, Stephen R. Proulx
(Submitted on 15 Jul 2013)

A steady influx of a single deleterious multilocus genotype will impose genetic load on the resident population and leave multiple descendants carrying various numbers of the foreign alleles. Provided that the foreign types are rare at equilibrium, and that all immigrant genes will eventually be eliminated by selection, the population structure can be inferred explicitly from the deterministic branching process taking place within a single immigrant lineage. Unless the migration and recombination rates were high, this simple method was a very close approximation to the simulated migration-selection balance with all possible multilocus genotypes considered.

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella

Genomic identification of founding haplotypes reveals the history of the selfing species Capsella rubella
Yaniv Brandvain, Tanja Slotte, Khaled Hazzouri, Stephen Wright, Graham Coop
(Submitted on 15 Jul 2013)

The shift from outcrossing to self-fertilization is among the most common transitions in plants. Until recently, however, a genome-wide view of this transition has been obscured by a dearth of appropriate data and the lack of appropriate population genomic methods to interpret such data. Here, we present novel analyses detailing the origin of the selfing species, Capsella rubella, which recently split from its outcrossing sister, Capsella grandiflora. Due to the recency of the split, most variation within C. rubella is found within C. grandiflora. We can therefore identify genomic regions where two C. rubella individuals have inherited the same or different segments of ancestral diversity (i.e. founding haplotypes) present in C. rubella’s founder(s). Based on this analysis, we show that C. rubella was founded by multiple individuals drawn from a diverse ancestral population closely related to extant C. grandiflora, that drift and selection have rapidly homogenized most of this ancestral variation since C. rubella’s founding, and that little novel variation has accumulated within this time. Despite the extensive loss of ancestral variation, the approximately 25% of the genome for which two C. rubella individuals have inherited different founding haplotypes makes up roughly 90% of the genetic variation between them. To extend these findings, we develop a coalescent model that utilizes the inferred frequency of founding haplotypes and variation within founding haplotypes to estimate that C. rubella was founded by a potentially large number of individuals 50-100 kya, and has subsequently experienced a 20X reduction in its effective population size. As population genomic data from an increasing number of outcrossing/selfing pairs are generated, analyses like this here will facilitate a fine-scaled view of the evolutionary and demographic impact of the transition to self-fertilization.

QuorUM: an error corrector for Illumina reads

QuorUM: an error corrector for Illumina reads
Guillaume Marçais, James A. Yorke, Aleksey Zimin
(Submitted on 12 Jul 2013)

Motivation: Illumina Sequencing data can provide high coverage of a genome by relatively short (100 bp150 bp) reads at a low cost. Our goal is to produce trimmed and error-corrected reads to improve genome assemblies. Our error correction procedure aims at producing a set of error-corrected reads (1) minimizing the number of distinct false k-mers, i.e. that are not present in the genome, in the set of reads and (2) maximizing the number that are true, i.e. that are present in the genome. Because coverage of a genome by Illumina reads varies greatly from point to point, we cannot simply eliminate k-mers that occur rarely.
Results: Our software, called QuorUM, provides reasonably accurate correction and is suitable for large data sets (1 billion bases checked and corrected per day per core).
Availability: QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at this http URL