Patching holes in the Chlamydomonas genome

Posted on October 29, 2015 by schraib

Frej Tulin, Frederick R. Cross

bioRxiv doi: http://dx.doi.org/10.1101/030163

The Chlamydomonas genome has been sequenced, assembled and annotated to produce a rich resource for genetics and molecular biology in this well-studied model organism. However, the current reference genome contains ~1000 blocks of unknown sequence (‘N-islands’), which are frequently placed in introns of annotated gene models. We developed a strategy, using careful bioinformatics analysis of short-sequence cDNA and genomic DNA reads, to search for previously unknown exons hidden within such blocks, and determine the sequence and exon/intron boundaries of such exons. These methods are based on assembly and alignment completely independent of prior reference assembly or reference annotation. Our evidence indicates that ~one-quarter of the annotated intronic N-islands actually contain hidden exons. For most of these our algorithm recovers full exonic sequence with associated splice junctions and exon-adjacent intron sequence, that can be joined to the reference genome assembly and annotated transcript models. These new exons represent de novo sequence generally present nowhere in the assembled genome, and the added sequence can be shown in many cases to greatly improve evolutionary conservation of the predicted encoded peptides. At the same time, our results confirm the purely intronic status for a substantial majority of N-islands annotated as intronic in the reference annotated genome, increasing confidence in this valuable resource.

Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests

Posted on October 29, 2015 by schraib

Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests
Alice Ledda, Guillaume Achaz, Thomas Wiehe, Luca Ferretti

We investigate the dependence of the site frequency spectrum (SFS) on the topological structure of coalescent trees. We show that basic population genetic statistics – for instance estimators of theta or neutrality tests such as Tajima’s D – can be decomposed into components of waiting times between coalescent events and of tree topology. Our results clarify the relative impact of the two components on these statistics. We provide a rigorous interpretation of positive or negative values of neutrality tests in terms of the underlying tree shape. In particular, we show that values of Tajima’s D and Fay and Wu’s H depend in a direct way on a measure of tree balance which is mostly determined by the root balance of the tree. We also compute the maximum and minimum values for neutrality tests as a function of sample size.
Focusing on the standard coalescent model of neutral evolution, we discuss how waiting times between coalescent events are related to derived allele frequencies and thereby to the frequency spectrum. Finally, we show how tree balance affects the frequency spectrum. In particular, we derive the complete SFS conditioned on the root imbalance. We show that the conditional spectrum is peaked at frequencies corresponding to the root imbalance and strongly biased towards rare alleles.

On the Balance of Unrooted Trees

Posted on October 29, 2015 by schraib

On the Balance of Unrooted Trees
Mareike Fischer, Volkmar Liebscher

We solve a class of optimization problems for (phylogenetic) X-trees or their shapes. These problems have recently appeared in different contexts, e.g. in the context of the impact of tree shapes on the size of TBR neighborhoods, but so far these problems have not been characterized and solved in a systematic way. In this work we generalize the concept and also present several applications. Moreover, our results give rise to a nice notion of balance for trees. Unsurprisingly, so-called caterpillars are the most unbalanced tree shapes, but it turns out that balanced tree shapes cannot be described so easily as they need not even be unique.

Estimation of the True Evolutionary Distance under the Fragile Breakage Model

Posted on October 29, 2015 by schraib

Estimation of the True Evolutionary Distance under the Fragile Breakage Model
Nikita Alexeev, Max A. Alekseyev

The ability to estimate the evolutionary distance between extant genomes plays a crucial role in many phylogenomic studies. Often such estimation is based on the parsimony assumption, implying that the distance between two genomes can be estimated as the minimal number of genome rearrangements required to transform one genome into the other. However, in reality the parsimony assumption may not always hold, emphasizing the need for estimation that does not rely on the minimal number of genome rearrangements. While there exists a method for such estimation, it however assumes that genomes can be broken by rearrangements equally likely at any position in the course of evolution. This assumption, known as the random breakage model, has recently been refuted in favor of the more rigorous fragile breakage model postulating that only certain “fragile” genomic regions are prone to rearrangements. We propose a new method for estimating the evolutionary distance between two genomes with high accuracy under the fragile breakage model.

A general approximation for the dynamics of quantitative traits

Posted on October 29, 2015 by schraib

A general approximation for the dynamics of quantitative traits
Katarína Boďová, Gašper Tkačik, Nicholas H. Barton

Selection, mutation and random drift affect the dynamics of allele frequencies and consequently of quantitative traits. While the macroscopic dynamics of quantitative traits can be measured, the underlying allele frequencies are typically unobserved. Can we understand how the macroscopic observables evolve without following these microscopic processes? The problem has previously been studied by analogy with statistical mechanics: the allele frequency distribution at each time is approximated by the stationary form, which maximises entropy. We explore the limitations of this method when mutation is small (4Nμ<1) so that populations are typically close to fixation and we extend the theory in this regime to account for changes in mutation strength. We consider a single diallelic locus under either directional selection, or with over-dominance, and then generalise to multiple unlinked biallelic loci with unequal effects. We find that the maximum entropy approximation is remarkably accurate, even when mutation and selection change rapidly.

Rawcopy: Improved copy number analysis with Affymetrix arrays

Posted on October 23, 2015 by schraib

Rawcopy: Improved copy number analysis with Affymetrix arrays

Markus Mayrhofer, Bjorn Viklund, Anders Isaksson

bioRxiv doi: http://dx.doi.org/10.1101/027409

Rawcopy is an R package for processing of Affymetrix CytoScan HD, CytoScan 750k and SNP 6.0 microarray raw intensities (CEL files). It uses data from a large number of reference samples to produce log ratio for total copy number analysis and B-allele frequency for allele-specific copy number and heterozygosity analysis. Rawcopy achieves higher signal-to-noise ratio than commonly used free and proprietary alternatives, leading to improved identification of copy number alterations. In addition, Rawcopy visualises each microarray sample for assessment of technical quality, patient identity and genome-wide absolute copy number states.

Homomorphic ZW Chromosomes in a Wild Strawberry Show Distinctive Recombination Heterogeneity but a Small Sex-Determining Region

Posted on October 23, 2015 by schraib

Homomorphic ZW Chromosomes in a Wild Strawberry Show Distinctive Recombination Heterogeneity but a Small Sex-Determining Region

Jacob Tennessen, Rajanikanth Govindarajulu, Aaron Liston, Tia-Lynn Ashman

bioRxiv doi: http://dx.doi.org/10.1101/029611

Sex chromosomes play a prominent role in development and evolution and have several characteristic features that distinguish them from autosomes. Across diverse taxa, recombination is typically suppressed at the sex-determining region (SDR) and proportionally elevated in the remainder of the chromosome or pseudoautosomal region (PAR). However, in most model taxa the sex chromosomes are ancient and highly differentiated from autosomes, and thus little is known about recombination dynamics of homomorphic sex chromosomes with incipient sex-determining mechanisms. Here we examine male function (pollen production) and female function (fruit production) in crosses of the dioecious octoploid strawberry Fragaria chiloensis in order to map the small and recently evolved SDR controlling both traits and to examine recombination patterns on the young ZW chromosome. The SDR occurs in a narrow 280kb window, in which the maternal recombination rate is lower than in the orthologous paternal region and the genome-wide average rate, but within the range of autosomal rate variation. In contrast to the SDR, the ZW recombination rate in the PAR is much higher than the rates of the ZZ or autosomal linkage groups, substantially overcompensating for the SDR rate. By extensively sequencing sections of the SDR vicinity in several crosses and unrelated plants, we show that W-specific divergence is elevated within a portion of the SDR and find only a single SNP to be in high linkage disequilibrium with sex, suggesting that any W-specific haplotype protected from recombination is not large. We hypothesize that selection for recombination suppression within the small SDR may be weak, but that fluctuating sex ratios could favor elevated recombination in the PAR to remove deleterious mutation on the W. Thus these results illuminate the recombination dynamics of a nascent sex chromosome with a modestly diverged SDR, which could be typical of other dioecious plants.

Inferring chimpanzee Y chromosome history and amplicon diversity from whole genome sequencing

Posted on October 23, 2015 by schraib

Inferring chimpanzee Y chromosome history and amplicon diversity from whole genome sequencing

Matthew Oetjens, Feichen Shen, Zhengting Zou, Jeffrey Kidd

bioRxiv doi: http://dx.doi.org/10.1101/029702

Due to the lack of recombination, the male-specific region of the Y chromosome (MSY) is a unique resource for tracking the genetic history of populations. The MSY is also enriched for large, nearly identical repetitive regions known as amplicons, which harbor many of the genes essential for spermatogenesis. In humans, sequence diversity on the unique segment of the MSY is greatly reduced compared to the autosomes, an observation consistent with the action of strong selection. Here, we analyze 9 chimpanzee (representing three subspecies: Pan troglodytes schweinfurthii, Pan troglodytes ellioti, and Pan troglodytes verus) and two Pan paniscus male whole-genome sequences to assess Y chromosome nucleotide and ampliconic copy-number diversity across the Pan genus. In total, we identified 23,946 Pan spp. SNVs across 4.2 million callable sites. Comparisons with autosomal, X chromosome, and mitochondrial sequences from the same samples indicate that nucleotide diversity on the chimpanzee MSY is reduced relative to neutral expectations with an equal sex ratio. Additionally, the estimated common chimpanzee Y chromosome TMRCA (0.44 mya [0.31-0.56]) is half the age of the mitochondria TMRCA (0.97 mya [0.65-1.35]), indicating an unequal sex ratio or Y chromosome selection in the common chimpanzee ancestral population. We observe that the copy-number of Y chromosome amplicons is variable amongst chimpanzees and bonobos, and identify several lineage-specific patterns, including variable copy-number of the testis-expressed genes RBMY and DAZ. We detect recurrent switchpoints of copy-number change along the ampliconic tracts across chimpanzee populations, which may be the result of localized genome instability or selective forces.

Flowr: Robust and efficient pipelines using a simple language-agnostic approach

Posted on October 23, 2015 by schraib

Flowr: Robust and efficient pipelines using a simple language-agnostic approach

Sahil Seth, Samir Amin, Xingzhi Song, Xizeng Mao, Huandong Sun, Andrew Futreal, Jianhua Zhang

bioRxiv doi: http://dx.doi.org/10.1101/029710

Motivation: Bioinformatics analyses have become increasingly intensive computing processes, with lowering costs and increasing numbers of samples. Each laboratory spends time creating and maintaining a set of pipelines, which may not be robust, scalable, or efficient. Further, the existence of different computing environments across institutions hinders both collabo-ration and the portability of analysis pipelines. Results: Flowr is a robust and scalable framework for designing and deploying computing pipelines in an easy-to-use fashion. It implements a scatter-gather approach using computing clusters, simplifying the concept to the use of five simple terms (in submission and dependency types). Most importantly, it is flexible, such that customizing existing pipelines is easy, and since it works across several computing environments (LSF, SGE, Torque, and SLURM), it is portable. Availability: http://docs.flowr.space

Machine learning for metagenomics: methods and tools

Posted on October 23, 2015 by schraib

Machine learning for metagenomics: methods and tools
Hayssam Soueidan, Macha Nikolski

While genomics is the research field relative to the study of the genome of any organism, metagenomics is the term for the research that focuses on many genomes at the same time, as typical in some sections of environmental study. Metagenomics recognizes the need to develop computational methods that enable understanding the genetic composition and activities of communities of species so complex that they can only be sampled, never completely characterized.
Machine learning currently offers some of the most computationally efficient tools for building predictive models for classification of biological data. Various biological applications cover the entire spectrum of machine learning problems including supervised learning, unsupervised learning (or clustering), and model construction. Moreover, most of biological data — and this is the case for metagenomics — are both unbalanced and heterogeneous, thus meeting the current challenges of machine learning in the era of Big Data.
The goal of this revue is to examine the contribution of machine learning techniques for metagenomics, that is answer the question “to what extent does machine learning contribute to the study of microbial communities and environmental samples?” We will first briefly introduce the scientific fundamentals of machine learning. In the following sections we will illustrate how these techniques are helpful in answering questions of metagenomic data analysis. We will describe a certain number of methods and tools to this end, though we will not cover them exhaustively. Finally, we will speculate on the possible future directions of this research.

Haldane's Sieve

Discussing preprints in population and evolutionary genetics

Yearly Archives: 2015

Patching holes in the Chlamydomonas genome

Decomposing the site frequency spectrum: the impact of tree topology on neutrality tests

On the Balance of Unrooted Trees

Estimation of the True Evolutionary Distance under the Fragile Breakage Model

A general approximation for the dynamics of quantitative traits

Rawcopy: Improved copy number analysis with Affymetrix arrays

Homomorphic ZW Chromosomes in a Wild Strawberry Show Distinctive Recombination Heterogeneity but a Small Sex-Determining Region

Inferring chimpanzee Y chromosome history and amplicon diversity from whole genome sequencing

Flowr: Robust and efficient pipelines using a simple language-agnostic approach

Machine learning for metagenomics: methods and tools

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: