TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees

TreeOTU: Operational Taxonomic Unit Classification Based on Phylogenetic Trees
Dongying Wu, Ladan Doroud, Jonathan A. Eisen
(Submitted on 28 Aug 2013)

Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based upon studies of sequences of the small- subunit rRNAs (ssu-rRNAs). To address the limitation of ssu-rRNA as a phylogenetic marker, such as copy number variation among organisms and complications introduced by horizontal gene transfer, convergent evolution, or evolution rate variations, we have identified protein- coding gene families as alternative Phylogenetic and Phylogenetic Ecology markers (PhyEco). Current nucleotide sequence similarity based Operational Taxonomic Unit (OTU) classification methods are not readily applicable to amino acid sequences of PhyEco markers. We report here the development of TreeOTU, a phylogenetic tree structure based OTU classification method that takes into account of differences in rates of evolution between taxa and between genes. OTU sets built by TreeOTU are more faithful to phylogenetic tree structures than sequence clustering (non phylogenetic) methods for ssu-rRNAs. OTUs built from phylogenetic trees of protein coding PhyEco markers are comparable to our current taxonomic classification at different levels. With the included OTU comparing tools, the TreeOTU is robust in phylogenetic referencing with different phylogenetic markers and trees.

The role of mutation rate variation and genetic diversity in the architecture of human disease

The role of mutation rate variation and genetic diversity in the architecture of human disease
Ying Chen Eyre-Walker, Adam Eyre-Walker
(Submitted on 29 Aug 2013)

We have investigated the role that the mutation rate and the structure of genetic variation at a locus play in determining whether a gene is involved in disease. We predict that the mutation rate and its genetic diversity should be higher in genes associated with disease, unless all genes that could cause disease have already been identified. Consistent with our predictions we find that genes associated with Mendelian and complex disease are substantially longer than non-disease genes. However, we find that both Mendelian and complex disease genes are found in regions of the genome with relatively low mutation rates, as inferred from intron divergence between humans and chimpanzees. Complex disease gene are predicted to have higher rates of non-synonymous mutation than non-disease genes, but the opposite pattern is found in Mendelian disease genes. Finally, we find that disease genes are in regions of significantly elevated genetic diversity, even when variation in the rate of mutation is controlled for. The effect is small nevertheless. Our results suggest that variation in the genic mutation rate and the genetic architecture of the locus play a minor role in determining whether a gene is associated with disease.

Exploration and retrieval of whole-metagenome sequencing samples

Exploration and retrieval of whole-metagenome sequencing samples
Sohan Seth, Niko Välimäki, Samuel Kaski, Antti Honkela
(Submitted on 28 Aug 2013)

Over the recent years, the field of whole metagenome shotgun sequencing has witnessed significant growth due to the next generation sequencing technologies that allow sequencing genomic samples cheaper, faster, and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. In this paper, we develop a content-based retrieval method for whole metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence k-mers from a pool of metagenomic samples, and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome data sets and observe significant enrichment for diseased samples in results of queries with another diseased sample.

Using Volcano Plots and Regularized-Chi Statistics in Genetic Association Studies

Using Volcano Plots and Regularized-Chi Statistics in Genetic Association Studies
Wentian Li, Jan Freudenberg, Young Ju Suh, Yaning Yang
(Submitted on 28 Aug 2013)

Labor intensive experiments are typically required to identify the causal disease variants from a list of disease associated variants in the genome. For designing such experiments, candidate variants are ranked by their strength of genetic association with the disease. However, the two commonly used measures of genetic association, the odds-ratio (OR) and p-value, may rank variants in different order. To integrate these two measures into a single analysis, here we transfer the volcano plot methodology from gene expression analysis to genetic association studies. In its original setting, volcano plots are scatter plots of fold-change and t-test statistic (or -log of the p-value), with the latter being more sensitive to sample size. In genetic association studies, the OR and Pearson’s chi-square statistic (or equivalently its square root, chi; or the standardized log(OR)) can be analogously used in a volcano plot, allowing for their visual inspection. Moreover, the geometric interpretation of these plots leads to an intuitive method for filtering results by a combination of both OR and chi-square statistic, which we term “regularized-chi”. This method selects associated markers by a smooth curve in the volcano plot instead of the right-angled lines which corresponds to independent cutoffs for OR and chi-square statistic. The regularized-chi incorporates relatively more signals from variants with lower minor-allele-frequencies than chi-square test statistic. As rare variants tend to have stronger functional effects, regularized-chi is better suited to the task of prioritization of candidate genes.

Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome

Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome
Wentian Li, Jan Freudenberg, Pedro Miramontes
(Submitted on 28 Aug 2013)

The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a greater length increases the chance for reads being uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 to 1000 basepairs. We use the proportion of non-singleton k-mers to evaluate the mappability of reads for a corresponding read length. We observe that the proportion of non-singletons decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different k ranges. A faster decay at smaller values for k indicates more limited gains for read lengths > 200 basepairs. The frequency distributions of k-mers exhibit long tails in a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The location of the most frequent 1000-mers comprises 172 kilobase-ranged regions, including four large stretches on chromosomes 1 and X, containing genes with biomedical implications. Even the read length 1000 would be insufficient to reliably sequence these specific regions.

Our paper: Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

This guest post is by Mike Harvey on his (along with coauthors) paper Tilston-Smith and Harvey et al Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales arXived here.

This paper is a result of work on developing markers and methods for generating genomic data for species without available genomes (I’ll refer to these as “non-model” species). The work is a collaborative effort between some researchers who are really on top of developments in sequencing technologies (and are also a blast to work with) – Travis Glenn at UGA, Brant Faircloth at UCLA, and John McCormack at Occidental – and our lab here at LSU. We think the marker sets we have been developing (ultraconserved elements) and more generally the method we are using (sequence capture) have the potential to make the genomic revolution more accessible to researchers studying the population genetics of diverse non-model organisms.


Although genomic resources for humans and other model systems are increasing rapidly, the bottleneck for those of us working on the population genetics of non-model systems is simply our ability to generate data. Many of us are still struggling to take advantage of the increase in sequencing capacity provided by next-generation platforms. For many projects, sequencing entire genomes is neither feasible (yet) nor necessary, so researchers have focused on finding reasonable methods of subsampling the genome in a repeatable way such that the same subset of genomic regions can be sampled for many individuals. We often have to do this, however, with little to no prior genomic information from our particular study organism.

Most methods for subsampling the genome thus far have involved “random” sampling from across the genome by using restriction enzymes to digest genomic DNA and then sequencing fragments that fall in a particular part of the fragment size distribution. Drawbacks of these methods include (1) the fact that the researcher has no prior knowledge of where in the genome sequences will be coming from or what function the genomic region might serve, and (2) that the repeatability of the method, specifically the ability to generate data from the same loci across samples, depends on the conservation of the enzyme cut sites, and these often are not conserved at deeper timescales. Sequencing transcriptomes is also a popular method for subsampling the genome, but this simply isn’t an option for those of us working with museum specimens and tissues or old blood samples in which RNA hasn’t been properly preserved.

Sequence capture, a molecular technique involving genome enrichment by hybridization to RNA or DNA ‘probes’, is a flexible alternative that allows researchers to subsample whatever portions of the genome they like. The drawback of sequence capture, however, is that you need enough prior genomic information to design the synthetic oligos used as probes. This is not a problem for e.g. exome capture in humans in which the targeted genes are well characterized, but it is a challenge for non-model systems without sequenced genomes.

This is where ultraconserved elements come in. Ultraconserved elements (UCEs) are short genomic regions that are highly conserved across widely divergent species (e.g. all amniotes). Because they are so conserved, UCE sequences can be easily used as probes for sequence capture in diverse non-model organisms, even if the organisms themselves have little or no genomic information available. If you are not working on amniotes or fishes (for which we have already designed probe arrays), all you may need to find UCEs is a couple of genomes from species that diverged from your study organism within the last few hundred million years. Of course, this general approach is not specific to loci that fall into our narrow definition of UCEs, but is limited merely by the availability of genomic information that can be used to design probes. As additional genomic information becomes available from a given group additional loci, including protein-coding regions, can easily be added to capture arrays.

Our question for this paper – does sequence capture of UCEs work for population genetics?

We have previously used sequence capture of UCEs to understand deeper-level phylogenetic questions. We’ve found that at deep timescales, the flanking regions of UCEs contain a large amount of informative variation. The goals of the present study were (1) to see if sufficient information existed in UCEs to enable studies at shallow evolutionary (read "population genetic or phylogeographic") timescales, and (2) to explore some of the analyses that might be possible with population genetic data from non-model organisms. For our study, we sampled two individuals from each of four populations in five different species of non-model Neotropical birds. We conducted sequence capture using probes designed from 2,386 UCEs shared by amniotes and we sequenced the resulting libraries using an Illumina HiSeq. We then examined the number of loci recovered and the amount of informative variation in those loci for each of the five species. We also conducted some standard analyses – species tree estimation, demographic modeling, and species delimitation – for each species

We were able to recover between 776 and 1,516 UCE regions across the five species, and these contained sufficient variation to conduct population genetic analyses in each species. Species tree estimates, demographic parameters, and species limits mostly corresponded with prior estimates based on morphology or mitochondrial DNA sequences. Confidence intervals around demographic parameter estimates from the UCEs were much narrower than estimates from mitochondrial DNA using similar methods, supporting the idea that larger datasets will allow more precise estimates of species histories.

Some conclusions

Pending faster and cheaper methods for sequencing and de novo assembling whole genomes, methods for sampling a subset of the genome will be a practical necessity for population genetic studies in non-model organisms. Sequence capture is both intuitively appealing and practical in that it allows researchers to select a priori the regions of the genome in which they are interested. Ultraconserved elements pair nicely with sequence capture because they allow us to collect data from the same loci shared across a very broad spectrum of organisms (e.g. all amniotes or all fishes). As genomic data for diverse groups increases, UCE capture probes will certainly be augmented with additional genomic regions. In the meantime, sequence capture of UCEs has a lot to offer for population genetic studies of non-model organisms. See our paper for more information, or visit, where our probe sets, protocols, code, and other information are available under open-source licenses (BSD-style and Creative Commons) for anyone to use.

Fast Approximate Inference of Transcript Expression Levels from RNA-seq Data

Fast Approximate Inference of Transcript Expression Levels from RNA-seq Data

James Hensman, Peter Glaus, Antti Honkela, Magnus Rattray
(Submitted on 27 Aug 2013)

Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem becomes an instance of non-trivial probabilistic inference. Bayesian inference in such a problem is intractable and approximate methods must be used such as Markov chain Monte Carlo (MCMC) and Variational Bayes. Standard implementations of these methods can be prohibitively slow for large datasets and complex gene models.
Results: We propose an approximate inference scheme based on Variational Bayes applied to an existing model of transcript expression inference from RNA-seq data. We apply recent advances in Variational Bayes algorithmics to improve the convergence of the algorithm beyond the standard variational expectation-maximisation approach. We apply our algorithm to simulated and biological datasets, demonstrating that the increase in speed requires only a small trade-off in accuracy of expression level estimation.
Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at this https URL The methods will be made available through the BitSeq Bioconductor package at the next stable release.