High-speed and accurate color-space short-read alignment with CUSHAW2

High-speed and accurate color-space short-read alignment with CUSHAW2
Yongchao Liu, Bernt Popp, Bertil Schmidt
(Submitted on 17 Apr 2013)

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are available at this http URL Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de

XORRO: Rapid Paired-End Read Overlapper

XORRO: Rapid Paired-End Read Overlapper
Russell J. Dickson, Gregory B. Gloor
(Submitted on 16 Apr 2013)

Background: Computational analysis of next-generation sequencing data is outpaced by data generation in many cases. In one such case, paired-end reads can be produced from the Illumina sequencing method faster than they can be overlapped by downstream analysis. The advantages in read length and accuracy provided by overlapping paired-end reads demonstrates the necessity for software to efficiently solve this problem.
Results: XORRO is an extremely efficient paired-end read overlapping program. XORRO can overlap millions of short paired-end reads in a few minutes. It uses 64-bit registers with a two bit alphabet to represent sequences and does comparisons using low-level logical operations like XOR, AND, bitshifting and popcount.
Conclusions: As of the writing of this manuscript, XORRO provides the fastest solution to the paired-end read overlap problem. XORRO is available for download at: sourceforge.net/projects/xorro-overlap/

GEMINI: integrative exploration of genetic variation and genome annotations

GEMINI: integrative exploration of genetic variation and genome annotations
Uma Paila, Brad Chapman, Rory Kirchner, Aaron Quinlan
(Submitted on 17 Apr 2013)

Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or variant prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate the utility of GEMINI for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to will provide researchers with a standard framework for medical genomics.

Our paper: Clusters of microRNAs emerge by new hairpins in existing transcripts

This guest post is by Antonio Marco (@antonio_marco_c) on his paper Marco et al. Clusters of microRNAs emerge by new hairpins in existing transcripts arXived here.

Our paper:

MicroRNAs are short regulatory sequences involved in virtually all biological processes. MicroRNAs are often organized in genomic clusters that produce polycistronic transcripts. It is well-known that protein-coding polycistronic transcripts are almost absent in animals (with a few exceptions in nematodes and ascidians). So where do these microRNA clusters come from, and why are they so prevalent? We tackle these questions in our paper “Clusters of microRNAs emerge by new hairpins in existing transcripts”, recently deposited in arXiv.

We envisioned several possible scenarios for the origin of polycistronic microRNAs: First, polycistronic microRNAs can emerge by genomic rearrangements that bring together pre-existing microRNAs. As in bacterial operons, the clustering of microRNAs with related functions can be advantageous, and the fusion of related microRNAs may be positively selected. We call this the ‘put together’ model. Alternatively, multiple microRNAs could become polycistronic as a by-product of genome reduction (this is analogous to Caenorhabditis elegans operons). This is the ‘left together’ model. A third model, called ‘tandem duplication’, implies that polycistronic microRNAs emerge by tandem duplication of single sequences. Lastly, new microRNAs can emerge de novo in already existing microRNA transcripts. We named this the ‘new hairpin’ model, since a novel microRNA first requires the formation of a hairpin-like structure in the transcript.

By reconstructing the evolutionary history of Drosophila melanogaster microRNAs we observed that the majority of microRNA clusters emerged by the formation of new microRNA precursors in existing transcribed microRNA genes (‘new hairpin’ model). We also find that gene duplication generated a minority of the clusters (‘tandem duplication’). However, we didn’t see any instance of fusion of pre-existing microRNA genes. Moreover, clusters rarely split or suffer rearrangements. Once a microRNA cluster is formed, it stays as a cluster or it is lost a a whole.

We propose a model for the origin and evolution of microRNA clusters. Polycistronic microRNAs are an extreme case of genetic linkage, in which a microRNA is typically a few nucleotides away from another microRNA. Once a cluster is formed, the linkage is so tight that recombination is dramatically reduced between these loci. We suggest that, because of strong selective interference between loci (Hill-Robertson effect), a microRNA under selective pressure strongly influences the evolutionary fate of any neighbouring microRNA. Even slightly deleterious microRNAs may be maintained in a population if selection in one microRNA of the cluster is strong enough. Currently, we are analysing polymorphism data to test the validity of our model in actual Drosophila populations.

In summary, we suggest that clusters of microRNAs emerge by non-adaptive mechanisms and they are maintained as a consequence of tight linkage.

Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish

Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish
Mikhail Spivakov, Thomas O. Auer, Ravindra Peravali, Ian Dunham, Dirk Dolle, Asao Fujiyama, Atsushi Toyoda, Tomoyuki Aizu, Yohei Minakuchi, Felix Loosli, Kiyoshi Naruse, Ewan Birney, Joachim Wittbrodt
(Submitted on 16 Apr 2013)

Background Oryzias latipes (Medaka) has been established as a vertebrate genetic model for over a century, and has recently been rediscovered outside its native Japan. The power of new sequencing methods now makes it possible to reinvigorate Medaka genetics, in particular by establishing a near-isogenic panel derived from a single wild population. Results Here we characterise the genomes of wild Medaka catches obtained from a single Southern Japanese population in Kiyosu as a precursor for the establishment of a near isogenic panel of wild lines. The population is free of significant detrimental population structure, and has advantageous linkage disequilibrium properties suitable for establishment of the proposed panel. Analysis of morphometric traits in five representative inbred strains suggests phenotypic mapping will be feasible in the panel. In addition high throughput genome sequencing of these Medaka strains confirms their evolutionary relationships on lines of geographic separation and provides further evidence that there has been little significant interbreeding between the Southern and Northern Medaka population since the Southern/Northern population split. The sequence data suggest that the Southern Japanese Medaka existed as a larger older population which went through a relatively recent bottleneck around 10,000 years ago. In addition we detect patterns of recent positive selection in the Southern population. Conclusions These data indicate that the genetic structure of the Kiyosu Medaka samples are suitable for the establishment of a vertebrate near isogenic panel and therefore inbreeding of 200 lines based on this population has commenced. Progress of this project can be tracked at this http URL

Reducing assembly complexity of microbial genomes with single-molecule sequencing

Reducing assembly complexity of microbial genomes with single-molecule sequencing
Sergey Koren, Gregory P Harhay, Timothy PL Smith, James L Bono, Dayna M Harhay, D. Scott Mcvey, Diana Radune, Nicholas H Bergman, Adam M Phillippy
(Submitted on 13 Apr 2013)

Background: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem.
Results: To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These assemblies are also comparable in accuracy to hybrid assemblies including second-generation data.
Conclusions: Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to below $2,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of complete genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.

Integrating influenza antigenic dynamics with molecular evolution

Integrating influenza antigenic dynamics with molecular evolution
Trevor Bedford, Marc A. Suchard, Philippe Lemey, Gytis Dudas, Victoria Gregory, Alan J. Hay, John W. McCauley, Colin A. Russell, Derek J. Smith, Andrew Rambaut
(Submitted on 12 Apr 2013)

Influenza viruses undergo continual antigenic evolution allowing mutant viruses to evade immunity acquired by the host population to previous virus strains. Antigenic phenotype is often assessed through pairwise measurement of cross-reactivity between influenza strains using the hemagglutination inhibition (HI) assay. Here, we extend previous approaches to antigenic cartography, which seeks to place strains on an antigenic map, such that distances on this map best recapitulate titers observed across multiple HI assays. In our model, we simultaneously characterize antigenic and genetic evolution by including an evolutionary model in which antigenic location diffuses over a shared virus phylogeny. Using HI data for four lineages of influenza, encompassing influenza A subtypes H3N2 and H1N1, and influenza B lineages Victoria and Yamagata, we determine average rates of antigenic drift for each lineage, as well as year-to-year variability in the rate of drift. Through comparison with epidemiological data, we demonstrate a year-to-year correlation between drift and incidence and present evidence that antigenic drift mediates interference between influenza lineages. We investigate the selective underpinnings for differing antigenic dynamics across lineages and show that A/H3N2 benefits from both a higher influx of new antigenic mutations and also from more efficient conversion of antigenic variation into fixed differences. This work does much to elucidate the antigenic dynamics of influenza lineages, but also allows for substantial future advances in investigating the dynamics of influenza and other antigenically-variable pathogens by providing a model that intimately combines molecular and antigenic evolution.

Identifiability of a Coalescent-based Population Tree Model

Identifiability of a Coalescent-based Population Tree Model
Arindam RoyChoudhury
(Submitted on 12 Apr 2013)

Identifiability of evolutionary tree models has been a recent topic of discussion and some models have been shown to be non-identifiable. A coalescent-based rooted population tree model, originally proposed by Nielsen et al. 1998 [2], has been used by many authors in the last few years and is a simple tool to accurately model the changes in allele frequencies in the tree. However, the identifiability of this model has never been proven. Here we prove this model to be identifiable by showing that the model parameters can be expressed as functions of the probability distributions of subsamples. This a step toward proving the consistency of the maximum likelihood estimator of the population tree based on this model.

The influence of relatives on the efficiency and error rate of familial searching

The influence of relatives on the efficiency and error rate of familial searching
Rori V. Rohlfs, Erin Murphy, Yun S. Song, Montgomery Slatkin
(Submitted on 10 Apr 2013)

We investigate the consequences of adopting the criteria used by the state of California, as described by Myers et al. (2011), for conducting familial searches. We carried out a simulation study of randomly generated profiles of related and unrelated individuals with 13-locus CODIS genotypes and YFiler Y-chromosome haplotypes, on which the Myers protocol for relative identification was carried out. For Y-chromosome sharing first degree relatives, the Myers protocol has a high probability (80 – 99%) of identifying their relationship. For unrelated individuals, there is a low probability that an unrelated person in the database will be identified as a first-degree relative. For more distant Y-haplotype sharing relatives (half-siblings, first cousins, half-first cousins or second cousins) there is a substantial probability that the more distant relative will be incorrectly identified as a first-degree relative. For example, there is a 3 – 18% probability that a first cousin will be identified as a full sibling, with the probability depending on the population background. Although the California familial search policy is likely to identify a first degree relative if his profile is in the database, and it poses little risk of falsely identifying an unrelated individual in a database as a first-degree relative, there is a substantial risk of falsely identifying a more distant Y-haplotype sharing relative in the database as a first-degree relative, with the consequence that their immediate family may become the target for further investigation. This risk falls disproportionately on those ethnic groups that are currently overrepresented in state and federal databases.

YHap: software for probabilistic assignment of Y haplogroups from population re-sequencing data

YHap: software for probabilistic assignment of Y haplogroups from population re-sequencing data
Fan Zhang, Ruoyan Chen, Dongbing Liu, Xiaotian Yao, Guoqing Li, Yabin Jin, Chang Yu, Yingrui Li, Lachlan Coin
(Submitted on 11 Apr 2013)

Y haplogroup analyses are an important component of genealogical reconstruction, population genetic analyses, medical genetics and forensics. These fields are increasingly moving towards use of low-coverage, high throughput sequencing. However, there is as yet no software available for using sequence data to assign Y haplogroup groups probabilistically, such that the posterior probability of assignment fully reflects the information present in the data, and borrows information across all samples sequenced from a population. YHap addresses this problem.