GEMINI: integrative exploration of genetic variation and genome annotations

GEMINI: integrative exploration of genetic variation and genome annotations
Uma Paila, Brad Chapman, Rory Kirchner, Aaron Quinlan
(Submitted on 17 Apr 2013)

Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or variant prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate the utility of GEMINI for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to will provide researchers with a standard framework for medical genomics.

Our paper: Clusters of microRNAs emerge by new hairpins in existing transcripts

This guest post is by Antonio Marco (@antonio_marco_c) on his paper Marco et al. Clusters of microRNAs emerge by new hairpins in existing transcripts arXived here.

Our paper:

MicroRNAs are short regulatory sequences involved in virtually all biological processes. MicroRNAs are often organized in genomic clusters that produce polycistronic transcripts. It is well-known that protein-coding polycistronic transcripts are almost absent in animals (with a few exceptions in nematodes and ascidians). So where do these microRNA clusters come from, and why are they so prevalent? We tackle these questions in our paper “Clusters of microRNAs emerge by new hairpins in existing transcripts”, recently deposited in arXiv.

We envisioned several possible scenarios for the origin of polycistronic microRNAs: First, polycistronic microRNAs can emerge by genomic rearrangements that bring together pre-existing microRNAs. As in bacterial operons, the clustering of microRNAs with related functions can be advantageous, and the fusion of related microRNAs may be positively selected. We call this the ‘put together’ model. Alternatively, multiple microRNAs could become polycistronic as a by-product of genome reduction (this is analogous to Caenorhabditis elegans operons). This is the ‘left together’ model. A third model, called ‘tandem duplication’, implies that polycistronic microRNAs emerge by tandem duplication of single sequences. Lastly, new microRNAs can emerge de novo in already existing microRNA transcripts. We named this the ‘new hairpin’ model, since a novel microRNA first requires the formation of a hairpin-like structure in the transcript.

By reconstructing the evolutionary history of Drosophila melanogaster microRNAs we observed that the majority of microRNA clusters emerged by the formation of new microRNA precursors in existing transcribed microRNA genes (‘new hairpin’ model). We also find that gene duplication generated a minority of the clusters (‘tandem duplication’). However, we didn’t see any instance of fusion of pre-existing microRNA genes. Moreover, clusters rarely split or suffer rearrangements. Once a microRNA cluster is formed, it stays as a cluster or it is lost a a whole.

We propose a model for the origin and evolution of microRNA clusters. Polycistronic microRNAs are an extreme case of genetic linkage, in which a microRNA is typically a few nucleotides away from another microRNA. Once a cluster is formed, the linkage is so tight that recombination is dramatically reduced between these loci. We suggest that, because of strong selective interference between loci (Hill-Robertson effect), a microRNA under selective pressure strongly influences the evolutionary fate of any neighbouring microRNA. Even slightly deleterious microRNAs may be maintained in a population if selection in one microRNA of the cluster is strong enough. Currently, we are analysing polymorphism data to test the validity of our model in actual Drosophila populations.

In summary, we suggest that clusters of microRNAs emerge by non-adaptive mechanisms and they are maintained as a consequence of tight linkage.

Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish

Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish
Mikhail Spivakov, Thomas O. Auer, Ravindra Peravali, Ian Dunham, Dirk Dolle, Asao Fujiyama, Atsushi Toyoda, Tomoyuki Aizu, Yohei Minakuchi, Felix Loosli, Kiyoshi Naruse, Ewan Birney, Joachim Wittbrodt
(Submitted on 16 Apr 2013)

Background Oryzias latipes (Medaka) has been established as a vertebrate genetic model for over a century, and has recently been rediscovered outside its native Japan. The power of new sequencing methods now makes it possible to reinvigorate Medaka genetics, in particular by establishing a near-isogenic panel derived from a single wild population. Results Here we characterise the genomes of wild Medaka catches obtained from a single Southern Japanese population in Kiyosu as a precursor for the establishment of a near isogenic panel of wild lines. The population is free of significant detrimental population structure, and has advantageous linkage disequilibrium properties suitable for establishment of the proposed panel. Analysis of morphometric traits in five representative inbred strains suggests phenotypic mapping will be feasible in the panel. In addition high throughput genome sequencing of these Medaka strains confirms their evolutionary relationships on lines of geographic separation and provides further evidence that there has been little significant interbreeding between the Southern and Northern Medaka population since the Southern/Northern population split. The sequence data suggest that the Southern Japanese Medaka existed as a larger older population which went through a relatively recent bottleneck around 10,000 years ago. In addition we detect patterns of recent positive selection in the Southern population. Conclusions These data indicate that the genetic structure of the Kiyosu Medaka samples are suitable for the establishment of a vertebrate near isogenic panel and therefore inbreeding of 200 lines based on this population has commenced. Progress of this project can be tracked at this http URL

Reducing assembly complexity of microbial genomes with single-molecule sequencing

Reducing assembly complexity of microbial genomes with single-molecule sequencing
Sergey Koren, Gregory P Harhay, Timothy PL Smith, James L Bono, Dayna M Harhay, D. Scott Mcvey, Diana Radune, Nicholas H Bergman, Adam M Phillippy
(Submitted on 13 Apr 2013)

Background: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem.
Results: To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These assemblies are also comparable in accuracy to hybrid assemblies including second-generation data.
Conclusions: Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to below $2,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of complete genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.

Integrating influenza antigenic dynamics with molecular evolution

Integrating influenza antigenic dynamics with molecular evolution
Trevor Bedford, Marc A. Suchard, Philippe Lemey, Gytis Dudas, Victoria Gregory, Alan J. Hay, John W. McCauley, Colin A. Russell, Derek J. Smith, Andrew Rambaut
(Submitted on 12 Apr 2013)

Influenza viruses undergo continual antigenic evolution allowing mutant viruses to evade immunity acquired by the host population to previous virus strains. Antigenic phenotype is often assessed through pairwise measurement of cross-reactivity between influenza strains using the hemagglutination inhibition (HI) assay. Here, we extend previous approaches to antigenic cartography, which seeks to place strains on an antigenic map, such that distances on this map best recapitulate titers observed across multiple HI assays. In our model, we simultaneously characterize antigenic and genetic evolution by including an evolutionary model in which antigenic location diffuses over a shared virus phylogeny. Using HI data for four lineages of influenza, encompassing influenza A subtypes H3N2 and H1N1, and influenza B lineages Victoria and Yamagata, we determine average rates of antigenic drift for each lineage, as well as year-to-year variability in the rate of drift. Through comparison with epidemiological data, we demonstrate a year-to-year correlation between drift and incidence and present evidence that antigenic drift mediates interference between influenza lineages. We investigate the selective underpinnings for differing antigenic dynamics across lineages and show that A/H3N2 benefits from both a higher influx of new antigenic mutations and also from more efficient conversion of antigenic variation into fixed differences. This work does much to elucidate the antigenic dynamics of influenza lineages, but also allows for substantial future advances in investigating the dynamics of influenza and other antigenically-variable pathogens by providing a model that intimately combines molecular and antigenic evolution.

Identifiability of a Coalescent-based Population Tree Model

Identifiability of a Coalescent-based Population Tree Model
Arindam RoyChoudhury
(Submitted on 12 Apr 2013)

Identifiability of evolutionary tree models has been a recent topic of discussion and some models have been shown to be non-identifiable. A coalescent-based rooted population tree model, originally proposed by Nielsen et al. 1998 [2], has been used by many authors in the last few years and is a simple tool to accurately model the changes in allele frequencies in the tree. However, the identifiability of this model has never been proven. Here we prove this model to be identifiable by showing that the model parameters can be expressed as functions of the probability distributions of subsamples. This a step toward proving the consistency of the maximum likelihood estimator of the population tree based on this model.

Clusters of microRNAs emerge by new hairpins in existing transcripts

Clusters of microRNAs emerge by new hairpins in existing transcripts
Antonio Marco, Maria Ninova, Matthew Ronshaugen, Sam Griffiths-Jones
(Submitted on 9 Apr 2013)

Genetic linkage may result in the expression of multiple products from a single polycistronic transcript, under the control of a single promoter. In animals, protein-coding polycistronic transcripts are rare. However, microRNAs are frequently clustered in the genomes of animals and plants, and these clusters are often transcribed as a single unit. The evolution of microRNA clusters has been the subject of much speculation, and a selective advantage of clusters of functionally related microRNAs is often proposed. However, the origin of microRNA clusters has not been so far systematically explored. Here we study the evolution of all microRNA clusters in Drosophila melanogaster, and suggest a number of models for their emergence. We observed that a majority of microRNA clusters arose by the de novo formation of new microRNA-like hairpins in existing microRNA transcripts. Some clusters also emerged by tandem duplication of a single microRNA. Comparative genomics show that these clusters, once formed, are unlikely to split or undergo rearrangements. We did not find any instances of clusters appearing by rearrangement of pre-existing microRNA genes. We propose a model for microRNA cluster origin and evolution in which selection over one of the microRNAs in the cluster interferes with the evolution of the other tightly linked microRNAs. Our analysis suggests that the evolutionary study of microRNAs and other small RNAs must consider and account for linkage associations.

An algebraic framework to sample the rearrangement histories of a cancer metagenome with double cut and join, duplication and deletion events

An algebraic framework to sample the rearrangement histories of a cancer metagenome with double cut and join, duplication and deletion events
Daniel R. Zerbino, Benedict Paten, Glenn Hickey, David Haussler
(Submitted on 22 Mar 2013)

Algorithms to study structural variants (SV) in whole genome sequencing (WGS) cancer datasets are currently unable to sample the entire space of rearrangements while allowing for copy number variations (CNV). In addition, rearrangement theory has up to now focused on fully assembled genomes, not on fragmentary observations on mixed genome populations. This affects the applicability of current methods to actual cancer datasets, which are produced from short read sequencing of a heterogeneous population of cells. We show how basic linear algebra can be used to describe and sample the set of possible sequences of SVs, extending the double cut and join (DCJ) model into the analysis of metagenomes. We also describe a functional pipeline which was run on simulated as well as experimental cancer datasets.

Natural selection reduced diversity on human Y chromosomes

Natural selection reduced diversity on human Y chromosomes
Melissa A. Wilson Sayres, Kirk E. Lohmueller, Rasmus Nielsen
(Submitted on 20 Mar 2013)

The human Y chromosome exhibits surprisingly low levels of genetic diversity. This could result from neutral processes if the effective population size of males is reduced relative to females due to a higher variance in the number of offspring from males than from females. Alternatively, selection acting on new mutations, and affecting linked neutral sites, could reduce variability on the Y chromosome. Here, using genome-wide analyses of X, Y, autosomal and mitochondrial DNA, in combination with extensive population genetic simulations, we show that low observed Y chromosome variability is not consistent with a purely neutral model. Instead, we show that models of purifying selection are consistent with observed Y diversity. Further, the number of sites estimated to be under purifying selection greatly exceeds the number of Y-linked coding sites, suggesting the importance of the highly repetitive ampliconic regions. Because the functional significance of the ampliconic regions is poorly understood, our findings should motivate future research in this area.

Genomic evidence of rapid and stable adaptive oscillations over seasonal time scales in Drosophila

Genomic evidence of rapid and stable adaptive oscillations over seasonal time scales in Drosophila
Alan O. Bergland, Emily L. Behrman, Katherine R. O’Brien, Paul S. Schmidt, Dmitri A. Petrov
(Submitted on 20 Mar 2013)

In many species, genomic data have revealed pervasive adaptive evolution indicated by the near fixation of beneficial alleles. However, when selection pressures are highly variable along a species range or through time adaptive alleles may persist at intermediate frequencies for long periods. So called balanced polymorphisms have long been understood to be an important component of standing genetic variation yet direct evidence of the ubiquity of balancing selection has remained elusive. We hypothesized that environmental fluctuations between seasons in a North American orchard would impose temporally variable selection on Drosophila melanogaster and consequently maintain allelic variation at polymorphisms adaptively evolving in response climatic variation. We identified hundreds of polymorphisms whose frequency oscillates among seasons and argue that these loci are subject to strong, temporally variable selection. We show that adaptively oscillating polymorphisms are often millions of years old, predating the divergence between D. melanogaster and D. simulans and that a subset of these polymorphisms respond predictably to an acute frost event. Taken together, our results demonstrate that rapid temporal fluctuations in climate over generational scales is a predominant force that maintains adaptive alleles and promotes genetic diversity.