Our paper: Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish

This guest post is by Ewan Birney on Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish, arXived here.

Our lab is part of a collaboration spanning Japan, two groups in Germany and EMBL-EBI in the UK which put the paper “Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish” on to the arXive pre-print server. Here I’ve taken up a kind invitation to post about this paper.

Before getting into the details of the paper, I should introduce Medaka, scientific name, Oryzias latipes. Medaka is a small fish which lives in both fresh and brackish water across the majority of the Japanese archipelago (the exception is Hokkaido, the large northern island in Japan), the Korean peninsular and the eastern China region. It has been kept as a garden pet (and then in aquaria) in Japan for a long time, with documented colourful strains in Japanese art since 17th Century (this is a woodblock from the celebrated artist, Ando Hiroshige, (安藤 広重) depicting Goldfish and, the much smaller, Medaka fish – the medaka fish are the small horizontal shoals, not the colourful goldfish sadly). It is widespread in the wild, in particular in rice paddies, hence its other common name, “the Japanese rice paddy fish”. After the rediscovery of Mendel’s work at the turn of the twentieth century scientists in Japan started to use the established colourful strains to explore genetics. The most famous paper from this era is the first discovery in any species of crossover of the sex chromosomes, X-Y. Rather brilliantly this is an open access paper from Genetics (Aida T: On the Inheritance of Color in a Fresh-Water Fish, APLOCHEILUS LATIPES Temmick and Schlegel, with Special Reference to Sex-Linked Inheritance. Genetics 1921, 6:554-573. http://europepmc.org/articles/PMC1200522 Note: at the time, the systematics in this area of fish was different, hence the different genus name). Since this early genetics, Medaka has been used for research both inside and outside Japan over the 20th Century, with a well established linkage map, transgenic procedures, genome sequence, and considerable probing of different phenotypes.

My experience in describing Medaka to European and American audiences is to now answer the usual questions about similarity and contrasts with Zebrafish, the most commonly used laboratory fish in the West. The first important thing to realise is that Medaka is on a very different branch of the Teleost (bony fish) lineage from Zebrafish, separated by an estimated 250-300 million years of divergence – so the Medaka fish genome is only marginally closer to the Zebrafish genome than either are to mammals. One should expect quite different biological details in the two systems, and each system to be equally applicable to mammalian systems. Medaka is somewhat smaller than Zebrafish and nearly always will live in the same tank format as Zebrafish and the same water system (many labs co-culture both Medaka and Zebrafish). Generation time is similar (6 weeks ~ 3 months). Zebrafish lay around 1,000 eggs in a single mating, which is a distinct advantage to Medaka’s clutch of around 30 eggs in a single mating, held to the female. However whereas zebrafish mate only once per week requiring a ‘recovery phase’ , medaka mate every day. Thus the difference in fecundity over time is small. Both zebrafish and medaka have transparent chorions (egg shells); also the embryo itself is completely transparent in both species rendering them ideal model systems to study development. Generally all techniques that have been established for one species are also applicable for the other, such as transgenesis by simple injection of suitable DNA vectors or antisense morphlinos. Medaka fish genetics is cleaner, with many inbred laboratory lines maintained by single brother-sister mating, and thus very homozygous throughout. Finally these medaka inbred lines are often made from wild-catch individuals, with an established breeding protocol to achieve homozygosity from wild individuals.

It is this last feature that we would like leverage here. With an inbreeding protocol from the wild, one can set up a near-isogenic wild panel, similar to the panels that have already been developed in Arabidopsis and Drosophila. These panels are proving very informative for quantitative genetics in both of these fields. Once such a panel is established it is a powerful mapping resource and is one of the few ways to study gene-environment effects as one can repeat phenotyping experiments over the same genotypes but in differing laboratory environments (say, high to low calorie diets, or different temperatures, or different small molecules added to the water). Being able to have such a panel with a vertebrate will be very powerful (Medaka fish has all the common cell types and tissues for a vertebrate – brain, heart, liver, muscle, gut, pancreas – both endocrine and exocrine, kidney.) The main purpose of this paper is to find and characterise the source wild population for such a panel.

A good genetic panel needs ideally to be free of population structure. One also needs the right linkage disequilibrium (LD) properties. In previous decades there was a sweet spot for LD in the genome; the longer the LD the cheaper it was to genotype, but shorter LD gave better resolution of where a functional variant is. With the advent of cheap sequencing, this trade off logic has changed to finding a population with short LD to have the best possible mapping resolution. Finally there are also practical aspects for choice of population – one would like to be able to resample from the same region easily (for example, to add to the panel in the future). After looking at a number of sites where we assessed population properties via mitochondrial typing, we choose a site, Kiyosu (https://maps.google.com/maps?q=34.78113,+137.347928333056(Kiyosu%20Sampling%20Site)&iwloc=A&hl=en), close to Nagoya where the NIBB Medaka resource group under Kiyoshi Naruse is housed. From this population we caught a number of individuals and set up 8 breeding pairs. For each of the 8 pairs, we sequenced the two parents and one child (a “trio” in the parlance of genetics). We choose this sequencing structure as it means we can phase the parental genotypes using the child’s genotype, and in effect sample 16 haplotypes from this population.

From this we show we have a good population for an isogenic panel. There is no discernable population structure, both from a distance matrix perspective across all individuals, and the lack of long LD in the population. As expected for a large teleost population, there are a relatively large number of variants, with a segregating SNP every 150bp on average. In this limited sample the LD is, as expected, quite tight, with the correlation between SNPs (expressed as r2) dropping to “baseline” levels between 5 – 10 KB . From even this limited panel we estimate that almost 40% of SNPs would be mappable to a single exon. We expect this will improve in a more complete panel.

To augment this population characterisation, we also performed some high throughput phenotyping, showing that the population has expected phenotypic diversity driven by genetics. To do this we took advantage of a small number of existing inbred southern line strains. As the numbers are low here, we cannot map any phenotypes (this will require the panel), but we can get an estimate of broad sense heritability, i.e., the proportion of variance of a measurement explained by the differences between inbred lines compared to the differences between individuals of the same line. This is the sort of calculation which is easy to do on inbred panels, and harder to deconvolute on family or outbred cases. For 6 out 7 traits we chose to measure we get reasonably high broad sense heritability measures.

We conclude that we have a good source wild population that is likely to lead to a successful near isogenic panel, and have started inbreeding. We are somewhere between the 3rd to the 4th round of brother-sister inbreeding for 200 founder pairs, and all the lines look healthy. Traditionally one considers lines to have inbred after 8 brother-sister matings, so we’re almost half way through. Although in theory the generation time is 3 months, when one does this on 200 lines, the logistics means it is closer to 4 to 5 months. Felix Loosli is overseeing the inbreeding at the KIT (Karlsruhe Institute for Technology), in Germany.

In addition to the main thrust of this paper, we also look at the population genetics of Medaka, as this is a large wild catch with complete genome wide coverage of SNPs. For me the most interesting thing is the relationship with the Northern strain of Medaka. Japan has a large mountain range roughly running down the middle of the main island of Japan (called the “Japanese Alps” in English –日本アルプス Nihon Arupusu in Japanese). The Medaka fish found to the north of this are phenotypically different (more heavily pigmented, and prefer to live in shallower tanks), but do interbreed in the laboratory with Southern strains. An open question is whether there has been any partial interbreeding (called introgression) in the wild. Using sensitive tests for introgression between lineages, first developed to detect the Human/Neanderthal interbreeding, we do not see evidence for wild interbreeding between the Northern and Southern strains. This lends support that these two “strains” are best thought of as separate species, and one might expect there to have been specific selection events for the phenotypic properties in both strains.

We welcome comments on this paper, but also more generally on the use of the future Kiyosu near isogenic wild panel. We will be completely open about data collection and distribution of data for this panel, wherever possible using global archives to minimise the complications in getting access to the data.

If you are a teleost biologist, the fact that one can co-culture Medaka and Zebrafish means that many phenotypic assays set up for Zebrafish are probably quite easily transferable to Medaka; there is an existing (small) set of inbred lines which you could look to develop assays on. If you are a molecular quantitative geneticist, this panel should have better mapping properties than human or mouse, but a full range of vertebrate cell types. I am looking forward to using some of the statistical techniques developed on Arabidopsis and Drosophila, in particular the gene/environment partitioning components, and if you are interested in gene/environment interactions, this panel will have some unique opportunities. Of course, this is not a panacea – it takes investment in husbandry details and logistics to bring in another species into any system, and quantitative genetics is just one way of exploring genetic effects – forward and reverse genetics are very powerful techniques (which, incidentally have both been used in Medaka).

If you are interested do contact Ewan Birney (birney@ebi.ac.uk) or Ian Dunham (dunham@ebi.ac.uk) on aspects of the genomics/variation, and Felix Loosli (felix.loosli@kit.edu) , Jochen Wittbrodt (Jochen.Wittbrodt@cos.uni-Heidelberg.de) or Kiyoshi Naruse (naruse@nibb.ac.jp) on Medaka husbandry and molecular biology.


Our paper: The influence of relatives on the efficiency and error rate of familial searching

This guest post is by Rori Rohlfs on her paper (along with coauthors): Rohlfs et al. The influence of relatives on the efficiency and error rate of familial searching. arXived here.

One of the ways we in the U.S. (and elsewhere) are likely to encounter genetic technologies in our lives is through forensic DNA identification.  Without knowing a specific quantity, clearly a huge number of us encounter forensic uses of DNA through court cases using genetic evidence (as survivors, defendants, jury members, etc.), DNA sample seizure during a stop or arrest (currently being considered by the U.S. Supreme Court), or by being genetically related to someone in an offender or arrestee DNA database (>11 million profiles in U.S. national database).  Despite the social relevance of forensic uses of DNA, it seems to me that forensic genetics isn’t much discussed by the population and evolutionary genetics crowd these days.

A while back, I became interested in a newer forensic technique known as familial searching, particularly in how some pop gen assumptions affect outcomes.  Familial searching is performed in cases where police have some DNA evidence from an unknown individual they want to identify but have no leads.  First, they’ll search offender/arrestee DNA database(s) for someone with a matching genetic profile (which is verrrry unlikely between unrelated individuals with complete profiles), who they’d then investigate.  In some jurisdictions (where familial searching is legal or practiced without explicit policy), if there’s no complete profile match, they’ll search the database again for a partially matching profile.  The idea being that the partial match may be due to a close genetic relationship.  (Of course, two unrelated individuals could reasonably have partially matching profiles by chance.  More on that later.)  Again, depending on the policies of the jurisdiction, the relatives of some number of partially matching individuals are investigated.  In the most high-profile case of familial searching in the U.S., the suspected genetic relative was subject to surreptitious DNA collection (i.e. being followed until leaving a DNA sample (in that case, a pizza crust)).  Then this sample was tested directly against the original unknown sample, and showing a complete profile match a suspect was identified.

Because familial searching effectively extends offender/arrestee databases to the genetic kin of people in the databases, it raises important questions like:

For a population geneticist, attempting to identify unknown genetic relatives of individuals in the database (rather than the known individuals in the database) introduces more uncertainty and some additional questions come up like:

  • With the genetic information in forensic profiles (typically 13-15 autosomal STRs, sometimes with 17 Y-chromosome STRs), what’s the chance that an unrelated individual coincidentally has a partially matching profile resembling a genetic relative?
  • What background allele and haplotype frequencies are considered in profile likelihood calculations?
  • What statistical methodology will be used to identify [specific?  non-specific?] genetic relatives?

All these questions are especially relevant when considering intense multiple testing introduced by the relevant databases (>1.4 million profiles in the California offender database).  It can be challenging to get a handle on these questions because of widely varying policies and methodology between jurisdictions.  In New York City, it seems that an error-prone ‘allelic matching’ technique has been used to attempt to identify relatives in at least one case of robbery, leading to investigations of unrelated individuals.  While in California, familial searching is used specifically in cold cases of violent crimes with a continuing threat to public safety and in 2011 Myers et al. published the likelihood ratio-based test statistic and procedure used in practice.

When I arrived at the U.C. Berkeley for my postdoc, I met Monty Slatkin and Yun Song who, along with Erin Murphy, had attempted to estimate some error rates of familial searching, but were stymied by a lack of a well-described methods currently used in practice.  When the statistical procedure used by California was published, we were excited to collaborate using practically relevant methodology.  Specifically, we estimated the false positive rate and power of familial searching using the California state procedure.  Generally, we found high power to detect a specified first-degree relationship (.79 to .99) and low (but still substantial in a multiple testing context) false positive rates of calling unrelated individuals as first-degree relatives (<5e-9 to 1e-5).  We got thinking about more distant Y-chromosome-sharing relatives (half-siblings, cousins, second cousins) who (barring mutation) share Y-haplotypes and some portion of their autosomal STRs IBD.  We estimated that these distant relatives could be mistaken for close relatives fairly often, like in our simulations 14-42% of half-sibs and 3-18% of first cousins were misidentified as siblings.

These rates are non-trivial, especially if you consider the size of databases and the fact that there are more distant relatives than near (so distant relatives are more likely to be present in databases).  Further, some of these genetic relationships are not known (even to the individuals themselves) so are not useful to investigation, but may still be interpreted as evidence of familial involvement, leading to investigation of uninvolved individuals.  Lucky for us, our collaborator Erin Murphy has a background in law and thoughtfully outlined some of the practical ramifications in the introduction and discussion of our paper.  Not the least of which is how extended families and communities in groups which are over-represented in databases (perhaps most obviously African Americans and Latinos) would be disproportionately impacted by misidentification of distant relatives as near relatives.

We hope that this interdisciplinary manuscript broadens sorely needed technical and policy discussions of familial searching.

The Expected Linkage Disequilibrium in Finite Populations Revisited

The Expected Linkage Disequilibrium in Finite Populations Revisited
Ulrike Ober, Alexander Malinowski, Martin Schlather, Henner Simianer
(Submitted on 17 Apr 2013)

The expected level of linkage disequilibrium (LD) in a finite ideal population at equilibrium is of relevance for many applications in population and quantitative genetics. Several recursion formulae have been proposed during the last decades, whose derivations mostly contain heuristic parts and therefore remain mathematically questionable. We propose a more justifiable approach, including an alternative recursion formula for the expected LD. Since the exact formula depends on the distribution of allele frequencies in a very complicated manner, we suggest an approximate solution and analyze its validity extensively in a simulation study. Compared to the widely used formula of Sved, the proposed formula performs better for all parameter constellations considered. We then analyze the expected LD at equilibrium using the theory on discrete-time Markov chains based on the linear recursion formula, with equilibrium being defined as the steady-state of the chain, which finally leads to a formula for the effective population size N_e. An additional analysis considers the effect of non-exactness of a recursion formula on the steady-state, demonstrating that the resulting error in expected LD can be substantial. In an application to the HapMap data of two human populations we illustrate the dependency of the N_e-estimate on the distribution of minor allele frequencies (MAFs), showing that the estimated N_e can vary by up to 30% when a uniform instead of a skewed distribution of MAFs is taken as a basis to select SNPs for the analyses. Our analyses provide new insights into the mathematical complexity of the problem studied.

Thoughts on: Loss and Recovery of Genetic Diversity in Adapting Populations of HIV

This guest post is an exchange between Richard Neher and the authors of the preprint “Loss and Recovery of Genetic Diversity in Adapting Populations of HIV” (Pleuni Pennings, Sergey Kryazhimskiy, and John Wakeley). Below is the comment from Richard Neher, and then appended is the response from the authors of the study.

In this paper, the authors use a data set from a study of the anti-HIV drug efavirenz. This drug has a fairly stereotypic resistance profile which in most cases involves a mutation at amino acid 103 of the reverse transcriptase of HIV (K103N). The authors examine sequences from patients after treatment failure (drug resistant virus) and observe that in a large fraction of the cases, the drug resistance mutation K103N is present on multiple genetic backgrounds or in both of the possible codons for asparagine. This suggests frequent soft sweeps, i.e., evolution of drug resistance is not limited by the waiting time for a point mutation.

The observation of frequent soft sweeps allows to put a lower bound on the product of population size and mutation rate. Since the mutation rate is on the order of 1e-5, the lower bound for the population size is around N>1e5. The authors suggest that the fact that not all patients exhibit obvious soft-sweeps can be used to deduce an upper bound of N. However, one has to realize that the patient sample is heterogeneous, that additional drugs are used along with efavirenz, and that most likely additional mutations have swept through the population. Multiple soft-sweeps in rapid succession will look like hard sweeps. The lower bound makes a lot of sense and does away with a long held erroneous belief that the “effective” HIV population within an infected individual is small.

The debate about the size of the HIV population has some interesting history. In the mid 90ies, it was estimated that roughly 1e7 cells are infected by HIV within a chronically infected patient every day. Virologists studying HIV evolution concluded that every point mutations is explored many times every day (see Coffin, Science, 1995, http://www.ncbi.nlm.nih.gov/pubmed/7824947), which was consistent with the frequent failure of mono-therapy, i.e., therapy with only one drug. Around the same time, it was observed that HIV sequences within a patient typically have a common ancestor about 3 years ago, which translates into roughly 500-1000 generations. Population geneticists then started to convince people that this rapid coalescence corresponds to an “effective” population size of the order of a 1000, and that this explains the observed stochasticity of HIV evolution. Not everybody was convinced and some went through great trouble to show that very rare alleles matter and that the population size is large, see for example http://jvi.asm.org/content/86/23/12525. In this paper, failure of efavirenz therapy is studied in monkeys. Despite the fact that the resistance mutations were at frequencies below 1e-5 before treatment, both codons for asparagine at position 103 are observed a few days after treatment. Via as similar argument as in the above paper, the authors conclude that the population size is large.

There is very little reason to believe that coalescence in HIV is driven by short term offspring number fluctuations (drift). Instead, the coalescence is most likely driven by selection in which case the time scale of coalescence depends weakly on the population size (see e.g. http://arxiv.org/abs/1302.1148).

The tendency of population geneticists to map everything to a neutral model has in this case of HIV produced much confusion. This confusion is easily avoided if people were willing to give up the concept of effective population size and simply call the time scale of coalescent what it is.

Response from Pennings et al.

Hi Richard,

Thanks a lot for your detailed comments.
We agree with most of your analysis, but let us explain why we believe that our estimate of the effective population size is not just a lower bound.
We use the observation of “soft” and “hard” sweeps to estimate the effective population size of the virus in a patient.
In 7 patients, a mixture of AAC and AAT alleles at the 103rd codon of RT replaced the wildtype, whereas in 10 patients either the AAT or the AAC allele replaced the wildtype.
When we only observe the AAC allele, it is still possible that this allele has several origins (the mutation from AAA to AAC occurred multiple times). This possibility is included in our analysis.
In addition, it is possible that originally, a mixture of AAC and AAT replaced the wildtype, but a subsequent sweep (at another site) removed one of the two alleles from the population (or reduced its frequency so that it doesn’t appear in the sample). You suggest that this process can explain our observation of hard sweeps.
We agree that this is a theoretical possibility, but we believe that our original interpretation is more parsimonious.

First, sweeps may be occurring regularly in all patients. In this case we do not expect any differences in diversity reduction in patients where the last sweep happened at a drug resistance codon versus patients in which another sweep was the last. Our data do not support this picture, because ongoing sweeps in all patients are not compatible with a significant and substantial reduction in diversity in the patients whose virus fixed a resistance mutation. Hence non-drug resistance related sweeps with a strong effect of diversity must be relatively rare in the viral populations.

We have plotted the reduction in diversity in intervals without the fixation of a resistance mutation and the in intervals with the fixation of a resistance mutation. The 10 patients in which only the AAC or AAT allele was observed are highlighted in red. The reduction of diversity in these intervals is quite severe and such severe reductions are not observed in the intervals without the fixation of a resistance mutations.


Second, it may be possible that, for some reason, the patients in which we see a hard sweep at site 103 actually had two or more sweeps (with the sweep at site 103 not being the last one) while patients in which we see a soft sweep had only one soft sweep at site 103. Then, indeed, the former set of patients would have a larger reduction in diversity than the latter set of patients, and this difference in reduction would NOT be due to fact that the former patients received only one resistance mutation. One potential scenario in which this could happen is if the time intervals during which the sweep occurred are systematically longer in patients in which we observe hard sweeps. However, this is not the case, see figure.


Another scenario is if there is a specific structure of epistasis among mutations in HIV. In particular, after the 103 mutation has fixed, another mutation or mutations become available which were not available before the K103N swept. These could be compensatory mutations, for example. In this case, in all patients there was a soft sweep at site 103. Following that, in some patients, the secondary mutation occurred and swept quickly, but in other it didn’t (just by chance). In those patients where it did occur and sweep, we see a larger reduction in diversity (including site 103) due to this secondary sweep. However, this would mean that the populations are limited by the supply of this secondary mutation rather then the K103N mutation, which seems unlikely (especially considering that after the K103N mutation the population size would have likely gone up). Also, if this were the case, the mutations that lead to the second sweep must occur relatively far away from the K103N site, otherwise they would have likely been discovered.

Finally, it can be that what looks like hard sweeps are indeed hard sweeps. We believe that this is, with our current knowledge, the most parsimonious explanation of our observations. Hence, the effective population size of the virus cannot be very large. This explanation is also compatible with the observation that resistance does not evolve in all patients.

Pleuni and Sergey (John is on vacation)

High-speed and accurate color-space short-read alignment with CUSHAW2

High-speed and accurate color-space short-read alignment with CUSHAW2
Yongchao Liu, Bernt Popp, Bertil Schmidt
(Submitted on 17 Apr 2013)

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are available at this http URL Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de

XORRO: Rapid Paired-End Read Overlapper

XORRO: Rapid Paired-End Read Overlapper
Russell J. Dickson, Gregory B. Gloor
(Submitted on 16 Apr 2013)

Background: Computational analysis of next-generation sequencing data is outpaced by data generation in many cases. In one such case, paired-end reads can be produced from the Illumina sequencing method faster than they can be overlapped by downstream analysis. The advantages in read length and accuracy provided by overlapping paired-end reads demonstrates the necessity for software to efficiently solve this problem.
Results: XORRO is an extremely efficient paired-end read overlapping program. XORRO can overlap millions of short paired-end reads in a few minutes. It uses 64-bit registers with a two bit alphabet to represent sequences and does comparisons using low-level logical operations like XOR, AND, bitshifting and popcount.
Conclusions: As of the writing of this manuscript, XORRO provides the fastest solution to the paired-end read overlap problem. XORRO is available for download at: sourceforge.net/projects/xorro-overlap/

GEMINI: integrative exploration of genetic variation and genome annotations

GEMINI: integrative exploration of genetic variation and genome annotations
Uma Paila, Brad Chapman, Rory Kirchner, Aaron Quinlan
(Submitted on 17 Apr 2013)

Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or variant prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate the utility of GEMINI for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to will provide researchers with a standard framework for medical genomics.

Our paper: Clusters of microRNAs emerge by new hairpins in existing transcripts

This guest post is by Antonio Marco (@antonio_marco_c) on his paper Marco et al. Clusters of microRNAs emerge by new hairpins in existing transcripts arXived here.

Our paper:

MicroRNAs are short regulatory sequences involved in virtually all biological processes. MicroRNAs are often organized in genomic clusters that produce polycistronic transcripts. It is well-known that protein-coding polycistronic transcripts are almost absent in animals (with a few exceptions in nematodes and ascidians). So where do these microRNA clusters come from, and why are they so prevalent? We tackle these questions in our paper “Clusters of microRNAs emerge by new hairpins in existing transcripts”, recently deposited in arXiv.

We envisioned several possible scenarios for the origin of polycistronic microRNAs: First, polycistronic microRNAs can emerge by genomic rearrangements that bring together pre-existing microRNAs. As in bacterial operons, the clustering of microRNAs with related functions can be advantageous, and the fusion of related microRNAs may be positively selected. We call this the ‘put together’ model. Alternatively, multiple microRNAs could become polycistronic as a by-product of genome reduction (this is analogous to Caenorhabditis elegans operons). This is the ‘left together’ model. A third model, called ‘tandem duplication’, implies that polycistronic microRNAs emerge by tandem duplication of single sequences. Lastly, new microRNAs can emerge de novo in already existing microRNA transcripts. We named this the ‘new hairpin’ model, since a novel microRNA first requires the formation of a hairpin-like structure in the transcript.

By reconstructing the evolutionary history of Drosophila melanogaster microRNAs we observed that the majority of microRNA clusters emerged by the formation of new microRNA precursors in existing transcribed microRNA genes (‘new hairpin’ model). We also find that gene duplication generated a minority of the clusters (‘tandem duplication’). However, we didn’t see any instance of fusion of pre-existing microRNA genes. Moreover, clusters rarely split or suffer rearrangements. Once a microRNA cluster is formed, it stays as a cluster or it is lost a a whole.

We propose a model for the origin and evolution of microRNA clusters. Polycistronic microRNAs are an extreme case of genetic linkage, in which a microRNA is typically a few nucleotides away from another microRNA. Once a cluster is formed, the linkage is so tight that recombination is dramatically reduced between these loci. We suggest that, because of strong selective interference between loci (Hill-Robertson effect), a microRNA under selective pressure strongly influences the evolutionary fate of any neighbouring microRNA. Even slightly deleterious microRNAs may be maintained in a population if selection in one microRNA of the cluster is strong enough. Currently, we are analysing polymorphism data to test the validity of our model in actual Drosophila populations.

In summary, we suggest that clusters of microRNAs emerge by non-adaptive mechanisms and they are maintained as a consequence of tight linkage.

Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish

Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish
Mikhail Spivakov, Thomas O. Auer, Ravindra Peravali, Ian Dunham, Dirk Dolle, Asao Fujiyama, Atsushi Toyoda, Tomoyuki Aizu, Yohei Minakuchi, Felix Loosli, Kiyoshi Naruse, Ewan Birney, Joachim Wittbrodt
(Submitted on 16 Apr 2013)

Background Oryzias latipes (Medaka) has been established as a vertebrate genetic model for over a century, and has recently been rediscovered outside its native Japan. The power of new sequencing methods now makes it possible to reinvigorate Medaka genetics, in particular by establishing a near-isogenic panel derived from a single wild population. Results Here we characterise the genomes of wild Medaka catches obtained from a single Southern Japanese population in Kiyosu as a precursor for the establishment of a near isogenic panel of wild lines. The population is free of significant detrimental population structure, and has advantageous linkage disequilibrium properties suitable for establishment of the proposed panel. Analysis of morphometric traits in five representative inbred strains suggests phenotypic mapping will be feasible in the panel. In addition high throughput genome sequencing of these Medaka strains confirms their evolutionary relationships on lines of geographic separation and provides further evidence that there has been little significant interbreeding between the Southern and Northern Medaka population since the Southern/Northern population split. The sequence data suggest that the Southern Japanese Medaka existed as a larger older population which went through a relatively recent bottleneck around 10,000 years ago. In addition we detect patterns of recent positive selection in the Southern population. Conclusions These data indicate that the genetic structure of the Kiyosu Medaka samples are suitable for the establishment of a vertebrate near isogenic panel and therefore inbreeding of 200 lines based on this population has commenced. Progress of this project can be tracked at this http URL

Reducing assembly complexity of microbial genomes with single-molecule sequencing

Reducing assembly complexity of microbial genomes with single-molecule sequencing
Sergey Koren, Gregory P Harhay, Timothy PL Smith, James L Bono, Dayna M Harhay, D. Scott Mcvey, Diana Radune, Nicholas H Bergman, Adam M Phillippy
(Submitted on 13 Apr 2013)

Background: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem.
Results: To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These assemblies are also comparable in accuracy to hybrid assemblies including second-generation data.
Conclusions: Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to below $2,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of complete genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.