Our paper: Integrating influenza antigenic dynamics with molecular evolution

This guest post is by Trevor Bedford (@trvrb) on his paper (along with coauthors): Bedford et al. Integrating influenza antigenic dynamics with molecular evolution arXived here.

The influenza virus shows a remarkable capacity to evolve to escape human immunity. Many other viruses, like measles, do not have this capacity. After infection with measles, a person gains life-long immunity to the virus, and hence measles has become constrained to be a childhood infection. Continual antigenic evolution in influenza necessitates frequent vaccine updates to provide sufficient protection to circulating strains.

Antigenic differences between strains are commonly quantified using the hemagglutination inhibition (HI) assay, which measures the ability of antibodies created against one strain to interfere with virus from another strain. The resulting HI data is represented as a sparse matrix of comparisons between viruses from strains A, B, C… and sera from strains X, Y, Z… Taken by itself, this matrix is difficult to work with. Experienced virologists can pick up the loss of reactivity between groups of viruses in the noisy HI data, but these patterns are not fully quantified.

In our new paper, available on the arXiv, we extend techniques of multidimensional scaling (MDS) pioneered by Derek Smith and colleagues for the analysis of influenza antigenic data. Here, we attempted to bring the MDS antigenic model into a fully Bayesian framework and refer to the revised technique as Bayesian MDS (BMDS). In this model, viruses and sera are represented as 2D coordinates on an antigenic map in which their pairwise distances yield expectations for the HI titers, with antigenically similar viruses lying close to one another and antigenically distant viruses lying far apart.

By placing antigenic cartography in a Bayesian context, we are able to integrate other data sources, most notably sequence data. In this case, genetic sequences provide an evolutionary tree relating virus strains and we assume that antigenic location evolves along this tree in a 2D diffusion process. This process imposes a prior on antigenic locations in which evolutionary similar viruses have a prior expectation of lying close to one another on the map. In the paper, we use this BMDS / diffusion model to investigate patterns of antigenic evolution in 4 circulating lineages of influenza and show that antigenic drift determines to a large degree incidence patterns across time and across lineages.

The paper is also up on GitHub, which I’ll keep updating as it goes through the review process. The BMDS model is implemented in the software package BEAST and is available in the latest source code. I hope to provide tutorials on running the BMDS model in the not-to-distant future.

Our paper: Inferring non-neutral regulatory change in pathways from transcriptional profiling data

This post is by Josh Schraiber on his paper (along with coauthors): Schraiber et al. Inferring non-neutral regulatory change in pathways from transcriptional profiling data arXived here.

We’ve known for a long time now that gene sequence alone does not determine phenotype. From the trivial example of differentiated cell types (which all have the same DNA) to now-common examples where species adapt to their environment by changing something other than protein-coding sequence, it’s clear that the expression level of a gene plays just as important a role in phenotypic development as does its sequence. Despite this fact, we still lack the kinds of tools that are widely available for detecting non-neutral evolution at the level of gene expression (in packages like PAML). Part of this problem lies in a fundamental lack of power. A single gene may have hundreds of sites, and the patterns that occur at all of those sites give us plenty of information to learn about accelerated substation rates and the like. But a gene (in a given environment) has just one expression level, so the sample size is often small and power is reduced.

This same problem occurs, of course, in phylogenetic studies of quantitative characters at the organismal level. The difference is that in those cases, researchers typically have access to tens, if not hundreds, of species with good quality measurements. Unfortunately, transcriptome-wide gene expression data can be difficult and costly to collect, so large-scale studies are few and far between.

Instead of trying to leverage large collections of species, we sought to utilize one of the benefits of transcriptome-wide profiles: data from lots and lots of genes. A common practice in molecular evolution is to run tests for selection on a gene-by-gene basis and then look for functional groups that are overrepresented (e.g. Gene Ontology enrichment). We turned that around and instead started with a priori defined gene groups (in our case, from Gene Ontology), looking to detect signal for a history of lineage-specific gene expression evolution, by jointly analyzing all the genes in a group simultaneously.

Doing this would potentially run into a problem of overfitting: should we try to fit a separate rate of evolution for each gene in the group? Instead, we borrowed a page from Ziheng Yang’s book and assumed that the rate of evolution across genes was inverse-gamma distributed. We chose this distribution mostly for for computational convenience, but it is important to note that it can cover a wide range of possibilities—from a model in which every gene evolves at the same rate to a distribution so fat-tailed that there is no average rate of evolution across the group! By fitting a distribution of rates across genes in a group, we are able to look for examples of lineage-specific evolution without being confounded by outlying genes.

We encourage you to check out our paper and let us know what you think
of our approach. In addition, our method will soon be available as an
R package (once I get around to doing all the documentation…) and we
would love to see people using it. If you are interested in getting an
early version of our package, please don’t hesitate to contact me:
jgschraiber@berkeley.edu.

Our paper: Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish

This guest post is by Ewan Birney on Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish, arXived here.

Our lab is part of a collaboration spanning Japan, two groups in Germany and EMBL-EBI in the UK which put the paper “Genomic and phenotypic characterisation of a wild Medaka population: Establishing an isogenic population genetic resource in fish” on to the arXive pre-print server. Here I’ve taken up a kind invitation to post about this paper.

birney_post
Before getting into the details of the paper, I should introduce Medaka, scientific name, Oryzias latipes. Medaka is a small fish which lives in both fresh and brackish water across the majority of the Japanese archipelago (the exception is Hokkaido, the large northern island in Japan), the Korean peninsular and the eastern China region. It has been kept as a garden pet (and then in aquaria) in Japan for a long time, with documented colourful strains in Japanese art since 17th Century (this is a woodblock from the celebrated artist, Ando Hiroshige, (安藤 広重) depicting Goldfish and, the much smaller, Medaka fish – the medaka fish are the small horizontal shoals, not the colourful goldfish sadly). It is widespread in the wild, in particular in rice paddies, hence its other common name, “the Japanese rice paddy fish”. After the rediscovery of Mendel’s work at the turn of the twentieth century scientists in Japan started to use the established colourful strains to explore genetics. The most famous paper from this era is the first discovery in any species of crossover of the sex chromosomes, X-Y. Rather brilliantly this is an open access paper from Genetics (Aida T: On the Inheritance of Color in a Fresh-Water Fish, APLOCHEILUS LATIPES Temmick and Schlegel, with Special Reference to Sex-Linked Inheritance. Genetics 1921, 6:554-573. http://europepmc.org/articles/PMC1200522 Note: at the time, the systematics in this area of fish was different, hence the different genus name). Since this early genetics, Medaka has been used for research both inside and outside Japan over the 20th Century, with a well established linkage map, transgenic procedures, genome sequence, and considerable probing of different phenotypes.

My experience in describing Medaka to European and American audiences is to now answer the usual questions about similarity and contrasts with Zebrafish, the most commonly used laboratory fish in the West. The first important thing to realise is that Medaka is on a very different branch of the Teleost (bony fish) lineage from Zebrafish, separated by an estimated 250-300 million years of divergence – so the Medaka fish genome is only marginally closer to the Zebrafish genome than either are to mammals. One should expect quite different biological details in the two systems, and each system to be equally applicable to mammalian systems. Medaka is somewhat smaller than Zebrafish and nearly always will live in the same tank format as Zebrafish and the same water system (many labs co-culture both Medaka and Zebrafish). Generation time is similar (6 weeks ~ 3 months). Zebrafish lay around 1,000 eggs in a single mating, which is a distinct advantage to Medaka’s clutch of around 30 eggs in a single mating, held to the female. However whereas zebrafish mate only once per week requiring a ‘recovery phase’ , medaka mate every day. Thus the difference in fecundity over time is small. Both zebrafish and medaka have transparent chorions (egg shells); also the embryo itself is completely transparent in both species rendering them ideal model systems to study development. Generally all techniques that have been established for one species are also applicable for the other, such as transgenesis by simple injection of suitable DNA vectors or antisense morphlinos. Medaka fish genetics is cleaner, with many inbred laboratory lines maintained by single brother-sister mating, and thus very homozygous throughout. Finally these medaka inbred lines are often made from wild-catch individuals, with an established breeding protocol to achieve homozygosity from wild individuals.

It is this last feature that we would like leverage here. With an inbreeding protocol from the wild, one can set up a near-isogenic wild panel, similar to the panels that have already been developed in Arabidopsis and Drosophila. These panels are proving very informative for quantitative genetics in both of these fields. Once such a panel is established it is a powerful mapping resource and is one of the few ways to study gene-environment effects as one can repeat phenotyping experiments over the same genotypes but in differing laboratory environments (say, high to low calorie diets, or different temperatures, or different small molecules added to the water). Being able to have such a panel with a vertebrate will be very powerful (Medaka fish has all the common cell types and tissues for a vertebrate – brain, heart, liver, muscle, gut, pancreas – both endocrine and exocrine, kidney.) The main purpose of this paper is to find and characterise the source wild population for such a panel.

A good genetic panel needs ideally to be free of population structure. One also needs the right linkage disequilibrium (LD) properties. In previous decades there was a sweet spot for LD in the genome; the longer the LD the cheaper it was to genotype, but shorter LD gave better resolution of where a functional variant is. With the advent of cheap sequencing, this trade off logic has changed to finding a population with short LD to have the best possible mapping resolution. Finally there are also practical aspects for choice of population – one would like to be able to resample from the same region easily (for example, to add to the panel in the future). After looking at a number of sites where we assessed population properties via mitochondrial typing, we choose a site, Kiyosu (https://maps.google.com/maps?q=34.78113,+137.347928333056(Kiyosu%20Sampling%20Site)&iwloc=A&hl=en), close to Nagoya where the NIBB Medaka resource group under Kiyoshi Naruse is housed. From this population we caught a number of individuals and set up 8 breeding pairs. For each of the 8 pairs, we sequenced the two parents and one child (a “trio” in the parlance of genetics). We choose this sequencing structure as it means we can phase the parental genotypes using the child’s genotype, and in effect sample 16 haplotypes from this population.

From this we show we have a good population for an isogenic panel. There is no discernable population structure, both from a distance matrix perspective across all individuals, and the lack of long LD in the population. As expected for a large teleost population, there are a relatively large number of variants, with a segregating SNP every 150bp on average. In this limited sample the LD is, as expected, quite tight, with the correlation between SNPs (expressed as r2) dropping to “baseline” levels between 5 – 10 KB . From even this limited panel we estimate that almost 40% of SNPs would be mappable to a single exon. We expect this will improve in a more complete panel.

To augment this population characterisation, we also performed some high throughput phenotyping, showing that the population has expected phenotypic diversity driven by genetics. To do this we took advantage of a small number of existing inbred southern line strains. As the numbers are low here, we cannot map any phenotypes (this will require the panel), but we can get an estimate of broad sense heritability, i.e., the proportion of variance of a measurement explained by the differences between inbred lines compared to the differences between individuals of the same line. This is the sort of calculation which is easy to do on inbred panels, and harder to deconvolute on family or outbred cases. For 6 out 7 traits we chose to measure we get reasonably high broad sense heritability measures.

We conclude that we have a good source wild population that is likely to lead to a successful near isogenic panel, and have started inbreeding. We are somewhere between the 3rd to the 4th round of brother-sister inbreeding for 200 founder pairs, and all the lines look healthy. Traditionally one considers lines to have inbred after 8 brother-sister matings, so we’re almost half way through. Although in theory the generation time is 3 months, when one does this on 200 lines, the logistics means it is closer to 4 to 5 months. Felix Loosli is overseeing the inbreeding at the KIT (Karlsruhe Institute for Technology), in Germany.

In addition to the main thrust of this paper, we also look at the population genetics of Medaka, as this is a large wild catch with complete genome wide coverage of SNPs. For me the most interesting thing is the relationship with the Northern strain of Medaka. Japan has a large mountain range roughly running down the middle of the main island of Japan (called the “Japanese Alps” in English –日本アルプス Nihon Arupusu in Japanese). The Medaka fish found to the north of this are phenotypically different (more heavily pigmented, and prefer to live in shallower tanks), but do interbreed in the laboratory with Southern strains. An open question is whether there has been any partial interbreeding (called introgression) in the wild. Using sensitive tests for introgression between lineages, first developed to detect the Human/Neanderthal interbreeding, we do not see evidence for wild interbreeding between the Northern and Southern strains. This lends support that these two “strains” are best thought of as separate species, and one might expect there to have been specific selection events for the phenotypic properties in both strains.

We welcome comments on this paper, but also more generally on the use of the future Kiyosu near isogenic wild panel. We will be completely open about data collection and distribution of data for this panel, wherever possible using global archives to minimise the complications in getting access to the data.

If you are a teleost biologist, the fact that one can co-culture Medaka and Zebrafish means that many phenotypic assays set up for Zebrafish are probably quite easily transferable to Medaka; there is an existing (small) set of inbred lines which you could look to develop assays on. If you are a molecular quantitative geneticist, this panel should have better mapping properties than human or mouse, but a full range of vertebrate cell types. I am looking forward to using some of the statistical techniques developed on Arabidopsis and Drosophila, in particular the gene/environment partitioning components, and if you are interested in gene/environment interactions, this panel will have some unique opportunities. Of course, this is not a panacea – it takes investment in husbandry details and logistics to bring in another species into any system, and quantitative genetics is just one way of exploring genetic effects – forward and reverse genetics are very powerful techniques (which, incidentally have both been used in Medaka).

If you are interested do contact Ewan Birney (birney@ebi.ac.uk) or Ian Dunham (dunham@ebi.ac.uk) on aspects of the genomics/variation, and Felix Loosli (felix.loosli@kit.edu) , Jochen Wittbrodt (Jochen.Wittbrodt@cos.uni-Heidelberg.de) or Kiyoshi Naruse (naruse@nibb.ac.jp) on Medaka husbandry and molecular biology.

Our paper: The influence of relatives on the efficiency and error rate of familial searching

This guest post is by Rori Rohlfs on her paper (along with coauthors): Rohlfs et al. The influence of relatives on the efficiency and error rate of familial searching. arXived here.

One of the ways we in the U.S. (and elsewhere) are likely to encounter genetic technologies in our lives is through forensic DNA identification.  Without knowing a specific quantity, clearly a huge number of us encounter forensic uses of DNA through court cases using genetic evidence (as survivors, defendants, jury members, etc.), DNA sample seizure during a stop or arrest (currently being considered by the U.S. Supreme Court), or by being genetically related to someone in an offender or arrestee DNA database (>11 million profiles in U.S. national database).  Despite the social relevance of forensic uses of DNA, it seems to me that forensic genetics isn’t much discussed by the population and evolutionary genetics crowd these days.

A while back, I became interested in a newer forensic technique known as familial searching, particularly in how some pop gen assumptions affect outcomes.  Familial searching is performed in cases where police have some DNA evidence from an unknown individual they want to identify but have no leads.  First, they’ll search offender/arrestee DNA database(s) for someone with a matching genetic profile (which is verrrry unlikely between unrelated individuals with complete profiles), who they’d then investigate.  In some jurisdictions (where familial searching is legal or practiced without explicit policy), if there’s no complete profile match, they’ll search the database again for a partially matching profile.  The idea being that the partial match may be due to a close genetic relationship.  (Of course, two unrelated individuals could reasonably have partially matching profiles by chance.  More on that later.)  Again, depending on the policies of the jurisdiction, the relatives of some number of partially matching individuals are investigated.  In the most high-profile case of familial searching in the U.S., the suspected genetic relative was subject to surreptitious DNA collection (i.e. being followed until leaving a DNA sample (in that case, a pizza crust)).  Then this sample was tested directly against the original unknown sample, and showing a complete profile match a suspect was identified.

Because familial searching effectively extends offender/arrestee databases to the genetic kin of people in the databases, it raises important questions like:

For a population geneticist, attempting to identify unknown genetic relatives of individuals in the database (rather than the known individuals in the database) introduces more uncertainty and some additional questions come up like:

  • With the genetic information in forensic profiles (typically 13-15 autosomal STRs, sometimes with 17 Y-chromosome STRs), what’s the chance that an unrelated individual coincidentally has a partially matching profile resembling a genetic relative?
  • What background allele and haplotype frequencies are considered in profile likelihood calculations?
  • What statistical methodology will be used to identify [specific?  non-specific?] genetic relatives?

All these questions are especially relevant when considering intense multiple testing introduced by the relevant databases (>1.4 million profiles in the California offender database).  It can be challenging to get a handle on these questions because of widely varying policies and methodology between jurisdictions.  In New York City, it seems that an error-prone ‘allelic matching’ technique has been used to attempt to identify relatives in at least one case of robbery, leading to investigations of unrelated individuals.  While in California, familial searching is used specifically in cold cases of violent crimes with a continuing threat to public safety and in 2011 Myers et al. published the likelihood ratio-based test statistic and procedure used in practice.

When I arrived at the U.C. Berkeley for my postdoc, I met Monty Slatkin and Yun Song who, along with Erin Murphy, had attempted to estimate some error rates of familial searching, but were stymied by a lack of a well-described methods currently used in practice.  When the statistical procedure used by California was published, we were excited to collaborate using practically relevant methodology.  Specifically, we estimated the false positive rate and power of familial searching using the California state procedure.  Generally, we found high power to detect a specified first-degree relationship (.79 to .99) and low (but still substantial in a multiple testing context) false positive rates of calling unrelated individuals as first-degree relatives (<5e-9 to 1e-5).  We got thinking about more distant Y-chromosome-sharing relatives (half-siblings, cousins, second cousins) who (barring mutation) share Y-haplotypes and some portion of their autosomal STRs IBD.  We estimated that these distant relatives could be mistaken for close relatives fairly often, like in our simulations 14-42% of half-sibs and 3-18% of first cousins were misidentified as siblings.

These rates are non-trivial, especially if you consider the size of databases and the fact that there are more distant relatives than near (so distant relatives are more likely to be present in databases).  Further, some of these genetic relationships are not known (even to the individuals themselves) so are not useful to investigation, but may still be interpreted as evidence of familial involvement, leading to investigation of uninvolved individuals.  Lucky for us, our collaborator Erin Murphy has a background in law and thoughtfully outlined some of the practical ramifications in the introduction and discussion of our paper.  Not the least of which is how extended families and communities in groups which are over-represented in databases (perhaps most obviously African Americans and Latinos) would be disproportionately impacted by misidentification of distant relatives as near relatives.

We hope that this interdisciplinary manuscript broadens sorely needed technical and policy discussions of familial searching.

Our paper: Clusters of microRNAs emerge by new hairpins in existing transcripts

This guest post is by Antonio Marco (@antonio_marco_c) on his paper Marco et al. Clusters of microRNAs emerge by new hairpins in existing transcripts arXived here.

Our paper:

MicroRNAs are short regulatory sequences involved in virtually all biological processes. MicroRNAs are often organized in genomic clusters that produce polycistronic transcripts. It is well-known that protein-coding polycistronic transcripts are almost absent in animals (with a few exceptions in nematodes and ascidians). So where do these microRNA clusters come from, and why are they so prevalent? We tackle these questions in our paper “Clusters of microRNAs emerge by new hairpins in existing transcripts”, recently deposited in arXiv.

We envisioned several possible scenarios for the origin of polycistronic microRNAs: First, polycistronic microRNAs can emerge by genomic rearrangements that bring together pre-existing microRNAs. As in bacterial operons, the clustering of microRNAs with related functions can be advantageous, and the fusion of related microRNAs may be positively selected. We call this the ‘put together’ model. Alternatively, multiple microRNAs could become polycistronic as a by-product of genome reduction (this is analogous to Caenorhabditis elegans operons). This is the ‘left together’ model. A third model, called ‘tandem duplication’, implies that polycistronic microRNAs emerge by tandem duplication of single sequences. Lastly, new microRNAs can emerge de novo in already existing microRNA transcripts. We named this the ‘new hairpin’ model, since a novel microRNA first requires the formation of a hairpin-like structure in the transcript.

By reconstructing the evolutionary history of Drosophila melanogaster microRNAs we observed that the majority of microRNA clusters emerged by the formation of new microRNA precursors in existing transcribed microRNA genes (‘new hairpin’ model). We also find that gene duplication generated a minority of the clusters (‘tandem duplication’). However, we didn’t see any instance of fusion of pre-existing microRNA genes. Moreover, clusters rarely split or suffer rearrangements. Once a microRNA cluster is formed, it stays as a cluster or it is lost a a whole.

We propose a model for the origin and evolution of microRNA clusters. Polycistronic microRNAs are an extreme case of genetic linkage, in which a microRNA is typically a few nucleotides away from another microRNA. Once a cluster is formed, the linkage is so tight that recombination is dramatically reduced between these loci. We suggest that, because of strong selective interference between loci (Hill-Robertson effect), a microRNA under selective pressure strongly influences the evolutionary fate of any neighbouring microRNA. Even slightly deleterious microRNAs may be maintained in a population if selection in one microRNA of the cluster is strong enough. Currently, we are analysing polymorphism data to test the validity of our model in actual Drosophila populations.

In summary, we suggest that clusters of microRNAs emerge by non-adaptive mechanisms and they are maintained as a consequence of tight linkage.

Our paper: The causal meaning of Fisher’s average effect

This guest post is by James Lee on his paper with Carson Chow, “The causal meaning of Fisher’s average effect“, arXived here

Early in graduate school, I took it upon myself to read Reinhard Burger’s excellent treatise The Mathematical Theory of Selection, Recombination, and Mutation. Here I encountered the concepts of “average excess” and “average effect,” which were defined (rather unclearly to the casual reader) by Ronald Fisher in his presentation of the Fundamental Theorem of Natural Selection. Finding some of the distinctions made between these two concepts rather confusing, I directed some questions about them to the Yahoo quantitative genetics group. A respondent told me to consult Falconer (1985), which would “make things as clear as mud.”

My school did not have electronic access to Genetics Research at the time, so I did things the old-fashioned way and got my hands on a bound copy of the journal volume containing Falconer’s article. This masterpiece of exposition impressed me so much that I copied it down by hand; since the paper was at the end of the bound volume, the librarian was not able to scan it for me.

Falconer set out four distinct concepts that at various times have been put forth as definitions of the average excess, average effect, or both:

(A) Divide the population into two groups, one containing all A1A1 homozygotes and half of the heterozygotes, the other containing all A2A2 homozygotes and half of the heterozygotes. Take the difference between the conditional mean phenotypes of these two groups.

(B) Choose gametes bearing A1 and A2 at random. Measure the phenotypes of the mature organisms to which these gametes ultimately give rise. Take the difference between the conditional mean phenotypes of the A1 and A2 gametes.

(C) Regress the phenotype on the count (0, 1, or 2) of an arbitrarily chosen allele (A1 or A2). Take the regression coefficient of gene count.

(D) Take the average change in phenotype resulting from experimentally “zapping” one allele into the other, as if by mutation, in a zygote immediately after fertilization but before the onset of any developmental events.

Implicitly assuming that genotypes and environments are independent, Falconer then showed that all four concepts are equivalent under random mating. Now suppose that mating is not random. Then (A) and (B) are still equal and correspond to what Fisher called the average excess. The numerical value of this quantity is generally not equal to either (C) or (D), and in turn (C) and (D) are generally not equal to each other. Falconer concluded that (C) was what Fisher really meant by the average effect.

This conclusion disturbed me a great deal. As any GWAS researcher knows, the (partial) regression of phenotype on gene count does not necessarily pick out any biologically meaningful quantity if genotypes and environments are dependent (“population stratification”). The fundamental issue here is that (C) is merely a statistical definition, appealing only to passive observations of a static population, whereas (D) is a causal definition turning on the result of a hypothetical experimental intervention. I no longer remember now whether I had read Pearl (2009) by this point, but regardless my Spider Sense was unambiguously telling me that (D) was deeper and more meaningful than (C). Furthermore, if Fisher was not the one who coined the slogan “correlation is not causation,” he was certainly one of its first and most vocal promoters. How could Fisher, who invented randomization in experimental design, have preferred a correlational definition over a causal one when setting forth one of the key concepts in his evolutionary theory? Could it be because of the difficult in translating (D) from words into mathematical symbols without something like Pearl’s do operator, which was not available in Fisher’s time?

This paradox continued to bother me over the next several years. Soon after my daughter was born, I indulged one of those wild impulses that strike the sleepless: I emailed my questions regarding this matter to Anthony W. F. Edwards, the last student of the great Fisher himself. Anthony very generously sent me some of his unpublished work and also his correspondence with Falconer about the very article that had spurred my thoughts. This correspondence spanned a period of more than 20 years, and it provided a very poignant portrait of Douglas Falconer as a scientist (Hill and Mackay, 2004). I did not immediately find the answers to my questions in the materials that Anthony sent to me, but they set me on the path toward finding the answers. These are presented in the paper, which will shortly appear in Genetics Research.

It turns out that Fisher’s average effect must be given a causal interpretation after all. For the detailed story of the reconciliation between (C) and (D), you will have to read the paper, written in collaboration with my supervisor Carson Chow. I am particularly pleased with our proof that the frequency-weighted mean of the (experimental) average effects at any locus is equal to zero. In most texts this relation is extrinsically applied to the multilocus case without any motivation except that it holds automatically for the (regression) average effects in the case of a single locus. The fact that this identity, which otherwise is an arbitrary constraint, can be derived from a definition positing the experimental replacement of a homologous gene is rather striking evidence for the importance of a causal interpretation.

Our investigation unexpectedly turned up many connections to other parts of population genetics. I like to think that in the pages of our paper one can hear many masters of population and quantitative genetics–Hardy, Fisher, Wright, Kimura, Falconer, Price, Ewens, Lessard–engaging in a deep conversation.

There are some issues raised in the paper that I am still contemplating. First, there is a complication when one considers randomly sampling a zygote and experimentally changing its genotype to the one whose value needs to be known; such an experiment inevitably changes the frequencies of the genotypes, and for theoretical reasons any ensuing frequency-dependent changes in the phenotypic means of the genotypes needs to be excluded. I believe that one way to do this properly is by partition of the effects of the experiment according to Wright’s path analysis–which would be rather ironic given the well-known antagonism between Wright and Fisher. Second, in the multilocus case it might be possible to mathematically describe special subsets of possible gene substitutions defining a given average effect that satisfy the property that all changes in Hardy-Weinberg and linkage disequilibria are “small.” We look forward to future work (by ourselves?) on these questions.

Note: The bibliography gives the name of the journal in which Falconer (1985) appears as Genetical Research. This is the same journal as Genetics Research; the name was changed about ten years ago.

Our paper: The effects of transcription factor competition on gene regulation

This guest post is by Radu Zabet on his papers “The effects of transcription factor competition on gene regulation” and “The influence of transcription factor competition on the relationship between occupancy and affinity”

Transcription factors (TFs) find their genomic target sites by a combination of three-dimensional diffusion and one-dimensional translocation on the DNA. We previously developed the stochastic simulation framework GRiP (http://logic.sysbiol.cam.ac.uk/grip/) that allows the realistic representation of the target finding process. The following two papers show our application of GRiP to address a few interesting phenomena:

The effects of transcription factor competition on gene regulation
arXiv:1303.6793

The binding of site-specific TFs to their genomic target sites controls the transcription rate of the target genes. In this manuscript, we discuss the influence of TF abundance on the arrival time of TFs on their target sites as well as the time they stay bound to the DNA. We investigate the TF search process using stochastic simulations and found that molecular crowding on the DNA always leads to longer times required by TF molecules to locate their target sites as well as to lower occupancy. There is also an “emergent property” in cases where many molecules compete in some sort of molecular traffic jam on the DNA. This newly identified noise component may be a contributor to transcriptional noise, by affecting both the size of the fluctuations and the distribution of the arrival times (unimodal or bimodal).

The influence of transcription factor competition on the relationship between occupancy and affinity
arXiv:1303.6869

This manuscript deals with the discrepancy between “predicted occupancy” of a TF to a binding site on the basis of, say, a PWM, in contrast to a “measured occupancy” when we simulate the system with our GRiP framework. Again, we can show that absolute TF abundances play an important role in gene expression, and also provide a compelling case where selecting “the highest peaks” from a ChIP experiment may not necessarily identify the most affine binding sites. Our results showed that for medium and high affinity sites, TF competition does not play a significant role for genomic occupancy except in cases when the abundance of the TF is significantly increased, or when the PWM displays relatively low information content. Nevertheless, for medium and low affinity sites, an increase in TF abundance (for both cognate and non-cognate molecules) leads to an increase in occupancy at several sites.

Our paper: Improving transcriptome assembly through error correction of high-throughput sequence reads

This guest post is by Matt MacManes on his preprint with Michael Eisen, “Improving transcriptome assembly through error correction of high-throughput sequence reads“, arXived here. This is cross-posted from his blog.

I am writing this blog post in support of a paper that I have just submitted to arXiv: Improving transcriptome assembly through error correction of high-throughput sequence reads. My goal is not to talk about the nuts and bolts of the paper so much as it is to ramble about its motivation and the writing process.

First, a little bit about me, as this is my 1st paper with my postdoctoral advisor, Mike Eisen. In short, I am a evolutionary biologist by training, having done my PhD on the relationship between mating systems and immunogenes in wild rodents. My postdoc work focuses on adaptation to desert life in rodents- I work on Peromyscus rodents in the Southern California deserts, combining field work and genomics. My overarching goals include the ability to operate in multiple domains– genomics, field biology, evolutionary biology to better understand basic questions– the links between genotype and phenotype, adaptation, etc… OK, enough.. on the the paper.

Abstract:

The study of functional genomics–particularly in non-model organisms has been dramatically improved over the last few years by use of transcriptomes and RNAseq. While these studies are potentially extremely powerful, a computationally intensive procedure–the de novo construction of a reference transcriptome must be completed as a prerequisite to further analyses. The accurate reference is critically important as all downstream steps, including estimating transcript abundance are critically dependent on the construction of an accurate reference. Though a substantial amount of research has been done on assembly, only recently have the pre-assembly procedures been studied in detail. Specifically, several stand-alone error correction modules have been reported on, and while they have shown to be effective in reducing errors at the level of sequencing reads, how error correction impacts assembly accuracy is largely unknown. Here, we show via use of a simulated dataset, that applying error correction to sequencing reads has significant positive effects on assembly accuracy, by reducing assembly error by nearly 50%, and therefore should be applied to all datasets.

For the past couple of years, I have had an interest in better understanding the dynamics of de novo transcriptome assembly.. I had mostly selfish/practical reasons for wanting to understand–a large amount of my work depends on getting these assemblies ‘right’.. It was quickly evident that much of the computational research is directed at assembly itself, and very little on the pre- and post-assembly processes.. We know these things are important, but often an understanding of their effects is lacking…

How error correction of sequencing reads affects assembly accuracy has been one of the specific ideas I’ve been interested in thinking about for the past several months. The idea of simulating RNAseq reads, applying various error corrections, then understanding their effects is logical– so much so that I was really surprised that this has not been done before. So off I went..

I wrote this paper over the coarse of a couple of weeks. It is a short and simple paper, and was quite easy to write. Of note, about 75% of the paper was written on the playground in the UC Berkeley University Village, while (loosely) providing supervision for my 2 youngest daughters. How is that for work-life balance!

The read data will be available on Figshare, and I owe thanks to those guys for lifting the upload limit- the read file is 2.6Gb with .bz2 compression, so not huge, but not small either. The winning (AllPathsLG corrected) assembly is there as well.

This type of work is inspired, in a very real sense, by C. Titus Brown, who is quickly becoming to be the go-to guy for understanding the nuts and bolts of genome assembly (and also got tenure based on his klout score HA!). His post and paper on The challenges of mRNAseq analysis is the type of stuff that I aspire to…

Anyway, I’d be really interested in hearing what you all think of the paper, so read, enjoy, commentand get to error correcting those reads!

Our paper: RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates

This guest post is by Titus Brown on his group’s preprint, RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates, arXived here. This post is cross-posted from his blog

I’m worried about our current mRNAseq analysis strategies.

I recently posted a draft paper of ours to arXiv entitled RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates; the link is to the Haldane’s Sieve discussion of the paper. Graham Coop said I should write something quick, and so I am (sort of — I mean to write this post a few days ago, but teaching. family. etc.)

I decided to post the paper — although we haven’t submitted it yet — because I wanted to solicit feedback and gauge where the disagreements or misunderstandings were likely to pop up. I suspect reviewers are going to have a hate-hate relationship with this paper, too, so I want to get rid of any obvious mistakes by crowdsourcing some early reviews ;) .

Before I talk about the paper, let me mention a few related factoids that crossed my ‘net firehose and that tie into the paper; I’ll weave them into the paper discussion below.

  1. Dan Zerbino (one of the Assembly Gods?) wrote a good e-mail about how Oases can’t really leverage coverage information when reconstructing transcripts
    on the oases-users mailing list.
  2. I read and reviewed the SASeq paper on mRNAseq quantification, which talks about how many of our transcript models are almost certainly not expressed.
  3. The lamprey genome paper came out, pointing out that the genome is probably missing ~20-30% of the germline content.
  4. A paper on evaluating mRNAseq expression analysis software hit arXiv.

OK, so what did our paper say?

The paper’s genesis is this: for years I’d been explaining to collaborators that mRNAseq was unreliable on any organism without a pretty good reference transcriptome, and that constructing the reference transcriptome relied on either good mRNAseq assembly tools OR on having a good reference genome. So that was what my lab was working on doing with their data.

Since we tend to work on weird organisms like chick (which has an incomplete genome and a notoriously poor gene annotation), lamprey (see above — missing 20-30% of its genome), and Molgulid ascidians (for which we are constructing the genome), we needed to go the de novo mRNAseq assembly route.

However, we also quickly realized that there were a few problems with de novo mRNAseq assembly and expression evaluation: lots of mRNAseq needed to be crunched in order to get good reconstruction, which was computationally intensive (see: diginorm); splice variants would confuse mapping (see: this paper); we were uncertain of how well splice variants could be assembled with or without a reference (see: this paper); and we weren’t sure how to evaluate mapping tools.

When a new postdoc (Alexis Black Pyrkosz, the first author) joined the lab and wanted a fairly simple project to get started, I suggested that she evaluate the extent to which an incomplete reference transcriptome would screw things up, and perhaps try out a few different mappers. This took her about a year to work through, in addition to all of her other projects. She built a straightforward simulation system, tried it out on chick and mouse, and got some purely computational results that place (in my view) fundamental limits on what you can accomplish with certainty using current mRNAseq technology.

Incidentally, one piece of feedback she got at GLBio from a prof was (paraphrased) "I hope this isn’t your real project, because it’s not very interesting."

The results, in short, are:

  1. What mapper you use doesn’t really matter, except to within a few percent; they all perform fine, and they all degrade fairly fast with sequencing errors.

  2. Incomplete reference transcriptomes matter, a lot. There are two entirely obvious reasons: if you have a splice variant A that is in your reference but not present in your mRNAseq, and a splice variant B that is not in your reference but is actually transcribed and in your mRNAseq, the reads for B will get mapped to the wrong transcript; and (the even more obvious one) you can’t measure the expression of something that’s not in your reference via mapping.

    The SASeq paper does a nice job of pointing out that there’s something rather seriously wrong with current mRNAseq references, even in mouse, and they provide a way to minimize misestimation in the context of the reference.

  3. Direct splice variant reconstruction and measurement is, technically, impossible for about 30% of the transcripts in mouse. For standard paired-end sequencing, it turns out that you cannot claim that exon A and exon Z are present in the same isoform for about 30% of the isoforms.

  4. The slightly surprising conclusion that we reached from this is that mRNAseq assembly is also, generally speaking, impossible: you cannot unambiguously construct a reasonably large proportion of observed isoforms via assembly, since the information to connect the exons is not there.

    Until recently, I had held out a forlorn hope that Clever Things were being done with coverage. Then I saw Dan Zerbino’s e-mail, point A, above.

And yet, judging by the Oases and Trinity publications, assembly works! What’s up?

There’s something a bit weird going on. Tools like Oases and Trinity can clearly construct a fair proportion of previously observed transcripts, even though the information to do so from direct observation isn’t there and they can’t necessarily use coverage inference reliably. My guess (see paper D, above) is that this is because biology is mostly cooperating with us by giving us one dominant isoform in many or most circumstances; this matches what Joe Pickrell et al. observed in their truly excellent noisy splicing paper. But I’d be interested in hearing alternate theories.

At this point, my friend and colleague Erich Schwarz tends to get unhappy with me and say "what would you have me do with my mRNAseq, then? Ignore it until you come up with a solution, which you claim is impossible anyway?" Good question! My answer is (a) "explore the extent to which we can place error bars or uncertainty on isoform abundance calculations", (b) "figure out where interesting isoform misestimation is likely to lie in the data sets", and (c) "look at exon-exon junctions and exon presence/absence instead of whole isoform abundance." But the tools to do this are still rather immature, I think, and people mostly ignore the issue or hope it doesn’t bugger up their results. (Please correct me if I’m wrong – I’d love pointers!)

In my lab, we are starting to explore ways to determine what mis- or un-assembled isoforms there might be in a given transcriptome. We’re also looking at non-reference-based ways of doing mRNAseq quantification and differential expression (technically, graph-based methods for mRNAseq). We are also deeply skeptical of many of the normalization approaches being used, largely because every time we evaluate them in the context of our actual data, our data seems to violate a number of their assumptions… Watch This Space.

Paper reactions

What’s the reaction to the paper been, so far?

Well, on Twitter, I’ve been getting a fair bit of "ahh, a good reference genome will solve your problems!" But I think points #3 and #4 above stand. Plus, invoking solving a harder and more expensive problem to solve what you think is a simpler problem is an interesting approach :) . And since we don’t have good references (and won’t for a while) it’s not really a solution for us.

I’ve also been getting "this is a well known problem and not worth publishing!" from a few people. Well, OK, fair enough. I’ve been skimming papers with an eye to this for a while, but it’s entirely possible I’ve missed this part of the literature. I’d love to read and cite such a paper in this one (and even rely on it to make our points, if it truly has significant overlap). Please post links in the comments, I’d really appreciate it!

It is clear that the paper needs some reshaping in light of some of the comments, and I’d like to especially thank Mick Watson for his open comments.

Concluding thoughts

If our results are right, then our current approaches to mRNAseq have some potentially serious problems, especially in the area of isoform expression analysis. Worse, these problems aren’t readily addressible by doing qPCR confirmation or replicate sequencing. I’m not entirely sure what the ramifications are but it seems like a worthwhile thing that someone should point out.

–titus

p.s. Our simulations are fairly simple, BTW. We’ll put the code out there soon for you to play with.

p.p.s. Likit Preeyanon, a graduate student in my lab, was one of the first people in my lab to look really critically at mRNAseq. Watch for his paper soon.

Our paper: Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

This guest post is by Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, and Dmitri A. Petrov, on their paper
 Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster, arXived here

We typically think of adaptive events as arising from single de novo mutations that sweep through the population one at a time. In this scenario, one expects to observe the signatures of hard selective sweeps, where a single haplotype rises to very high frequencies, removing variation in linked genomic regions. It is also possible, however, that adaptation could lead to signatures of soft sweeps. Soft sweeps are generated by multiple adaptive haplotypes rising in frequency at the same time, either because (i) the adaptive mutation comes from standing variation and thus had time to recombine onto multiple haplotypes, or (ii) because multiple de novo mutations arise virtually simultaneously. The second mode is likely in large populations or when the adaptive mutation rate per locus is high.

Soft sweeps have generally been considered a mere curiosity and most scans for adaptation focus on the hard sweep scenario. Despite this prevailing view, the three best-studied cases of adaptation in Drosophila at the loci Ace, CHKov1, and Cyp6g1 all show signatures of soft sweeps. In two cases (Ace and Cyp6g1), soft sweeps were generated by de novo mutations indicating that the population size in D. melanogaster relevant to adaptation is on the order of billions or larger. In one case (CHKov1), soft sweeps arose from standing variation. Surprisingly, we do not have very convincing cases of recent adaptation in Drosophila that generated hard sweeps.

Nevertheless, it remained an open question of whether these three cases were the exception or the norm. They are all related to pesticide or viral resistance and it is entirely possible that much adaptation unrelated to human disturbance or immunity proceeds differently and might generate hard sweeps.

In this paper, we developed two haplotype statistics that allowed us to systematically identify hard and soft sweeps with similar power and then to differentiate them from each other. We applied these statistics to the Drosophila polymorphism data of ~150 fully sequenced, inbred strains available through the Drosophila Genetic Reference Panel (DGRP).

We found abundant signatures of recent and strong sweeps in the Drosophila genome with haplotype structure often extending over tens or even hundreds of kb. However, to our surprise, when we looked at the top 50 peaks, all of them showed signatures of soft sweeps, while we could not convincingly demonstrate the existence of any hard sweeps.

Our results suggest that hard sweeps might be exceedingly rare in Drosophila. Instead, it appears that adaptation in Drosophila primarily proceeds via soft sweeps and thus often involves standing genetic variation or recurrent de novo mutations. There are two caveats, however: One is that we were only able to study strong and recent adaptation. Such strong adaptation should “feel” recent population sizes that are close to the census size, whereas it should be insensitive to bottlenecks that have occurred in the distant past. Weaker adaptation, on the other hand, might take longer and thus would be sensitive to ancient bottlenecks or interference from other sweeps. Whether weak adaptation thus proceeds via hard sweeps remains to be seen. The second caveat is that much of adaptation might involve sweeps that are so soft and move so many haplotypes up in frequency that we cannot detect them. Similarly, adaptation could often be polygenic involving very subtle shifts in allele frequency at many loci. These modes would hardly leave any signatures of sweeps at all. Whichever way it is, it is becoming increasingly clear that adaptation in Drosophila and many other organisms is likely to be much more complex, much more common, and in many ways a much more turbulent process than we usually tend to think.