Author post: Evolution at two levels of gene expression in yeast

This guest post is by Carlo Arteri and Hunter Fraser on their preprint Evolution at two levels of gene expression in yeast, arXived here

Taking studies of regulatory evolution to the next level: translation

Understanding the molecular basis of regulatory variation within and between species has become a major focus of modern genetics. For instance, the majority of identified human disease-risk alleles lie in non-coding regions of the genome, suggesting that they affect gene regulation (Epstein 2009). Furthermore, it has been argued that regulatory changes have played a dominant role in explaining uniquely human attributes (King and Wilson 1975). However, our knowledge of gene regulatory evolution is based almost entirely on studies of mRNA levels, despite both the greater functional importance of protein abundance, and evidence that post-transcriptional regulation is pervasive. The availability of high-throughput methods for measuring mRNA abundance coupled to the lack of comparable methods at the protein level have contributed to this focus; however, a new method known as ribosome profiling (Ingolia et al. 2009) has enabled us to study divergence in the regulation of translation.

‘Riboprofiling’ involves the construction of two RNA-seq libraries: one measuring mRNA abundance (the ‘mRNA’ fraction), and the second capturing the portion of the transcriptome that is actively being translated by ribosomes (the ‘Ribo’ fraction). We performed riboprofiling on interspecific hybrids of two closely related species of budding yeast, Saccharomyces cerevisiae and S. paradoxus, (~5 million years diverged) as well as the parental strains. As both parental alleles at a locus share the same trans cellular environment in the hybrid, differences in the relative allelic abundance (termed allele-specific expression, or ASE) reveal cis-regulatory divergence. Consequently, interspecies differences not attributable to cis-effects indicate trans divergence. By measuring differences in the magnitudes of ASE between the two hybrid riboprofiling fractions, we identified independent cis and trans regulatory changes in both mRNA abundance and translational efficiency.

We found that both cis and trans regulatory divergence in translation are widespread, and of comparable magnitude to divergence at the mRNA level – indicating that we miss much regulatory evolution by focusing on mRNA in isolation. Moreover, we observed an overwhelming bias towards divergence in opposing parental directions, suggesting the action of stabilizing selection in order to maintain more similar protein levels between species than would be expected by comparing mRNA abundances alone. Interestingly, while we confirmed the results of previous studies indicating that both cis and trans regulatory divergence at the mRNA level are associated with the presence of TATA boxes and nucleosome free regions in promoters, no such relationship was found for translational divergence, indicating that these regulatory systems have different underlying architectures.

We also searched for evidence of polygenic selection in and between both regulatory levels by applying a recently developed modification of Orr’s sign test (Orr 1998; Fraser et al. 2010; Bullard et al. 2010). Under neutral divergence, no pattern is expected with regards to the parental direction of up or down-regulating alleles among orthologs within a functional group (e.g., a pathway or multi-gene complex). However, a significant bias towards one parental lineage is evidence of lineage-specific selection. This analysis uncovered evidence of polygenic selection at both regulatory levels in a number of functional groups. In particular, genes involved in tolerance to heavy metals were enriched for reinforcing divergence in mRNA abundance and translation favoring S. cerevisiae. Increased tolerance to these metals has been observed in S. cerevisiae (Warringer et al. 2011), suggesting that domesticated yeasts have experienced a history of polygenic adaptation across regulatory levels allowing them to grow on metals such as copper. Finally, we also uncovered multiple instances of stop-codon readthrough that are conserved between species, highlighting yet another post-transcriptional mechanism leading to increased proteomic diversity.

By applying a novel approach to a long-standing question, our analysis has revealed the underappreciated complexity of post-transcriptional regulatory divergence. We argue that partitioning the search for the locus of selection into the binary categories of ‘coding’ vs. ‘regulatory’ overlooks the many opportunities for selection to act at multiple regulatory levels along the path from genotype to phenotype.

References:

Bullard JH, Mostovoy Y, Dudoit S, Brem RB. 2010. Polygenic and directional regulatory evolution across pathways in Saccharomyces. Proc Natl Acad Sci USA 107: 5058-5063.

Epstein DJ. 2009. Cis-regulatory mutations in human disease. Brief Funct Genomic Proteomic 8: 310–316.

Fraser HB, Moses AM, Schadt EE. 2010. Evidence for widespread adaptive evolution of gene expression in budding yeast. Proc Natl Acad Sci USA 107: 2977-2982.

Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. 2009. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324:218-223.

King MC, Wilson AC. 1975. Evolution at two levels in humans and chimpanzees. Science 188: 107-116.

Orr HA. 1998. Testing natural selection vs. genetic drift in phenotypic evolution using quantitative trait locus data. Genetics 149: 2099-2104.

Warringer J, Zörgö E, Cubillos FA, Zia A, Gjuvsland A, Simpson JT, Forsmark A, Durbin R, Omholt SW, Louis EJ, Liti G, Moses A, Blomberg A. 2011. Trait variation in yeast is defined by population history. PLoS Genet 7 :e1002111.

Author post: Patterns of positive selection in seven ant genomes

This guest post is by Julien Roux on Roux et al. “Patterns of positive selection in seven ant genomes“, arXived here.

The publication of the honeybee genome in 2006 can be considered the birth date of "sociogenomics", a research field whose agenda is to understand social life in molecular terms. Recently, this field has entered a period of rapid discovery with the publication of full genome sequences of multiple Hymenoptera species. In particular, the release of seven ant genome sequences gave us the opportunity to look for the molecular origins of some of the spectacular adaptations of the ant lineage, through patterns of positive selection on amino-acid substitutions in ant genes. We used rigorous methods to detect episodic positive selection while controlling for false positives inspired by the database Selectome. All data is publicly available for people to reuse.

An original aspect of our paper is that we analyzed not only ant genomes, but also data from 10 species of bees and 12 species of flies with the same methods to permit an unbiased comparison of positive selection patterns between lineages. For example, immune genes were enriched for positive selection signal in all three lineages. This may not look surprising since these are classical hits of positive selection scans, but it was previously hypothesized that the evolution of social hygienic behaviors in ants and bees may have relaxed the selective pressure on immune genes. Our analysis indicates that this effect is either absent or relatively small.

Other hypotheses have been put forward in relation to the evolution of sociality in Hymenoptera. Notably, it was proposed that the challenges of social life in the colonies should be reflected by increased positive selection signal on neurogenesis genes. Similarly, because communication is mostly based on chemical signals in colonies of social insects, it was suggested that increased positive selection should be observed on olfactory receptors compared to non-social insects. Our results question both these hypotheses, since we observed that increased positive selection on these classes of genes does not coincide with (but predated) the evolution of sociality in Hymenoptera.

Finally, the comparison between the three lineages allowed us to pinpoint some patterns that were most likely specific to the ant lineage. We found less positive selection on genes related to metabolism in ants compared to bees and flies. We think this could be the sign of relaxed selection on these genes, possibly in relation to the important reduction on metabolic needs with the loss of flight in ant workers. By contrast, we identified a robust pattern of directional selection specific to the ant lineage on genes functioning in the mitochondria. Several pieces of evidence suggest that this pattern might be linked to the remarkable lifespan extension that evolved in the ant lineage. Queens of some ant species can indeed live up to 100 times longer than solitary insects, (that is up to 30 years!). Positive selection possibly played a role in optimizing the activity of mitochondria, where the respiratory chain is the primary source of production of Reactive Oxidative Species (ROS), an important proximal cause for aging, thus contributing to the evolution of increased lifespan in ants.

In conclusion, protein level episodic positive selection appears to have played an important role in the evolution of social insects, notably regarding strong mitochondrial adaptation in ants.

Author post: A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

This author post is by Cyrus Maher and Ryan Hernandez on their preprint A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity, arXived here.

Rigorous evolutionary analysis of protein coding regions often requires high-quality multiple sequence alignments. These alignments can only be generated after the identification of orthologous sequences. In our pre-print, “A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity”, we present a novel method that substantially improves the number and quality of detected orthologs, especially in the presence of sequencing error and complex evolutionary processes.

This endeavor grew out of our forthcoming work on the evolutionary impact of ancient pathogens on the human genome. Early on, we observed the decisive influence ortholog quality exerted on our downstream conclusions. As one might imagine, accurate sequence analysis is a fool’s errand if the sequences are, in fact, the wrong ones! Such experiences have impelled us to take a keen interest in orthologs, much as a bad case of gastroenteritis might inspire a sushi chef to become thoroughly attentive to the quality of his or her fish.

Identifying orthologous sequences is referred to as ortholog detection (OD). In brief, existing OD methods can be classified as tree-based, graph-based, or a hybrid of the two. Tree-based methods may use reconciliation techniques between gene and species trees or may rely on the gene tree alone. Graph-based methods can employ a variety of metrics to quantify similarity between sequences. Popular measures include sequence identity and matrix-weighted similarity scores. Syntenic information may also be incorporated in this context.

Here we consider alignments from UCSC (MZ), MultiParanoid (MP), translated BLAT (BL), and OMA. To briefly summarize the strengths of the considered methods: MZ utilizes syntenic similarity, MP includes all-by-all similarity in its calculations, OMA considers phylogenetic information directly, and BL does not require an accurately predicted proteome. In figure 1A of our paper, we illustrate the head-to-head performance of four popular methods for OD. Interestingly, we find striking complementarity between methods, motivating a search for a practical way to integrate ortholog predictions from methodologically diverse sources.

Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA),  MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

Figure 1: Comparison of sequence identity levels between methods A.) Heat map of the percent of orthologs for which BLAT (BL), OMA (OMA), MultiParanoid (MP),, and MultiZ (MZ) outperform one another. Performance is based on percent identity of each method’s orthologs to the human sequence. One method is considered to outperform another method if it improves percent identity by at least five percentage points. Text in diagonal cells shows the number of orthologs identified by each method, colored by the percent of transcripts at which a given method outperforms all the others

These efforts culminate in the presentation of MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization. MOSAIC is a well-documented python package that can flexibly integrate ortholog predictions from an arbitrary number of sources. We compare integrated MOSAIC alignments to those generated using each constituent method alone. Relative to the best-performing single method, we show that MOSAIC more than quintuples the number of sequences for which all orthologs of interest are successfully identified (see figure below). However, this increase in putative orthologs could be the result of, e.g. the improper inclusion low-quality or paralogous sequences. This does not appear to be the case for MOSAIC. Crucially, improvements in power are secured while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality.

OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

Figure 2: OD power and the effect of pooling methods A.) The cumulative number of human transcripts as a function of the maximum number of missing species allowed

These results are obtained from alignments between the human proteome and orthologs from nine species encompassing a range of primates and closely related mammals. For other sequence sets, the best strategy for method integration may differ slightly depending on, e.g. the level of divergence between species of interest. To account for this, MOSAIC provides several options for scoring and optimization, and even facilitates the specification of user-defined metrics for sequence similarity and cluster optimality.

In the future, we would also like to add functionality to automatically fetch relevant alignments from major ortholog databases. In the meantime, we hope that this tool will prove a useful addition to a variety of evolutionary analysis pipelines. We of course welcome feedback on how we might improve the performance and practical utility of the method. Thank you in advance for your input!

Author Post: The Population Genetic Signature of Polygenic Local Adaptation

This guest post is by Jeremy Berg [@JeremyJBerg] and Graham Coop [@Graham_coop] on their paper The Population Genetic Signature of Polygenic Local Adaptation arXived here

The field of population genetics has devoted a lot time to identifying signals of adaptation. These tests are usually predicated on the fact that local adaptation can drive large allele frequency changes between populations. However, we’ve known for almost a century that many traits are highly polygenic, so that adaptation can occur through subtle shifts in allele frequencies at many loci. Until now we’ve been unable to detect such signals, but genome-wide association studies (GWAS) now give us a way of potentially learning about selection on quantitative traits from population genetic data. In this paper we develop a set of approaches to do this in a robust population genetic framework.

GWAS usually assume a simple additive model, i.e. no epistasis/dominance, to test for and estimate effect sizes for a genome-wide set of loci. To test whether local adaptation has shaped the genetic basis of the trait, we do the perhaps boneheaded thing of taking the GWAS results at face value. For each population we simply sum up the product of the frequency at each GWAS SNP and the effect size of that SNP. This gives us an estimate of the mean additive genetic value for the phenotype in each population. This is not the mean phenotype of the population as it ignores the fact that we don’t know all the variants affecting our trait; environmental change across populations, gene by environment interactions, and changes in allele frequencies that have altered the dominance and epistatic relationships between alleles (i.e. all that good stuff that makes life interesting). However, these additive genetic values do have the very useful property that they are simple linear functions of the allele frequencies, which means that we can construct a simple and robust model of genetic drift causing these phenotypes to diverge across populations.

Height-genetic-value

In Figure A we show our estimated genetic values using the human height GWAS of Lango Allen et al (2010). As you can see, populations show deviations around the global mean genetic value, and populations from the same geographic regions covary somewhat in the deviation they take, reflecting the fact that allele frequencies at each GWAS locus tend to covary in their shared genetic drift due to population history and migration. For example in Figure B we show allele frequencies at one of the GWAS height loci.

OneSNP

We can approximately model the allele frequencies at a single locus by assuming that they are multivariate normally distributed around the global mean. The covariance matrix of this distribution is given by a matrix closely related to the kinship matrix of our populations, which can be calculated from a genome-wide sample of putatively neutral loci. As our vector of phenotypic genetic values across populations is simply a weighted sum of the individual allele frequencies, our vector of genetic values is also follows a multivariate Normal distribution. Given that we are summing up lot of loci, even if the multivariate normal model is a poor approximation to drift at one locus, the central limit theorem suggests that it should still be a good fit to the distribution of the genetic values.

This simple neutral model framework, based on multivariate normal distributions, gives us a strong framework to develop tests of selection. Our most basic test is a test for the over-dispersion of the variance of genetic values (i.e. too great an among population variance, once population structure has been accounted for). We also develop a test for an environmental correlations and a way to identify outlier populations and regions to further understand the signal of local adaptation.

We apply our tests to six different GWAS datasets using the HGDP as our set of populations. Our tests reveal wide-spread evidence of selection shaping polygenic traits across populations, although many of the signals are quite subtle. Somewhat surprisingly, we find little evidence for selection on the loci involved in Type 2 diabetes, somewhat of a poster-child for adaptation shaping the genetic basis of a disease thanks to the thrifty gene hypothesis.

We think our approach is a promising way forward to look for selection on the genetic basis of quantitative traits as view by GWAS. However, it also highlights some concerns. In developing our tests we found that we had developed a set of methods that already have equivalents in the quantative trait community– in particular QST, a phenotypic analogy of FST (and its extensions by a number of authors). This raises the question of whether in systems where common garden experiments are possible there is a need to do GWAS if we are only interested in how local adaptation has shaped traits, or if QST style approaches are the best that one can do. We do think that there is much more that could be learnt by our style of approach, but it should also give researchers pause to consider why they want to “find the genes” for local adaptation.

We’ve already gotten some very helpful comments via Haldane’s sieve. We’d love more comments, particularly about points of confusion that could be clarified, other datasets that might be good to apply this to, or other applications we could develop.

Our paper: The inevitability of unconditionally deleterious substitutions during adaptation

This author post is by Joshua B. Plotkin and David McCandlish on their preprint “The inevitability of unconditionally deleterious substitutions during adaptation”, arXived here.

The idea for this paper came to us while we were re-reading an earlier study by Sergey Kryazhimskiy and others, on the dynamics of adapting populations (Kryazhimskiy et al. 2009). Kryazhimskiy et al. studied the “fitness trajectory” — that is, the mean fitness across an ensemble of populations, as a function of time. The basic idea of their study was to infer the structure of an underlying fitness landscape by observing the fitness trajectory in experimental populations of evolving organisms (such as the ones from Lenski’s long-term experiments).

In re-reading Sergey’s paper, we noticed that the fitness trajectories were always monotonic, that is, the expected fitness would either always decrease or always increase. Indeed Kryazhimskiy et al. 2009 had presented a detailed analytical theory for how the fitness trajectory should behave (at least for a large class of models), and according to this theory the fitness trajectories should always be monotonic. However, when we looked more carefully at how this analytical theory was derived, we saw that the apparent impossibility of non-monotonic fitness trajectories was actually an unintended consequence of a seemingly innocuous technical assumption. The theory had been thoroughly tested against simulations for the examples explored in the paper, and it had performed quite well. But still, we wondered, could we construct fitness landscapes with non-monotonic fitness trajectories?

The answer was yes. In fact, we found conditions that produced non-monotonic fitness trajectories in one of the simplest and widely used models of a fitness landscape: the house of cards model, where the fitness of each new mutation is drawn from some fixed probability distribution. We also noticed an interesting pattern. If the population starts at a very low fitness then the fitness trajectory is be monotonically increasing. But if the starting fitness of the population is closer to the equilibrium mean fitness (that is, the value that the fitness trajectory would eventually tend to) the fitness trajectories will become non-monotonic: fitness will initially decrease, and then, eventually, increase to
its asymptotic value.

After much coffee, we eventually proved that this basic pattern must occur for any house of cards model whose equilibrium fitness distribution has a finite mean (at least under a Moran process in the limit of weak mutation). That result was the germ that eventually developed into our paper, which includes further results on the house of cards model, and on Fisher’s geometric model.

Why are non-monotonic fitness trajectories interesting? On the one hand, this is a population-genetic curiosity in a vein similar to McVean and Charlesworth (1999)’s observation that increasing the strength of purifying selection can sometimes increase the nucleotide site diversity. It’s somewhat counter-intuitive that the expected selection coefficient of the first mutation to fix in an adapting population can be negative, even on a fitness landscapes that contains no local maxima!

On the other hand, we think that this result has important implications studying adaptive evolution. It is common in such studies to assume that deleterious mutations can never fix (e.g. by approximating the probability of fixation for a new mutation as 2s). Our results on the surprising prevalence of deleterious substitutions during adapation should hopefully spur others to consider carefully the circumstances under which ignoring deleterious fixations is justified.

Joshua B. Plotkin and David McCandlish

Works cited:

Kryazhimskiy S, Tkacik G, Plotkin JB. The dynamics of adaptation on correlated fitness landscapes. PNAS 106: 18638-18643 (2009)

McVean, G. A., and Charlesworth, B. (1999). A population genetic model for the evolution of synonymous codon usage: patterns and predictions. Genetical research, 74:145-158.

Our paper: An integrative genomic approach illuminates the causes and consequences of genetic background effects

This is a guest post by Dr. Chris Chandler on “An integrative genomic approach illuminates the causes and consequences of genetic background effects“, arXived here.

Biologists have long recognized that a mutation can have variable effects on an organism’s phenotype; even introductory genetics classes often make this observation by introducing the concepts of penetrance and expressivity. More mysterious, however, are the factors that influence the phenotypic expression of a mutation or allele. We know, for instance, that introducing the same mutation into two different but otherwise wild-type genetic backgrounds can result in vastly different phenotypes. But what specific differences between these two genetic backgrounds interact with the mutation, and how? And how does gene expression fit into this puzzle? Answering these questions has not been an easy task, which is not too surprising when you realize that penetrance and expressivity are, in reality, complex quantitative traits. We therefore adopted a multi-pronged genetic and genomic approach to tease apart the mechanisms mediating background dependence in a mutation affecting wing development in the fly Drosophila melanogaster.

The phenotypic patterns seen in our model trait have already been characterized: the scalloped[E3] (sd[E3]) mutation has strong effects in the Oregon-R (ORE) background, resulting in a tiny, underdeveloped wing, while its effects in the Samarkand (SAM) background are still obvious but much less extreme, resulting in a blade-like wing.

To try to find out what causes these differences, we generated and combined a variety of datasets: whole-genome re-sequencing of the parental strains and a panel of introgression lines to map the background modifiers of the sd[E3] phenotype; transcription profiling (using two microarray datasets and one RNA-seq-like dataset), including analyses of allele-specific expression in flies carrying a “hybrid” genetic background; predictions of binding sites for the SD protein, which is a transcription factor; and a screen for deletion alleles that enhance or suppress the sd[E3] phenotype in a background-dependent fashion.

Our results point to a complex genetic basis for this background dependence. We found evidence for a number of loci that are likely to modulate the effects of the sd[E3] allele. However, some unexpected inconsistencies provide a cautionary tale for those intending to take a similar mapping-by-introgression approach for their trait of interest: do multiple replicates, and introgress in both directions, or you may inadvertently end up mapping some other trait! Although the number of candidate genes we identified were generally large, by combining those results with data from our other datasets, we were able to narrow our focus to those showing a consistent signal, yielding a robust set of candidate genes for further study. Without getting into too much detail, we also used a novel approach to show that background-dependent modifier deletions of the sd[E3] phenotype (of which there are many) involve higher-order epistatic interactions between the sd[E3] mutation, the deletion, and the genetic background, rather than quantitative non-complementation (so more than two genes were involved).

Overall, we think that an integrative approach like this could be useful for others trying to understand complex traits, including genetic background-dependence of mutations. In addition, if you’re a Drosophila researcher working with the commonly used Samarkand or Oregon-R strains, our genome re-sequencing data (raw and assembled), including SNPs, will soon be available in public repositories for genetic data.

Our Paper: The genomic impacts of drift and selection for hybrid performance in maize

This next paper is by Jeff Ross-Ibarra (@jrossibarra) on his paper (along with coauthors) Gerke et al The genomic impacts of drift and selection for hybrid performance in maize arXived here.

Iowa recurrent selection as an evolutionary experiment in hybrid vigor

Maize is an outcrossing species, and was cultivated as such up through the first quarter of the 20th century. Starting in the 1920’s, however, breeders began to abandon open-pollinated maize in favor of hybrid varieties resulting from crosses between inbred lines. Hybrids are often more robust and higher yielding than either inbred parent, a phenomenon known as hybrid vigor or heterosis.

Breeding for hybrid varieties – and presumably increased heterosis – has had a profound impact on diversity across the maize genome. There are at least two important differences from previous breeding efforts: first, breeders select on and work with inbred maize lines rather than mass selection on open-pollinated populations. This results in much smaller effective population sizes, and has implications for recessive traits and deleterious alleles that could be masked in heterozygotes. The second difference is that instead of selecting the best plants per se, breeders now select for inbreds that make high-yielding hybrids. This means a breeder might favor an inbred that itself is not high-yielding if it consistently makes good hybrids when paired with other inbreds.

We set out to study the effects of these breeding strategies on patterns of diversity across the maize genome. We took addvantage of one of the longest-running ongoing experiments on selection for hybrid performance, started in the late 1940’s by the US Dept. of Agriculture’s Agricutural Research Service. Two small (12 and 16) sets of founder inbred lines were randomly mated to create two base populations: the Iowa Stiff Stalk Synthetic (BSSS) and the Iowa Corn Borer Synthetic No. 1 (BSCB1). In addition to its role as an important selection experiment, multiple maize breeding lines have come out of the BSSS population, including the line used for the maize reference genome.

Diversity in the BSSS and BSCB1 is patterned predominantly by drift

Over the course of the experiment we studied, the two base populations underwent 16 cycles of recurrent selection, in which lines from each population were crossed to each other and evaluated for both hybrid and per-se performance. Selected lines were intermated within each population to form the next generation. To investigate the genomic impact of this selection scheme, we genotyped progenitor lines and over 600 individuals from multiple selection cycles using the Illumina MaizeSNP50 SNP array. And because we know the exact crossing and selection scheme used, we can compare the observed changes in genome-wide diversity with strictly neutral crossing simulations using the genotypes of the starting populations.

Both populations steadily lost genetic diversity as they became more diverged from one another, but diversity and divergence between BSSS and BSCB1 can be largely reproduced by simulation without any selection. In fact, principal component analysis clearly reveals changes in population structure and diversity that mirror alterations in rates of inbreeding and effective population size that occurred over the course of the experiment. This indicates the structure is not necessarily related to the phenotypic improvement, but might be a by-product of the breeding scheme. Similar population structure is reflected in a recent broad comparison of US maize germplasm and suggests that much of the diversity and structure of modern maize germplasm has been effected by genetic drift.

Selection efficacy and fixation at regions of low-recombination.

But genetic drift can’t be the whole story in these populations. Numerous experiments have shown that the later populations are superior to their progenitors in terms of hybrid yield and traits important to increased planting density (more plants per acre = more yield). These same trends are observed across North American maize as a whole, suggesting common themes in how maize has improved over time. Selection is difficult to detect in the face of strong genetic drift, especially when the selection has been on traits with complex genetic architectures. However our simulations do detect regions of low heterozygosity in each population that are longer than expected given their genetic distance.

The most striking pattern of these regions is their lack overlap between the two populations. In simple cases, classic overdominance models of heterosis predict that at a single locus, two distinct alleles confer heterozygote advantage when combined. In this case, selection should lead to decreased heterozygosity at a locus in both populations as complementary alleles rise in frequency. We don’t observe this, and neither did a different study that used other populations.

A popular alternative to the over-dominance model is the dominance model, which predicts that heterosis is caused by the complementation of linked recessive deleterious alleles. In this case, multiple haplotypes in the other population may complement a fixed region if most deleterious alleles in maize are rare. Evidence from numerous studies supports a dominance model of heterosis, including findings of excess residual heterozygosity in low recombination regions of a maize mapping population. In regions of low recombination, heterozygosity (and thus complementation) becomes important due to an inabilty to efficiently select for new recombinants in these regions, especially with low effective population sizes. And because of low rates of recombination, a small genetic interval in these regions becomes massive in physical space and encompasses the composite effects of many deleterious loci. We observe fixation in these regions in the BSSS and BSCB1 populations. They are short genetically (1-2 centimorgans), but make up very large fractions of the chromosome. We find that in many cases, these regions have been inherited largely intact from the original population founders, indicating that selection for new haplotype combinations in these regions has been ineffective. Large haplotypes in some cases may have fixed early on in the formation of many breeding programs, and the combination of limited exchange between breeding pools and small effective population sizes has provided little opportunity for selective removal of deleterious alleles. Complementation and the inefficiency of selection in these pericentromeric regions, which span a large portion of the physical genome, may thus explain the difference between hybrid and inbred yield and why it has remained fairly constant.

Our Paper: Genome-wide inference of ancestral recombination graphs

This guest post is by Adam Siepel (@asiepel) on his paper with Matthew Rasmussen (@mattrasmus): Rasmussen and Siepel “Genome-wide inference of ancestral recombination graphs” , arXived here.

Inference of Ancestral Recombination Graphs

 As the title indicates, our paper is about the problem of inferring an “ancestral recombination graph,” or ARG, from sequence data.  This is a topic that may strike many readers as impenetrably obscure and technical, so I will first try to explain, in plain language, what the ARG describes and why it has so much potential to be useful in many kinds of genetic analysis.  Then, I will tell the story of how I and members of my research group have become increasingly fascinated by this problem over the years, how we have struggled with it, and how we finally achieved the conceptual breakthrough that is described in our paper.  As will become evident, Matt Rasmussen, a former postdoc in the group and lead author of our paper, was central in this achievement.

What is the ARG?

The ARG is an elegantly simple yet superbly rich data structure that describes the complete evolutionary history of a collection of genetic sequences drawn from individuals in one or more populations.  It was invented in the mid 1990s by the mathematicians Bob Griffiths and Paul Marjoram.  The ARG captures essentially all evolutionary information relevant for genetic analysis of such sequences.  Statisticians say that it fully defines the “correlation structure” of the sequences, meaning that it explains most similarities and differences among the sequences in terms of their patterns of shared ancestry.

The ARG is something like a family tree, only richer, because it not only defines the relationships among individuals, but it also traces the histories of specific segments of DNA sequences.  For example, if you were to replace your family tree with an ARG, you could tell exactly which pieces of your genome came from your eccentric great grandmother and which pieces you share with your charming, intelligent, and handsome third cousin.

Now, this is obviously all a bit vague and superficial.  What does the ARG actually describe?  First, the ARG is a “graph” in the mathematical sense, meaning that it consists of a set of “nodes” and a set of “edges” between nodes.  Traditionally, nodes and edges are depicted with circles and connecting lines, respectively, making graphs look a bit like those molecular modeling kits you may have used in your high school chemistry class.  In the case of the ARG, there are two types of nodes, representing two key classes of events in the history of the sequences: (i) recombination nodes, which describe how a sequence is assembled by concatenating two ancestral sequences when DNA is mixed-and-matched during the production of sperm and egg cells or other gametes (meiosis); and (ii) coalescence nodes, which describe how two sequences trace back to a common ancestor. Edges between these nodes represent lines of descent, or lineages, for segments of DNA.  Each node is annotated with the time of the corresponding event.  In addition, for reasons that will become apparent in a moment, recombination nodes are annotated with the position along the sequences at which the ancestral sequences were cut and then joined together.

The significance of recombinations and coalescences comes from the fact that these are the two ways in which lineages can join or split over time.  The best way to understand them is to think about the behavior of lineages as one looks backward in time.  The graph is typically laid out with time on the vertical axis, so that the bottom of the graph represents the present time and each node is assigned a height above this baseline indicating the time before the present at which the associated event occurred. Therefore, to look backward in time, we look upward in the graph.  As we do so, we see that recombination events cause a single lineage to split into two ancestral lineages (representing the two sequence fragments that were joined together by the recombination in forward time), and coalescence events cause two lineages to join into one. Therefore, recombination nodes have one edge coming in and two going out, and coalescence nodes have two edges coming in and one going out.  One way of thinking about it is that, given a fragment of modern DNA, recombinations have the effect of increasing its set of ancestors, while coalescences have the effect of decreasing its set of ancestors.

All of this joining and splitting results in a complex network that, to the uninitiated, might vaguely resemble a map of the London Tube.  For the ARG-literate, however, this graph is richly informative about evolutionary history.  The key to interpreting the graph is to make use of the positional information associated with the recombination nodes (the positions at which the recombinations occurred). For any position along the sequence, one can extract an evolutionary tree, or genealogy, for the samples in question, simply by following their lineages upward on the page and taking a left or a right turn at each recombination node based on whether the genomic position in question falls to the left or the right, respectively, of the recombination event.  It is easy to see that a subgraph extracted in this way will contain only joins (coalescences) and no splits, and must eventually reduce to a single lineage, and therefore will define a tree.  Thus, the ARG has embedded within it a “local tree,” or genealogy, for every position in the sequence!

I have to pause here to say that I find this property of the ARG incredibly beautiful.  In a highly compact manner, the graph manages to capture both a complete record of all recombination events and a genealogy for every position in the sequence, and it does so in a way that shows how neighboring genealogies can be reconciled in terms of historical recombination events!  Figure 1 of our paper illustrates it with a simple ARG for four sequences.  Don’t be confused by the absence of explicit nodes (circles) in the diagram.  By tradition, nodes are not drawn as circles in ARGs, but are simply understood to be the points at which edges meet.

It is worth emphasizing that this representation is very general.  It can be used to describe the history of a particular gene of interest in individuals from a single well-defined population, or the history of whole genomes (with one ARG per chromosome) for individuals from many diverse populations.  It can even be used to describe the histories of sequences from representatives of different species, such as humans, chimpanzees, and gorillas.  As long as the sequences in question are orthologous and collinear—meaning, essentially, that they are derived from a common ancestral sequence in the absence of duplication and rearrangement events—then the coalescence and recombination events defined by the ARG are sufficient for describing precisely how the sequences derive from their common ancestor, and, hence, how they are correlated with each other.

Why is the ARG useful?

Many biologists are familiar with the value of “tree-thinking” in understanding evolutionary relationships.  For example, a diagram of a tree intuitively and precisely captures the fact that humans and chimpanzees are more closely related to each other than either humans or chimpanzees are to rhesus macaques.  Similarly, it explains why even neutrally evolving sequences (“junk DNA,” if the term is not too loaded) will tend to be more similar between humans and chimpanzees, and between rats and mice, than between humans and mice (this is essentially what we mean by “correlation structure”).  Against the background of neutral evolution, trees help us to identify genes and noncoding functional elements, to detect evolutionarily conserved and adaptively evolving sequences, to date speciation events, and so on.

The problem with trees on population genetic time scales, however, is that they change along the sequence, due to recombination.  As noted above, the ARG precisely describes these trees and the way they change.  Therefore, it enables tree-thinking with population genetic data.

Viewing population genetics in terms of the ARG can clarify one’s thinking about many problems of interest.  For instance, the ARG makes it clear that divergence times for genetically isolated populations can be estimated by looking across the ARG for the most recent coalescences that cross population boundaries.  Similarly, given an estimated divergence time, the rate of gene flow or migration between populations can be estimated, in a fairly straightforward manner, in terms of the rates of inter-population coalescence events across the ARG.  Ancestral effective population sizes can be estimated from the density of coalescence events in the ARG over time. Signatures of natural selection, including hitchhiking and background selection, can be detected by various kinds of local distortion of the ARG.  In general, the ARG provides a unifying framework for the field, and many challenging statistical problems in population genetics can properly be seen as problems of revealing relevant features of the ARG.

What would a reconstructed ARG mean in practical terms?  First, I should be clear that we have no intension of actually drawing an ARG for dozens of complete human genome sequences.  Such a drawing would be far too large and complex to be useful.  Rather, the value of a reconstructed ARG is as a rich data structure that could be interrogated for many features of interest, such as local trees, recombination events, mutation ages, or regions of identity by descent.  Because these features would be derived from a unified description of the evolutionary history of the sample, they would be guaranteed to be internally consistent, unlike ones based on simpler estimators.  In this way, the ARG would be useful in many problems of interest in statistical genetics, ranging from demography inference (e.g., estimation of population divergence times or rates of gene flow between populations), to the detection of regions influenced by natural selection, to the detection of genotype/phenotype associations.

Why is it so difficult to find a good ARG?

In practice, most population geneticists do not work with ARGs, but instead work with surrogates such as principle components, site frequency spectra, and spectra of identity by descent.  The reason people work with these simpler, lower-dimensional summaries of genetic data, of course, is that explicit ARG reconstruction is forbiddingly difficult.  From a statistical and computational perspective, there are two major issues in reconstructing the ARG.  First, the problem of searching all possible ARGs for one that best fits the data is computationally intractable, even in a restricted, parsimony-based formulation of the problem (it belongs to the class of problems computer scientists call “NP-hard”).  Second, and perhaps more importantly, in most cases of interest there is simply not enough information in the data to reconstruct a single ARG with high confidence.  Rather, in general, a large family of ARGs will be more or less equally compatible with observed sequences.

For these reasons, it would be misleading to suggest that there is any hope of producing a magical computer program that will allow the user to input a collection of sequences and obtain the true ARG for those sequences as output.  Instead, we must consider many possible ARGs, weighting them in some way by their plausibility.  In other words, we must consider a statistical distribution of ARGs given the data.

Because of the awkwardness of the space of ARGs (each ARG is a complex, combinatorial object, difficult to summarize in terms of low-dimensional features), we and others have come to the conclusion that the best way to get at these distributions is by making use of statistical sampling methods.  In our case, we use an approach, called Markov chain Monte Carlo (MCMC), that chooses samples that are guaranteed to be representative of the distribution of ARGs given the data and the model, provided the sampling program is run long enough.  After collecting fairly large numbers of samples, we can make useful statements about general features of the ARG even if we have limited confidence in each individual sample.  For example, the average of the times to most recent common ancestry (TMRCA) in the sampled ARGs at a particular position along the sequence can be used as an estimator of the true TMRCA at that position.  We show that our methods can be used to summarize various useful features of this kind, including recombination and coalescence rates, and the ages of mutations that are polymorphic in the sample, as well as TMRCAs.

How we became interested in the problem of ARG inference—or, the evolution of an obsession

The ARG seems to have a unique ability to inspire obsession in a particular kind of mathematically inclined geneticist.  As discussed above, it is potentially richly informative but extremely difficult to obtain, which makes it a natural “Holy Grail” for evolutionary genetics.  At the same time, the ARG is fairly simple to describe, it is straightforward to simulate, and sequence data are strongly informative about many of its features, making solutions to the inference problem seem tantalizingly within reach.

I first encountered the ARG and the problem of ARG inference in 2005 when I picked up the book Gene genealogies, variation, and evolution: a primer in coalescent theory, by Hein, Schierup, and Wiuf.  The authors of this book had obviously been bitten by the ARG bug (in fact, the lead author, Jotun Hein had already been working on variations on the ARG inference problem for more than 15 years) and I found their enthusiasm contagious.  At the time I was primarily a phylogeneticist, new to population genetics, so the book was a perfect introduction to the topic for me.  It is written largely from a phylogenetic perspective with a strong emphasis on combinatorial analysis and algorithms and without a lot of intimidating population genetics jargon.  Not long afterward I got to know Jotun Hein at leisurely meeting in Barbados and became further interested in the topic through my conversations with him.

As time went on, and I attended more and more talks on population genetics (they are hard to avoid at Cornell!), I became increasingly convinced that the ARG provided the right framework for thinking about a wide variety of problems in population genetics.  While population mixture models, principle components analysis, copy-and-paste models, models of the site-frequency spectrum, and similar approaches clearly have their place, for a committed tree thinker like me, there is something deeply unsatisfying about the use of these techniques in evolutionary modeling and inference.  I had a strong feeling that it must be possible to do better by more directly modeling the true evolution process that gives rise to the observed sequences.

In 2009, I was invited to write a review article on primate comparative genomics for Genome Research.  This exercise led me to carry out a broad review of the literature and reflect on many of the problems of current interest in population genetics.  Perhaps inevitably, the ARG ended up playing a central role in my discussion of the field.  Not many people have cited this rather long-winded review but the act of writing it was extremely useful in clarifying my thinking about population genetics and the ARG.

Soon afterward, I began to try to interest students and postdocs in tackling the problem of explicit ARG inference in earnest, with an eye toward applications involving dozens or possibly hundreds of complete genomes.  I had a hunch that it should be possible to improve on existing methods by making carefully selected approximations and drawing from our standard bag of tricks for probabilistic modeling.  I did manage to convince a couple of students to work on the problem but progress was slow and limited, and they moved onto other problems before obtaining publishable results.

When Matt Rasmussen joined the group as a postdoc in 2011, he jumped at the chance to tackle the ARG inference problem.  It quickly became clear that Matt was an ideal match for the problem, given his background in molecular evolution and phylogenetics, his strong skills in algorithms and probabilistic modeling, and his programming talents.  He made early progress by setting up a forward simulator for ARGs, studying the properties of simulated ARGs in the presence of selection, and developing intuition about which features in the data were most informative for inference.

Still, the inference problem proved challenging.  Matt was focusing on an approach I had suggested involving the use of a small set of “representative genealogies” that were selected to span, as effectively as possible, the (infinitely large) space of all genealogies.  His plan was to use these genealogies to define a phylogenetic hidden Markov model, which could be used for approximate inference.  He had devised a number of clever heuristics for selecting these genealogies, for example, based on the collection of “splits” (bipartitions of sequences) implied by informative sites in the data, but it gradually became clear that this approach was limited in effectiveness and had poor scaling properties.  It was just too hard to pick a good set of representative genealogies.  Almost a year into the project, we faced a choice between settling for an approach that was clearly suboptimal or going back to the drawing board and beginning anew.

Sequence “threading”—exaptation of an idea

Our breakthrough came some time during the winter of 2012.  I remember the occasion vividly.  It was one of those rare moments when the outlines of a solution to a difficult puzzle suddenly come into focus at all once—as close to a mythical “eureka” moment as I have experienced.

Matt and I were reviewing possible solutions to the problem of selecting representative genealogies, when I was reminded of an idea I had discussed with a student a couple of years earlier.  Could there be a way to build the ARG up one sequence at a time, by “threading” an nth sequence into an ARG of size n–1?  This idea had intuitive appeal but the previous student and I had never been able to make it work with full genealogies.  However, Matt immediately picked up the idea and ran with it.  It fit naturally with the way he was representing the local genealogies and thinking about how they were related and he guessed that it ought to be possible to think of it as a problem of adding one branch to each local tree, position by position along the sequence.

As we discussed further how one might set up a solution to the threading problem, I was suddenly struck by a parallel to a problem my group had worked on several years earlier.  That problem had to do with gene prediction in a multi-species setting, where the exon-intron boundaries for genes were allowed to differ from one species to the next.  The approach we had settled on was to repeatedly sample the gene annotations for one species conditional on those for all of the others, using a probabilistic model called a phylogenetic hidden Markov model.  It turned out that this sampling problem could be elegantly solved by dynamic programming, using an algorithm for hidden Markov models known as the stochastic traceback algorithm.

Matt’s threading problem, it was evident, had the same essential structure, although it differed in many details.  Therefore, it ought to be possible to carry out the threading operation exactly, in time proportional to the sequence length, using the stochastic traceback algorithm.  Furthermore, it ought to be possible to structure this operation so that repeated applications would guarantee that we sampled from the true distribution of ARGs given the data and our model (i.e., so that we had a proper Gibbs sampler, in statistical jargon).  It was immediately clear to both of us that this was the right path forward.

As usual, there was a good deal of work in taking these ideas from white-board scratchings to a fully defined model and a working implementation.  Not least among these was the formidable task of generalizing the threading method to multiple sequences, which proved necessary for good MCMC convergence properties with more than a few sequences.  Matt was exceptionally diligent, resourceful, and creative in working through all of these steps, and eventually demonstrated that the method performs remarkably well, perhaps better than either of us anticipated.  Nevertheless, it was in that one exciting discussion about the threading operation that the project came into focus, making it clear that we had fundamentally new way of tackling this longstanding problem that was worth the additional months that would be needed to see it through to a full-fledged implementation.

Most evolutionary biologists are familiar with the concept of “exaptation,” popularized by Stephen Jay Gould and Elizabeth Vrba.  Exaptation refers to the process by which evolution takes a phenotypic characteristic developed for one purpose, and co-opts it for use in another way.  Examples include the gas bladders used for buoyancy in some fish (derived from lungs) and the use of bird feathers for flight (thought to have evolved for temperature regulation).  The phenomenon, of course, is also common in the evolution of ideas.  To me, the threading idea was a perfect example of an algorithmic exaptation, the reuse of an idea developed for one problem in evolutionary sequence analysis and co-opted for use in another.  In this case, the threading operation proved considerably more powerful and useful for ARG inference than for gene prediction.

Concluding notes

In this post, I have tried to convey some of my excitement about genome-wide ARG inference, a topic that might strike the casual reader as dry and obscure.  Space does not allow for a discussion of all of the possible ways in which we are thinking of making use of Matt’s ARG sampling program (called ARGweaver), but our manuscript does include a fairly lengthy Discussion section, which lays out some of our ideas for future work.  Clearly there is much still to be done, but I am convinced that this is going to be an enormously powerful tool for population genomic analysis.

Finally, I should say that, while I have focused on our own efforts in this post, we are far from the only researchers actively working on this topic.  As described in our paper, there is a great deal of closely related recent work by Yun Song, Richard Durbin, Thomas Mailund, Asger Hobolth, Mikkel Schierup, and others.  In short, there is no shortage of opportunities for exaptation of ideas in this vibrant field.

Adam Siepel

Our paper: Sashimi plots: Quantitative visualization of RNA sequencing read alignments

This is a guest post by Yarden Katz [@yardenkatz] on his paper (along with coauthors): katz et al. Sashimi plots: Quantitative visualization of RNA sequencing read alignments arXived here

A first draft of our paper Sashimi plots: Quantitative visualization of RNA sequencing read alignments is now available. Sashimi plots are a simple visualization of RNA sequencing data, intended to make it easier to detect differentially spliced exons across multiple RNA-Seq samples. In a Sashimi plot, RNA-Seq reads are summarized as read densities, and junction reads are collapsed into arcs whose width is proportional to the number of reads spanning the exons connected by the arc. See the paper for examples.

We call it a Sashimi plot in part because of the impeccable resemblance of bumpy RNA-Seq read densities in exons to small pieces of Sashimi, and also because we tried to keep the plots as close to the “raw” data as possible. While Sashimi plots can display estimates of isoform abundance levels from programs like MISO, the goal here was to summarize the read alignments as they are, without further processing or inference, so that conclusions from probabilistic models can be visually verified.

The original Sashimi plot program is a command line utility that makes customizable Sashimi plots using Python (using the matplotlib library). Recently, the IGV genome browser team implemented a version of Sashimi plots in their browser (see installation instructions.) This allows Sashimi plots to be made dynamically for any genomic region of interest, at a resolution set by the zoom in/out features of the browser. The plot can be made for all or a subset of the tracks loaded, and the scales can be adjusted by the user as in the main IGV window. Both the static, Python-based version of Sashimi plots and the dynamic version within IGV are available and actively maintained, and code bases for both are available on GitHub.

Sashimi plots still have important limitations. First, the junction arcs can get messy for genes with many alternative isoforms. This can be partially addressed by looking at simplified event annotations (e.g. ones containing only two isoforms, or a handful of isoforms, as in these annotations) rather than making plots for the full set of isoforms of a gene. The second limitation is that sometimes subtle differences are not readily seen from junction arc widths. We’re considering alternative representations (such as circle area or diameter) for quantitatively representing junction read counts.

The paper is meant primarily as advertisement for the software. We hope that other members of the RNA processing/sequencing community will find this useful and come up with their own variants of these plots.

Relevant links:

Our paper: Effect of Genetic Variation in a Drosophila Model of Misfolded Human Proinsulin

This guest post is by Bin He on two preprints, Genetic Complexity in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin and Effect of Genetic Variation in a Drosophila Model of Misfolded Human Proinsulin, arXived here and here, respectively. This is a cross-post from Bin’s blog

Here we describe a pair of papers, both of which have been posted by Joe on this blog in the past month. But since they are intimately connected, we would like to write an additional post to explain the rationales behind them and the major findings therein.

The central questions in these two papers concern the genetic architecture of complex traits, such as those in human common disorders. We took a model organism approach in order to complement human studies, which are getting more and more powerful because of the successful community collaboration, but are still limited in several aspects, including mapping resolution and the ability to perform experimental validations.

Another important thinking underlying this project is the idea that decanalization of a trait may have caused a release of genetic variation, which subsequently contributed to the disease variability we see today. To this end, our fly model of misfolded human proinsulin may be viewed as an external perturbation, which, by exhausting the organism’s buffering capacity, reveals normally cryptic genetic variation. Under this view, our model will have general relevance in many human disorders.

To perform this study, we first established a fly model of a disease-associated human mutant proinsulin, which was the subject of our first paper “Genetic Complexity in a Drosophila Model of Diabetes-Associated Misfolded Human Proinsulin”.

We’d like to bring out several points. First, regarding the etiology of the disease phenotype in our fly model, we believe it is mainly due to the physical property of the mutant protein, rather than the biological function of the human proinsulin. Although Drosophila also has insulin-like proteins, their sequence similarity and functions differ substantially from the human homolog. Consistent with this view, when we made a transgenic fly expressing the wild-type human proinsulin, what we observed is that, at both phenotype and transcription level, expressing the wild-type human proinsulin in developing eye and other imaginal discs do not cause any visible changes. We thus propose that our fly model is for a general class of human disease associated with unfolded or misfolded protein.

In the first paper, we also described the phenomenon of variable phenotypic severity when put on different wild-derived genetic background. A series of experiments ruled out possible confounding factors, such as correlations induced by natural variability in eye size, or different levels of transgene expression.

We were exploring the idea of using natural variation in the fly to identify associated loci underlying a complex disease trait. We did so by crossing the transgenic, Mendelian disease carrying line to a panel of wild-derived inbred lines, and asked whether the severity of the disease is dependent on the genetic background. The answer is a definite yes: the range of phenotype quantified by the size of the eye span from 10% to 80% of wildtype (the mutant human proinsulin was expressed in the eye disc during development, causing neurodegeneration. We used eye because it is dispensable in lab conditions, and easy to measure the phenotype). We then conducted a GWAS, which led to the identification of sfl, as described above, and also the HS biosynthetic pathway by genetic test. One unique advantage of our system is its ultra-high resolution in mapping: we localized the association signal to ~400bp LD block within one of the introns of sfl, allowing us to test specific hypotheses about the molecular mechanisms of the associated variants. Pyro-sequencing analysis revealed allele-specific expression difference due to the intronic variation, but also highlighted the genetic heterogeneity even within that locus, with additional cis-variants present to influence the expression level. Overall, we believe that our fly model system is a powerful complementary approach to the genetic study of complex traits. Its high mapping resolution and rich molecular/genetic toolkits allow faster and in-depth characterization of disease-associated variation, which is a unique advantage.

Bin Z. He
Kreitman Lab, Dept of Ecology and Evolution, University of Chicago
current address: O’Shea Lab, FAS Center for Systems Biology, Harvard University / HHMI