The role of twitter in the life cycle of a scientific publication

The role of twitter in the life cycle of a scientific publication
Emily S. Darling, David Shiffman, Isabelle M. Côté, Joshua A. Drew
(Submitted on 2 May 2013)

Twitter is a micro-blogging social media platform for short messages that can have a long-term impact on how scientists create and publish ideas. We investigate the usefulness of twitter in the development and distribution of scientific knowledge. At the start of the life cycle of a scientific publication, twitter provides a large virtual department of colleagues that can help to rapidly generate, share and refine new ideas. As ideas become manuscripts, twitter can be used as an informal arena for the pre-review of works in progress. Finally, tweeting published findings can communicate research to a broad audience of other researchers, decision makers, journalists and the general public that can amplify the scientific and social impact of publications. However, there are limitations, largely surrounding issues of intellectual property and ownership, inclusiveness and misrepresentations of science sound bites. Nevertheless, we believe twitter is a useful social media tool that can provide a valuable contribution to scientific publishing in the 21st century.

Thoughts on “Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes”

Our next guest post is by Pleuni Pennings [@pleunipennings] with her thoughts on:
Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes, Duncan Palmer, John Frater, Rodney Philips, Angela McLean, Gil McVean, arXived here

[UPDATED]

Last week, a group of people from Oxford University published an interesting paper on the ArXiv. The paper is about using genealogical data (from HIV sequences), in combination with cross-sectional data (on patient and HIV phenotypes) to infer rates of evolution in HIV.

My conclusion: the approach is very interesting, and it makes total sense to use genealogical data to improve the inference from cross-sectional data. In fact, it is quite surprising to me that inferring rates from cross-sectional data works at all. However, in a previous paper by (partly) the same people, they show that it is possible to infer rates from using cross-sectional data only, and the estimates they get are very similar to the estimates from longitudinal data. The current paper provides a new and improved method, whose results are consistent with the previous papers.

The biological conclusion of the paper is that HIV adaptation is slower than many previous studies suggested. Case studies of fast evolution of the virus suffer from extreme publication bias and give the impression that evolution in HIV is always fast, whereas cross-sectional and longitudinal data show that evolution is often slow. Waiting times for CTL-escape and reversion are on the order of years.

1. What rates are they interested in?

The rates of interest here are the rate of escape from CTL pressure and the rate of reversion if there is no CTL pressure.

When someone is infected with HIV, the CTL response by the immune system of the patient can reduce the amount of virus in the patient. CTL stands for cytotoxic lymphocytes. Which amino-acid sequences (epitopes) can be recognized by the host’s CTL response depends on the HLA genotype of the host.
Suppose I have a certain HLA genotype X, such that my CTLs can recognize virus with a specific sequence of about 9 amino acids, let’s call this sequence Y. To escape from the pressure of these CTLs, the virus can mutate sequence Y to sequence Y’. A virus with sequence Y’ is called an escape mutant. The host (patient) with HLA X is referred to as a “matched host” and hosts without HLA X are referred to as “unmatched.” The escape mutations are thought to be costly for the virus.
So, for each CTL epitope there are 4 possible combinations of host and virus:
1. matched host and wildtype virus (there is selection pressure on the virus to “escape”)
2. matched host and escape mutant virus
3. unmatched host and wildtype virus
4. unmatched host and escape mutant virus (there is selection pressure on the virus to revert)

The question is “how fast does the virus escape if it is in a matched host and how fast does it revert if it is in an unmatched host?”

2. Why do we want to know these rates?

First of all, just out of curiosity, it is interesting to study how fast things evolve – it is surprising how little we know about rates of adaptive evolution. Secondly, because escape rates are relevant for the success of a potential HIV vaccine, if escape rates are high, then vaccines will probably not be very successful.

3. What are cross-sectional data and how can we infer rates from them?

Cross-sectional data are snap-shots of the population, with information on hosts and their virus. Here, it is the number of matched and unmatched hosts with wildtype and escape virus at a given point in time.

So how do these data tell us what escape rates and reversion rates are? Intuitively, it is easy to see how very high or very low rates would shape the data. For example, if escape and reversion would happen very fast, then the virus would always be perfectly adapted: we’d only find wildtype virus in unmatched hosts and only escape mutant virus in matched hosts. Conversely, if escape and reversion would be extremely slow, than the fraction of escape mutant virus would not differ between matched and unmatched hosts. Everyone would be infected with a random virus and this would never change.
The real situation is somewhere in between: the fraction of escape mutant virus is higher in matched hosts than in unmatched hosts. With the help of an standard epidemiological SI-model (ODE-model) and an estimate of the age of the epidemic, the fraction of escape mutant virus in the two types of hosts translates into estimates of the rates of escape and reversion. In the earlier paper, this is exactly what the authors did, and the results make a lot of sense. Rates range from months to years, reversion is always slower than escape, and there are large differences between CTLs. The results also matched well with data from longitudinal studies. In a longitudinal study, the patients are followed over time and evolution of the virus can be more directly observed. This is much more costly, but a much better way to estimate rates.

4. Why are the estimates from cross-sectional data not good enough?

Unfortunately, the estimates from cross-sectional data are only point estimates, and maybe not very good ones. The problem is that the method (implicitly) assumes that each virus is independently derived from an ancestor at the beginning of the epidemic. For example, if there are a lot of escape mutant viruses in the dataset, then the estimated rate of escape will be high. However, the high number of escape mutant virus may be due to one or a few escape events early on in the epidemic that got transmitted to a lot of other patients. It is a classical case of non-independence of data. It could lead us to believe that we can have more confidence in the estimates than we should have.

5. Genealogical data to the rescue!

Fortunately, the authors have viral sequences that provide much more information than just whether or not the virus is an escape mutant. The sequences of the virus can inform us about the underlying genealogical tree and can tell us how non-independent the data really are (two escape mutants that are very close to each other in the tree are not very independent). The goal of the current paper is to use the genealogical data to get better estimates of the escape and reversion rates.

A large part of the paper deals with the nuts and bolts of how to combine all the data, but in essence, this is what they do: They first estimate the genealogical tree for the viruses of the patients for which they have data (while allowing for uncertainty in the estimated tree). Then they add information on the states of the tips (wildtype vs escape for the virus and matched vs unmatched for the patient), and use the tree with the tip-labels to estimate the rates. This seems to be a very useful new method, that may give better estimates and a natural way to get credible intervals for the estimates.

The results they obtain with the new method are similar to the previous results for three CTL epitopes and slower rates for one CTL epitope. The credible intervals are quite wide, which shows that the data (from 84 patients) really don’t contain a whole lot of information about the rates, possibly because the trees are rather star-shaped, due to the exponential growth of the epidemic. Interestingly, the fact that the tree is rather star-shaped could explain why the older approach (based only on cross-sectional data) worked quite well. However, this will not necessarily be the case for other datasets.

Question for the authors

Do you use the information about the specific escape mutations in the data? Certainly not all sequences that are considered “escape mutants” carry exactly the same nucleotide changes? Whenever they carry different mutations, you know they must be independent.

Assembling large, complex environmental metagenomes

Assembling large, complex environmental metagenomes
Adina Chuang Howe, Janet Jansson, Stephanie A. Malfatti, Susannah G. Tringe, James M. Tiedje, C. Titus Brown
(Submitted on 12 Dec 2012)

The large volumes of sequencing data required to deeply sample complex environments pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two data reduction approaches, digital normalization and partitioning, to this challenge. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.

HIV drug resistance: problems and perspectives

HIV drug resistance: problems and perspectives
Pleuni S Pennings
(Submitted on 25 Nov 2012)

Many HIV patients now have access to combination antiretroviral treatment (ART). At the end of 2011, more than eight million people were receiving antiretroviral therapy in low-income and middle-income countries. ART generally works well in keeping the virus suppressed and the patient healthy. However, treatment only works as long as the virus is not resistant against the drugs used. In the last decades HIV treatments have become better and better at slowing down the evolution of drug resistance, so that some patients are treated for many years without having any resistance problems. However, for some patients, especially in low-income countries, drug resistance is still a serious threat to their health. This essay will review what is known about transmitted and acquired drug resistance, multi-class drug resistance, resistance to newer drugs, resistance due to treatment for the prevention of mother-to-child transmission, the role of minority variants (low-frequency drug-resistance mutations), and resistance due to pre-exposure prophylaxis.

The evolution of genetic architectures underlying quantitative traits

The evolution of genetic architectures underlying quantitative traits
Etienne Rajon, Joshua B. Plotkin
(Submitted on 31 Oct 2012)

In the classic view introduced by R.A. Fisher, a quantitative trait is encoded by many loci with small, additive effects. Recent advances in QTL mapping have begun to elucidate the genetic architectures underlying vast numbers of phenotypes across diverse taxa, producing observations that sometimes contrast with Fisher’s blueprint. Despite these considerable empirical efforts to map the genetic determinants of traits, it remains poorly understood how the genetic architecture of a trait should evolve, or how it depends on the selection pressures on the trait. Here we develop a simple, population-genetic model for the evolution of genetic architectures. Our model predicts that traits under moderate selection should be encoded by many loci with highly variable effects, whereas traits under either weak or strong selection should be encoded by relatively few loci. We compare these theoretical predictions to qualitative trends in the genetics of human traits, and to systematic data on the genetics of gene expression levels in yeast. Our analysis provides an evolutionary explanation for broad empirical patterns in the genetic basis of traits, and it introduces a single framework that unifies the diversity of observed genetic architectures, ranging from Mendelian to Fisherian.

Our paper: Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs


This guest post is by Christopher Brown, Lara Mangravite, and Barbara Engelhardt on their paper: Integrative modeling of eQTLs and cis-regulatory elements suggest mechanisms underlying cell type specificity of eQTLs arXived here.

Why do we study eQTLs? Why don’t we count bristles?

The genetic dissection of complex trait models, independent of the particular phenotype, is useful for improving our understanding of the genetic architecture underlying the biochemical function that regulates complex traits in general. In the last ten years, gene expression levels themselves have emerged as useful phenotypes amenable to genetic dissection with several advantages, most notably that it is easy to accurately quantify tens of thousands of traits simultaneously (indeed even more when we address splicing and promoter usage). While the identification of SNPs that are associated with variation in gene expression (eQTLs) is certainly interesting at this basic level, an additional critical use for eQTL data has emerged. Because the majority of common human phenotypic variation appears to be driven by non-coding sequence variants, eQTL analyses are beginning to help with the mechanistic interpretation of GWAS results. In light of these interests and applications, we believe that eQTL analyses are hampered by (at least) three important limitations, which we have attempted to address in our recent preprint:

(1) Methodological (non) uniformity. Most eQTL studies have been performed by different groups, on different genotyping and gene expression platforms, with different association methods, and using different criteria for defining significance. This lack of uniformity complicates even simple cross study comparisons; for example, what fraction of genes has one or more independently associated eQTL when analyzed across tissues? We address this issue by testing for eQTL associations across a diverse set of cell types using a uniform pipeline with standardized analysis parameters to perform all analytical steps starting from raw data. As a fairly trivial example, our analyses across the eleven studies demonstrated that nearly all of the variation in the proportion of genes with significant eQTL associations identified within each study can be explained by just two factors: study size and replicate gene expression measurements. The proportion of genes with one or more independently associated eQTLs, then, is probably not 5-10% as has been hypothesized, but most or all of them, which we can get a better picture of when we design studies with sufficient power.

(2) Undercharacterized cell specificity. It is generally agreed upon that some eQTLs regulate gene expression in a cell type specific manner. When using eQTLs to interpret the genetic contribution to complex clinical traits, it is important to consider the cell type(s) most relevant to the trait of interest. However, if we don’t know what cell type is responsible for a phenotype or if we don’t have eQTL data for the cell type of interest, we are forced to extrapolate inferences about eQTLs derived from other cell types. By enabling the simultaneous comparison of within and between cell type eQTL replication for multiple cell type combinations and integrating these results with cis-regulatory element (CRE) mapping data from ENCODE, we have addressed several unresolved questions concerning the nature of cell type specific and ubiquitous eQTL SNPs. We find that eQTL-CRE overlap is frequently cell type specific and that this information can be used to predict cell specificity of eQTLs in the absence of additional gene expression data from the cell type of interest. While these results are certainly preliminary (and indeed we see many possible improvements), we hope this will improve the utility of eQTL-GWAS comparisons, particularly in situations where the GWAS cell type of interest lacks eQTL data.

(3) Resolution, causality, and mechanism. Lead tag SNPs are probably causal variants less than 30% of the time. While larger and more diverse genomic sample sets are essential to improve the resolution for identifying causal variants, this is not always possible due to time or budget constraints. However, the application of orthogonal genomic data also has the potential to considerably refine resolution with the added benefit of providing insight into the mechanism through which a causal variant acts. We approach this (as a few other groups have – notably Dan Gaffney et al.) by integrating CRE data into our analyses, because it appears that genetic variants that overlap certain types of CREs are much more likely to be functional than those that do not. We believe that this hypothesis, and the methods used to address it, need to be validated with directed functional assays, but we see no reason to doubt the principle of understanding heritable phenotypes using genotype functional analyses. Furthermore, the analysis of cell specific eQTL data in the context of cell specific CRE data, which is now possible, enables predictions about the regulatory mechanisms that are affected by a specific eQTL, which will allow us to place GWAS hits into pathways or provide other meaningful biological insights.

Why did we submit the paper to arXiv and Haldane’s Sieve?

We are big proponents of open access publication, open data, and transparent methods and analysis. At least part of what we’ve done here is to create a resource that we hope will be useful to the broader community. We are open to pre and post publication review of and commentary on our motivations and methods. Furthermore, we have submitted all of the eQTLs we identify to a database of eQTLs (eqtl.uchicago.edu), and we are currently securing funding to develop open access, online tools to help GWAS researchers follow up specific functional variants using our methods.

Christopher Brown, Lara Mangravite, Barbara Engelhardt

Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences

Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences

Swee Hoe Ong, Vinutha Uppoor Kukkillaya, Andreas Wilm, Christophe Lay, Eliza Xin Pei Ho, Louie Low, Martin Lloyd Hibberd, Niranjan Nagarajan
(Submitted on 12 Oct 2012)

The high throughput and cost-effectiveness afforded by short-read sequencing technologies, in principle, enable researchers to perform 16S rRNA profiling of complex microbial communities at unprecedented depth and resolution. Existing Illumina sequencing protocols are, however, limited by the fraction of the 16S rRNA gene that is interrogated and therefore limit the resolution and quality of the profiling. To address this, we present the design of a novel protocol for shotgun Illumina sequencing of the bacterial 16S rRNA gene, optimized to capture more than 90% of sequences in the Greengenes database and with nearly twice the resolution of existing protocols. Using several in silico and experimental datasets, we demonstrate that despite the presence of multiple variable and conserved regions, the resulting shotgun sequences can be used to accurately quantify the diversity of complex microbial communities. The reconstruction of a significant fraction of the 16S rRNA gene also enabled high precision (>90%) in species-level identification thereby opening up potential application of this approach for clinical microbial characterization.

A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data

A mixed model approach for joint genetic analysis of alternatively spliced transcript isoforms using RNA-Seq data

Barbara Rakitsch, Christoph Lippert, Hande Topa, Karsten Borgwardt, Antti Honkela, Oliver Stegle
(Submitted on 10 Oct 2012)

RNA-Seq technology allows for studying the transcriptional state of the cell at an unprecedented level of detail. Beyond quantification of whole-gene expression, it is now possible to disentangle the abundance of individual alternatively spliced transcript isoforms of a gene. A central question is to understand the regulatory processes that lead to differences in relative abundance variation due to external and genetic factors. Here, we present a mixed model approach that allows for (i) joint analysis and genetic mapping of multiple transcript isoforms and (ii) mapping of isoform-specific effects. Central to our approach is to comprehensively model the causes of variation and correlation between transcript isoforms, including the genomic background and technical quantification uncertainty. As a result, our method allows to accurately test for shared as well as transcript-specific genetic regulation of transcript isoforms and achieves substantially improved calibration of these statistical tests. Experiments on genotype and RNA-Seq data from 126 human HapMap individuals demonstrate that our model can help to obtain a more fine-grained picture of the genetic basis of gene expression variation.

Forward Simulation of Fisher-Wright Populations with Stochastic Population Size and Neutral Single Step Mutations in Haplotypes

Efficient Forward Simulation of Fisher-Wright Populations with Stochastic Population Size and Neutral Single Step Mutations in Haplotypes
Mikkel Meyer Andersen, Poul Svante Eriksen
(Submitted on 5 Oct 2012)

In both population genetics and forensic genetics it is important to know how haplotypes are distributed in a population. Simulation of population dynamics helps facilitating research on the distribution of haplotypes. In forensic genetics, the haplotypes can for example consist of lineage markers such as short tandem repeat loci on the Y chromosome (Y-STR). A dominating model for describing population dynamics is the simple, yet powerful, Fisher-Wright model. We describe an efficient algorithm for exact forward simulation of exact Fisher-Wright populations (and not approximative such as the coalescent model). The efficiency comes from convenient data structures by changing the traditional view from individuals to haplotypes. The algorithm is implemented in the open-source R package ‘fwsim’ and is able to simulate very large populations. We focus on a haploid model and assume stochastic population size with flexible growth specification, no selection, a neutral single step mutation process, and self-reproducing individuals. These assumptions make the algorithm ideal for studying lineage markers such as Y-STR.

A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing

A phylogeny of birds based on over 1,500 loci collected by target enrichment and high-throughput sequencing
John E. McCormack, Michael G. Harvey, Brant C. Faircloth, Nicholas G. Crawford, Travis C. Glenn, Robb T. Brumfield
(Submitted on 4 Oct 2012)

Evolutionary relationships among birds in Neoaves, a clade including the vast majority of avian diversity, have vexed systematists due to the ancient, rapid radiation of numerous lineages. We applied a new phylogenomic approach to resolve relationships in Neoaves using target enrichment (sequence capture) and high-throughput sequencing of ultraconserved elements (UCEs) in avian genomes. We collected sequence data from UCE loci for 32 members of Neoaves and one outgroup (chicken) and analyzed data sets that differed in amount of missing data. An alignment of 1,541 loci that allowed missing data was 87% complete and resulted in a highly resolved phylogeny with broad agreement between the Bayesian and maximum-likelihood (ML) trees. Although the 100% complete matrix of 416 UCE loci was broadly similar, the Bayesian and ML trees differed to a greater extent in this analysis, suggesting that increasing from 416 to 1,541 loci led to increased stability and resolution of the tree. Novel results of our study include surprisingly close relationships between phenotypically divergent bird families, such as tropicbirds (Phaethontidae) and the sunbittern (Eurypygidae) as well as a sister relationship between bustards (Otididae) and turacos (Musophagidae). This phylogeny bolsters support for monophyletic waterbird and landbird clades and also strongly supports controversial relationships from previous studies, including the sister relationship between passerines and parrots and the non-monophyly of raptorial birds in the hawk and falcon families. Although significant challenges remain to fully resolving some of the deep relationships in Neoaves, especially among lineages outside the waterbirds and landbirds, this study suggests that increased data will yield an increasingly resolved avian phylogeny.