Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome

Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome
Wentian Li, Jan Freudenberg, Pedro Miramontes
(Submitted on 28 Aug 2013)

The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a greater length increases the chance for reads being uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 to 1000 basepairs. We use the proportion of non-singleton k-mers to evaluate the mappability of reads for a corresponding read length. We observe that the proportion of non-singletons decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different k ranges. A faster decay at smaller values for k indicates more limited gains for read lengths > 200 basepairs. The frequency distributions of k-mers exhibit long tails in a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The location of the most frequent 1000-mers comprises 172 kilobase-ranged regions, including four large stretches on chromosomes 1 and X, containing genes with biomedical implications. Even the read length 1000 would be insufficient to reliably sequence these specific regions.

Our paper: Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

This guest post is by Mike Harvey on his (along with coauthors) paper Tilston-Smith and Harvey et al Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales arXived here.

This paper is a result of work on developing markers and methods for generating genomic data for species without available genomes (I’ll refer to these as “non-model” species). The work is a collaborative effort between some researchers who are really on top of developments in sequencing technologies (and are also a blast to work with) – Travis Glenn at UGA, Brant Faircloth at UCLA, and John McCormack at Occidental – and our lab here at LSU. We think the marker sets we have been developing (ultraconserved elements) and more generally the method we are using (sequence capture) have the potential to make the genomic revolution more accessible to researchers studying the population genetics of diverse non-model organisms.

Background

Although genomic resources for humans and other model systems are increasing rapidly, the bottleneck for those of us working on the population genetics of non-model systems is simply our ability to generate data. Many of us are still struggling to take advantage of the increase in sequencing capacity provided by next-generation platforms. For many projects, sequencing entire genomes is neither feasible (yet) nor necessary, so researchers have focused on finding reasonable methods of subsampling the genome in a repeatable way such that the same subset of genomic regions can be sampled for many individuals. We often have to do this, however, with little to no prior genomic information from our particular study organism.

Most methods for subsampling the genome thus far have involved “random” sampling from across the genome by using restriction enzymes to digest genomic DNA and then sequencing fragments that fall in a particular part of the fragment size distribution. Drawbacks of these methods include (1) the fact that the researcher has no prior knowledge of where in the genome sequences will be coming from or what function the genomic region might serve, and (2) that the repeatability of the method, specifically the ability to generate data from the same loci across samples, depends on the conservation of the enzyme cut sites, and these often are not conserved at deeper timescales. Sequencing transcriptomes is also a popular method for subsampling the genome, but this simply isn’t an option for those of us working with museum specimens and tissues or old blood samples in which RNA hasn’t been properly preserved.

Sequence capture, a molecular technique involving genome enrichment by hybridization to RNA or DNA ‘probes’, is a flexible alternative that allows researchers to subsample whatever portions of the genome they like. The drawback of sequence capture, however, is that you need enough prior genomic information to design the synthetic oligos used as probes. This is not a problem for e.g. exome capture in humans in which the targeted genes are well characterized, but it is a challenge for non-model systems without sequenced genomes.

This is where ultraconserved elements come in. Ultraconserved elements (UCEs) are short genomic regions that are highly conserved across widely divergent species (e.g. all amniotes). Because they are so conserved, UCE sequences can be easily used as probes for sequence capture in diverse non-model organisms, even if the organisms themselves have little or no genomic information available. If you are not working on amniotes or fishes (for which we have already designed probe arrays), all you may need to find UCEs is a couple of genomes from species that diverged from your study organism within the last few hundred million years. Of course, this general approach is not specific to loci that fall into our narrow definition of UCEs, but is limited merely by the availability of genomic information that can be used to design probes. As additional genomic information becomes available from a given group additional loci, including protein-coding regions, can easily be added to capture arrays.

Our question for this paper – does sequence capture of UCEs work for population genetics?

We have previously used sequence capture of UCEs to understand deeper-level phylogenetic questions. We’ve found that at deep timescales, the flanking regions of UCEs contain a large amount of informative variation. The goals of the present study were (1) to see if sufficient information existed in UCEs to enable studies at shallow evolutionary (read "population genetic or phylogeographic") timescales, and (2) to explore some of the analyses that might be possible with population genetic data from non-model organisms. For our study, we sampled two individuals from each of four populations in five different species of non-model Neotropical birds. We conducted sequence capture using probes designed from 2,386 UCEs shared by amniotes and we sequenced the resulting libraries using an Illumina HiSeq. We then examined the number of loci recovered and the amount of informative variation in those loci for each of the five species. We also conducted some standard analyses – species tree estimation, demographic modeling, and species delimitation – for each species

We were able to recover between 776 and 1,516 UCE regions across the five species, and these contained sufficient variation to conduct population genetic analyses in each species. Species tree estimates, demographic parameters, and species limits mostly corresponded with prior estimates based on morphology or mitochondrial DNA sequences. Confidence intervals around demographic parameter estimates from the UCEs were much narrower than estimates from mitochondrial DNA using similar methods, supporting the idea that larger datasets will allow more precise estimates of species histories.

Some conclusions

Pending faster and cheaper methods for sequencing and de novo assembling whole genomes, methods for sampling a subset of the genome will be a practical necessity for population genetic studies in non-model organisms. Sequence capture is both intuitively appealing and practical in that it allows researchers to select a priori the regions of the genome in which they are interested. Ultraconserved elements pair nicely with sequence capture because they allow us to collect data from the same loci shared across a very broad spectrum of organisms (e.g. all amniotes or all fishes). As genomic data for diverse groups increases, UCE capture probes will certainly be augmented with additional genomic regions. In the meantime, sequence capture of UCEs has a lot to offer for population genetic studies of non-model organisms. See our paper for more information, or visit ultraconserved.org, where our probe sets, protocols, code, and other information are available under open-source licenses (BSD-style and Creative Commons) for anyone to use.

Fast Approximate Inference of Transcript Expression Levels from RNA-seq Data


Fast Approximate Inference of Transcript Expression Levels from RNA-seq Data

James Hensman, Peter Glaus, Antti Honkela, Magnus Rattray
(Submitted on 27 Aug 2013)

Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem becomes an instance of non-trivial probabilistic inference. Bayesian inference in such a problem is intractable and approximate methods must be used such as Markov chain Monte Carlo (MCMC) and Variational Bayes. Standard implementations of these methods can be prohibitively slow for large datasets and complex gene models.
Results: We propose an approximate inference scheme based on Variational Bayes applied to an existing model of transcript expression inference from RNA-seq data. We apply recent advances in Variational Bayes algorithmics to improve the convergence of the algorithm beyond the standard variational expectation-maximisation approach. We apply our algorithm to simulated and biological datasets, demonstrating that the increase in speed requires only a small trade-off in accuracy of expression level estimation.
Availability: The methods were implemented in R and C++, and are available as part of the BitSeq project at this https URL The methods will be made available through the BitSeq Bioconductor package at the next stable release.

Fluctuating selection models and McDonald-Kreitman type analyses

Fluctuating selection models and McDonald-Kreitman type analyses
Toni I. Gossmann, David Waxman, Adam Eyre-Walker
(Submitted on 25 Aug 2013)

It is likely that the strength of selection acting upon a mutation varies through time due to changes in the environment. However, most population genetic theory assumes that the strength of selection remains constant. Here we investigate the consequences of fluctuating selection pressures on the quantification of adaptive evolution using McDonald-Kreitman (MK) style approaches. In agreement with previous work, we show that fluctuating selection can generate evidence of adaptive evolution even when the expected strength of selection on a mutation is zero. However, we also find that the mutations, which contribute to both polymorphism and divergence tend, on average, to be positively selected during their lifetime, under fluctuating selection models. This is because mutations that fluctuate, by chance, to positive selected values, tend to reach higher frequencies in the population than those that fluctuate towards negative values. Hence the evidence of positive adaptive evolution detected under a fluctuating selection model by MK type approaches is genuine since fixed mutations tend to be advantageous on average during their lifetime. Never-the-less we show that methods tend to underestimate the rate of adaptive evolution when selection fluctuates.

Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales

Target capture and massively parallel sequencing of ultraconserved elements (UCEs) for comparative studies at shallow evolutionary time scales
Brian Tilston Smith, Michael G. Harvey, Brant C. Faircloth, Travis C. Glenn, Robb T. Brumfield
(Submitted on 24 Aug 2013)

Comparative genetic studies of non-model organisms are transforming rapidly due to major advances in sequencing technology. A limiting factor in these studies has been the identification and screening of orthologous loci across an evolutionarily distant set of taxa. Here, we evaluate the efficacy of genomic markers targeting ultraconserved DNA elements (UCEs) for analyses at shallow evolutionary timescales. Using sequence capture and massively parallel sequencing to generate UCE data for five co-distributed Neotropical rainforest bird species, we recovered 776-1,516 UCE loci across the five species. Across species, 53-77 percent of the loci were polymorphic, containing between 2.0 and 3.2 variable sites per polymorphic locus, on average. We performed species tree construction, coalescent modeling, and species delimitation, and we found that the five co-distributed species exhibited discordant phylogeographic histories. We also found that species trees and divergence times estimated from UCEs were similar to those obtained from mtDNA. The species that inhabit the understory had older divergence times across barriers, contained a higher number of cryptic species, and exhibited larger effective population sizes relative to species inhabiting the canopy. Because orthologous UCEs can be obtained from a wide array of taxa, are polymorphic at shallow evolutionary time scales, and can be generated rapidly at low cost, they are effective genetic markers for studies investigating evolutionary patterns and processes at shallow time scales.

A network approach to analyzing highly recombinant malaria parasite genes

A network approach to analyzing highly recombinant malaria parasite genes
Daniel B. Larremore, Aaron Clauset, Caroline O. Buckee
(Submitted on 23 Aug 2013)

The var genes of the human malaria parasite Plasmodium falciparum present a challenge to population geneticists due to their extreme diversity, which is generated by high rates of recombination. These genes encode a primary antigen protein called PfEMP1, which is expressed on the surface of infected red blood cells and elicits protective immune responses. Var gene sequences are characterized by pronounced mosaicism, precluding the use of traditional phylogenetic tools that require bifurcating tree-like evolutionary relationships. We present a new method that identifies highly variable regions (HVRs), and then maps each HVR to a complex network in which each sequence is a node and two nodes are linked if they share an exact match of significant length. Here, networks of var genes that recombine freely are expected to have a uniformly random structure, but constraints on recombination will produce network communities that we identify using a stochastic block model. We validate this method on synthetic data, showing that it correctly recovers populations of constrained recombination, before applying it to the Duffy Binding Like-{\alpha} (DBL{\alpha}) domain of var genes. We find nine HVRs whose network communities map in distinctive ways to known DBL{\alpha} classifications and clinical phenotypes. We show that the recombinational constraints of some HVRs are correlated, while others are independent. These findings suggest that this micromodular structuring facilitates independent evolutionary trajectories of neighboring mosaic regions, allowing the parasite to retain protein function while generating enormous sequence diversity. Our approach therefore offers a rigorous method for analyzing evolutionary constraints in var genes, and is also flexible enough to be easily applied more generally to any highly recombinant sequences.

Complexity of evolutionary equilibria in static fitness landscapes

Complexity of evolutionary equilibria in static fitness landscapes
Artem Kaznatcheev
(Submitted on 23 Aug 2013)

A fitness landscape is a genetic space — with two genotypes adjacent if they differ in a single locus — and a fitness function. Evolutionary dynamics produce a flow on this landscape from lower fitness to higher; reaching equilibrium only if a local fitness peak is found. I use computational complexity to question the common assumption that evolution on static fitness landscapes can quickly reach a local fitness peak. I do this by showing that the popular NK model of rugged fitness landscapes is PLS-complete for K >= 2; the reduction from Weighted 2SAT is a bijection on adaptive walks, so there are NK fitness landscapes where every adaptive path from some vertices is of exponential length. Alternatively — under the standard complexity theoretic assumption that there are problems in PLS not solvable in polynomial time — this means that there are no evolutionary dynamics (known, or to be discovered, and not necessarily following adaptive paths) that can converge to a local fitness peak on all NK landscapes with K = 2. Applying results from the analysis of simplex algorithms, I show that there exist single-peaked landscapes with no reciprocal sign epistasis where the expected length of an adaptive path following strong selection weak mutation dynamics is $e^{O(n^{1/3})}$ even though an adaptive path to the optimum of length less than n is available from every vertex. The technical results are written to be accessible to mathematical biologists without a computer science background, and the biological literature is summarized for the convenience of non-biologists with the aim to open a constructive dialogue between the two disciplines.

Journal policy change: MBE will consider preprints

Molecular biology and evolution (MBE) has updated its policy to allow the submission of papers previously submitted to the arXiv:

All manuscripts published in arXiv are considered unpublished works. Manuscripts that appear on arXiv may be submitted to MBE for consideration for publication.

It is unclear whether these policies extend to other preprint sites, but presumably it may well do. It is great to see this policy change, and well done to Melissa Wilson Sayres, Antonio Marco [@amarcobio], and others for encouraging MBE to affect this change.

However, one less encouraging feature of this change is that MBE has also implemented a policy where preprint papers have to be cited as unpublished data in the text rather than as a citation appearing in the reference section. It is yet unclear how citation search engines, such as Google scholar, will interact with this form of reference. Will they register, and count, them as citations or go unnoticed?

One of the many appealing features of preprints is that they allow papers to being to be acknowledged and cited earlier. It is unclear why MBE feels that this policy is necessary, but in our view it seems counter-productive. Hopefully, this is something that can be changed, given time, and encouragement from MBE’s community.

Thoughts on MBE’s preprint citation policy

This guest post is by Graham Coop [@graham_coop] on the journal Molecular Biology and Evolution’s new preprint policy.

We had an interesting discussion via twitter on the potential reasons for MBE’s policy of not allowing a full citation of preprint articles. I thought I’d writeup some of my thoughts as shaped by that conversation.

Following on from this discussion, I thought I’d lay out some of the arguments that we discussed and my thoughts on these points. We do not know MBE’s reasoning on this, so I may have missed some obvious practical reason for this citation policy (if so, it would be great if it could be explained). Also I note that other journals may well have similar policies about preprint citations, so this is not an argument specifically against MBE. It is great that MBE is now allowing preprints, so this is a somewhat minor quibble compared to that step.

One of my main reasons for disliking this policy, other than it singling out preprints for special treatment, is that it may well disrupt how preprints accumulate citations (via tools like google scholar). I view one of the key advantages of preprints that they allow the early recognition and acknowledgement of good ideas (with bad ones being allowed to sink out of view). This is particularly important for young researchers, where preprints can potentially allow people on the job market to escape some of the randomness of how long the publication process takes. Allowing young scholars to have their work critiqued, and cited, early to me seems an important step in allowing young researchers to get a headstart in an increasingly difficult job market.

Potential arguments against treating preprint citations like any other citation:
1) Allowing full citation of preprints may lose the journal (or the authors) citations.

It is slightly hard to see the logic of (1). If I cite a preprint, which has yet to appear in a journal, then by its very nature the journal couldn’t possibly have benefited from that citation. I’m hardly going to delay my own submission/publication to wait for a paper to appear merely so I can cite it (unless I have some prior commitment to a colleague). The same argument seem to hold for the author, citations of the preprint are citations that you would not have received if you did not distribute the article early. Now, a fair concern is that journals/authors may lose citations of the published article, if after the article appears people accidentally cite the arXived paper instead of the final article. However, MBE’s system doesn’t avoid this problem, and it seems like it could be addressed simply by asking the authors to do a pubmed search for each arXived paper to avoid this oversight.

2) Another potential concern is that preprints are, by their nature, subject to change.

Preprints can be updated, so that information contained in them could change, or even be removed. However, preprint sites like arXiv (as well as peerJ and figshare) keep all previous versions of the paper, and these are clearly labeled and can be cited separately. So I can clearly indicate which version I am citing, and this citation is a permanent entry. While this information may have changed in subsequent versions, this is really no different than the fact that subsequent publications can overturn existing results. What is different with versioning of preprints is that we get to see more of this process in the open, which feels like a good thing overall.

3) Authors should acknowledge that arXived preprints have to not been through peer review.

At first sight there is more validity to this point, but I think it is also weak. As an author, and as a reviewer (and indeed as a reader), you have a responsibility to question whether a citation really supports a particular point. As an author I invest a lot of time in trying to track done the right citations and to carefully read, and test, the papers I rely heavily on. As a reviewer I regularly question authors’ use of particular citations and point them toward additional work or ask them to change the wording around a citation. Published papers are not immune from problems, any more than preprints are. If I, and the reviewers of my article, think it is appropriate for me to cite a preprint then I should be allowed to do so as I would any other article.

Also this argument seems somewhat strange; MBE already allows the normal citation of PhD theses and [potentially unpeer-reviewed] books (as pointed out by Antonio Marco). So it is really quite unclear why preprints have been singled out in this way.

All of my articles have benefited greatly from the comments of colleagues and from peer review. I also have a lot of respect for the work done by editors of various journals, including MBE. However, it is unclear to me who this policy serves. Journal policies should always be a light hand; they should ideally allow the authors freedom to fully acknowledge their sources. I see no strong argument for this policy other than it prevents the further blurring of the line between journals and preprints. In my view the only sustainable way forward for journals and scientific societies is to be innovative focal points for collating peer-review and peer-recognition. Only by adapting quickly can journals hope to stay relevant in an age where increasingly (to steal Mike Eisen’s phrase) publishing is pushing a button.

Graham Coop

Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model

Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth-death SIR model
Denise Kühnert, Tanja Stadler, Timothy G. Vaughan, Alexei J. Drummond
(Submitted on 23 Aug 2013)

Evolution of RNA viruses such as HIV, Hepatitis C and Influenza virus occurs so rapidly that the viruses’ genomes contain information on past ecological dynamics. The interaction of ecological and evolutionary processes demands their joint analysis. Here we adapt a birth-death-sampling model, which allows for serially sampled data and rate changes over time to estimate epidemiological parameters of the underlying population dynamics in terms of a compartmental susceptible-infected-removed (SIR) model. Our proposed approach results in a phylodynamic method that enables the joint estimation of epidemiological parameters and phylogenetic history. In contrast to standard coalescent process approaches this method provides separate information on incidence and prevalence of infections. Detailed information on the interaction of host population dynamics and evolutionary history can inform decisions on how to contain or entirely avoid disease outbreaks.
We apply our Birth-Death SIR method (BDSIR) to five human immunodeficiency virus type 1 clusters sampled in the United Kingdom (UK) between 1999 and 2003. The estimated basic reproduction ratio ranges from 1.9 to 3.2 among the clusters. Our results imply that these local epidemics arose from introduction of infected individuals into populations of between 900 and 3000 effectively susceptible individuals, albeit with wide margins of uncertainty. All clusters show a decline in the growth rate of the local epidemic in the middle or end of the 90’s. The effective reproduction ratio of cluster 1 drops below one around 1994, with the local epidemic having almost run its course by the end of the sampled period. For the other four clusters the effective reproduction ratio also decreases over time, but stays above 1. The method is implemented as a BEAST2 package.