Butter: High-precision genomic alignment of small RNA-seq data

Butter: High-precision genomic alignment of small RNA-seq data
Michael J Axtell

Eukaryotes produce large numbers of small non-coding RNAs that act as specificity determinants for various gene-regulatory complexes. These include microRNAs (miRNAs), endogenous short interfering RNAs (siRNAs), and Piwi-associated RNAs (piRNAs). These RNAs can be discovered, annotated, and quantified using small RNA-seq, a variant RNA-seq method based on highly parallel sequencing. Alignment to a reference genome is a critical step in analysis of small RNA-seq data. Because of their small size (20-30 nts depending on the organism and sub-type) and tendency to originate from multi-gene families or repetitive regions, reads that align equally well to more than one genomic location are very common. Typical methods to deal with multi-mapped small RNA-seq reads sacrifice either precision or sensitivity. The tool ‘butter’ balances precision and sensitivity by placing multi-mapped reads using an iterative approach, where the decision between possible locations is dictated by the local densities of more confidently aligned reads. Butter displays superior performance relative to other small RNA-seq aligners. Treatment of multi-mapped small RNA-seq reads has substantial impacts on downstream analyses, including quantification of MIRNA paralogs, and discovery of endogenous siRNA loci. Butter is freely available under a GNU general public license.

Concerning RNA-Guided Gene Drives for the Alteration of Wild Populations

Concerning RNA-Guided Gene Drives for the Alteration of Wild Populations
Kevin M Esvelt, Andrea L Smidler, Flaminia Catteruccia, George M Church

Gene drives may be capable of addressing ecological problems by altering entire populations of wild organisms, but their use has remained largely theoretical due to technical constraints. Here we consider the potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations. We detail likely capabilities, discuss limitations, and provide novel precautionary strategies to control the spread of gene drives and reverse genomic changes. The ability to edit populations of sexual species would offer substantial benefits to humanity and the environment. For example, RNA-guided gene drives could potentially prevent the spread of disease, support agriculture by reversing pesticide and herbicide resistance in insects and weeds, and control damaging invasive species. However, the possibility of unwanted ecological effects and near-certainty of spread across political borders demand careful assessment of each potential application. We call for thoughtful, inclusive, and well-informed public discussions to explore the responsible use of this currently theoretical technology.

No evidence that sex and transposable elements drive genome size variation in evening primroses

No evidence that sex and transposable elements drive genome size variation in evening primroses
J Arvid Agren, Stephan Greiner, Marc TJ Johnson, Stephen I Wright

Genome size varies dramatically across species, but despite an abundance of attention there is little agreement on the relative contributions of selective and neutral processes in governing this variation. The rate of sexual reproduction can potentially play an important role in genome size evolution because of its effect on the efficacy of selection and transmission of transposable elements. Here, we used a phylogenetic comparative approach and whole genome sequencing to investigate the contribution of sex and transposable element content to genome size variation in the evening primrose (Oenothera) genus. We determined genome size using flow cytometry from 30 Oenothera species of varying reproductive system and find that variation in sexual/asexual reproduction cannot explain the almost two-fold variation in genome size. Moreover, using whole genome sequences of three species of varying genome sizes and reproductive system, we found that genome size was not associated with transposable element abundance; instead the larger genomes had a higher abundance of simple sequence repeats. Although it has long been clear that sexual reproduction may affect various aspects of genome evolution in general and transposable element evolution in particular, it does not appear to have played a major role in the evening primroses.

Interpretation and approximation tools for big, dense Markov chain transition matrices in ecology and evolution

Interpretation and approximation tools for big, dense Markov chain transition matrices in ecology and evolution
Katja Reichel, Valentin Bahier, C├ędric Midoux, Jean-Pierre Masson, Solenn Stoeckel
Comments: 8 pages, 4 figures, supplement: 2 figures, visual abstract, highlights, source code
Subjects: Quantitative Methods (q-bio.QM); Populations and Evolution (q-bio.PE)

Markov chains are a common framework for individual-based state and time discrete models in ecology and evolution. Their use, however, is largely limited to systems with a low number of states, since the transition matrices involved pose considerable challenges as their size and their density increase. Big, dense transition matrices may easily defy both the computer’s memory and the scientists’ ability to interpret them, due to the very high amount of information they contain; yet approximations using other types of models are not always the best solution.
We propose a set of methods to overcome the difficulties associated with big, dense Markov chain transition matrices. Using a population genetic model as an example, we demonstrate how big matrices can be transformed into clear and easily interpretable graphs with the help of network analysis. Moreover, we describe an algorithm to save computer memory by substituting the original matrix with a sparse approximate while preserving all its mathematically important properties. In the same model example, we manage to store about 90% less data while keeping more than 99% of the information contained in the matrix and a closely corresponding dominant eigenvector.
Our approach is an example how numerical limitations for the number of states in a Markov chain can be overcome. By facilitating the use of state-rich Markov chain models, they may become a valuable supplement to the diversity of models currently employed in biology.

Redefining Genomic Privacy: Trust and Empowerment

Redefining Genomic Privacy: Trust and Empowerment

Arvind Narayanan, Kenneth Yocum, David Glazer, Nita Farahany, Maynard Olson, Lincoln D. Stein, James B. Williams, Jan A. Witkowski, Robert C. Kain, Yaniv Erlich

Fulfilling the promise of the genetic revolution requires the analysis of large datasets containing information from thousands to millions of participants. However, sharing human genomic data requires protecting subjects from potential harm. Current models rely on de-identification techniques that treat privacy versus data utility as a zero-sum game. Instead we propose using trust-enabling techniques to create a solution where researchers and participants both win. To do so we introduce three principles that facilitate trust in genetic research and outline one possible framework built upon those principles. Our hope is that such trust-centric frameworks provide a sustainable solution that reconciles genetic privacy with data sharing and facilitates genetic research.

Author post: Inferring human population size and separation history from multiple genome sequences

This guest post is by Stephan Schiffels (@stschiff) on his paper with Richard Durbin Inferring human population size and separation history from multiple genome sequences biorxived here

In our paper, we study genome sequences to learn about human history and how human populations are related to each other. Remarkably, we only need a few individuals for this, because once we look sufficiently many generations into the past, every single genome contains fragments from a very large number of ancestors. This means that given only two genomes, say one individual from Africa and one individual from Europe, we typically find shared fragments from common ancestors (great great … great grandparents) from 2,000 or more generations ago. This trace of shared segments in our genomes can be detected and enables us to make inference about human history.

A few years ago, Heng Li and Richard Durbin introduced the PSMC method which is based on estimating this shared common ancestry in a single diploid genome to infer population sizes. We now introduced a major extension to this approach, called MSMC (Multiple Sequentially Markovian Coalescent), which is able to find and date traces of shared ancestry across multiple genome sequences. This is generally a hard problem because of the complex way of how sequences relate with each other through recombination and mutation (see an excellent blog post by Adam Siepel). In our method, we therefore made a choice to focus only on the pair of segments which coalesce first, i.e. share the most recent common ancestor of all pairs. Because of ancestral recombinations, this changes along the sequences.

Consider again the example of an African and a European individual, each of them carrying two copies of a chromosome. In one part of their genomes, the most recent ancestor of any two chromosomes may be shared between the two European chromosomes, in other parts it may be shared between the two African chromosomes, and in some cases it may actually be found across a European and an African chromosome. The relative frequency of how often we observe each of the three cases, and the distribution of times to the most recent common ancestor, give information about when the separation happened, and how long it took for the ancestral people to part fully from each other. In the case of West-Africans and Europeans, we found that the two populations started to separate from each other (at least genetically) long before the known out-of-Africa emigration 50,000 years ago. And we see the same thing if we compare West-Africans to Asians or Americans instead of Europeans. We can also see clearly how ancestors of Native Americans separated from Asians around 20,000 years ago, consistently preceding the known first arrival of people in the New World around 15,000 years ago.

Our method can also estimate effective population size changes through time. One consequence of our approach to look only for the first common ancestor is that we can now look into the much more recent past than was previously possible with similar methods, such as PSMC. For example, we can now see a deep bottleneck in Native American ancestors around 15,000 years ago which fits with the separation and immigration history described above, and we can see recent expansions that are consistent with the spread of agriculture in Africa.

We believe that MSMC is a useful tool for estimating population history from whole genome sequences. But more ideas and development are still needed in the future to expand this approach to more genomes and to look into the past even more recently than 2,000 years ago, which is our current limit with MSMC. Closely related approaches are currently developed by Yun Song, Thomas Mailund and others, which will complement MSMC. This is a great time to work in this field, given that many more high quality individual genome sequences are being generated, and in many cases from populations that we have not covered at all in our paper. All of this will help to greatly expand our knowledge of human population history.

Reconstructing Austronesian population history in Island Southeast Asia

Reconstructing Austronesian population history in Island Southeast Asia
Mark Lipson, Po-Ru Loh, Nick Patterson, Priya Moorjani, Ying-Chin Ko, Mark Stoneking, Bonnie Berger, David Reich

Austronesian languages are spread across half the globe, from Easter Island to Madagascar. Evidence from linguistics and archaeology indicates that the “Austronesian expansion,” which began 4-5 thousand years ago, likely had roots in Taiwan, but the ancestry of present-day Austronesian-speaking populations remains controversial. Here, focusing primarily on Island Southeast Asia, we analyze genome-wide data from 56 populations using new methods for tracing ancestral gene flow. We show that all sampled Austronesian groups harbor ancestry that is more closely related to aboriginal Taiwanese than to any present-day mainland population. Surprisingly, western Island Southeast Asian populations have also inherited ancestry from a source nested within the variation of present-day populations speaking Austro-Asiatic languages, which have historically been nearly exclusive to the mainland. Thus, either there was once a substantial Austro-Asiatic presence in Island Southeast Asia, or Austronesian speakers migrated to and through the mainland, admixing there before continuing to western Indonesia.