Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

Andreas Tuerk, Gregor Wiktorin

Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragmentation bias, which is not represented appropriately by current statistical models of RNA-Seq data. Another, less investigated, source of error is the inaccuracy of transcript start and end annotations. This article introduces the Mix2 (rd. ”mixquare”) model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2 model can be efficiently trained with the EM algorithm and are tied between similar transcripts. Transcript specific shift and scale parameters allow the Mix2 model to automatically correct inaccurate transcript start and end annotations. Experiments are conducted on synthetic data covering 7 genes of different complexity, 4 types of fragment bias and correct as well as incorrect transcript start and end annotations. Abundance estimates obtained by Cufflinks 2.2.0, PennSeq and the Mix2 model show superior performance of the Mix2 model in the vast majority of test conditions.

Complete plastid genome assembly of invasive plant, Centaurea diffusa

Complete plastid genome assembly of invasive plant, Centaurea diffusa

Kathryn G Turner, Christopher J Grassa

Invasive plants present both problems and possibilities for discovery, which may be addressed utilizing new genomic tools. Here we present the completed plastome assembly for the problematic invasive weed, Centaurea diffusa. This new tool represents a significant contribution to future studies of the ecological genomics of invasive plants, particularly this weedy genus, and studies of the Asteraceae in general.

Power analysis of artificial selection experiments using efficient whole genome simulation of quantitative traits

Power analysis of artificial selection experiments using efficient whole genome simulation of quantitative traits

Darren Kessner, John Novembre

Evolve and resequence studies combine artificial selection experiments with massively parallel sequencing technology to study the genetic basis for complex traits. In these experiments, individuals are selected for extreme values of a trait, causing alleles at quantitative trait loci (QTLs) to increase or decrease in frequency in the experimental population. We present a new analysis of the power of artificial selection experiments to detect and localize quantitative trait loci. This analysis uses a simulation framework that explicitly models whole genomes of individuals, quantitative traits, and selection based on individual trait values. We find that explicitly modeling QTL provides produces qualitatively different insights than considering independent loci with constant selection coefficients. Specifically, we observe how interference between QTLs under selection impacts the trajectories and lengthens the fixation times of selected alleles. We also show that a substantial portion of the genetic variance of the trait (50–100%) can be explained by detected QTLs in as little as 20 generations of selection, depending on the trait architecture and experimental design. Furthermore, we show that power depends crucially on the opportunity for recombination during the experiment. Finally, we show that an increase in power is obtained by leveraging founder haplotype information to obtain allele frequency estimates.

Genomic, transcriptomic and phenomic variation reveals the complex adaptation of modern maize breeding

Genomic, transcriptomic and phenomic variation reveals the complex adaptation of modern maize breeding

Haijun Liu, Xiaqing Wang, Marilyn Warburton, Weiwei Wen, Minliang Jin, Min Deng, Jie Liu, Hao Tong, Qingchun Pan, Xiaohong Yang, Jianbing Yan

The temperate-tropical division of early maize germplasm to different agricultural environments was arguably the greatest adaptation process associated with the success and near ubiquitous importance of global maize production. Deciphering this history is challenging, but new insight has been gained from the genomic, transcriptomic and phenotypic variation collected from 368 diverse temperate and tropical maize inbred lines in this study. This is the first attempt to systematically explore the mechanisms of the adaptation process. Our results indicated that divergence between tropical and temperate lines seem occur 3,400-6,700 years ago. A number of genomic selection signals and transcriptomic variants including differentially expressed individual genes and rewired co-expression networks of genes were identified. These candidate signals were found to be functionally related to stress response and most were associated with directionally selected traits, which may have been an advantage under widely varying environmental conditions faced by maize as it was migrated away from its domestication center. It?s also clear in our study that such stress adaptation could involve evolution of protein-coding sequences as well as transcriptome-level regulatory changes. This latter process may be a more flexible and dynamic way for maize to adapt to environmental changes over this dramatically short evolutionary time frame.

Natural variation in teosinte at the domestication locus teosinte branched1 (tb1)

Natural variation in teosinte at the domestication locus teosinte branched1 (tb1)

Laura Vann, Thomas Kono, Tanja Pyha ̈j ̈arvi, Matthew B Hufford, Jeffrey Ross-Ibarra

Premise of the study: The teosinte branched1 (tb1) gene is a major QTL controlling branching differences between maize and its wild progenitor, teosinte. The insertion of a transposable element (Hopscotch) upstream of tb1 is known to enhance the gene’s expression, causing reduced tillering in maize. Observations of the maize tb1 allele in teosinte and estimates of an insertion age of the Hopscotch that predates domestication led us to investigate its prevalence and potential role in teosinte. Methods: Prevalence of the Hopscotch element was assessed across an Americas-wide sample of 1110 maize and teosinte individuals using a co-dominant PCR assay. Population genetic summaries were calculated for a subset of individuals from four teosinte populations in central Mexico. Phenotypic data were also collected from a single teosinte population where Hopscotch was found segregating. Key results: Genotyping results suggest the Hopscotch element is at higher than expected frequency in teosinte. Analysis of linkage disequilibrium near tb1 does not support recent introgression of the Hopscotch allele from maize into teosinte. Population genetic signatures are consistent with selection on this locus revealing a potential ecological role for Hopscotch in teosinte. Finally, two greenhouse experiments with teosinte do not suggest tb1 controls tillering in natural populations. Conclusions: Our findings suggest the role of Hopscotch differs between maize and teosinte. Future work should assess tb1 expression levels in teosinte with and without the Hopscotch and more comprehensively phenotype teosinte to assess the ecological significance of the Hopscotch insertion and, more broadly, the tb1 locus in teosinte. Key words: domestication; maize; teosinte; teosinte branched1; transposable element

How the tortoise beats the hare: Slow and steady adaptation in structured populations suggests a rugged fitness landscape in bacteria

How the tortoise beats the hare: Slow and steady adaptation in structured populations suggests a rugged fitness landscape in bacteria

Joshua R. Nahum, Peter Godfrey-Smith, Brittany N. Harding, Joseph H. Marcus, Jared Carlson-Stevermer, Benjamin Kerr

In the context of Wright’s adaptive landscape, genetic epistasis can yield a multi-peaked or “rugged” topography. In an unstructured population, a lineage with selective access to multiple peaks is expected to rapidly fix on one, which may not be the highest peak. Contrarily, beneficial mutations in a population with spatially restricted migration take longer to fix, allowing distant parts of the population to explore the landscape semi-independently. Such a population can simultaneous discover multiple peaks and the genotype at the highest discovered peak is expected to fix eventually. Thus, structured populations sacrifice initial speed of adaptation for breadth of search. As in the Tortoise-Hare fable, the structured population (Tortoise) starts relatively slow, but eventually surpasses the unstructured population (Hare) in average fitness. In contrast, on single-peak landscapes (e.g., systems lacking epistasis), all uphill paths converge. Given such “smooth” topography, breadth of search is devalued, and a structured population only lags behind an unstructured population in average fitness (ultimately converging). Thus, the Tortoise-Hare pattern is an indicator of ruggedness. After verifying these predictions in simulated populations where ruggedness is manipulable, we then explore average fitness in metapopulations of Escherichia coli. Consistent with a rugged landscape topography, we find a Tortoise-Hare pattern. Further, we find that structured populations accumulate more mutations, suggesting that distant peaks are higher. This approach can be used to unveil landscape topography in other systems, and we discuss its application for antibiotic resistance, engineering problems, and elements of Wright’s Shifting Balance Process.

Predicting evolution from the shape of genealogical trees

Predicting evolution from the shape of genealogical trees

Richard A. Neher, Colin A. Russell, Boris I. Shraiman
(Submitted on 3 Jun 2014)

Given a sample of genome sequences from an asexual population, can one predict its evolutionary future? Here we demonstrate that the branching pattern of reconstructed genealogical trees contains information about the relative fitness of the sampled sequences and that this information can be used to infer the closest extant relative of future populations. Our approach is based on the assumption that evolution proceeds predominantly by accumulation of small effect mutations and does not require any species specific input. Hence, the resulting inference algorithm can be applied to any asexual population under persistent selection pressure. We demonstrate its performance using historical data on seasonal influenza A/H3N2 virus. We predict the progenitor lineage of the upcoming influenza season with near optimal performance in 30% of cases and makes informative predictions in 16 out of 18 years. Beyond providing a practical tool for prediction, our results suggest that continuous adaptation by small effect mutations is a major component of influenza virus evolution.

Author post: Inferring human population size and separation history from multiple genome sequences

This guest post is by Stephan Schiffels (@stschiff) on his paper with Richard Durbin Inferring human population size and separation history from multiple genome sequences biorxived here

In our paper, we study genome sequences to learn about human history and how human populations are related to each other. Remarkably, we only need a few individuals for this, because once we look sufficiently many generations into the past, every single genome contains fragments from a very large number of ancestors. This means that given only two genomes, say one individual from Africa and one individual from Europe, we typically find shared fragments from common ancestors (great great … great grandparents) from 2,000 or more generations ago. This trace of shared segments in our genomes can be detected and enables us to make inference about human history.

A few years ago, Heng Li and Richard Durbin introduced the PSMC method which is based on estimating this shared common ancestry in a single diploid genome to infer population sizes. We now introduced a major extension to this approach, called MSMC (Multiple Sequentially Markovian Coalescent), which is able to find and date traces of shared ancestry across multiple genome sequences. This is generally a hard problem because of the complex way of how sequences relate with each other through recombination and mutation (see an excellent blog post by Adam Siepel). In our method, we therefore made a choice to focus only on the pair of segments which coalesce first, i.e. share the most recent common ancestor of all pairs. Because of ancestral recombinations, this changes along the sequences.

Consider again the example of an African and a European individual, each of them carrying two copies of a chromosome. In one part of their genomes, the most recent ancestor of any two chromosomes may be shared between the two European chromosomes, in other parts it may be shared between the two African chromosomes, and in some cases it may actually be found across a European and an African chromosome. The relative frequency of how often we observe each of the three cases, and the distribution of times to the most recent common ancestor, give information about when the separation happened, and how long it took for the ancestral people to part fully from each other. In the case of West-Africans and Europeans, we found that the two populations started to separate from each other (at least genetically) long before the known out-of-Africa emigration 50,000 years ago. And we see the same thing if we compare West-Africans to Asians or Americans instead of Europeans. We can also see clearly how ancestors of Native Americans separated from Asians around 20,000 years ago, consistently preceding the known first arrival of people in the New World around 15,000 years ago.

Our method can also estimate effective population size changes through time. One consequence of our approach to look only for the first common ancestor is that we can now look into the much more recent past than was previously possible with similar methods, such as PSMC. For example, we can now see a deep bottleneck in Native American ancestors around 15,000 years ago which fits with the separation and immigration history described above, and we can see recent expansions that are consistent with the spread of agriculture in Africa.

We believe that MSMC is a useful tool for estimating population history from whole genome sequences. But more ideas and development are still needed in the future to expand this approach to more genomes and to look into the past even more recently than 2,000 years ago, which is our current limit with MSMC. Closely related approaches are currently developed by Yun Song, Thomas Mailund and others, which will complement MSMC. This is a great time to work in this field, given that many more high quality individual genome sequences are being generated, and in many cases from populations that we have not covered at all in our paper. All of this will help to greatly expand our knowledge of human population history.

Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions

Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions

Matthew Hall, Andrew Rambaut
(Submitted on 2 Jun 2014)

The reconstruction of transmission trees for epidemics from genetic data has been the subject of some recent interest. It has been demonstrated that the transmission tree structure can be investigated by augmenting internal nodes of a phylogenetic tree constructed using pathogen sequences from the epidemic with information about the host that held the corresponding lineage. In this paper, we note that this augmentation is equivalent to a correspondence between transmission trees and partitions of the phylogenetic tree into connected subtrees each containing one tip, and provide a framework for Markov Chain Monte Carlo inference of phylogenies that are partitioned in this way, giving a new method to co-estimate both trees. The procedure is integrated in the existing phylogenetic inference package BEAST.

Target enrichment of ultraconserved elements from arthropods provides a genomic perspective on relationships among Hymenoptera

Target enrichment of ultraconserved elements from arthropods provides a genomic perspective on relationships among Hymenoptera

Brant C. Faircloth, Michael G. Branstetter, Noor D. White, Seán G. Brady
(Submitted on 2 Jun 2014)

Gaining a genomic perspective on phylogeny requires the collection of data from many putatively independent loci collected across the genome. Among insects, an increasingly common approach to collecting this class of data involves transcriptome sequencing, because few insects have high-quality genome sequences available; assembling new genomes remains a limiting factor; the transcribed portion of the genome is a reasonable, reduced subset of the genome to target; and the data collected from transcribed portions of the genome are similar in composition to the types of data with which biologists have traditionally worked (e.g., exons). However, molecular techniques requiring RNA as a template are limited to using very high quality source materials, which are often unavailable from a large proportion of biologically important insect samples. Recent research suggests that DNA-based target enrichment of conserved genomic elements offers another path to collecting phylogenomic data across insect taxa, provided that conserved elements are present in and can be collected from insect genomes. Here, we identify a large set (n=1510) of ultraconserved elements (UCE) shared among the insect order Hymenoptera. We use in silico analyses to show that these loci accurately reconstruct relationships among genome-enabled Hymenoptera, and we design a set of baits for enriching these loci that researchers can use with DNA templates extracted from a variety of sources. We use our UCE bait set to enrich an average of 721 UCE loci from 30 hymenopteran taxa, and we use these UCE loci to reconstruct phylogenetic relationships spanning very old (≥220 MYA) to very young (≥1 MYA) divergences among hymenopteran lineages. In contrast to a recent study addressing hymenopteran phylogeny using transcriptome data, we found ants to be sister to all remaining aculeate lineages with complete support.