Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing

Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing

Saulo A. Aflitos, Elio Schijlen, Richard Finkers, Sandra Smit, Jun Wang, Gengyun Zhang, Ning Li, Likai Mao, Hans de Jong, Freek Bakker, Barbara Gravendeel, Timo Breit, Rob Dirks, Henk Huits, Darush Struss, Ruth Wagner, Hans van Leeuwen, Roeland van Ham, Laia Fito, Laëtitia Guigner, Myrna Sevilla, Philippe Ellul, Eric W. Ganko, Arvind Kapur, Emmanuel Reclus, Bernard de Geus, Henri van de Geest, Bas te Lintel Hekkert, Jan C. Van Haarst, Lars Smits, Andries Koops, Gabino Sanchez Perez, Dick de Ridder, Sjaak van Heusden, Richard Visser, Zhiwu Quan, Jiumeng Min, Li Liao, Xiaoli Wang, Guangbiao Wang, Zhen Yue, Xinhua Yang, Na Xu, Eric Schranz, Eric F. Smets, Rutger A. Vos, Han Rauwerda, Remco Ursem, Cees Schuit, Mike Kerns, Jan van den Berg, Wim H. Vriezen, Antoine Janssen, Torben Jahrman, Frederic Moquet, Julien Bonnet, Sander A. Peters
(Submitted on 21 Apr 2015)

Genetic variation in the tomato clade was explored by sequencing a selection of 84 tomato accessions and related wild species representative for the Lycopersicon, Arcanum, Eriopersicon, and Neolycopersicon groups. We present a reconstruction of three new reference genomes in support of our comparative genome analyses. Sequence diversity in commercial breeding lines appears extremely low, indicating the dramatic genetic erosion of crop tomatoes. This is reflected by the SNP count in wild species which can exceed 10 million i.e. 20 fold higher than in crop accessions. Comparative sequence alignment reveals group, species, and accession specific polymorphisms, which explain characteristic fruit traits and growth habits in tomato accessions. Using gene models from the annotated Heinz reference genome, we observe a bias in dN/dS ratio in fruit and growth diversification genes compared to a random set of genes, which probably is the result of a positive selection. We detected highly divergent segments in wild S. lycopersicum species, and footprints of introgressions in crop accessions originating from a common donor accession. Phylogenetic relationships of fruit diversification and growth specific genes from crop accessions show incomplete resolution and are dependent on the introgression donor. In contrast, whole genome SNP information has sufficient power to resolve the phylogenetic placement of each accession in the four main groups in the Lycopersicon clade using Maximum Likelihood analyses. Phylogenetic relationships appear correlated with habitat and mating type and point to the occurrence of geographical races within these groups and thus are of practical importance for introgressive hybridization breeding. Our study illustrates the need for multiple reference genomes in support of tomato comparative genomics and Solanum genome evolution studies.

Introgression Browser: High throughput whole-genome SNP visualization

Introgression Browser: High throughput whole-genome SNP visualization

Saulo Alves Aflitos, Gabino Sanchez-Perez, Dick de Ridder, Paul Fransz, Eric Schranz, Hans de Jong, Sander Peters
(Submitted on 21 Apr 2015)

Breeding by introgressive hybridization is a pivotal strategy to broaden the genetic basis of crops. Usually, the desired traits are monitored in consecutive crossing generations by marker-assisted selection, but their analyses fail in chromosome regions where crossover recombinants are rare or not viable. Here, we present the Introgression Browser (IBROWSER), a novel bioinformatics tool aimed at visualizing introgressions at nucleotide or SNP accuracy. The software selects homozygous SNPs from Variant Call Format (VCF) information and filters out heterozygous SNPs, Multi-Nucleotide Polymorphisms (MNPs) and insertion-deletions (InDels). For data analysis IBROWSER makes use of sliding windows, but if needed it can generate any desired fragmentation pattern through General Feature Format (GFF) information. In an example of tomato (Solanum lycopersicum) accessions we visualize SNP patterns and elucidate both position and boundaries of the introgressions. We also show that our tool is capable of identifying alien DNA in a panel of the closely related S. pimpinellifolium by examining phylogenetic relationships of the introgressed segments in tomato. In a third example, we demonstrate the power of the IBROWSER in a panel of 600 Arabidopsis accessions, detecting the boundaries of a SNP-free region around a polymorphic 1.17 Mbp inverted segment on the short arm of chromosome 4. The architecture and functionality of IBROWSER makes the software appropriate for a broad set of analyses including SNP mining, genome structure analysis, and pedigree analysis. Its functionality, together with the capability to process large data sets and efficient visualization of sequence variation, makes IBROWSER a valuable breeding tool.

The generalised quasispecies

The generalised quasispecies

Raphaël Cerf, Joseba Dalmau
(Submitted on 22 Apr 2015)

We study Eigen’s quasispecies model in the asymptotic regime where the length of the genotypes goes to infinity and the mutation probability goes to 0. We give several explicit formulas for the stationary solutions of the limiting system of differential equations.

Genetic Basis of Transcriptome Diversity in Drosophila melanogaster

Genetic Basis of Transcriptome Diversity in Drosophila melanogaster

Wen Huang , Mary Anna Carbone , Michael Magwire , Jason Peiffer , Richard Lyman , Eric Stone , Robert Anholt , Trudy Mackay
doi: http://dx.doi.org/10.1101/018325

Understanding how DNA sequence variation is translated into variation for complex phenotypes has remained elusive, but is essential for predicting adaptive evolution, selecting agriculturally important animals and crops, and personalized medicine. Here, we quantified genome-wide variation in gene expression in the sequenced inbred lines of the Drosophila melanogaster Genetic Reference Panel (DGRP). We found that a substantial fraction of the Drosophila transcriptome is genetically variable and organized into modules of genetically correlated transcripts, which provide functional context for newly identified transcribed regions. We identified regulatory variants for the mean and variance of gene expression, the latter of which could often be explained by an epistatic model. Expression quantitative trait loci for the mean, but not the variance, of gene expression were concentrated near genes. This comprehensive characterization of population scale diversity of transcriptomes and its genetic basis in the DGRP is critically important for a systems understanding of quantitative trait variation.

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Nicolas Duforet-Frebourg, Guillaume Laval, Eric Bazin, Michael G.B. Blum
(Submitted on 8 Apr 2015)

Large-scale genomic data offers the perspective to decipher the genetic architecture of natural selection. To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal component analysis. We show that the common Fst index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components. Looking at the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) after removal of recently admixed individuals resulting in 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3X). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and non-coding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). PCA-based statistics retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially in non-model species for which defining populations can be difficult. Genome scan based on PCA is implemented in the open-source and freely available PCAdapt software.

Low levels of transposable element activity in Drosophila mauritiana: causes and consequences

Low levels of transposable element activity in Drosophila mauritiana: causes and consequences

Robert Kofler , Christian Schlötterer
doi: http://dx.doi.org/10.1101/018218

Transposable elements (TEs) are major drivers of genomic and phenotypic evolution, yet many questions about their biology remain poorly understood. Here, we compare TE abundance between populations of the two sister species D. mauritiana und D. simulans and relate it to the more distantly related D. melanogaster. The low population frequency of most TE insertions in D. melanogaster and D. simulans has been a key feature of several models of TE evolution. In D. mauritiana, however, the majority of TE insertions are fixed (66%). We attribute this to a lower transposition activity of up to 47 TE families in D. mauritiana, rather than stronger purifying selection. Only three families, including the extensively studied Mariner, may have a higher activity in D. mauritiana. This remarkable difference in TE activity between two recently diverged Drosophila species (≈ 250,000 years), also supports the hypothesis that TE copy numbers in Drosophila may not reflect a stable equilibrium where the rate of TE gains equals the rate of TE losses by negative selection. We propose that the transposition rate heterogeneity results from the contrasting ecology of the two species: the extent of vertical extinction of TE families and horizontal acquisition of active TE copies may be very different between the colonizing D. simulans and the island endemic D. mauritiana. Our findings provide novel insights in the evolution of TEs in Drosophila and suggest that the ecology of the host species could be a major, yet underappreciated, factor governing the evolutionary dynamics of TEs.

When the mean is not enough: Calculating fixation time distributions in birth-death processes

When the mean is not enough: Calculating fixation time distributions in birth-death processes

Peter Ashcroft, Arne Traulsen, Tobias Galla
(Submitted on 16 Apr 2015)

Studies of fixation dynamics in Markov processes predominantly focus on the mean time to absorption. This may be inadequate if the distribution is broad and skewed. We compute the distribution of fixation times in one-step birth-death processes with two absorbing states. These are expressed in terms of the spectrum of the process, and we provide different representations as forward-only processes in eigenspace. These allow efficient sampling of fixation time distributions. As an application we study evolutionary game dynamics, where invading mutants can reach fixation or go extinct. We also highlight the median fixation time as a possible analog of mixing times in systems with small mutation rates and no absorbing states, whereas the mean fixation time has no such interpretation.

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Fast principal components analysis reveals independent evolution of ADH1B gene in Europe and East Asia

Kevin J Galinsky , Gaurav Bhatia , Po-Ru Loh , Stoyan Georgiev , Sayan Mukherjee , Nick J Patterson , Alkes L Price
doi: http://dx.doi.org/10.1101/018143

Principal components analysis (PCA) is a widely used tool for inferring population structure and correcting confounding in genetic data. We introduce a new algorithm, FastPCA, that leverages recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using a new test for natural selection based on population differentiation along these PCs, we replicate previously known selected loci and identify three new signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984 has previously been associated to alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents.

Fulfilling the promise of Mendelian randomization

Fulfilling the promise of Mendelian randomization

Joseph Pickrell
doi: http://dx.doi.org/10.1101/018150

Many important questions in medicine involve questions about causality, For example, do low levels of high-density lipoproteins (HDL) cause heart disease? Does high body mass index (BMI) cause type 2 diabetes? Or are these traits simply correlated in the population for other reasons? A popular approach to answering these problems using human genetics is called “Mendelian randomization”. We discuss the prospects and limitations of this approach, and some ways forward.

Is there such a thing as Landscape Genetics?

Is there such a thing as Landscape Genetics?

Rodney J Dyer
doi: http://dx.doi.org/10.1101/018192

For a scientific discipline to be interdisciplinary it must satisfy two conditions; it must consist of contributions from at least two existing disciplines and it must be able to provide insights, through this interaction, that neither progenitor discipline could address. In this paper, I examine the complete body of peer-reviewed literature self-identified as landscape genetics using the statistical approaches of text mining and natural language processing. The goal here is to quantify the kinds of questions being addressed in landscape genetic studies, the ways in which questions are evaluated mechanistically, and how they are differentiated from the progenitor disciplines of landscape ecology and population genetics. I then circumscribe the main factions within published landscape genetic papers examining the extent to which emergent questions are being addressed and highlighting a deep bifurcation between existing individual- and population-based approaches. I close by providing some suggestions on where theoretical and analytical work is needed if landscape genetics is to serve as a real bridge connecting evolution and ecology sensu lato.