Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic

Elephantid genomes reveal the molecular bases of Woolly Mammoth adaptations to the arctic
Vincent Lynch , Oscar C. Bedoya-Reina , Aakrosh Ratan , Michael Sulak , Daniela I. Drautz-Moses , George H. Perry , Webb Miller , Stephan C. Schuster
doi: http://dx.doi.org/10.1101/018366

Woolly mammoths and the living elephants are characterized by major phenotypic differences that allowed them to live in very different environments. To identify the genetic changes that underlie the suite of adaptations in woolly mammoths to life in extreme cold, we sequenced the nuclear genome from three Asian elephants and two woolly mammoths, identified and functionally annotated genetic changes unique to the woolly mammoth lineage. We find that genes with mammoth specific amino acid changes are enriched in functions related to circadian biology, skin and hair development and physiology, lipid metabolism, adipose development and physiology, and temperature sensation. Finally we resurrect and functionally test the mammoth and ancestral elephant TRPV3 gene, which encodes a temperature sensitive transient receptor potential (thermoTRP) channel involved in thermal sensation and hair growth, and show that a single mammoth-specific amino acid substitution in an otherwise highly conserved region of the TRPV3 channel strongly affected its temperature sensitivity. Our results have identified a set of genetic changes that likely played important roles in the adaptation of woolly mammoths to life in the high artic.


A high-throughput RNA-seq approach to profile transcriptional responses

A high-throughput RNA-seq approach to profile transcriptional responses

Gregory A Moyerbrailean , Gordon O Davis , Chris T Harvey , Donovan Watza , Xiaoquan Wen , Roger Pique-Regi , Francesca Luca
doi: http://dx.doi.org/10.1101/018416

In recent years, different technologies have been used to measure genome-wide gene expression levels and to study the transcriptome across many types of tissues and in response to in vitro treatments. However, a full understanding of gene regulation in any given cellular and environmental context combination is still missing. This is partly because analyzing tissue/environment-specific gene expression generally implies screening a large number of cellular conditions and samples, without prior knowledge of which conditions are most informative (e.g. some cell types may not respond to certain treatments). To circumvent these challenges, we have established a new two-step high-throughput and cost-effective RNA-seq approach: the first step consists of gene expression screening of a large number of conditions, while the second step focuses on deep sequencing of the most relevant conditions (e.g. largest number of differentially expressed genes). This study design allows for a fast and economical screen in step one, with a more profitable allocation of resources for the deep sequencing of re-pooled libraries in step two. We have applied this approach to study the response to 26 treatments in three lymphoblastoid cell line samples and we show that it is applicable for other high-throughput transcriptome profiling requiring iterative refinement or screening.

Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing

Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing

Saulo A. Aflitos, Elio Schijlen, Richard Finkers, Sandra Smit, Jun Wang, Gengyun Zhang, Ning Li, Likai Mao, Hans de Jong, Freek Bakker, Barbara Gravendeel, Timo Breit, Rob Dirks, Henk Huits, Darush Struss, Ruth Wagner, Hans van Leeuwen, Roeland van Ham, Laia Fito, Laëtitia Guigner, Myrna Sevilla, Philippe Ellul, Eric W. Ganko, Arvind Kapur, Emmanuel Reclus, Bernard de Geus, Henri van de Geest, Bas te Lintel Hekkert, Jan C. Van Haarst, Lars Smits, Andries Koops, Gabino Sanchez Perez, Dick de Ridder, Sjaak van Heusden, Richard Visser, Zhiwu Quan, Jiumeng Min, Li Liao, Xiaoli Wang, Guangbiao Wang, Zhen Yue, Xinhua Yang, Na Xu, Eric Schranz, Eric F. Smets, Rutger A. Vos, Han Rauwerda, Remco Ursem, Cees Schuit, Mike Kerns, Jan van den Berg, Wim H. Vriezen, Antoine Janssen, Torben Jahrman, Frederic Moquet, Julien Bonnet, Sander A. Peters
(Submitted on 21 Apr 2015)

Genetic variation in the tomato clade was explored by sequencing a selection of 84 tomato accessions and related wild species representative for the Lycopersicon, Arcanum, Eriopersicon, and Neolycopersicon groups. We present a reconstruction of three new reference genomes in support of our comparative genome analyses. Sequence diversity in commercial breeding lines appears extremely low, indicating the dramatic genetic erosion of crop tomatoes. This is reflected by the SNP count in wild species which can exceed 10 million i.e. 20 fold higher than in crop accessions. Comparative sequence alignment reveals group, species, and accession specific polymorphisms, which explain characteristic fruit traits and growth habits in tomato accessions. Using gene models from the annotated Heinz reference genome, we observe a bias in dN/dS ratio in fruit and growth diversification genes compared to a random set of genes, which probably is the result of a positive selection. We detected highly divergent segments in wild S. lycopersicum species, and footprints of introgressions in crop accessions originating from a common donor accession. Phylogenetic relationships of fruit diversification and growth specific genes from crop accessions show incomplete resolution and are dependent on the introgression donor. In contrast, whole genome SNP information has sufficient power to resolve the phylogenetic placement of each accession in the four main groups in the Lycopersicon clade using Maximum Likelihood analyses. Phylogenetic relationships appear correlated with habitat and mating type and point to the occurrence of geographical races within these groups and thus are of practical importance for introgressive hybridization breeding. Our study illustrates the need for multiple reference genomes in support of tomato comparative genomics and Solanum genome evolution studies.

Introgression Browser: High throughput whole-genome SNP visualization

Introgression Browser: High throughput whole-genome SNP visualization

Saulo Alves Aflitos, Gabino Sanchez-Perez, Dick de Ridder, Paul Fransz, Eric Schranz, Hans de Jong, Sander Peters
(Submitted on 21 Apr 2015)

Breeding by introgressive hybridization is a pivotal strategy to broaden the genetic basis of crops. Usually, the desired traits are monitored in consecutive crossing generations by marker-assisted selection, but their analyses fail in chromosome regions where crossover recombinants are rare or not viable. Here, we present the Introgression Browser (IBROWSER), a novel bioinformatics tool aimed at visualizing introgressions at nucleotide or SNP accuracy. The software selects homozygous SNPs from Variant Call Format (VCF) information and filters out heterozygous SNPs, Multi-Nucleotide Polymorphisms (MNPs) and insertion-deletions (InDels). For data analysis IBROWSER makes use of sliding windows, but if needed it can generate any desired fragmentation pattern through General Feature Format (GFF) information. In an example of tomato (Solanum lycopersicum) accessions we visualize SNP patterns and elucidate both position and boundaries of the introgressions. We also show that our tool is capable of identifying alien DNA in a panel of the closely related S. pimpinellifolium by examining phylogenetic relationships of the introgressed segments in tomato. In a third example, we demonstrate the power of the IBROWSER in a panel of 600 Arabidopsis accessions, detecting the boundaries of a SNP-free region around a polymorphic 1.17 Mbp inverted segment on the short arm of chromosome 4. The architecture and functionality of IBROWSER makes the software appropriate for a broad set of analyses including SNP mining, genome structure analysis, and pedigree analysis. Its functionality, together with the capability to process large data sets and efficient visualization of sequence variation, makes IBROWSER a valuable breeding tool.

The generalised quasispecies

The generalised quasispecies

Raphaël Cerf, Joseba Dalmau
(Submitted on 22 Apr 2015)

We study Eigen’s quasispecies model in the asymptotic regime where the length of the genotypes goes to infinity and the mutation probability goes to 0. We give several explicit formulas for the stationary solutions of the limiting system of differential equations.

Genetic Basis of Transcriptome Diversity in Drosophila melanogaster

Genetic Basis of Transcriptome Diversity in Drosophila melanogaster

Wen Huang , Mary Anna Carbone , Michael Magwire , Jason Peiffer , Richard Lyman , Eric Stone , Robert Anholt , Trudy Mackay
doi: http://dx.doi.org/10.1101/018325

Understanding how DNA sequence variation is translated into variation for complex phenotypes has remained elusive, but is essential for predicting adaptive evolution, selecting agriculturally important animals and crops, and personalized medicine. Here, we quantified genome-wide variation in gene expression in the sequenced inbred lines of the Drosophila melanogaster Genetic Reference Panel (DGRP). We found that a substantial fraction of the Drosophila transcriptome is genetically variable and organized into modules of genetically correlated transcripts, which provide functional context for newly identified transcribed regions. We identified regulatory variants for the mean and variance of gene expression, the latter of which could often be explained by an epistatic model. Expression quantitative trait loci for the mean, but not the variance, of gene expression were concentrated near genes. This comprehensive characterization of population scale diversity of transcriptomes and its genetic basis in the DGRP is critically important for a systems understanding of quantitative trait variation.

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Detecting genomic signatures of natural selection with principal component analysis: application to the 1000 Genomes data

Nicolas Duforet-Frebourg, Guillaume Laval, Eric Bazin, Michael G.B. Blum
(Submitted on 8 Apr 2015)

Large-scale genomic data offers the perspective to decipher the genetic architecture of natural selection. To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal component analysis. We show that the common Fst index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components. Looking at the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) after removal of recently admixed individuals resulting in 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3X). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and non-coding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). PCA-based statistics retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially in non-model species for which defining populations can be difficult. Genome scan based on PCA is implemented in the open-source and freely available PCAdapt software.

Efficient compression and analysis of large genetic variation datasets

Efficient compression and analysis of large genetic variation datasets
Ryan M Layer , Neil Kindlon , Konrad J Karczewski , Exome Aggregation Consortium ExAC , Aaron R Quinlan
doi: http://dx.doi.org/10.1101/018259

The economy of human genome sequencing has catalyzed ambitious efforts to interrogate the genomes of large cohorts in search of deeper insight into the genetic basis of disease. This manuscript introduces Genotype Query Tools (GQT) as a new indexing strategy and powerful toolset that enables interactive analyses based on genotypes, phenotypes and sample relationships. Speed improvements are achieved by operating directly on a compressed index without decompression. GQT’s data compression ratios increase favorably with cohort size and therefore, by avoiding data inflation, relative analysis performance improves in kind. We demonstrate substantial query performance improvements over state-of-the-art tools using datasets from the 1000 Genomes Project (46 fold), the Exome Aggregation Consortium (443 fold), and simulated datasets of up to 100,000 genomes (218 fold). Moreover, our genotype indexing strategy complements existing formats and toolsets to provide a powerful framework for current and future analyses of massive genome datasets.

Efficient Privacy-Preserving String Search and an Application in Genomics

Efficient Privacy-Preserving String Search and an Application in Genomics
Kana Shimizu , Koji Nuida , Gunnar Rätsch
doi: http://dx.doi.org/10.1101/018267

Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g., an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. Approach: We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queries. In an experiment based on 2,184 aligned haploid genomes from the 1,000 Genomes Project, our algorithm was able to perform typical queries within ≈2 seconds and ≈20 seconds seconds for client and server side, respectively, on a laptop computer. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm.

Relationship between LD Score and Haseman-Elston Regression

Relationship between LD Score and Haseman-Elston Regression
Brendan Bulik-Sullivan
doi: http://dx.doi.org/10.1101/018283

Estimating SNP-heritability from summary statistics using LD Score regression provides a convenient alternative to standard variance component models, because LD Score regression is computationally very fast and does not require individual genotype data. However, the mathematical relationship between variance component methods and LD Score regression is not clear; in particular, it is not known in general how much of an increase in standard error one incurs by working with summary data instead of individual genotypes. In this paper, I show that in samples of unrelated individuals, LD Score regression with constrained intercept is essentially the same as Haseman-Elston (HE) regression, which is currently the state-of-the-art method for estimating SNP-heritability from ascertained case/control samples. Similar results hold for SNP-genetic correlation.