Genetic Analysis of Transformed Phenotypes

Genetic Analysis of Transformed Phenotypes

Nicolo Fusi, Christoph Lippert, Neil D. Lawrence, Oliver Stegle
(Submitted on 21 Feb 2014)

Linear mixed models (LMMs) are a powerful and established tool for studying the genetics of phenotypic variation. A limiting assumption of LMMs is that the phenotype is Gaussian distributed under the model, a requirement that rarely holds in practice. Since violations of this assumption can lead to false conclusions and losses in power, it’s common practice to pre-process the phenotypic values, for instance by applying logarithmic transformations. Unfortunately, these are not appropriate in every situation, and choosing a “good” transformation is in general challenging and subjective. Here, we present an extension of the LMM that estimates an optimal transformation from the data. We show in extensive simulations and real data from human, mouse and yeast that application of these optimal transformations leads to increased power in genome-wide association studies and higher accuracy in heritability estimates and phenotype predictions.

Extensive translation of small ORFs revealed by polysomal ribo-Seq

Extensive translation of small ORFs revealed by polysomal ribo-Seq

Julie L Aspden, Ying Chen Eyre-Walker, Rose J. Phillips, Michele Brocard, Unum Amin, Juan Couso

Thousands of small Open Reading Frames (smORFs) encoding small peptides of fewer than 100 amino acids exist in our genomes. Examples of functional smORFs have been characterised in a few species but the actual number of translated smORFs, and their molecular, functional and evolutionary features are not known. Here we present a genome-wide assessment of smORF translation by ribosomal profiling of polysomal fractions. This ‘polysomal ribo-Seq’ suggests that smORFs are translated at the same level and in the same relative numbers (80%) as normal proteins. The smORF peptides appear widely conserved, show activity in cells, and display a putative amino acid signature. These findings reinforce the idea that smORFs are an abundant and fundamental genome component, displaying features usually attributed to canonical proteins, including high translation levels, biological function, amino acid sequence specificity and cross-species conservation.

Mapping eQTL networks with mixed graphical Markov models

Mapping eQTL networks with mixed graphical Markov models

Inma Tur, Alberto Roverato, Robert Castelo
(Submitted on 19 Feb 2014 (v1), last revised 29 Oct 2014 (this version, v5))

Expression quantitative trait loci (eQTL) mapping constitutes a challenging problem due to, among other reasons, the high-dimensional multivariate nature of gene-expression traits. Next to the expression heterogeneity produced by confounding factors and other sources of unwanted variation, indirect effects spread throughout genes as a result of genetic, molecular and environmental perturbations. From a multivariate perspective one would like to adjust for the effect of all of these factors to end up with a network of direct associations connecting the path from genotype to phenotype. In this paper we approach this challenge with mixed graphical Markov models, higher-order conditional independences and q-order correlation graphs. These models show that additive genetic effects propagate through the network as function of gene-gene correlations. Our estimation of the eQTL network underlying a well-studied yeast data set leads to a sparse structure with more direct genetic and regulatory associations that enable a straightforward comparison of the genetic control of gene expression across chromosomes. Interestingly, it also reveals that eQTLs explain most of the expression variability of network hub genes.

A Novel Approach for Multi-Domain and Multi-Gene Family Identification Provides Insights into Evolutionary Dynamics of Disease Resistance Genes in Core Eudicot Plants

A Novel Approach for Multi-Domain and Multi-Gene Family Identification Provides Insights into Evolutionary Dynamics of Disease Resistance Genes in Core Eudicot Plants

Johannes A. Hofberger, Beifei Zhou, Haibao Tang, Jonathan DG Jones, M. Eric Schranz

Recent advances in DNA sequencing techniques resulted in more than forty sequenced plant genomes representing a diverse set of taxa of agricultural, energy, medicinal and ecological importance. However, gene family curation is often only inferred from DNA sequence homology and lacks insights into evolutionary processes contributing to gene family dynamics. In a comparative genomics framework, we integrated multiple lines of evidence provided by gene synteny, sequence homology and protein-based Hidden Markov Modelling to extract homologous super-clusters composed of multi-domain resistance (R)-proteins of the NB-LRR type (for NUCLEOTIDE BINDING/LEUCINE-RICH REPEATS), that are involved in plant innate immunity. To assess the diversity of R-proteins within and between species, we screened twelve eudicot plant genomes including six major crops and found a total of 2,363 NB-LRR genes. Our curated R-proteins set shows a 50% average for tandem duplicates and a 22% fraction of gene copies retained from ancient polyploidy events (ohnologs). We provide evidence for strong positive selection acting on all identified genes and show significant differences in molecular evolution rates (Ka/Ks-ratio) among tandem- (mean=1.59), ohnolog (mean=1.36) and singleton (mean=1.22) R-gene duplicates. To foster the process of gene-edited plant breeding, we report species-specific presence/absence of all 140 NB-LRR genes present in the model plant Arabidopsis and describe four distinct clusters of NB-LRR ?gatekeeper? loci sharing syntelogs across all analyzed genomes. In summary, we designed and implemented an easy-to-follow computational framework for super-gene family identification, and provide the most curated set of NB-LRR genes whose genetic versatility among twelve lineages can underpin crop improvement.

Cell specific eQTL analysis without sorting cells

Cell specific eQTL analysis without sorting cells

Harm-Jan Westra, Danny Arends, Tõnu Esko, Marjolein J. Peters, Claudia Schurmann, Katharina Schramm, Johannes Kettunen, Hanieh Yaghootkar, Benjamin Fairfax, Anand Kumar Andiappan, Yang Li, Jingyuan Fu, Juha Karjalainen, Mathieu Platteel, Marijn Visschedijk, Rinse Weersma, Silva Kasela, Lili Milani, Liina Tserel, Pärt Peterson, Eva Reinmaa, Albert Hofman, André G. Uitterlinden, Fernando Rivadeneira, Georg Homuth, Astrid Petersmann, Roberto Lorbeer, Holger Prokisch, Thomas Meitinger, Christian Herder, Michael Roden, Harald Grallert, Samuli Ripatti, Markus Perola, Adrew R. Wood, David Melzer, Luigi Ferrucci, Andrew B. Singleton, Dena G. Hernandez, Julian C. Knight, Rossella Melchiotti, Bernett Lee, Michael Poidinger, Francesca Zolezzi, Anis Larbi, De Yun Wang, Leonard H. van den Berg, Jan H. Veldink, Olaf Rotzschke, Seiko Makino, Timouthy Frayling, Veikko Salomaa, Konstantin Strauch, Uwe Völker, Joyce B.J. van Meurs, Andres Metspalu, Cisca Wijmenga, Ritsert C. Jansen, Lude Franke

Expression quantitative trait locus (eQTL) mapping on tissue, organ or whole organism data can detect associations that are generic across cell types. We describe a new method to focus upon specific cell types without first needing to sort cells. We applied the method to whole blood data from 5,683 samples and demonstrate that SNPs associated with Crohn’s disease preferentially affect gene expression within neutrophils.

The disruption of trace element homeostasis due to aneuploidy as a unifying theme in the etiology of cancer

The disruption of trace element homeostasis due to aneuploidy as a unifying theme in the etiology of cancer

Johannes Engelken, Matthias Altmeyer, Renty Franklin

#### #### Abstract for Scientists: While decades of cancer research have firmly established multiple “hallmarks of cancer”, cancer’s genomic landscape remains to be fully understood. Particularly, the phenomenon of aneuploidy – gains and losses of large genomic regions, i.e. whole chromosomes or chromosome arms – and why most cancer cells are aneuploid remains enigmatic. This is despite the achievements of cytogenomics and whole genome sequencing which have successfully pinpointed focal amplifications and focal deletions as well as point mutations affecting numerous genes involved in carcinogenesis. A characteristic of many different cancers is the deregulation of the homeostasis of trace elements, such as copper (Cu), zinc (Zn) and iron (Fe). Concentrations of copper are markedly increased in cancer tissue and the blood plasma of cancer patients, while zinc levels are typically decreased. Here we discuss the hypothesis that the disruption of trace element homeostasis and the phenomenon of aneuploidy might be linked. Our tentative analysis of genomic data from diverse tumor types mainly from The Cancer Genome Atlas (TCGA) project suggests that gains and losses of metal transporter genes occur frequently and correlate well with transporter gene expression levels. Hereby they may confer a cancer-driving selective growth advantage at early and possibly also later stages during cancer development. This idea is consistent with recent observations in yeast, which suggest that through chromosomal gains and losses cells can adapt quickly to new carbon sources, nutrient starvation as well as to copper toxicity. In human cancer development, candidate driving events may include, among others, the gains of zinc transporter genes SLC39A1 and SLC39A4 on chromosome arms 1q and 8q, respectively, and the losses of zinc transporter genes SLC30A5, SLC39A14 and SLC39A6 on 5q, 8p and 18q. The recurrent gain of 3q might be associated with the iron transporter gene TFRC and the loss of 13q with the copper transporter gene ATP7B. By altering cellular trace element homeostasis (especially fluctuations in labile and total zinc) such events might contribute to the initiation of the malignant transformation. Consistently, it has been shown that zinc affects a number of the observed hallmark characteristics including DNA repair, inflammation and apoptosis. We term this model the “aneuploidy metal transporter cancer” (AMTC) hypothesis. While the AMTC hypothesis does not contradict the cancer-promoting role of point and focal mutations in established tumor suppressor genes and oncogenes (e.g. MYC, MYCN, TP53, PIK3CA, BRCA1, ERBB2), it seems possible that some of these mutations may be a response to the prior disruption of trace element homeostasis. We suggest a number of approaches for how this hypothesis could be tested experimentally and briefly touch on possible implications for cancer etiology, metastasis, drug resistance and therapy.

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data

Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data
Richard W Lusk
(Submitted on 30 Jan 2014)

Background: Trace quantities of contaminating DNA are widespread in the laboratory environment, but their presence has received little attention in the context of high throughput sequencing. This issue is highlighted by recent works that have rested controversial claims upon sequencing data that appear to support the presence of unexpected exogenous species.
Results: I used reads that preferentially aligned to alternate genomes to infer the distribution of potential contaminant species in a set of independent sequencing experiments. I confirmed that dilute samples are more exposed to contaminating DNA, and, focusing on four single-cell sequencing experiments, found that these contaminants appear to originate from a wide diversity of clades. Although negative control libraries prepared from “blank” samples recovered the highest-frequency contaminants, low-frequency contaminants, which appeared to make heterogeneous contributions to samples prepared in parallel within a single experiment, were not well controlled for. I used these results to show that, despite heavy replication and plausible controls, contamination can explain all of the observations used to support a recent claim that complete genes pass from food to human blood.
Conclusions: Contamination must be considered a potential source of signals of exogenous species in sequencing data, even if these signals are replicated in independent experiments, vary across conditions, or indicate a species which seems a priori unlikely to contaminate. Negative control libraries processed in parallel are essential to control for contaminant DNAs, but their limited ability to recover low-frequency contaminants must be recognized.

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Gad Abraham, Michael Inouye

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy compared with existing tools in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

On the representation of de Bruijn graphs

On the representation of de Bruijn graphs
Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared Simpson, Paul Medvedev
(Submitted on 21 Jan 2014)

The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers proposed a navigational data structure approach in order to improve memory usage. We prove several theoretical space lower bounds to show the limitation of these types of approaches. We further design and implement a general data structure (DBGFM) and demonstrate its use on a human whole-genome dataset, achieving space usage of 1.5 GB and a 46% improvement over previous approaches. As part of DBGFM, we develop the notion of frequency-based minimizers and show how it can be used to enumerate all maximal simple paths of the de Bruijn graph using only 43 MB of memory. Finally, we demonstrate that our approach can be integrated into an existing assembler by modifying the ABySS software to use DBGFM.

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly repetitive transposable elements

Rajiv C McCoy, Ryan W Taylor, Timothy A Blauwkamp, Joanna L Kelley, Michael Kertesz, Dmitry Pushkarev, Dmitri A Petrov, Anna-Sophie Fiston-Lavier

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, mostly due to the presence of repeats, which cannot be reconstructed unambiguously with short read data alone. One class of repeats, called transposable elements (TEs), is particularly problematic due to high sequence identity, high copy number, and a capacity to induce complex genomic rearrangements. Despite their importance to genome function and evolution, most current de novo assembly approaches cannot resolve TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly parallel library preparation and local assembly of short read data and achieve lengths of 2-15 Kbp with an extremely low error rate (0.05%). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain yw;cn,bw,sp) achieving an NG50 contig size of 77.9 Kbp and covering 97.2% of the current reference genome (including heterochromatin). TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recover and accurately place 80.4% of annotated transposable elements with perfect identity to the current reference genome. As TEs are complex and highly repetitive features that are ubiquitous in genomes across the tree of life, TruSeq synthetic long-read technology offers a powerful and inexpensive approach to drastically improve de novo assemblies of whole genomes.