Genomic and transcriptomic insights into the regulation of snake venom production

Genomic and transcriptomic insights into the regulation of snake venom production

Adam D Hargreaves, Martin T Swain, Matthew J Hegarty, Darren W Logan, John F Mulley
doi: http://dx.doi.org/10.1101/008474

The gene regulatory mechanisms underlying the rapid replenishment of snake venom following expenditure are currently unknown. Using a comparative transcriptomic approach we find that venomous and non-venomous species produce similar numbers of secreted products in their venom or salivary glands and that only one transcription factor (Tbx3) is expressed in venom glands but not salivary glands. We also find evidence for temporal variation in venom production. We have generated a draft genome sequence for the painted saw-scaled viper, Echis coloratus, and identified conserved transcription factor binding sites in the upstream regions of venom genes. We find binding sites to be conserved across members of the same gene family, but not between gene families, indicating that multiple gene regulatory networks are involved in venom production. Finally, we suggest that negative regulation may be important for rapid activation of the venom replenishment cycle.

Sexual dimorphism in epigenomic responses of stem cells to extreme fetal growth

Sexual dimorphism in epigenomic responses of stem cells to extreme fetal growth

Fabien Delahaye, Neil Ari Wijetunga, Hye J Heo, Jessica N Tozour, Yong Mei Zhao, John M Greally, Francine H Einstein
doi: http://dx.doi.org/10.1101/008482

Extreme fetal growth is associated with increased susceptibility to a range of adult diseases through an unknown mechanism of cellular memory. We tested whether heritable epigenetic processes in long-lived CD34+ hematopoietic stem/progenitor cells (HSPCs) showed evidence for re-programming associated with the extremes of fetal growth. Here we show that both fetal growth restriction and over-growth are associated with global shifts towards DNA hypermethylation, targeting cis-regulatory elements in proximity to genes involved in glucose homeostasis and stem cell function. A sexually dimorphic response was found, intrauterine growth restriction (IUGR) associated with substantially greater epigenetic dysregulation in males but large for gestational age (LGA) growth affecting females predominantly. The findings are consistent with extreme fetal growth interacting with variable fetal susceptibility to influence cellular aging and metabolic characteristics through epigenetic mechanisms, potentially generating biomarkers that could identify infants at higher risk for chronic disease later in life.

DISEASES: Text mining and data integration of disease–gene associations

DISEASES: Text mining and data integration of disease–gene associations

Sune Pletscher-Frankild, Albert Pallejà, Kalliopi Tsafou, Janos X Binder, Lars Juhl Jensen
doi: http://dx.doi.org/10.1101/008425

Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease–gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease–gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a user-friendly web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.

IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees

IPED2: Inheritance Path based Pedigree Reconstruction Algorithm for Complicated Pedigrees

Dan He, Zhanyong Wang, Laxmi Parida, Eleazar Eskin
(Submitted on 23 Aug 2014)

Reconstruction of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. The problem is known to be NP-hard even for datasets known to only contain siblings. Some recent methods have been developed to accurately and efficiently reconstruct pedigrees. These methods, however, still consider relatively simple pedigrees, for example, they are not able to handle half-sibling situations where a pair of individuals only share one parent. In this work, we propose an efficient method, IPED2, based on our previous work, which specifically targets reconstruction of complicated pedigrees that include half-siblings. We note that the presence of half-siblings makes the reconstruction problem significantly more challenging which is why previous methods exclude the possibility of half-siblings. We proposed a novel model as well as an efficient graph algorithm and experiments show that our algorithm achieves relatively accurate reconstruction. To our knowledge, this is the first method that is able to handle pedigree reconstruction based on genotype data only when half-sibling exists in any generation of the pedigree.

Sources of PCR-induced distortions in high-throughput sequencing datasets

Sources of PCR-induced distortions in high-throughput sequencing datasets

Justus M Kebschull, Anthony M Zador
doi: http://dx.doi.org/10.1101/008375

PCR allows the exponential and sequence specific amplification of DNA, even from minute starting quantities. Today, PCR is at the core of the most successful DNA sequencing technologies and is a fundamental step in preparing DNA samples for high throughput sequencing. Despite its importance, we have little comprehensive understanding of the biases and errors that PCR introduces into pools of DNA molecules. Understanding PCRs imperfections and their impact on the amplification of different sequences in a complex mixture is particularly important for a proper understanding of high-throughput sequencing data. We examined the effects of bias, stochasticity, template switches and polymerase errors introduced during PCR on sequence representation in next-generation sequencing libraries. Using Illumina sequencing results of a pool of diverse PCR amplicons with a defined structure, we searched for signatures of each process. We further developed quantitative models for each process and compared predictions of these models to our experimental data. We find that PCR stochasticity is the major force skewing sequence representation after amplification of a pool of unique DNA amplicons. PCR errors become very common in later cycles of PCR but have little impact on the overall sequence distribution as they are confined to small copy numbers. PCR template switches are rare and confined to low copy numbers. Our results will have particular relevance to single cell sequencing, in which sequences are represented by only one or a few molecules.

The genomic landscape of polymorphic human nuclear mitochondrial insertions

The genomic landscape of polymorphic human nuclear mitochondrial insertions

Gargi Dayama, Sarah B Emery, Jeffrey M Kidd, Ryan E Mills
doi: http://dx.doi.org/10.1101/008144

The transfer of mitochondrial genetic material into the nuclear genomes of eukaryotes is a well-established phenomenon. Many studies over the past decade have utilized reference genome sequences of numerous species to characterize the prevalence and contribution of nuclear mitochondrial insertions to human diseases. The recent advancement of high throughput sequencing technologies has enabled the interrogation of genomic variation at a much finer scale, and now allows for an exploration into the diversity of polymorphic nuclear mitochondrial insertions (NumtS) in human populations. We have developed an approach to discover and genotype previously undiscovered Numt insertions using whole genome, paired-end sequencing data. We have applied this method to almost a thousand individuals in twenty populations from the 1000 Genomes Project and other data sets and identified 138 novel sites of Numt insertions, extending our current knowledge of existing Numt locations in the human genome by almost 20%. Most of the newly identified NumtS were found in less than 1% of the samples we examined, suggesting that they occur infrequently in nature or have been rapidly removed by purifying selection. We find that recent Numt insertions are derived from throughout the mitochondrial genome, including the D-loop, and have integration biases consistent with previous studies on older, fixed NumtS in the reference genome. We have further determined the complete inserted sequence for a subset of these events to define their age and origin of insertion as well as their potential impact on studies of mitochondrial heteroplasmy.

CloudSTRUCTURE: infer population STRUCTURE on the cloud

CloudSTRUCTURE: infer population STRUCTURE on the cloud

Liya Wang, Doreen Ware
(Submitted on 18 Aug 2014)

We present CloudSTRUCTURE, an application for running parallel analyses with the population genetics program STRUCTURE. The HPC ready application, powered by iPlant cyber-infrastructure, provides a fast (by parallelization) and convenient (through a user friendly GUI) way to calculate like-lihood values across multiple values of K (number of genetic groups) and numbers of iterations. The results are automati-cally summarized for easier determination of the K value that best fit the data. In addition, CloudSTRUCTURE will reformat STRUCTURE output for use in downstream programs, such as TASSEL for association analysis with population structure ef-fects stratified.

Matchmaker, Matchmaker, Make Me a Match: Migration of Populations via Marriages in the Past

Matchmaker, Matchmaker, Make Me a Match: Migration of Populations via Marriages in the Past

Sang Hoon Lee, Robyn Ffrancon, Daniel M. Abrams, Beom Jun Kim, Mason A. Porter
doi: http://dx.doi.org/10.1101/000257

The study of human mobility is both of fundamental importance and of great potential value. For example, it can be leveraged to facilitate efficient city planning and improve prevention strategies when faced with epidemics. The newfound wealth of rich sources of data—including banknote flows, mobile phone records, and transportation data—have led to an explosion of attempts to characterize modern human mobility. Unfortunately, the dearth of comparable historical data makes it much more difficult to study human mobility patterns from the past. In this paper, we present such an analysis: we demonstrate that the data record from Korean family books (called “jokbo”) can be used to estimate migration patterns via marriages from the past 750 years. We apply two generative models of long-term human mobility to quantify the relevance of geographical information to human marriage records in the data, and we find that the wide variety in the geographical distributions of the clans poses interesting challenges for the direct application of these models. Using the different geographical distributions of clans, we quantify the “ergodicity” of clans in terms of how widely and uniformly they have spread across Korea, and we compare these results to those obtained using surname data from the Czech Republic. To examine population flow in more detail, we also construct and examine a population-flow network between regions. Based on the correlation between ergodicity and migration patterns in Korea, we identify two different types of migration patterns: diffusive and convective. We expect the analysis of diffusive versus convective effects in population flows to be widely applicable to the study of mobility and migration patterns across different cultures.

Long-read, whole genome shotgun sequence data for five model organisms

Long-read, whole genome shotgun sequence data for five model organisms

Kristi E Kim, Paul Peluso, Primo Baybayan, Patricia Jane Yeadon, Charles Yu, William Fisher, Chen-Shan Chin, Nicole A Rapicavoli, David R Rank, Joachim Li, David Catcheside, Susan E Celniker, Adam M Phillippy, Casey M Bergman, Jane M Landolin
doi: http://dx.doi.org/10.1101/008037

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characterisitcs of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4-C2 and P5-C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Konstantin Berlin, Sergey Koren, Chen-Shan Chin, James Drake, Jane M Landolin, Adam M Phillippy
doi: http://dx.doi.org/10.1101/008003

We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.