YFitter: Maximum likelihood assignment of Y chromosome haplogroups from low-coverage sequence data

YFitter: Maximum likelihood assignment of Y chromosome haplogroups from low-coverage sequence data

Luke Jostins, Yali Xu, Shane McCarthy, Qasim Ayub, Richard Durbin, Jeff Barrett, Chris Tyler-Smith
(Submitted on 30 Jul 2014)

Low-coverage short-read resequencing experiments have the potential to expand our understanding of Y chromosome haplogroups. However, the uncertainty associated with these experiments mean that haplogroups must be assigned probabilistically to avoid false inferences. We propose an efficient dynamic programming algorithm that can assign haplogroups by maximum likelihood, and represent the uncertainty in assignment. We apply this to both genotype and low-coverage sequencing data, and show that it can assign haplogroups accurately and with high resolution. The method is implemented as the program YFitter, which can be downloaded from this http URL

QuASAR: Quantitative Allele Specific Analysis of Reads

QuASAR: Quantitative Allele Specific Analysis of Reads

Chris Harvey, Gregory A Moyebrailean, Omar Davis, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

Expression quantitative trait loci (eQTL) studies have discovered thousands of genetic variants that regulate gene expression and have been crucial to enable a better understanding of the functional role of non-coding sequences. However, eQTL studies are generally quite expensive, requiring a large sample size and genome-wide genotyping. On the other hand, allele specific expression (ASE) is becoming a very popular approach to detect the effect of a genetic variant on gene expression, even with a single individual. This is typically achieved by counting the number of RNA-seq reads for each allele at heterozygous sites and rejecting the null hypothesis of 1:1 ratio. When genotype information is not readily available it could be inferred from the RNA-seq reads directly, but there are no methods available that can incorporate the uncertainty on the genotype call with the ASE inference step. Here, we present QuASAR, Quantitative Allele Specific Analysis of Reads, a novel statistical learning method for jointly detecting heterozygote genotypes and inferring ASE. The proposed ASE inference step takes into consideration the uncertainty in the genotype calls while including parameters that model base-call errors in sequencing and allelic over-dispersion. We validated our method with experimental data for which high quality genotypes are available. Results on an additional dataset with multiple replicates at different sequencing depths demonstrate that QuASAR is a powerful tool for ASE analysis when genotypes are not available.

Are all genetic variants in DNase I sensitivity regions functional?

Are all genetic variants in DNase I sensitivity regions functional?

Gregory A Moyerbrailean, Chris T Harvey, Cynthia A Kalita, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

A detailed mechanistic understanding of the direct functional consequences of DNA variation on gene regulatory mechanism is critical for a complete understanding of complex trait genetics and evolution. Here, we present a novel approach that integrates sequence information and DNase I footprinting data to predict the impact of a sequence change on transcription factor binding. Applying this approach to 653 DNase-seq samples, we identified 3,831,862 regulatory variants predicted to affect active regulatory elements for a panel of 1,372 transcription factor motifs. Using QuASAR, we validated the non-coding variants predicted to be functional by examining allele-specific binding (ASB). Combining the predictive model and the ASB signal, we identified 3,217 binding variants within footprints that are significantly imbalanced (20% FDR). Even though most variants in DNase I hypersensitive regions may not be functional, we estimate that 56% of our annotated functional variants show actual evidence of ASB. To assess the effect these variants may have on complex phenotypes, we examined their association with complex traits using GWAS and observed that ASB-SNPs are enriched 1.22-fold for complex traits variants. Furthermore, we show that integrating footprint annotations into GWAS meta-study results improves identification of likely causal SNPs and provides a putative mechanism by which the phenotype is affected.

Long non-coding RNA discovery in Anopheles gambiae using deep RNA sequencing

Long non-coding RNA discovery in Anopheles gambiae using deep RNA sequencing

Adam M Jenkins, Robert M Waterhouse, Alan S Kopin, Marc A.T. Muskavitch

Long non-coding RNAs (lncRNAs) are mRNA-like transcripts longer than 200 bp that have no protein-coding potential. lncRNAs have recently been implicated in epigenetic regulation, transcriptional and post-transcriptional gene regulation, and regulation of genomic stability in mammals, Caenorhabditis elegans, and Drosophila melanogaster. Using deep RNA sequencing of multiple Anopheles gambiae life stages, we have identified over 600 novel lncRNAs and more than 200 previously unannotated putative protein-coding genes. The lncRNAs exhibit differential expression profiles across life stages and adult genders. Those lncRNAs that are antisense to known protein-coding genes or are contained within intronic regions of protein-coding genes may mediate transcriptional repression or stabilization of associated mRNAs. lncRNAs exhibit faster rates of sequence evolution across anophelines compared to previously known and newly identified protein-coding genes. This initial description of lncRNAs in An. gambiae offers the first genome-wide insights into long non-coding RNAs in this vector mosquito and defines a novel set of potential targets for the development of vector-based interventions that may curb the human malaria burden in disease-endemic countries.

Mito-seek enables deep analysis of mitochondrial DNA, revealing ubiquitous, stable heteroplasmy maintained by intercellular exchange

Mito-seek enables deep analysis of mitochondrial DNA, revealing ubiquitous, stable heteroplasmy maintained by intercellular exchange

Ravi Sachidanandam, Anitha D Jayaprakash, Erica Benson, Raymond Liang, Jaehee Shim, Luca Lambertini, Mike Wigler, Stuart Aaronson
doi: http://dx.doi.org/10.1101/007005

Eukaryotic cells carry two genomes, nuclear (nDNA) and mitochondrial (mtDNA), which are ostensibly decoupled in their replication, segregation and inheritance. It is increasingly appreciated that heteroplasmy, the occurrence of multiple mtDNA haplotypes in a cell, plays an important biological role, but its features are not well understood. Until now, accurately determining the diversity of mtDNA has been difficult due to the relatively small amount of mtDNA in each cell ( 98%) mtDNA and its ability to detect rare variants is limited only by sequencing depth, providing unprecedented sensitivity and specificity. Using Mito-seek, we confirmed the ubiquity of heteroplasmy by analyzing mtDNA from a diverse set of cell lines and human samples. By applying Mito-seek to colonies derived from single cells, we showed that heteroplasmy is stably maintained in individual daughter cells over multiple cell divisions. Our simulations indicate that the stability of heteroplasmy can be facilitated by the exchange of mtDNA between cells. We also explicitly demonstrate this exchange by co-culturing cell lines with distinct mtDNA haplotypes. Our results shed new light on the maintenance of heteroplasmy and provide a novel platform to investigate various features of heteroplasmy in normal and diseased tissues.

inPHAP: Interactive visualization of genotype and phased haplotype data

inPHAP: Interactive visualization of genotype and phased haplotype data
Günter Jäger, Alexander Peltzer, Kay Nieselt
Comments: BioVis 2014 conference
Subjects: Graphics (cs.GR); Genomics (q-bio.GN)

Background: To understand individual genomes it is necessary to look at the variations that lead to changes in phenotype and possibly to disease. However, genotype information alone is often not sufficient and additional knowledge regarding the phase of the variation is needed to make correct interpretations. Interactive visualizations, that allow the user to explore the data in various ways, can be of great assistance in the process of making well informed decisions. But, currently there is a lack for visualizations that are able to deal with phased haplotype data. Results: We present inPHAP, an interactive visualization tool for genotype and phased haplotype data. inPHAP features a variety of interaction possibilities such as zooming, sorting, filtering and aggregation of rows in order to explore patterns hidden in large genetic data sets. As a proof of concept, we apply inPHAP to the phased haplotype data set of Phase 1 of the 1000 Genomes Project. Thereby, inPHAP’s ability to show genetic variations on the population as well as on the individuals level is demonstrated for several disease related loci. Conclusions: As of today, inPHAP is the only visual analytical tool that allows the user to explore unphased and phased haplotype data interactively. Due to its highly scalable design, inPHAP can be applied to large datasets with up to 100 GB of data, enabling users to visualize even large scale input data. inPHAP closes the gap between common visualization tools for unphased genotype data and introduces several new features, such as the visualization of phased data.

Single haplotype assembly of the human genome from a hydatidiform mole

Single haplotype assembly of the human genome from a hydatidiform mole

Karyn Meltz Steinberg, Valerie K Schneider, Tina A Graves-Lindsay, Robert S Fulton, Richa Agarwala, John Huddleston, Sergey A Shiryayev, Aleksandr Morgulis, Urvashi Surti, Wesley C Warren, Deanna M Church, Evan E Eichler, Richard K Wilson

An accurate and complete reference human genome sequence assembly is essential for accurately interpreting individual genomes and associating sequence variation with disease phenotypes. While the current reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can help overcome these problems, even the longest available reads do not resolve all regions of the human genome. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones, an optical map, and 100X whole genome shotgun (WGS) sequence coverage using short (Illumina) read pairs. We used the WGS sequence and the GRCh37 reference assembly to create a sequence assembly of the CHM1 genome. We subsequently incorporated 382 finished CHORI-17 BAC clone sequences to generate a second draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene and repeat content show this assembly to be of excellent quality and contiguity, and comparisons to ClinVar and the NHGRI GWAS catalog show that the CHM1 genome does not harbor an excess of deleterious alleles. However, comparison to assembly-independent resources, such as BAC clone end sequences and long reads generated by a different sequencing technology (PacBio), indicate misassembled regions. The great majority of these regions is enriched for structural variation and segmental duplication, and can be resolved in the future by sequencing BAC clone tiling paths. This publicly available first generation assembly will be integrated into the Genome Reference Consortium (GRC) curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity

Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity

Corey T Watson, Karyn Meltz Steinberg, Tina A Graves-Lindsay, Rene L Warren, Maika Malig, Jacqueline E Schein, Richard K Wilson, Rob Holt, Evan Eichler, Felix Breden

Germline variation at immunoglobulin gene (IG) loci is critical for pathogen-mediated immunity, but establishing complete reference sequences in these regions is problematic because of segmental duplications and somatically rearranged source DNA. We sequenced BAC clones from the essentially haploid hydatidiform mole, CHM1, across the light chain IG loci, kappa (IGK) and lambda (IGL), creating single haplotype representations of these regions. The IGL haplotype is 1.25Mb of contiguous sequence with four novel V gene and one novel C gene alleles and an 11.9kbp insertion. The IGK haplotype consists of two 644kbp proximal and 466kbp distal contigs separated by a gap also present in the reference genome sequence. Our effort added an additional 49kbp of unique sequence extending into this gap. The IGK haplotype contains six novel V gene and one novel J gene alleles and a 16.7kbp region with increased sequence identity between the two IGK contigs, exhibiting signatures of interlocus gene conversion. Our data facilitated the first comparison of nucleotide diversity between the light and IG heavy (IGH) chain haplotypes within a single genome, revealing a three to six fold enrichment in the IGH locus, supporting the theory that the heavy chain may be more important in determining antigenic specificity.

Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling


Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling

Stinus Lindgreen, Sinan Ugur Umu, Alicia Sook-Wei Lai, Hisham Eldai, Wenting Liu, Stephanie McGimpsey, Nicole Wheeler, Patrick J. Biggs, Nick R. Thomson, Lars Barquist, Anthony M. Poole, Paul P. Gardner
Comments: 16 pages, 4 figures
Subjects: Genomics (q-bio.GN)

Noncoding RNAs are increasingly recognized as integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a null hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

svaseq: removing batch effects and other unwanted noise from sequencing data

svaseq: removing batch effects and other unwanted noise from sequencing data

Jeffrey Leek

It is now well known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. We introduced surrogate variable analysis for estimating these artifacts by (1) identifying the part of the genomic data only affected by artifacts and (2) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors. Here I describe an update to the sva approach that can be applied to analyze count data or FPKMs from sequencing experiments. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. These updates are available through the surrogate variable analysis (sva) Bioconductor package.