Mito-seek enables deep analysis of mitochondrial DNA, revealing ubiquitous, stable heteroplasmy maintained by intercellular exchange

Mito-seek enables deep analysis of mitochondrial DNA, revealing ubiquitous, stable heteroplasmy maintained by intercellular exchange

Ravi Sachidanandam, Anitha D Jayaprakash, Erica Benson, Raymond Liang, Jaehee Shim, Luca Lambertini, Mike Wigler, Stuart Aaronson
doi: http://dx.doi.org/10.1101/007005

Eukaryotic cells carry two genomes, nuclear (nDNA) and mitochondrial (mtDNA), which are ostensibly decoupled in their replication, segregation and inheritance. It is increasingly appreciated that heteroplasmy, the occurrence of multiple mtDNA haplotypes in a cell, plays an important biological role, but its features are not well understood. Until now, accurately determining the diversity of mtDNA has been difficult due to the relatively small amount of mtDNA in each cell ( 98%) mtDNA and its ability to detect rare variants is limited only by sequencing depth, providing unprecedented sensitivity and specificity. Using Mito-seek, we confirmed the ubiquity of heteroplasmy by analyzing mtDNA from a diverse set of cell lines and human samples. By applying Mito-seek to colonies derived from single cells, we showed that heteroplasmy is stably maintained in individual daughter cells over multiple cell divisions. Our simulations indicate that the stability of heteroplasmy can be facilitated by the exchange of mtDNA between cells. We also explicitly demonstrate this exchange by co-culturing cell lines with distinct mtDNA haplotypes. Our results shed new light on the maintenance of heteroplasmy and provide a novel platform to investigate various features of heteroplasmy in normal and diseased tissues.

Phylogenomic analyses of deep gastropod relationships reject Orthogastropoda

Phylogenomic analyses of deep gastropod relationships reject Orthogastropoda

Felipe Zapata, Nerida G Wilson, Mark Howison, Sónia CS Andrade, Katharina M J?rger, Michael Schrödl, Freya E Goetz, Gonzalo Giribet, Casey W Dunn
doi: http://dx.doi.org/10.1101/007039

Gastropods are a highly diverse clade of molluscs that includes many familiar animals, such as limpets, snails, slugs, and sea slugs. It is one of the most abundant groups of animals in the sea and the only molluscan lineage that has successfully colonised land. Yet the relationships among and within its constituent clades have remained in flux for over a century of morphological, anatomical and molecular study. Here we re-evaluate gastropod phylogenetic relationships by collecting new transcriptome data for 40 species and analysing them in combination with publicly available genomes and transcriptomes. Our datasets include all five main gastropod clades: Patellogastropoda, Vetigastropoda, Neritimorpha, Caenogastropoda and Heterobranchia. We use two different methods to assign orthology, subsample each of these matrices into three increasingly dense subsets, and analyse all six of these supermatrices with two different models of molecular evolution. All twelve analyses yield the same unrooted network connecting the five major gastropod lineages. This reduces deep gastropod phylogeny to three alternative rooting hypotheses. These results reject the prevalent hypothesis of gastropod phylogeny, Orthogastropoda. Our dated tree is congruent with a possible end-Permian recovery of some gastropod clades, namely Caenogastropoda and some Heterobranchia subclades.

Single haplotype assembly of the human genome from a hydatidiform mole

Single haplotype assembly of the human genome from a hydatidiform mole

Karyn Meltz Steinberg, Valerie K Schneider, Tina A Graves-Lindsay, Robert S Fulton, Richa Agarwala, John Huddleston, Sergey A Shiryayev, Aleksandr Morgulis, Urvashi Surti, Wesley C Warren, Deanna M Church, Evan E Eichler, Richard K Wilson

An accurate and complete reference human genome sequence assembly is essential for accurately interpreting individual genomes and associating sequence variation with disease phenotypes. While the current reference genome sequence is of very high quality, gaps and misassemblies remain due to biological and technical complexities. Large repetitive sequences and complex allelic diversity are the two main drivers of assembly error. Although increasing the length of sequence reads and library fragments can help overcome these problems, even the longest available reads do not resolve all regions of the human genome. In order to overcome the issue of allelic diversity, we used genomic DNA from an essentially haploid hydatidiform mole, CHM1. We utilized several resources from this DNA including a set of end-sequenced and indexed BAC clones, an optical map, and 100X whole genome shotgun (WGS) sequence coverage using short (Illumina) read pairs. We used the WGS sequence and the GRCh37 reference assembly to create a sequence assembly of the CHM1 genome. We subsequently incorporated 382 finished CHORI-17 BAC clone sequences to generate a second draft assembly, CHM1_1.1 (NCBI AssemblyDB GCA_000306695.2). Analysis of gene and repeat content show this assembly to be of excellent quality and contiguity, and comparisons to ClinVar and the NHGRI GWAS catalog show that the CHM1 genome does not harbor an excess of deleterious alleles. However, comparison to assembly-independent resources, such as BAC clone end sequences and long reads generated by a different sequencing technology (PacBio), indicate misassembled regions. The great majority of these regions is enriched for structural variation and segmental duplication, and can be resolved in the future by sequencing BAC clone tiling paths. This publicly available first generation assembly will be integrated into the Genome Reference Consortium (GRC) curation framework for further improvement, with the ultimate goal being a completely finished gap-free assembly.

Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity

Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity

Corey T Watson, Karyn Meltz Steinberg, Tina A Graves-Lindsay, Rene L Warren, Maika Malig, Jacqueline E Schein, Richard K Wilson, Rob Holt, Evan Eichler, Felix Breden

Germline variation at immunoglobulin gene (IG) loci is critical for pathogen-mediated immunity, but establishing complete reference sequences in these regions is problematic because of segmental duplications and somatically rearranged source DNA. We sequenced BAC clones from the essentially haploid hydatidiform mole, CHM1, across the light chain IG loci, kappa (IGK) and lambda (IGL), creating single haplotype representations of these regions. The IGL haplotype is 1.25Mb of contiguous sequence with four novel V gene and one novel C gene alleles and an 11.9kbp insertion. The IGK haplotype consists of two 644kbp proximal and 466kbp distal contigs separated by a gap also present in the reference genome sequence. Our effort added an additional 49kbp of unique sequence extending into this gap. The IGK haplotype contains six novel V gene and one novel J gene alleles and a 16.7kbp region with increased sequence identity between the two IGK contigs, exhibiting signatures of interlocus gene conversion. Our data facilitated the first comparison of nucleotide diversity between the light and IG heavy (IGH) chain haplotypes within a single genome, revealing a three to six fold enrichment in the IGH locus, supporting the theory that the heavy chain may be more important in determining antigenic specificity.

Most viewed on Haldane’s Sieve: June 2014

The most viewed posts on Haldane’s Sieve in June 2014 were:

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data

Anand Bhaskar, Y.X. Rachel Wang, Yun S. Song

With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal which is difficult to pick up with small sample sizes. Lastly, we apply our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing dataset of tens of thousands of individuals assayed at a few hundred genic regions.

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

Michael Inouye, Harriet Dashnow, Lesley Raven, Mark B Schultz, Bernard J Pope, Takehiro Tomita, Justin Zobel, Kathryn E Holt

Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control, yet routine use of whole genome sequencing (WGS) for these purposes poses significant challenges. Here we present SRST2, a tool for fast and accurate detection of genes, alleles and multi-locus sequence types from WGS data, which outperforms assembly-based methods. Using >900 genomes from common pathogens, we demonstrate SRST2’s utility for rapid genome surveillance in public health laboratory and hospital infection control settings.

svaseq: removing batch effects and other unwanted noise from sequencing data

svaseq: removing batch effects and other unwanted noise from sequencing data

Jeffrey Leek

It is now well known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. We introduced surrogate variable analysis for estimating these artifacts by (1) identifying the part of the genomic data only affected by artifacts and (2) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors. Here I describe an update to the sva approach that can be applied to analyze count data or FPKMs from sequencing experiments. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. These updates are available through the surrogate variable analysis (sva) Bioconductor package.

Redefining Genomic Privacy: Trust and Empowerment

Redefining Genomic Privacy: Trust and Empowerment

Arvind Narayanan, Kenneth Yocum, David Glazer, Nita Farahany, Maynard Olson, Lincoln D. Stein, James B. Williams, Jan A. Witkowski, Robert C. Kain, Yaniv Erlich

Fulfilling the promise of the genetic revolution requires the analysis of large datasets containing information from thousands to millions of participants. However, sharing human genomic data requires protecting subjects from potential harm. Current models rely on de-identification techniques that treat privacy versus data utility as a zero-sum game. Instead we propose using trust-enabling techniques to create a solution where researchers and participants both win. To do so we introduce three principles that facilitate trust in genetic research and outline one possible framework built upon those principles. Our hope is that such trust-centric frameworks provide a sustainable solution that reconciles genetic privacy with data sharing and facilitates genetic research.

Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Gustavo Sacomoto
(Submitted on 23 Jun 2014)

In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events. Finally, we show that it enables to identify more correct events than general purpose transcriptome assemblers.
In order to deal with ever-increasing volumes of NGS data, we put an extra effort to make KisSplice as scalable as possible. First, to improve its running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. Then, to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time.
Additionally, we apply the same techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson’s algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the classical K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms with the same time complexities but using exponentially less memory than previous approaches.