QuASAR: Quantitative Allele Specific Analysis of Reads

QuASAR: Quantitative Allele Specific Analysis of Reads

Chris Harvey, Gregory A Moyebrailean, Omar Davis, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

Expression quantitative trait loci (eQTL) studies have discovered thousands of genetic variants that regulate gene expression and have been crucial to enable a better understanding of the functional role of non-coding sequences. However, eQTL studies are generally quite expensive, requiring a large sample size and genome-wide genotyping. On the other hand, allele specific expression (ASE) is becoming a very popular approach to detect the effect of a genetic variant on gene expression, even with a single individual. This is typically achieved by counting the number of RNA-seq reads for each allele at heterozygous sites and rejecting the null hypothesis of 1:1 ratio. When genotype information is not readily available it could be inferred from the RNA-seq reads directly, but there are no methods available that can incorporate the uncertainty on the genotype call with the ASE inference step. Here, we present QuASAR, Quantitative Allele Specific Analysis of Reads, a novel statistical learning method for jointly detecting heterozygote genotypes and inferring ASE. The proposed ASE inference step takes into consideration the uncertainty in the genotype calls while including parameters that model base-call errors in sequencing and allelic over-dispersion. We validated our method with experimental data for which high quality genotypes are available. Results on an additional dataset with multiple replicates at different sequencing depths demonstrate that QuASAR is a powerful tool for ASE analysis when genotypes are not available.

A Statistical Test for Clades in Phylogenies

A Statistical Test for Clades in Phylogenies

Thurston H. Y. Dang, Elchanan Mossel
(Submitted on 29 Jul 2014)

We investigated testing the likelihood of a phylogenetic tree by comparison to its subtree pruning and regrafting (SPR) neighbors, with or without re-optimizing branch lengths. This is inspired by aspects of Bayesian significance tests, and the use of SPRs for heuristically finding maximum likelihood trees. Through a number of simulations with the Jukes-Cantor model on various topologies, it is observed that the SPR tests are informative, and reasonably fast compared to searching for the maximum likelihood tree. This suggests that the SPR tests would be a useful addition to the suite of existing statistical tests, for identifying potential inaccuracies of inferred topologies.

Are all genetic variants in DNase I sensitivity regions functional?

Are all genetic variants in DNase I sensitivity regions functional?

Gregory A Moyerbrailean, Chris T Harvey, Cynthia A Kalita, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

A detailed mechanistic understanding of the direct functional consequences of DNA variation on gene regulatory mechanism is critical for a complete understanding of complex trait genetics and evolution. Here, we present a novel approach that integrates sequence information and DNase I footprinting data to predict the impact of a sequence change on transcription factor binding. Applying this approach to 653 DNase-seq samples, we identified 3,831,862 regulatory variants predicted to affect active regulatory elements for a panel of 1,372 transcription factor motifs. Using QuASAR, we validated the non-coding variants predicted to be functional by examining allele-specific binding (ASB). Combining the predictive model and the ASB signal, we identified 3,217 binding variants within footprints that are significantly imbalanced (20% FDR). Even though most variants in DNase I hypersensitive regions may not be functional, we estimate that 56% of our annotated functional variants show actual evidence of ASB. To assess the effect these variants may have on complex phenotypes, we examined their association with complex traits using GWAS and observed that ASB-SNPs are enriched 1.22-fold for complex traits variants. Furthermore, we show that integrating footprint annotations into GWAS meta-study results improves identification of likely causal SNPs and provides a putative mechanism by which the phenotype is affected.

Bayesian mixture analysis for metagenomic community profiling.

Bayesian mixture analysis for metagenomic community profiling.

Sofia Morfopoulou, Vincent Plagnol

Deep sequencing of clinical samples is now an established tool for the detection of infectious pathogens, with direct medical applications. The large amount of data generated provides an opportunity to detect species even at very low levels, provided that computational tools can effectively interpret potentially complex metagenomic mixtures. Data interpretation is complicated by the fact that short sequencing reads can match multiple organisms and by the lack of completeness of existing databases, in particular for viral pathogens. This interpretation problem can be formulated statistically as a mixture model, where the species of origin of each read is missing, but the complete knowledge of all species present in the mixture helps with the individual reads assignment. Several analytical tools have been proposed to approximately solve this computational problem. Here, we show that the use of parallel Monte Carlo Markov chains (MCMC) for the exploration of the species space enables the identification of the set of species most likely to contribute to the mixture. The added accuracy comes at a cost of increased computation time. Our approach is useful for solving complex mixtures involving several related species. We designed our method specifically for the analysis of deep transcriptome sequencing datasets and with a particular focus on viral pathogen detection, but the principles are applicable more generally to all types of metagenomics mixtures. The code is available on github (http://github.com/smorfopoulou/metaMix) and the process is currently being implemented in a user friendly R package (metaMix, to be submitted to CRAN).

Long non-coding RNA discovery in Anopheles gambiae using deep RNA sequencing

Long non-coding RNA discovery in Anopheles gambiae using deep RNA sequencing

Adam M Jenkins, Robert M Waterhouse, Alan S Kopin, Marc A.T. Muskavitch

Long non-coding RNAs (lncRNAs) are mRNA-like transcripts longer than 200 bp that have no protein-coding potential. lncRNAs have recently been implicated in epigenetic regulation, transcriptional and post-transcriptional gene regulation, and regulation of genomic stability in mammals, Caenorhabditis elegans, and Drosophila melanogaster. Using deep RNA sequencing of multiple Anopheles gambiae life stages, we have identified over 600 novel lncRNAs and more than 200 previously unannotated putative protein-coding genes. The lncRNAs exhibit differential expression profiles across life stages and adult genders. Those lncRNAs that are antisense to known protein-coding genes or are contained within intronic regions of protein-coding genes may mediate transcriptional repression or stabilization of associated mRNAs. lncRNAs exhibit faster rates of sequence evolution across anophelines compared to previously known and newly identified protein-coding genes. This initial description of lncRNAs in An. gambiae offers the first genome-wide insights into long non-coding RNAs in this vector mosquito and defines a novel set of potential targets for the development of vector-based interventions that may curb the human malaria burden in disease-endemic countries.

Author post: Facilitated diffusion buffers noise in gene expression

This guest post is by Radu Zabet on his preprint (with Armin Schoech) Facilitated diffusion buffers noise in gene expression, arXived here.

How does the binding dynamics of transcription factors affect the noise in gene expression?

Transcription factors (TFs) are proteins that bind to DNA and control gene activity. Gene regulation can be modelled as a chemical reaction, which is fundamentally a stochastic process. Given the importance of an accurate control of the gene regulatory program in the cell, significant efforts have been made in understanding the noise properties of gene expression.

Why can noise in gene expression be modelled assuming an ON/OFF gene model?

With few exceptions, previous studies investigated the noise in gene expression assuming that the regulatory process is a two-state Markov model (genes switch stochastically between ON and OFF states). However, it is known that, mechanistically, transcription factors find their genomic target sites through facilitated diffusion, a combination of 3D diffusion in the cytoplasm/nucleoplasm and 1D random walk along the DNA, and this is likely to influence the noise properties of the gene regulation process. Previous experimental studies (e.g. see http://www.nature.com/ng/journal/v43/n6/full/ng.821.html) successfully modelled the noise measured experimentally by assuming an ON/OFF gene model (two-state Markov model) in bacterial and animal cells. In this manuscript, we built a three-state Markov model that accurately models the facilitated diffusion and we showed that for biologically relevant parameters, at least in bacteria (we assumed lac repressor system http://www.sciencemag.org/content/336/6088/1595), noise in gene expression can be modelled assuming the ON/OFF gene model, but only if the binding/unbinding rates are adjusted accordingly. This explains why in many cases the experimental noise in gene regulation can be modelled assuming an ON/OFF gene model. Note that there are several exceptions where the noise in gene expression does not seem to be accounted by the ON/OFF gene model (e.g. http://genome.cshlp.org/content/early/2014/07/16/gr.168773.113 or http://www.pnas.org/content/111/29/10598).

What is the effect of facilitated diffusion on the noise in gene expression?

Next, assuming the ON/OFF gene model we investigated the evolutionary advantage that a TF, which performs facilitated diffusion, has on noise in gene expression compared to an equivalent TF that only performs the 3D diffusion (and does not perform 1D random walk on the DNA). Our results show that the noise in gene expression can be reduced significantly when the TF performs facilitated diffusion compared to its equivalent TF that only performs 3D diffusion in the cell. This is important, because while the majority of the studies identify the speedup in the binding site search process as the main evolutionary advantage of why facilitated diffusion exists, we show that, in addition to this speedup in binding kinetics, facilitated diffusion also reduces the noise in gene expression. Interestingly, it seems that the noise level in gene expression is reduced close to the noise level of an unregulated gene (the lowest noise level in gene expression that could be achieved), while the noise of an equivalent TF that performs only 3D diffusion is significantly higher.

Finally, to test our model, we parameterise it with values measured experimentally in the case of lac repressor in E. coli and we estimated the mean mRNA level to be 0.16 and the Fano factor (variance divided by mean) to be 1.3 (as opposed to 2.0 in the case of TF performing only 3D diffusion). These values are similar to the values measured experimentally in the low inducer case of Plac by http://www.nature.com/ng/journal/v43/n6/full/ng.821.html (mean mRNA level of 0.15 and Fano factor of 1.25) and shows that facilitated diffusion is essential in explaining the experimentally measured noise in mRNA.

Author post: Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans

This guest post is by Gundula Povysil and Sepp Hochreiter on their preprint Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans, bioRxived here.

We completed our preprint Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans in bioRxiv by presenting results not only for chromosome 1 but now for all autosomes and chromosome X.

In this manuscript we analyze the sharing of very short identity by descent (IBD) segments between humans, Neandertals, and Denisovans to gain new insights into their demographic history. In the updated version we included a separate chromosome X analysis (both IBD segment sharing and length of segments). We identified IBD segments in the 1000 Genomes Project sequencing data using our recently published method HapFABIA, many of which are shared with Neandertals or Denisovans.

Here we highlight the most interesting findings of our analysis:

Introgression from Denisovans into ancestors of Asians:

The Denisova genome most prominently matches IBD segments that are shared by Asians and on average these segments are longer than segments shared between other continental populations and the Denisova genome. Therefore, we could confirm an introgression from Denisovans into ancestors of Asians after their migration out of Africa.

Introgression from Neandertals into ancestors of Europeans and Asians:

While Neandertal-matching IBD segments are most often shared by Asians, Europeans share a considerably higher percentage of IBD segments with Neandertals compared to other populations, too. Neandertal-matching IBD segments that are shared by Asians or Europeans are longer than those observed in Africans. These IBD segments hint at a gene flow from Neandertals into ancestors of Asians and Europeans after they left Africa.

Ancient Neandertal and Denisova IBD segments survived only in Africans

Interestingly, many Neandertal- and/or Denisova-matching IBD segments are predominantly observed in Africans – some of them even exclusively. IBD segments shared between Africans and Neandertals or Denisovans are strikingly short, therefore we assume that they are very old. Consequently, we conclude that DNA regions from ancestors of humans, Neandertals, and Denisovans have survived in Africans.

Neandertal but no Denisova introgression on the X chromosome

Neandertal-matching IBD segments on chromosome X confirm gene flow from Neandertals into ancestors of Asians and Europeans outside Africa. Interestingly, there is hardly any signal of Denisova introgression on the X chromosome.

We highly appreciate any comments, discussions, or thoughts on our results.

Facilitated diffusion buffers noise in gene expression

Facilitated diffusion buffers noise in gene expression

Armin Schoech, Nicolae Radu Zabet
(Submitted on 22 Jul 2014)

Transcription factors perform facilitated diffusion (3D diffusion in the cytosol and 1D diffusion on the DNA) when binding to their target sites to regulate gene expression. Here, we investigated the influence of this binding mechanism on the noise in gene expression. Our results showed that, for biologically relevant parameters, the binding process can be represented by a two-state Markov model and that the accelerated target finding due to facilitated diffusion leads to a reduction in both the mRNA and the protein noise.

Statistical and conceptual challenges in the comparative analysis of principal components

Statistical and conceptual challenges in the comparative analysis of principal components

Josef C Uyeda, Daniel S. Caetano, Matthew W Pennell

Quantitative geneticists long ago recognized the value of studying evolution in a multivariate framework (Pearson, 1903). Due to linkage, pleiotropy, coordinated selection and mutational covariance, the evolutionary response in any phenotypic trait can only be properly understood in the context of other traits (Lande, 1979; Lynch and Walsh, 1998). This is of course also well?appreciated by comparative biologists. However, unlike in quantitative genetics, most of the statistical and conceptual tools for analyzing phylogenetic comparative data (recently reviewed in Pennell and Harmon, 2013) are designed for analyzing a single trait (but see, for example Revell and Harmon, 2008; Revell and Harrison, 2008; Hohenlohe and Arnold, 2008; Revell and Collar, 2009; Schmitz and Motani, 2011; Adams, 2014b). Indeed, even classical approaches for testing for correlated evolution between two traits (e.g., Felsenstein, 1985; Grafen, 1989; Harvey and Pagel, 1991) are not actually multivariate as each trait is assumed to have evolved under a process that is independent of the state of the other (Hansen and Orzack, 2005; Hansen and Bartoszek, 2012). As a result of these limitations, researchers with multivariate datasets are often faced with a choice: analyze each trait as if they were independent or else decompose the dataset into statistically independent set of traits, such that each set can be analyzed with the univariate methods.

Author post: Single haplotype assembly of the human genome from a hydatidiform mole

This guest post is by Karyn Meltz Steinberg on her preprint (with coauthors) Single haplotype assembly of the human genome from a hydatidiform mole, bioRxived here.

The human reference sequence is a mosaic of many DNA sources patched together to create the (mostly) contiguous chromosomes we all use every day in genomics labs. This mélange of haplotypes can result in reference representations that do not exist. For example, in GRCh37 at the MRC1 locus mixed haplotypes led to the presence of two gene models that represent false duplications and a gap that affected alignments of short reads. The problem of diploid source DNA was even worse in regions of structural variation where it was difficult to distinguish allelic variation from paralogous variation. The assembly structure in these regions was often wrong with either a collapsed assembly leading to missing sequence or with haplotype expanse, meaning 2, 3 or more haplotypes were represented on the chromosome sequence. The genomic resources associated with the essentially haploid complete hydatidiform mole, CHM1, have opened the door to allow us to address these issues.

What is essentially haploid? Well, what usually happens is that a sperm (usually bearing an X chromosome) fertilizes an egg that doesn’t have a nucleus (thus no DNA), the sperm DNA doubles, giving the correct chromosome complement, but with both pairs of chromosomes being identical. This cell divides and grows but does not form a normal embryo. In the early 1990s, Dr. Urvashi Surti was able to make an immortalized cell line from CHM1 tissue using hTERT. She karyotyped each passage to check that it maintained ploidy and that there were no gross somatic rearrangements.

Dr. Pieter de Jong then created an indexed BAC library from this high fidelity material. These BACs could then be used to resolve structurally complex regions such as 17q21.311. We continued working on sequencing more tiling paths across structurally complex regions; however, it was not practical or cost-efficient to Sanger sequence every single clone in the BAC library. As work continued, it became clear that developing the Primary Assembly using a single haplotype resource could be very powerful. This was possible due to the efforts of the Genome Reference Consortium (GRC) to extend the assembly model to include more than one sequence representation for a give region. We used Illumina sequencing technology and a reference based assembly algorithm developed at NCBI to produce an initial assembly. We then integrated the BAC sequences into the assembly to improve regions that are nearly impossible to assembly using whole genome strategies. The result is the highest quality whole genome sequence human genome assembly that is publicly available to date as assessed by metrics including contig and scaffold N50, repetitive element content and gene annotation.

So what–how can this help us, you ask? For the first time, one single haplotype of the human genome is represented. The fact that CHM1 is haploid means that we are able to finally go into the messy regions of the genome and resolve the genomic architecture as well as put any structural variation in the context of surrounding linked allelic variation. These are often biologically interesting regions; for example genes related to immune response and metabolism that are probably associated with complex traits are usually members of large gene families in segmentally duplicated sequence.

A fine example of the power of this haploid resource is also on BioRxiv, “Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity” (disclosure: this is also co-authored by me). The IG light chain genes encode for one part of immunoglobulin molecules that are expressed by B cells in response to antigenic stimulation (the heavy chain, IGH, was also resolved using CHM1 resources last year2). They are part of large gene families formed by duplication at three loci in the human genome. In previous versions of the reference assembly, these loci were comprised of sequence from multiple DNA sources that may have undergone somatic rearrangement. By sequencing BAC clones in a tiling path across these loci we now have a single haplotype representation of germline DNA sequence that allows us to perform accurate analyses of variation.

REFERENCES

1 Itsara, A. et al. Resolving the breakpoints of the 17q21.31 microdeletion syndrome with next-generation sequencing. American journal of human genetics 90, 599-613, doi:10.1016/j.ajhg.2012.02.013 (2012).
2 Watson, C. T. et al. Complete haplotype sequence of the human immunoglobulin heavy-chain variable, diversity, and joining genes and characterization of allelic and copy-number variation. American journal of human genetics 92, 530-546 (2013).