Phen-Gen: Combining Phenotype and Genotype to Analyze Rare Disorders

Phen-Gen: Combining Phenotype and Genotype to Analyze Rare Disorders

Asif Javed , Saloni Agrawal , Pauline Ng
doi: http://dx.doi.org/10.1101/015727

We introduce Phen-Gen, a method which combines patient’s disease symptoms and sequencing data with prior domain knowledge to identify the causative gene(s) for rare disorders. Simulations reveal that the causal variant is ranked first in 88% cases when it is coding; which is 52% advantage over a genotype-only approach and outperforms existing methods by 13-58%. If disease etiology is unknown, the causal variant is assigned top-rank in 71% of simulations.

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage

Chromosome-scale shotgun assembly using an in vitro method for long-range linkage
Nicholas H. Putnam, Brendan O’Connell, Jonathan C. Stites, Brandon J. Rice, Andrew Fields, Paul D. Hartley, Charles W. Sugnet, David Haussler, Daniel S. Rokhsar, Richard E. Green
Subjects: Genomics (q-bio.GN); Biomolecules (q-bio.BM)

Long-range and highly accurate de novo assembly from short-read data is one of the most pressing challenges in genomics. Recently, it has been shown that read pairs generated by proximity ligation of DNA in chromatin of living tissue can address this problem. These data dramatically increase the scaffold contiguity of assemblies and provide haplotype phasing information. Here, we describe a simpler approach (“Chicago”) based on in vitro reconstituted chromatin. We generated two Chicago datasets with human DNA and used a new software pipeline (“HiRise”) to construct a highly accurate de novo assembly and scaffolding of a human genome with scaffold N50 of 30 Mb. We also demonstrated the utility of Chicago for improving existing assemblies by re-assembling and scaffolding the genome of the American alligator. With a single library and one lane of Illumina HiSeq sequencing, we increased the scaffold N50 of the American alligator from 508 kb to 10 Mb. Our method uses established molecular biology procedures and can be used to analyze any genome, as it requires only about 5 micrograms of DNA as the starting material.

ViennaNGS: A toolbox for building efficient next-generation sequencing analysis pipelines

ViennaNGS: A toolbox for building efficient next-generation sequencing analysis pipelines
Michael T. Wolfinger , Jörg Fallmann , Florian Eggenhofer , Fabian Amman
doi: http://dx.doi.org/10.1101/013011

Recent achievements in next-generation sequencing (NGS) technologies lead to a high demand for reuseable software components to easily compile customized analysis workflows for big genomics data. We present ViennaNGS, an integrated collection of Perl modules focused on building efficient pipelines for NGS data processing. It comes with functionality for extracting and converting features from common NGS file formats, computation and evaluation of read mapping statistics, as well as normalization of RNA abundance. Moreover, ViennaNGS provides software components for identification and characterization of splice junctions from RNA-seq data, parsing and condensing sequence motif data, automated construction of Assembly and Track Hubs for the UCSC genome browser, as well as wrapper routines for a set of commonly used NGS command line tools.

Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping

Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping

Chris Wallace , Antony J Cutler , Nikolas Pontikos , Marcin L Pekalski , Oliver S Burren , Jason D Cooper , Arcadio Rubio Garcia , Ricardo C Ferreira , Hui Guo , Neil M Walker , Deborah J Smyth , Stephen S Rich , Suna Onengut-Gumuscu , Stephen S Sawcer , Maria Ban , Sylvia Richardson , John Todd , Linda Wicker
doi: http://dx.doi.org/10.1101/015164

Identification of candidate causal variants in regions associated with risk of common diseases is complicated by linkage disequilibrium (LD) and multiple association signals. Nonetheless, accurate maps of these variants are needed, both to fully exploit detailed cell specific chromatin annotation data and highlight disease causal mechanisms and cells, and for design of the functional studies that will ultimately be required to confirm causal mechanisms. We adapted a Bayesian evolutionary stochastic search algorithm to the fine mapping problem, and demonstrated its improved performance over conventional stepwise and regularised regression through simulation studies. We then applied it to fine map the established multiple sclerosis (MS) and type 1 diabetes (T1D) associations in the IL-2RA (CD25) gene region. For T1D, both stepwise and stochastic search approaches identified four T1D association signals, with the major effect tagged by the single nucleotide polymorphism, rs12722496. In contrast, for MS, the stochastic search found two distinct competing models: a single candidate causal variant, tagged by rs2104286 and reported previously using conditional analysis; and a more complex model with two association signals, one of which was tagged by the major T1D associated rs12722496 and the other by rs56382813. There is low to moderate LD between rs2104286 and both rs12722496 and rs56382813 (r2 ≈ 0.3) and our two SNP model could not be recovered through a forward stepwise search after conditioning on rs2104286. Both signals in the two variant model for MS affect CD25 expression on distinct subpopulations of CD4+ T cells, which are key cells in the autoimmune process. The results support a shared causal variant for T1D and MS. Our study illustrates the benefit of using a purposely designed model search strategy for fine mapping and the advantage of combining disease and protein expression data.

Correcting Illumina sequencing errors for human data

Correcting Illumina sequencing errors for human data

Heng Li
(Submitted on 12 Feb 2015)

Summary: We present a new tool to correct sequencing errors in Illumina data produced from high-coverage whole-genome shotgun resequencing. It uses a non-greedy algorithm and shows comparable performance and higher accuracy in an evaluation on real human data. This evaluation has the most complete collection of high-performance error correctors so far.
Availability and implementation: this https URL

Inferring processes underlying B-cell repertoire diversity

Inferring processes underlying B-cell repertoire diversity

Yuval Elhanati, Zachary Sethna, Quentin Marcou, Curtis G. Callan Jr., Thierry Mora, Aleksandra M. Walczak
(Submitted on 10 Feb 2015)

We quantify the VDJ recombination and somatic hypermutation processes in human B-cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, due to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.

Discovery of large genomic inversions using pooled clone sequencing

Discovery of large genomic inversions using pooled clone sequencing

Marzieh Eslami Rasekh , Giorgia Chiatante , Mattia Miroballo , Joyce Tang , Mario Ventura , Chris T Amemiya , Evan E. Eichler , Francesca Antonacci , Can Alkan
doi: http://dx.doi.org/10.1101/015156

There are many different forms of genomic structural variation that can be broadly classified as copy number variation (CNV) and balanced rearrangements. Although many algorithms are now available in the literature that aim to characterize CNVs, discovery of balanced rearrangements (inversions and translocations) remains an open problem. This is mainly because the breakpoints of such events typically lie within segmental duplications and common repeats, which reduce the mappability of short reads. The 1000 Genomes Project spearheaded the development of several methods to identify inversions, however, they are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies (HTS). Here we propose to use a sequencing method (Kitzman et al., 2011) originally developed to improve haplotype resolution to characterize large genomic inversions. This method, called pooled clone sequencing, merges the advantages of clone based sequencing approach with the speed and cost efficiency of HTS technologies. Using data generated with pooled clone sequencing method, we developed a novel algorithm, dipSeq, to discover large inversions (>500 Kbp). We show the power of dipSeq first on simulated data, and then apply it to the genome of a HapMap individual (NA12878). We were able to accurately discover all previously known and experimentally validated large inversions in the same genome. We also identified a novel inversion, and confirmed using fluorescent in situ hybridization. Availability: Implementation of the dipSeq algorithm is available at https://github.com/BilkentCompGen/dipseq

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Aram Avila-Herrera , Katherine Pollard
doi: http://dx.doi.org/10.1101/014902

When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure. Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. All commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments. We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Benjamin Neymotin, Victoria Ettorre, David Gresham
doi: http://dx.doi.org/10.1101/014845

Degradation of mRNA contributes to variation in transcript abundance. Studies of individual mRNAs show that cis and trans factors control mRNA degradation rates. However, transcriptome-wide studies have failed to identify global relationships between transcript properties and mRNA degradation. We investigated the contribution of cis and trans factors to transcriptome-wide degradation rate variation in the budding yeast, Saccharomyces cerevisiae, using multiple regression analysis. We find that multiple transcript properties are associated with mRNA degradation rates and that a model incorporating these factors explains ~50% of the genome-wide variance. Predictors of mRNA degradation rates include transcript length, abundance, ribosome density, codon adaptation index (CAI) and GC content of the third position in codons. To validate these factors we studied individual transcripts expressed from identical promoters. We find that decreasing ribosome density by mutating the translational start site of the GAP1 transcript increases its degradation rate. Using variants of GFP that differ at synonymous sites, we show that increased GC content of the third position of codons results in decreased mRNA degradation rate. Thus, in steady-state conditions, a large fraction of genome-wide variation in mRNA degradation rates is determined by inherent properties of transcripts related to protein translation rather than specific regulatory mechanisms.

Permutation Testing in the Presence of Polygenic Variation

Permutation Testing in the Presence of Polygenic Variation

Mark Abney
doi: http://dx.doi.org/10.1101/014571

This article discusses problems with and solutions to performing valid permutation tests for quantitative trait loci in the presence of polygenic effects. Although permutation testing is a popular approach for determining statistical significance of a test statistic with an unknown distribution–for instance, the maximum of multiple correlated statistics or some omnibus test statistic for a gene, gene-set or pathway–naive application of permutations may result in an invalid test. The risk of performing an invalid permutation test is particularly acute in complex trait mapping where polygenicity may combine with a structured population resulting from the presence of families, cryptic relatedness, admixture or population stratification. I give both analytical derivations and a conceptual understanding of why typical permutation procedures fail and suggest an alternative permutation based algorithm, MVNpermute, that succeeds. In particular, I examine the case where a linear mixed model is used to analyze a quantitative trait and show that both phenotype and genotype permutations may result in an invalid permutation test. I provide a formula that predicts the amount of inflation of the type 1 error rate depending on the degree of misspecification of the covariance structure of the polygenic effect and the heritability of the trait. I validate this formula by doing simulations, showing that the permutation distribution matches the theoretical expectation, and that my suggested permutation based test obtains the correct null distribution. Finally, I discuss situations where naive permutations of the phenotype or genotype are valid and the applicability of the results to other test statistics.