Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping

Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping

Chris Wallace , Antony J Cutler , Nikolas Pontikos , Marcin L Pekalski , Oliver S Burren , Jason D Cooper , Arcadio Rubio Garcia , Ricardo C Ferreira , Hui Guo , Neil M Walker , Deborah J Smyth , Stephen S Rich , Suna Onengut-Gumuscu , Stephen S Sawcer , Maria Ban , Sylvia Richardson , John Todd , Linda Wicker
doi: http://dx.doi.org/10.1101/015164

Identification of candidate causal variants in regions associated with risk of common diseases is complicated by linkage disequilibrium (LD) and multiple association signals. Nonetheless, accurate maps of these variants are needed, both to fully exploit detailed cell specific chromatin annotation data and highlight disease causal mechanisms and cells, and for design of the functional studies that will ultimately be required to confirm causal mechanisms. We adapted a Bayesian evolutionary stochastic search algorithm to the fine mapping problem, and demonstrated its improved performance over conventional stepwise and regularised regression through simulation studies. We then applied it to fine map the established multiple sclerosis (MS) and type 1 diabetes (T1D) associations in the IL-2RA (CD25) gene region. For T1D, both stepwise and stochastic search approaches identified four T1D association signals, with the major effect tagged by the single nucleotide polymorphism, rs12722496. In contrast, for MS, the stochastic search found two distinct competing models: a single candidate causal variant, tagged by rs2104286 and reported previously using conditional analysis; and a more complex model with two association signals, one of which was tagged by the major T1D associated rs12722496 and the other by rs56382813. There is low to moderate LD between rs2104286 and both rs12722496 and rs56382813 (r2 ≈ 0.3) and our two SNP model could not be recovered through a forward stepwise search after conditioning on rs2104286. Both signals in the two variant model for MS affect CD25 expression on distinct subpopulations of CD4+ T cells, which are key cells in the autoimmune process. The results support a shared causal variant for T1D and MS. Our study illustrates the benefit of using a purposely designed model search strategy for fine mapping and the advantage of combining disease and protein expression data.

Correcting Illumina sequencing errors for human data

Correcting Illumina sequencing errors for human data

Heng Li
(Submitted on 12 Feb 2015)

Summary: We present a new tool to correct sequencing errors in Illumina data produced from high-coverage whole-genome shotgun resequencing. It uses a non-greedy algorithm and shows comparable performance and higher accuracy in an evaluation on real human data. This evaluation has the most complete collection of high-performance error correctors so far.
Availability and implementation: this https URL

Inferring processes underlying B-cell repertoire diversity

Inferring processes underlying B-cell repertoire diversity

Yuval Elhanati, Zachary Sethna, Quentin Marcou, Curtis G. Callan Jr., Thierry Mora, Aleksandra M. Walczak
(Submitted on 10 Feb 2015)

We quantify the VDJ recombination and somatic hypermutation processes in human B-cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, due to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.

Discovery of large genomic inversions using pooled clone sequencing

Discovery of large genomic inversions using pooled clone sequencing

Marzieh Eslami Rasekh , Giorgia Chiatante , Mattia Miroballo , Joyce Tang , Mario Ventura , Chris T Amemiya , Evan E. Eichler , Francesca Antonacci , Can Alkan
doi: http://dx.doi.org/10.1101/015156

There are many different forms of genomic structural variation that can be broadly classified as copy number variation (CNV) and balanced rearrangements. Although many algorithms are now available in the literature that aim to characterize CNVs, discovery of balanced rearrangements (inversions and translocations) remains an open problem. This is mainly because the breakpoints of such events typically lie within segmental duplications and common repeats, which reduce the mappability of short reads. The 1000 Genomes Project spearheaded the development of several methods to identify inversions, however, they are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies (HTS). Here we propose to use a sequencing method (Kitzman et al., 2011) originally developed to improve haplotype resolution to characterize large genomic inversions. This method, called pooled clone sequencing, merges the advantages of clone based sequencing approach with the speed and cost efficiency of HTS technologies. Using data generated with pooled clone sequencing method, we developed a novel algorithm, dipSeq, to discover large inversions (>500 Kbp). We show the power of dipSeq first on simulated data, and then apply it to the genome of a HapMap individual (NA12878). We were able to accurately discover all previously known and experimentally validated large inversions in the same genome. We also identified a novel inversion, and confirmed using fluorescent in situ hybridization. Availability: Implementation of the dipSeq algorithm is available at https://github.com/BilkentCompGen/dipseq

Locating a Tree in a Phylogenetic Network in Quadratic Time

Locating a Tree in a Phylogenetic Network in Quadratic Time

Philippe Gambette, Andreas D. M. Gunawan, Anthony Labarre, Stéphane Vialette, Louxin Zhang
(Submitted on 11 Feb 2015)

A fundamental problem in the study of phylogenetic networks is to determine whether or not a given phylogenetic network contains a given phylogenetic tree. We develop a quadratic-time algorithm for this problem for binary nearly-stable phylogenetic networks. We also show that the number of reticulations in a reticulation visible or nearly stable phylogenetic network is bounded from above by a function linear in the number of taxa.

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Alexander Immel , Dorothée G. Drucker , Tina K. Jahnke , Susanne C. Münzel , Verena J. Schuenemann , Marion Bonazzi , Alexander Herbig , Claus-Joachim Kind , Johannes Krause
doi: http://dx.doi.org/10.1101/014944

The giant deer Megaloceros giganteus is among the most fascinating Late Pleistocene Eurasian megafauna that became extinct at the end of the last ice age. Important questions persist regarding its phylogenetic relationship to contemporary taxa and the reasons for its extinction. We analyzed two large ancient cervid bone fragments recovered from cave sites in the Swabian Jura (Baden-Württemberg, Germany) dated to 12,000 years ago. Using hybridization capture in combination with next generation sequencing, we were able to reconstruct nearly complete mitochondrial genomes from both specimens. Both mtDNAs cluster phylogenetically with fallow deer and show high similarity to previously studied partial Megaloceros giganteus DNA from Kamyshlov in western Siberia and Killavullen in Ireland. The unexpected presence of Megaloceros giganteus in Southern Germany after the Ice Age suggests a later survival in Central Europe than previously proposed. The complete mtDNAs provide strong phylogenetic support for a Dama-Megaloceros clade. Furthermore, isotope analyses support an increasing competition between giant deer, red deer, and reindeer after the Last Glacial Maximum, which might have contributed to the extinction of Megaloceros in Central Europe.

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Aram Avila-Herrera , Katherine Pollard
doi: http://dx.doi.org/10.1101/014902

When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure. Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. All commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments. We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Julia H Wildschutte , Alayna A Baron , Nicolette M Diroff , Jeffrey M Kidd
doi: http://dx.doi.org/10.1101/014977

Alu insertions have contributed to >11% of the human genome. About ~30-35 Alu subfamilies remain actively mobile, and are recognized as major drivers of genetic variation and disease. Sophisticated computational methods permit identification of non-reference insertions based on specific signatures from whole genome sequencing data, but reporting of entire insertion sequences is limited. We build on existing methods and develop an approach that combines Alu detection and de novo assembly of WGS data to reconstruct the full sequence of insertion events. Using this approach, we generate a highly accurate call set of 1,614 completely assembled Alu variants from 53 samples from the Human Genome Diversity Project panel. Experimental validation of 30 sites shows 100% this method produces a highly accurate call set that accurately reconstructs insertion sequence. We utilize the reconstructed alternative insertion haplotypes to genotype 1,010 fully assembled insertions, obtaining >99% accuracy. We find evidence of insertion by non-classical mechanisms and observe 5??? truncation in 16% of AluYa5 and AluYb8 insertions. The sites of truncation coincide with stem-loop structures and SRP9/14 binding sites in the Alu RNA, implicating L1 ORF2p pausing in the generation of 5??? truncations.

Improving access to endogenous DNA in ancient bones and teeth

Improving access to endogenous DNA in ancient bones and teeth

Peter de Barros Damgaard , Ashot Margaryan , Hannes Schroeder , Ludovic Orlando , Eske Willerslev , Morten E Allentoft
doi: http://dx.doi.org/10.1101/014985

Poor DNA preservation is the most limiting factor in ancient genomic research. In the vast majority of ancient bones and teeth, endogenous DNA molecules only represent a minor fraction of the whole DNA extract, rendering traditional shot-gun sequencing approaches cost-ineffective for whole-genome characterization. Based on ancient human bone samples from temperate and tropical environments, we show that an initial EDTA-based enzymatic ‘pre-digestion’ of powdered bone increases the proportion of endogenous DNA several fold. By performing the pre-digestion step between 30 min and 6 hours on five bones, we identify the optimal pre-digestion time and document an average increase of 2.7 times in the endogenous DNA fraction after 1 hour of pre-digestion. With longer pre-digestion times, the increase is asymptotic while molecular complexity decreases. We repeated the experiment with n=21 and t=15-30′, and document a significant increase in endogenous DNA content (one-sided paired t-test: p=0.009). We advocate the implementation of a short pre-digestion step as a standard procedure in ancient DNA extractions from bone material. Finally, we demonstrate on 14 ancient teeth that crushed cementum of the roots contains up to 14 times more endogenous DNA than the dentine. Our presented methodological guidelines considerably advance the ability to characterize ancient genomes.

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Benjamin Neymotin, Victoria Ettorre, David Gresham
doi: http://dx.doi.org/10.1101/014845

Degradation of mRNA contributes to variation in transcript abundance. Studies of individual mRNAs show that cis and trans factors control mRNA degradation rates. However, transcriptome-wide studies have failed to identify global relationships between transcript properties and mRNA degradation. We investigated the contribution of cis and trans factors to transcriptome-wide degradation rate variation in the budding yeast, Saccharomyces cerevisiae, using multiple regression analysis. We find that multiple transcript properties are associated with mRNA degradation rates and that a model incorporating these factors explains ~50% of the genome-wide variance. Predictors of mRNA degradation rates include transcript length, abundance, ribosome density, codon adaptation index (CAI) and GC content of the third position in codons. To validate these factors we studied individual transcripts expressed from identical promoters. We find that decreasing ribosome density by mutating the translational start site of the GAP1 transcript increases its degradation rate. Using variants of GFP that differ at synonymous sites, we show that increased GC content of the third position of codons results in decreased mRNA degradation rate. Thus, in steady-state conditions, a large fraction of genome-wide variation in mRNA degradation rates is determined by inherent properties of transcripts related to protein translation rather than specific regulatory mechanisms.