Inferring processes underlying B-cell repertoire diversity

Inferring processes underlying B-cell repertoire diversity

Yuval Elhanati, Zachary Sethna, Quentin Marcou, Curtis G. Callan Jr., Thierry Mora, Aleksandra M. Walczak
(Submitted on 10 Feb 2015)

We quantify the VDJ recombination and somatic hypermutation processes in human B-cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, due to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.

Discovery of large genomic inversions using pooled clone sequencing

Discovery of large genomic inversions using pooled clone sequencing

Marzieh Eslami Rasekh , Giorgia Chiatante , Mattia Miroballo , Joyce Tang , Mario Ventura , Chris T Amemiya , Evan E. Eichler , Francesca Antonacci , Can Alkan
doi: http://dx.doi.org/10.1101/015156

There are many different forms of genomic structural variation that can be broadly classified as copy number variation (CNV) and balanced rearrangements. Although many algorithms are now available in the literature that aim to characterize CNVs, discovery of balanced rearrangements (inversions and translocations) remains an open problem. This is mainly because the breakpoints of such events typically lie within segmental duplications and common repeats, which reduce the mappability of short reads. The 1000 Genomes Project spearheaded the development of several methods to identify inversions, however, they are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies (HTS). Here we propose to use a sequencing method (Kitzman et al., 2011) originally developed to improve haplotype resolution to characterize large genomic inversions. This method, called pooled clone sequencing, merges the advantages of clone based sequencing approach with the speed and cost efficiency of HTS technologies. Using data generated with pooled clone sequencing method, we developed a novel algorithm, dipSeq, to discover large inversions (>500 Kbp). We show the power of dipSeq first on simulated data, and then apply it to the genome of a HapMap individual (NA12878). We were able to accurately discover all previously known and experimentally validated large inversions in the same genome. We also identified a novel inversion, and confirmed using fluorescent in situ hybridization. Availability: Implementation of the dipSeq algorithm is available at https://github.com/BilkentCompGen/dipseq

Locating a Tree in a Phylogenetic Network in Quadratic Time

Locating a Tree in a Phylogenetic Network in Quadratic Time

Philippe Gambette, Andreas D. M. Gunawan, Anthony Labarre, Stéphane Vialette, Louxin Zhang
(Submitted on 11 Feb 2015)

A fundamental problem in the study of phylogenetic networks is to determine whether or not a given phylogenetic network contains a given phylogenetic tree. We develop a quadratic-time algorithm for this problem for binary nearly-stable phylogenetic networks. We also show that the number of reticulations in a reticulation visible or nearly stable phylogenetic network is bounded from above by a function linear in the number of taxa.

Massive migration from the steppe is a source for Indo-European languages in Europe

Massive migration from the steppe is a source for Indo-European languages in Europe
Wolfgang Haak , Iosif Lazaridis , Nick Patterson , Nadin Rohland , Swapan Mallick , Bastien Llamas , Guido Brandt , Susanne Nordenfelt , Eadaoin Harney , Kristin Stewardson , Qiaomei Fu , Alissa Mittnik , Eszter Bánffy , Christos Economou , Michael Francken , Susanne Friederich , Rafael Garrido Pena , Fredrik Hallgren , Valery Khartanovich , Aleksandr Khokhlov , Michael Kunst , Pavel Kuznetsov , Harald Meller , Oleg Mochalov , Vayacheslav Moiseyev , Nicole Nicklisch , Sandra L. Pichler , Roberto Risch , Manuel A. Rojo Guerra , Christina Roth , Anna Szécsényi-Nagy , Joachim Wahl , Matthias Meyer , Johannes Krause , Dorcas Brown , David Anthony , Alan Cooper , Kurt Werner Alt , David Reich
doi: http://dx.doi.org/10.1101/013433

We generated genome-wide data from 69 Europeans who lived between 8,000-3,000 years ago by enriching ancient DNA libraries for a target set of almost four hundred thousand polymorphisms. Enrichment of these positions decreases the sequencing required for genome-wide ancient DNA analysis by a median of around 250-fold, allowing us to study an order of magnitude more individuals than previous studies and to obtain new insights about the past. We show that the populations of western and far eastern Europe followed opposite trajectories between 8,000-5,000 years ago. At the beginning of the Neolithic period in Europe, ~8,000-7,000 years ago, closely related groups of early farmers appeared in Germany, Hungary, and Spain, different from indigenous hunter-gatherers, whereas Russia was inhabited by a distinctive population of hunter-gatherers with high affinity to a ~24,000 year old Siberian6. By ~6,000-5,000 years ago, a resurgence of hunter-gatherer ancestry had occurred throughout much of Europe, but in Russia, the Yamnaya steppe herders of this time were descended not only from the preceding eastern European hunter-gatherers, but from a population of Near Eastern ancestry. Western and Eastern Europe came into contact ~4,500 years ago, as the Late Neolithic Corded Ware people from Germany traced ~3/4 of their ancestry to the Yamnaya, documenting a massive migration into the heartland of Europe from its eastern periphery. This steppe ancestry persisted in all sampled central Europeans until at least ~3,000 years ago, and is ubiquitous in present-day Europeans. These results provide support for the theory of a steppe origin of at least some of the Indo-European languages of Europe.

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Alexander Immel , Dorothée G. Drucker , Tina K. Jahnke , Susanne C. Münzel , Verena J. Schuenemann , Marion Bonazzi , Alexander Herbig , Claus-Joachim Kind , Johannes Krause
doi: http://dx.doi.org/10.1101/014944

The giant deer Megaloceros giganteus is among the most fascinating Late Pleistocene Eurasian megafauna that became extinct at the end of the last ice age. Important questions persist regarding its phylogenetic relationship to contemporary taxa and the reasons for its extinction. We analyzed two large ancient cervid bone fragments recovered from cave sites in the Swabian Jura (Baden-Württemberg, Germany) dated to 12,000 years ago. Using hybridization capture in combination with next generation sequencing, we were able to reconstruct nearly complete mitochondrial genomes from both specimens. Both mtDNAs cluster phylogenetically with fallow deer and show high similarity to previously studied partial Megaloceros giganteus DNA from Kamyshlov in western Siberia and Killavullen in Ireland. The unexpected presence of Megaloceros giganteus in Southern Germany after the Ice Age suggests a later survival in Central Europe than previously proposed. The complete mtDNAs provide strong phylogenetic support for a Dama-Megaloceros clade. Furthermore, isotope analyses support an increasing competition between giant deer, red deer, and reindeer after the Last Glacial Maximum, which might have contributed to the extinction of Megaloceros in Central Europe.

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Aram Avila-Herrera , Katherine Pollard
doi: http://dx.doi.org/10.1101/014902

When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure. Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. All commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments. We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Julia H Wildschutte , Alayna A Baron , Nicolette M Diroff , Jeffrey M Kidd
doi: http://dx.doi.org/10.1101/014977

Alu insertions have contributed to >11% of the human genome. About ~30-35 Alu subfamilies remain actively mobile, and are recognized as major drivers of genetic variation and disease. Sophisticated computational methods permit identification of non-reference insertions based on specific signatures from whole genome sequencing data, but reporting of entire insertion sequences is limited. We build on existing methods and develop an approach that combines Alu detection and de novo assembly of WGS data to reconstruct the full sequence of insertion events. Using this approach, we generate a highly accurate call set of 1,614 completely assembled Alu variants from 53 samples from the Human Genome Diversity Project panel. Experimental validation of 30 sites shows 100% this method produces a highly accurate call set that accurately reconstructs insertion sequence. We utilize the reconstructed alternative insertion haplotypes to genotype 1,010 fully assembled insertions, obtaining >99% accuracy. We find evidence of insertion by non-classical mechanisms and observe 5??? truncation in 16% of AluYa5 and AluYb8 insertions. The sites of truncation coincide with stem-loop structures and SRP9/14 binding sites in the Alu RNA, implicating L1 ORF2p pausing in the generation of 5??? truncations.

Improving access to endogenous DNA in ancient bones and teeth

Improving access to endogenous DNA in ancient bones and teeth

Peter de Barros Damgaard , Ashot Margaryan , Hannes Schroeder , Ludovic Orlando , Eske Willerslev , Morten E Allentoft
doi: http://dx.doi.org/10.1101/014985

Poor DNA preservation is the most limiting factor in ancient genomic research. In the vast majority of ancient bones and teeth, endogenous DNA molecules only represent a minor fraction of the whole DNA extract, rendering traditional shot-gun sequencing approaches cost-ineffective for whole-genome characterization. Based on ancient human bone samples from temperate and tropical environments, we show that an initial EDTA-based enzymatic ‘pre-digestion’ of powdered bone increases the proportion of endogenous DNA several fold. By performing the pre-digestion step between 30 min and 6 hours on five bones, we identify the optimal pre-digestion time and document an average increase of 2.7 times in the endogenous DNA fraction after 1 hour of pre-digestion. With longer pre-digestion times, the increase is asymptotic while molecular complexity decreases. We repeated the experiment with n=21 and t=15-30′, and document a significant increase in endogenous DNA content (one-sided paired t-test: p=0.009). We advocate the implementation of a short pre-digestion step as a standard procedure in ancient DNA extractions from bone material. Finally, we demonstrate on 14 ancient teeth that crushed cementum of the roots contains up to 14 times more endogenous DNA than the dentine. Our presented methodological guidelines considerably advance the ability to characterize ancient genomes.

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel
David Houle , Eladio J. Marquez
doi: http://dx.doi.org/10.1101/014936

We calculated the linkage disequilibrium between all pairs of variants in the Drosophila Genome Reference Panel, and make available the list of all highly correlated SNPs for use in association studies. Seventy-three percent of variant SNPs are correlated at r2>0.5 with at least one other SNP, and the mean number of correlated SNPs per variant over the whole genome is 64.9. Disequilibrium between distant SNPs is also common when minor allele frequency (MAF) is low: 24% of SNPs with MAF<0.1 are highly correlated with SNPs more than 100kb distant. While SNPs within regions with polymorphic inversions are highly correlated with somewhat larger numbers of SNPs, and these correlated SNPs are on average farther away, the probability that a SNP in such regions is highly correlated with at least one other SNP is very similar to SNPs outside inversions. Previous karyotyping of the DGRP lines has been inconsistent, and we used LD and genotype to investigate these discrepancies. When previous studies agreed on inversion karyotype, our analysis was almost perfectly concordant with those assignments. In discordant cases, and for inversion heterozygotes, our results suggest errors in two previous analyses, or discordance between genotype and karyotype. Heterozygosities of chromosome arms are in many cases surprisingly highly correlated, suggesting strong epsistatic selection during the inbreeding and maintenance of the DGRP lines.

Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization

Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization
Marco Mariotti , Didac Santesmasses , Salvador Capella-Gutierrez , Andrea Mateo , Carme Arnan , Rory Johnson , Salvatore D’Aniello , Sun Hee Yim , Vadim N Gladyshev , Florenci Serras , Montserrat Corominas , Toni Gabaldon , Roderic Guigo
doi: http://dx.doi.org/10.1101/014928

SPS catalyzes the synthesis of selenophosphate, the selenium donor for the synthesis of the amino acid selenocysteine (Sec), incorporated in selenoproteins in response to the UGA codon. SPS is unique among proteins of the selenoprotein biosynthesis machinery in that it is, in many species, a selenoprotein itself, although, as in all selenoproteins, Sec is often replaced by cysteine (Cys). In metazoan genomes we found, however, SPS genes with lineage specific substitutions other than Sec or Cys. Our results show that these non-Sec, non-Cys SPS genes originated through a number of independent gene duplications of diverse molecular origin from an ancestral selenoprotein SPS gene. Although of independent origin, complementation assays in fly mutants show that these genes share a common function, which most likely emerged in the ancestral metazoan gene. This function appears to be unrelated to selenophosphate synthesis, since all genomes encoding selenoproteins contain Sec or Cys SPS genes (SPS2), but those containing only non-Sec, non-Cys SPS genes (SPS1) do not encode selenoproteins. Thus, in SPS genes, through parallel duplications and subsequent convergent subfunctionalization, two functions initially carried by a single gene are recurrently segregated at two different loci. RNA structures enhancing the readthrough of the Sec-UGA codon in SPS genes, which may be traced back to prokaryotes, played a key role in this process. The SPS evolutionary history in metazoans constitute a remarkable example of the emergence and evolution of gene function. We have been able to trace this history with unusual detail thanks to the singular feature of SPS genes, wherein the amino acid at a single site determines protein function, and, ultimately, the evolutionary fate of an entire class of genes.