Massive migration from the steppe is a source for Indo-European languages in Europe

Massive migration from the steppe is a source for Indo-European languages in Europe
Wolfgang Haak , Iosif Lazaridis , Nick Patterson , Nadin Rohland , Swapan Mallick , Bastien Llamas , Guido Brandt , Susanne Nordenfelt , Eadaoin Harney , Kristin Stewardson , Qiaomei Fu , Alissa Mittnik , Eszter Bánffy , Christos Economou , Michael Francken , Susanne Friederich , Rafael Garrido Pena , Fredrik Hallgren , Valery Khartanovich , Aleksandr Khokhlov , Michael Kunst , Pavel Kuznetsov , Harald Meller , Oleg Mochalov , Vayacheslav Moiseyev , Nicole Nicklisch , Sandra L. Pichler , Roberto Risch , Manuel A. Rojo Guerra , Christina Roth , Anna Szécsényi-Nagy , Joachim Wahl , Matthias Meyer , Johannes Krause , Dorcas Brown , David Anthony , Alan Cooper , Kurt Werner Alt , David Reich
doi: http://dx.doi.org/10.1101/013433

We generated genome-wide data from 69 Europeans who lived between 8,000-3,000 years ago by enriching ancient DNA libraries for a target set of almost four hundred thousand polymorphisms. Enrichment of these positions decreases the sequencing required for genome-wide ancient DNA analysis by a median of around 250-fold, allowing us to study an order of magnitude more individuals than previous studies and to obtain new insights about the past. We show that the populations of western and far eastern Europe followed opposite trajectories between 8,000-5,000 years ago. At the beginning of the Neolithic period in Europe, ~8,000-7,000 years ago, closely related groups of early farmers appeared in Germany, Hungary, and Spain, different from indigenous hunter-gatherers, whereas Russia was inhabited by a distinctive population of hunter-gatherers with high affinity to a ~24,000 year old Siberian6. By ~6,000-5,000 years ago, a resurgence of hunter-gatherer ancestry had occurred throughout much of Europe, but in Russia, the Yamnaya steppe herders of this time were descended not only from the preceding eastern European hunter-gatherers, but from a population of Near Eastern ancestry. Western and Eastern Europe came into contact ~4,500 years ago, as the Late Neolithic Corded Ware people from Germany traced ~3/4 of their ancestry to the Yamnaya, documenting a massive migration into the heartland of Europe from its eastern periphery. This steppe ancestry persisted in all sampled central Europeans until at least ~3,000 years ago, and is ubiquitous in present-day Europeans. These results provide support for the theory of a steppe origin of at least some of the Indo-European languages of Europe.

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Alexander Immel , Dorothée G. Drucker , Tina K. Jahnke , Susanne C. Münzel , Verena J. Schuenemann , Marion Bonazzi , Alexander Herbig , Claus-Joachim Kind , Johannes Krause
doi: http://dx.doi.org/10.1101/014944

The giant deer Megaloceros giganteus is among the most fascinating Late Pleistocene Eurasian megafauna that became extinct at the end of the last ice age. Important questions persist regarding its phylogenetic relationship to contemporary taxa and the reasons for its extinction. We analyzed two large ancient cervid bone fragments recovered from cave sites in the Swabian Jura (Baden-Württemberg, Germany) dated to 12,000 years ago. Using hybridization capture in combination with next generation sequencing, we were able to reconstruct nearly complete mitochondrial genomes from both specimens. Both mtDNAs cluster phylogenetically with fallow deer and show high similarity to previously studied partial Megaloceros giganteus DNA from Kamyshlov in western Siberia and Killavullen in Ireland. The unexpected presence of Megaloceros giganteus in Southern Germany after the Ice Age suggests a later survival in Central Europe than previously proposed. The complete mtDNAs provide strong phylogenetic support for a Dama-Megaloceros clade. Furthermore, isotope analyses support an increasing competition between giant deer, red deer, and reindeer after the Last Glacial Maximum, which might have contributed to the extinction of Megaloceros in Central Europe.

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Coevolutionary analyses require phylogenetically deep alignments and better null models to accurately detect inter-protein contacts within and between species

Aram Avila-Herrera , Katherine Pollard
doi: http://dx.doi.org/10.1101/014902

When biomolecules physically interact, natural selection operates on them jointly. Contacting positions in protein and RNA structures exhibit correlated patterns of sequence evolution due to constraints imposed by the interaction, and molecular arms races can develop between interacting proteins in pathogens and their hosts. To evaluate how well methods developed to detect coevolving residues within proteins can be adapted for cross-species, inter-protein analysis, we used statistical criteria to quantify the performance of these methods in detecting inter-protein residues within 8 angstroms of each other in the co-crystal structures of 33 bacterial protein interactions. We also evaluated their performance for detecting known residues at the interface of a host-virus protein complex with a partially solved structure. Our quantitative benchmarking showed that all coevolutionary methods clearly benefit from alignments with many sequences. Methods that aim to detect direct correlations generally outperform other approaches. However, faster mutual information based methods are occasionally competitive in small alignments and with relaxed false positive rates. All commonly used null distributions are anti-conservative and have high false positive rates in some scenarios, although the empirical distribution of scores performs reasonably well with deep alignments. We conclude that coevolutionary analysis of cross-species protein interactions holds great promise but requires sequencing many more species pairs.

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Julia H Wildschutte , Alayna A Baron , Nicolette M Diroff , Jeffrey M Kidd
doi: http://dx.doi.org/10.1101/014977

Alu insertions have contributed to >11% of the human genome. About ~30-35 Alu subfamilies remain actively mobile, and are recognized as major drivers of genetic variation and disease. Sophisticated computational methods permit identification of non-reference insertions based on specific signatures from whole genome sequencing data, but reporting of entire insertion sequences is limited. We build on existing methods and develop an approach that combines Alu detection and de novo assembly of WGS data to reconstruct the full sequence of insertion events. Using this approach, we generate a highly accurate call set of 1,614 completely assembled Alu variants from 53 samples from the Human Genome Diversity Project panel. Experimental validation of 30 sites shows 100% this method produces a highly accurate call set that accurately reconstructs insertion sequence. We utilize the reconstructed alternative insertion haplotypes to genotype 1,010 fully assembled insertions, obtaining >99% accuracy. We find evidence of insertion by non-classical mechanisms and observe 5??? truncation in 16% of AluYa5 and AluYb8 insertions. The sites of truncation coincide with stem-loop structures and SRP9/14 binding sites in the Alu RNA, implicating L1 ORF2p pausing in the generation of 5??? truncations.

Improving access to endogenous DNA in ancient bones and teeth

Improving access to endogenous DNA in ancient bones and teeth

Peter de Barros Damgaard , Ashot Margaryan , Hannes Schroeder , Ludovic Orlando , Eske Willerslev , Morten E Allentoft
doi: http://dx.doi.org/10.1101/014985

Poor DNA preservation is the most limiting factor in ancient genomic research. In the vast majority of ancient bones and teeth, endogenous DNA molecules only represent a minor fraction of the whole DNA extract, rendering traditional shot-gun sequencing approaches cost-ineffective for whole-genome characterization. Based on ancient human bone samples from temperate and tropical environments, we show that an initial EDTA-based enzymatic ‘pre-digestion’ of powdered bone increases the proportion of endogenous DNA several fold. By performing the pre-digestion step between 30 min and 6 hours on five bones, we identify the optimal pre-digestion time and document an average increase of 2.7 times in the endogenous DNA fraction after 1 hour of pre-digestion. With longer pre-digestion times, the increase is asymptotic while molecular complexity decreases. We repeated the experiment with n=21 and t=15-30′, and document a significant increase in endogenous DNA content (one-sided paired t-test: p=0.009). We advocate the implementation of a short pre-digestion step as a standard procedure in ancient DNA extractions from bone material. Finally, we demonstrate on 14 ancient teeth that crushed cementum of the roots contains up to 14 times more endogenous DNA than the dentine. Our presented methodological guidelines considerably advance the ability to characterize ancient genomes.

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel
David Houle , Eladio J. Marquez
doi: http://dx.doi.org/10.1101/014936

We calculated the linkage disequilibrium between all pairs of variants in the Drosophila Genome Reference Panel, and make available the list of all highly correlated SNPs for use in association studies. Seventy-three percent of variant SNPs are correlated at r2>0.5 with at least one other SNP, and the mean number of correlated SNPs per variant over the whole genome is 64.9. Disequilibrium between distant SNPs is also common when minor allele frequency (MAF) is low: 24% of SNPs with MAF<0.1 are highly correlated with SNPs more than 100kb distant. While SNPs within regions with polymorphic inversions are highly correlated with somewhat larger numbers of SNPs, and these correlated SNPs are on average farther away, the probability that a SNP in such regions is highly correlated with at least one other SNP is very similar to SNPs outside inversions. Previous karyotyping of the DGRP lines has been inconsistent, and we used LD and genotype to investigate these discrepancies. When previous studies agreed on inversion karyotype, our analysis was almost perfectly concordant with those assignments. In discordant cases, and for inversion heterozygotes, our results suggest errors in two previous analyses, or discordance between genotype and karyotype. Heterozygosities of chromosome arms are in many cases surprisingly highly correlated, suggesting strong epsistatic selection during the inbreeding and maintenance of the DGRP lines.

Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization

Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization
Marco Mariotti , Didac Santesmasses , Salvador Capella-Gutierrez , Andrea Mateo , Carme Arnan , Rory Johnson , Salvatore D’Aniello , Sun Hee Yim , Vadim N Gladyshev , Florenci Serras , Montserrat Corominas , Toni Gabaldon , Roderic Guigo
doi: http://dx.doi.org/10.1101/014928

SPS catalyzes the synthesis of selenophosphate, the selenium donor for the synthesis of the amino acid selenocysteine (Sec), incorporated in selenoproteins in response to the UGA codon. SPS is unique among proteins of the selenoprotein biosynthesis machinery in that it is, in many species, a selenoprotein itself, although, as in all selenoproteins, Sec is often replaced by cysteine (Cys). In metazoan genomes we found, however, SPS genes with lineage specific substitutions other than Sec or Cys. Our results show that these non-Sec, non-Cys SPS genes originated through a number of independent gene duplications of diverse molecular origin from an ancestral selenoprotein SPS gene. Although of independent origin, complementation assays in fly mutants show that these genes share a common function, which most likely emerged in the ancestral metazoan gene. This function appears to be unrelated to selenophosphate synthesis, since all genomes encoding selenoproteins contain Sec or Cys SPS genes (SPS2), but those containing only non-Sec, non-Cys SPS genes (SPS1) do not encode selenoproteins. Thus, in SPS genes, through parallel duplications and subsequent convergent subfunctionalization, two functions initially carried by a single gene are recurrently segregated at two different loci. RNA structures enhancing the readthrough of the Sec-UGA codon in SPS genes, which may be traced back to prokaryotes, played a key role in this process. The SPS evolutionary history in metazoans constitute a remarkable example of the emergence and evolution of gene function. We have been able to trace this history with unusual detail thanks to the singular feature of SPS genes, wherein the amino acid at a single site determines protein function, and, ultimately, the evolutionary fate of an entire class of genes.

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Benjamin Neymotin, Victoria Ettorre, David Gresham
doi: http://dx.doi.org/10.1101/014845

Degradation of mRNA contributes to variation in transcript abundance. Studies of individual mRNAs show that cis and trans factors control mRNA degradation rates. However, transcriptome-wide studies have failed to identify global relationships between transcript properties and mRNA degradation. We investigated the contribution of cis and trans factors to transcriptome-wide degradation rate variation in the budding yeast, Saccharomyces cerevisiae, using multiple regression analysis. We find that multiple transcript properties are associated with mRNA degradation rates and that a model incorporating these factors explains ~50% of the genome-wide variance. Predictors of mRNA degradation rates include transcript length, abundance, ribosome density, codon adaptation index (CAI) and GC content of the third position in codons. To validate these factors we studied individual transcripts expressed from identical promoters. We find that decreasing ribosome density by mutating the translational start site of the GAP1 transcript increases its degradation rate. Using variants of GFP that differ at synonymous sites, we show that increased GC content of the third position of codons results in decreased mRNA degradation rate. Thus, in steady-state conditions, a large fraction of genome-wide variation in mRNA degradation rates is determined by inherent properties of transcripts related to protein translation rather than specific regulatory mechanisms.

A Pleiotropy-Informed Bayesian False Discovery Rate adapted to a Shared Control Design Finds New Disease Associations From GWAS Summary Statistics

A Pleiotropy-Informed Bayesian False Discovery Rate adapted to a Shared Control Design Finds New Disease Associations From GWAS Summary Statistics

James Liley, Chris Wallace
doi: http://dx.doi.org/10.1101/014886

Genome-wide association studies (GWAS) have been successful in identifying single nucleotide polymorphisms (SNPs) associated with many traits and diseases. However, at existing sample sizes, these variants explain only part of the estimated heritability. Leverage of GWAS results from related phenotypes may improve detection without the need for larger datasets. The Bayesian conditional false discovery rate (cFDR) constitutes an upper bound on the expected false discovery rate (FDR) across a set of SNPs whose p values for two diseases are both less than two disease-specific thresholds. Calculation of the cFDR requires only summary statistics and has several advantages over traditional GWAS analysis. However, existing methods require distinct control samples between studies. Here, we extend the technique to allow for some or all controls to be shared, increasing applicability. Several different SNP sets can be defined with the same cFDR value, and we show that the expected FDR across the union of these sets may exceed expected FDR in any single set. We describe a procedure to establish an upper bound for the expected FDR among the union of such sets of SNPs. We apply our technique to pairwise analysis of p values from ten autoimmune diseases with variable sharing of controls, enabling discovery of 59 SNP-disease associations which do not reach GWAS significance after genomic control in individual datasets. Most of the SNPs we highlight have previously been confirmed using replication studies or larger GWAS, a useful validation of our technique; we report eight SNP-disease associations across five diseases not previously declared. Our technique extends and strengthens the previous algorithm, and establishes robust limits on the expected FDR. This approach can improve SNP detection in GWAS, and give insight into shared aetiology between phenotypically related conditions.

Natural Selection Shapes the Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome

Natural Selection Shapes the Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome

John E Pool
doi: http://dx.doi.org/10.1101/014837

North American populations of Drosophila melanogaster are thought to derive from both European and African source populations, but despite their importance for genetic research, patterns of admixture along their genomes are essentially undocumented. Here, I infer geographic ancestry along genomes of the Drosophila Genetic Reference Panel (DGRP) and the D. melanogaster reference genome. Overall, the proportion of African ancestry was estimated to be 20% for the DGRP and 9% for the reference genome. Based on the size of admixture tracts and the approximate timing of admixture, I estimate that the DGRP population underwent roughly 13.9 generations per year. Notably, ancestry levels varied strikingly among genomic regions, with significantly less African introgression on the X chromosome, in regions of high recombination, and at genes involved in specific processes such as circadian rhythm. An important role for natural selection during the admixture process was further supported by a genome-wide signal of ancestry disequilibrium, in that many between-chromosome pairs of loci showed a deficiency of Africa-Europe allele combinations. These results support the hypothesis that admixture between partially genetically isolated Drosophila populations led to natural selection against incompatible genetic variants, and that this process is ongoing. The ancestry blocks inferred here may be relevant for the performance of reference alignment in this species, and may bolster the design and interpretation of many population genetic and association mapping studies.