Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors


Quality assessment for different haplotyping methods and GWAS sensitivity to phasing errors

Giovanni Busonera , Marco Cogoni , Gianluigi Zanetti
doi: http://dx.doi.org/10.1101/015669

In this report we present a multimarker association tool (Flash) based on a novel algorithm to generate haplotypes from raw genotype data. It belongs to the entropy minimization class of methods and is composed of a two stage deterministic – heuristic part and of a optional stochastic optimization. This algorithm is able to scale up well to handle huge datasets with faster performance than the competing technologies such as BEAGLE and MACH while maintaining a comparable accuracy. A quality assessment of the results is carried out by comparing the switch error. Finally, the haplotypes are used to perform a haplotype-based Genome-wide Association Study (GWAS). The association results are compared with a multimarker and a single SNP association test performed with Plink. Our experiments confirm that the multimarker association test can be more powerful than the single SNP one as stated in the literature. Moreover, Flash and Plink show similar results for the multimarker association test but Flash speeds up the computation time of about an order of magnitude using 5 SNP size haplotypes.

Calibrating the Human Mutation Rate via Ancestral Recombination Density in Diploid Genomes

Calibrating the Human Mutation Rate via Ancestral Recombination Density in Diploid Genomes

Mark Lipson , Po-Ru Loh , Sriram Sankararaman , Nick Patterson , Bonnie Berger , David Reich
doi: http://dx.doi.org/10.1101/015560

The human mutation rate is an essential parameter for studying the evolution of our species, interpreting present-day genetic variation, and understanding the incidence of genetic disease. Nevertheless, our current estimates of the rate are uncertain. Classical methods based on sequence divergence have yielded significantly larger values than more recent approaches based on counting de novo mutations in family pedigrees. Here, we propose a new method that uses the fine-scale human recombination map to calibrate the rate of accumulation of mutations. By comparing local heterozygosity levels in diploid genomes to the genetic distance scale over which these levels change, we are able to estimate a long-term mutation rate averaged over hundreds or thousands of generations. We infer a rate of 1.65 +/- 0.10 x 10^(-8) mutations per base per generation, which falls in between phylogenetic and pedigree-based estimates, and we suggest possible mechanisms to reconcile our estimate with previous studies. Our results support intermediate-age divergences among human populations and between humans and other great apes.

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations
Marc Haber , Massimo Mezzavilla , Yali Xue , David Comas , Paolo Gasparini , Pierre Zalloua , Chris Tyler-Smith
doi: http://dx.doi.org/10.1101/015396

The Armenians are a culturally isolated population who historically inhabited a region in the Near East bounded by the Mediterranean and Black seas and the Caucasus, but remain underrepresented in genetic studies and have a complex history including a major geographic displacement during World War One. Here, we analyse genome-wide variation in 173 Armenians and compare them to 78 other worldwide populations. We find that Armenians form a distinctive cluster linking the Near East, Europe, and the Caucasus. We show that Armenian diversity can be explained by several mixtures of Eurasian populations that occurred between ~3,000 and ~2,000 BCE, a period characterized by major population migrations after the domestication of the horse, appearance of chariots, and the rise of advanced civilizations in the Near East. However, genetic signals of population mixture cease after ~1,200 BCE when Bronze Age civilizations in the Eastern Mediterranean world suddenly and violently collapsed. Armenians have since remained isolated and genetic structure within the population developed ~500 years ago when Armenia was divided between the Ottomans and the Safavid Empire in Iran. Finally, we show that Armenians have higher genetic affinity to Neolithic Europeans than other present-day Near Easterners, and that 29% of the Armenian ancestry may originate from an ancestral population best represented by Neolithic Europeans.

Massive migration from the steppe is a source for Indo-European languages in Europe

Massive migration from the steppe is a source for Indo-European languages in Europe
Wolfgang Haak , Iosif Lazaridis , Nick Patterson , Nadin Rohland , Swapan Mallick , Bastien Llamas , Guido Brandt , Susanne Nordenfelt , Eadaoin Harney , Kristin Stewardson , Qiaomei Fu , Alissa Mittnik , Eszter Bánffy , Christos Economou , Michael Francken , Susanne Friederich , Rafael Garrido Pena , Fredrik Hallgren , Valery Khartanovich , Aleksandr Khokhlov , Michael Kunst , Pavel Kuznetsov , Harald Meller , Oleg Mochalov , Vayacheslav Moiseyev , Nicole Nicklisch , Sandra L. Pichler , Roberto Risch , Manuel A. Rojo Guerra , Christina Roth , Anna Szécsényi-Nagy , Joachim Wahl , Matthias Meyer , Johannes Krause , Dorcas Brown , David Anthony , Alan Cooper , Kurt Werner Alt , David Reich
doi: http://dx.doi.org/10.1101/013433

We generated genome-wide data from 69 Europeans who lived between 8,000-3,000 years ago by enriching ancient DNA libraries for a target set of almost four hundred thousand polymorphisms. Enrichment of these positions decreases the sequencing required for genome-wide ancient DNA analysis by a median of around 250-fold, allowing us to study an order of magnitude more individuals than previous studies and to obtain new insights about the past. We show that the populations of western and far eastern Europe followed opposite trajectories between 8,000-5,000 years ago. At the beginning of the Neolithic period in Europe, ~8,000-7,000 years ago, closely related groups of early farmers appeared in Germany, Hungary, and Spain, different from indigenous hunter-gatherers, whereas Russia was inhabited by a distinctive population of hunter-gatherers with high affinity to a ~24,000 year old Siberian6. By ~6,000-5,000 years ago, a resurgence of hunter-gatherer ancestry had occurred throughout much of Europe, but in Russia, the Yamnaya steppe herders of this time were descended not only from the preceding eastern European hunter-gatherers, but from a population of Near Eastern ancestry. Western and Eastern Europe came into contact ~4,500 years ago, as the Late Neolithic Corded Ware people from Germany traced ~3/4 of their ancestry to the Yamnaya, documenting a massive migration into the heartland of Europe from its eastern periphery. This steppe ancestry persisted in all sampled central Europeans until at least ~3,000 years ago, and is ubiquitous in present-day Europeans. These results provide support for the theory of a steppe origin of at least some of the Indo-European languages of Europe.

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Mitochondrial Genomes of Giant Deers Suggest their Late Survival in Central Europe

Alexander Immel , Dorothée G. Drucker , Tina K. Jahnke , Susanne C. Münzel , Verena J. Schuenemann , Marion Bonazzi , Alexander Herbig , Claus-Joachim Kind , Johannes Krause
doi: http://dx.doi.org/10.1101/014944

The giant deer Megaloceros giganteus is among the most fascinating Late Pleistocene Eurasian megafauna that became extinct at the end of the last ice age. Important questions persist regarding its phylogenetic relationship to contemporary taxa and the reasons for its extinction. We analyzed two large ancient cervid bone fragments recovered from cave sites in the Swabian Jura (Baden-Württemberg, Germany) dated to 12,000 years ago. Using hybridization capture in combination with next generation sequencing, we were able to reconstruct nearly complete mitochondrial genomes from both specimens. Both mtDNAs cluster phylogenetically with fallow deer and show high similarity to previously studied partial Megaloceros giganteus DNA from Kamyshlov in western Siberia and Killavullen in Ireland. The unexpected presence of Megaloceros giganteus in Southern Germany after the Ice Age suggests a later survival in Central Europe than previously proposed. The complete mtDNAs provide strong phylogenetic support for a Dama-Megaloceros clade. Furthermore, isotope analyses support an increasing competition between giant deer, red deer, and reindeer after the Last Glacial Maximum, which might have contributed to the extinction of Megaloceros in Central Europe.

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Julia H Wildschutte , Alayna A Baron , Nicolette M Diroff , Jeffrey M Kidd
doi: http://dx.doi.org/10.1101/014977

Alu insertions have contributed to >11% of the human genome. About ~30-35 Alu subfamilies remain actively mobile, and are recognized as major drivers of genetic variation and disease. Sophisticated computational methods permit identification of non-reference insertions based on specific signatures from whole genome sequencing data, but reporting of entire insertion sequences is limited. We build on existing methods and develop an approach that combines Alu detection and de novo assembly of WGS data to reconstruct the full sequence of insertion events. Using this approach, we generate a highly accurate call set of 1,614 completely assembled Alu variants from 53 samples from the Human Genome Diversity Project panel. Experimental validation of 30 sites shows 100% this method produces a highly accurate call set that accurately reconstructs insertion sequence. We utilize the reconstructed alternative insertion haplotypes to genotype 1,010 fully assembled insertions, obtaining >99% accuracy. We find evidence of insertion by non-classical mechanisms and observe 5??? truncation in 16% of AluYa5 and AluYb8 insertions. The sites of truncation coincide with stem-loop structures and SRP9/14 binding sites in the Alu RNA, implicating L1 ORF2p pausing in the generation of 5??? truncations.

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel
David Houle , Eladio J. Marquez
doi: http://dx.doi.org/10.1101/014936

We calculated the linkage disequilibrium between all pairs of variants in the Drosophila Genome Reference Panel, and make available the list of all highly correlated SNPs for use in association studies. Seventy-three percent of variant SNPs are correlated at r2>0.5 with at least one other SNP, and the mean number of correlated SNPs per variant over the whole genome is 64.9. Disequilibrium between distant SNPs is also common when minor allele frequency (MAF) is low: 24% of SNPs with MAF<0.1 are highly correlated with SNPs more than 100kb distant. While SNPs within regions with polymorphic inversions are highly correlated with somewhat larger numbers of SNPs, and these correlated SNPs are on average farther away, the probability that a SNP in such regions is highly correlated with at least one other SNP is very similar to SNPs outside inversions. Previous karyotyping of the DGRP lines has been inconsistent, and we used LD and genotype to investigate these discrepancies. When previous studies agreed on inversion karyotype, our analysis was almost perfectly concordant with those assignments. In discordant cases, and for inversion heterozygotes, our results suggest errors in two previous analyses, or discordance between genotype and karyotype. Heterozygosities of chromosome arms are in many cases surprisingly highly correlated, suggesting strong epsistatic selection during the inbreeding and maintenance of the DGRP lines.

Natural Selection Shapes the Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome

Natural Selection Shapes the Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome

John E Pool
doi: http://dx.doi.org/10.1101/014837

North American populations of Drosophila melanogaster are thought to derive from both European and African source populations, but despite their importance for genetic research, patterns of admixture along their genomes are essentially undocumented. Here, I infer geographic ancestry along genomes of the Drosophila Genetic Reference Panel (DGRP) and the D. melanogaster reference genome. Overall, the proportion of African ancestry was estimated to be 20% for the DGRP and 9% for the reference genome. Based on the size of admixture tracts and the approximate timing of admixture, I estimate that the DGRP population underwent roughly 13.9 generations per year. Notably, ancestry levels varied strikingly among genomic regions, with significantly less African introgression on the X chromosome, in regions of high recombination, and at genes involved in specific processes such as circadian rhythm. An important role for natural selection during the admixture process was further supported by a genome-wide signal of ancestry disequilibrium, in that many between-chromosome pairs of loci showed a deficiency of Africa-Europe allele combinations. These results support the hypothesis that admixture between partially genetically isolated Drosophila populations led to natural selection against incompatible genetic variants, and that this process is ongoing. The ancestry blocks inferred here may be relevant for the performance of reference alignment in this species, and may bolster the design and interpretation of many population genetic and association mapping studies.

Molecular evolutionary consequences of island colonisation

Molecular evolutionary consequences of island colonisation

Jennifer James, Robert Lanfear, Adam Eyre-Walker
doi: http://dx.doi.org/10.1101/014811

Island endemics are likely to experience population bottlenecks; they also have restricted ranges. Therefore we expect island species to have small effective population sizes (Ne) and reduced genetic diversity compared to their mainland counterparts. As a consequence, island species may have inefficient selection and reduced adaptive potential. We used both polymorphisms and substitutions to address these predictions, improving on the approach of recent studies that only used substitution data. This allowed us to directly test the assumption that island species have small values of Ne. We found that island species had significantly less genetic diversity than mainland species; however, this pattern could be attributed to a subset of island species that had undergone a recent population bottleneck. When these species were excluded from the analysis, island and mainland species had similar levels of genetic diversity, despite island species occupying considerably smaller areas than their mainland counterparts. We also found no overall difference between island and mainland species in terms of effectiveness of selection or mutation rate. Our evidence suggests that island colonisation has no lasting impact on molecular evolution. This surprising result highlights gaps in our knowledge of the relationship between census and effective population size.

Ebola virus is evolving but not changing: no evidence for functional change in EBOV from 1976 to the 2014 outbreak

Ebola virus is evolving but not changing: no evidence for functional change in EBOV from 1976 to the 2014 outbreak

Abayomi S Olabode, Xiaowei Jiang, David L Robertson, Simon C Lovell
doi: http://dx.doi.org/10.1101/014480

The Ebola epidemic is having a devastating impact in West Africa. Sequencing of Ebola viruses from infected individuals has revealed extensive genetic variation, leading to speculation that the virus may be adapting to the human host and accounting for the scale of the 2014 outbreak. We show that so far there is no evidence for adaptation of EBOV to humans. We analyze the putatively functional changes associated with the current and previous Ebola outbreaks, and find no significant molecular changes. Observed amino acid replacements have minimal effect on protein structure, being neither stabilizing nor destabilizing. Replacements are not found in regions of the proteins associated with known functions and tend to occur in disordered regions. This observation indicates that the difference between the current and previous outbreaks is not due to the observed evolutionary change of the virus. Instead, epidemiological factors must be responsible for the unprecedented spread of EBOV.