Bias in Estimators of Archaic Admixture

Bias in Estimators of Archaic Admixture

Alan R. Rogers, Ryan J. Bohlender
(Submitted on 20 Dec 2014)

This article evaluates bias in one class of methods used to estimate archaic admixture in modern humans. These methods study the pattern of allele sharing among modern and archaic genomes. They are sensitive to “ghost” admixture, which occurs when a population receives archaic DNA from sources not acknowledged by the statistical model. The effect of ghost admixture depends on two factors: branch-length bias and population-size bias. Branch-length bias occurs because a given amount of admixture has a larger effect if the two populations have been separated for a long time. Population-size bias occurs because differences in population size distort branch lengths in the gene genealogy. In the absence of ghost admixture, these effects are small. They become important, however, in the presence of ghost admixture. Estimators differ in the pattern of response. Increasing a given parameter may inflate one estimator but deflate another. For this reason, comparisons among estimators are informative. Using such comparisons, this article supports previous findings that the archaic population was small and that Europeans received little gene flow from archaic populations other than Neanderthals. It also identifies an inconsistency in estimates of archaic admixture into Melanesia.

Scaling probabilistic models of genetic variation to millions of humans

Scaling probabilistic models of genetic variation to millions of humans

Prem Gopalan, Wei Hao, David M. Blei, John D. Storey

A major goal of population genetics is to quantitatively understand variation of genetic polymorphisms among individuals. Researchers have developed sophisticated statistical methods to capture the complex population structure that underlies observed genotypes in humans. The number of humans that have been densely genotyped across the genome has grown significantly in recent years. In aggregate about 1M individuals have been densely genotyped to date, and if we could analyze this data then we would have a nearly complete picture of human genetic variation. Existing state-of-the-art methods, however, cannot scale to data of this size. To this end, we have developed TeraStructure. TeraStructure is a new algorithm to fit Bayesian models of genetic variation in human populations on tera-sample-sized data sets (1012 observed genotypes, e.g., 1M individuals at 1M SNPs). It is a principled approach to approximate Bayesian inference that iterates between subsampling locations of the genome and updating an estimate of the latent population structure. On real and simulated data sets of up to 10K individuals, TeraStructure is twice as fast as existing methods and recovers the latent population structure with equal accuracy. On genomic data simulated at the tera-sample-size scales, TeraStructure continues to be accurate and is the only method that can complete its analysis.

A FISH-based chromosome map for the European corn borers yields insights into ancient chromosomal fusions in the silkworm.

A FISH-based chromosome map for the European corn borers yields insights into ancient chromosomal fusions in the silkworm.

Yuji Yasukochi, Mizuki Ohno, Fukashi Shibata, Akiya Jouraku, Ryo Nakano, Yukio Ishikawa, Ken Sahara

A significant feature of the genomes of Lepidoptera, butterflies and moths, is the high conservation of chromosome organization. Recent remarkable progress in genome sequencing of Lepidoptera has revealed that syntenic gene order is extensively conserved across phylogenetically distant species. The ancestral karyotype of Lepidoptera is thought to be n = 31; however, that of the most well studied moth, Bombyx mori, is n = 28, suggesting that three chromosomal fusion events occurred in this lineage. To identify the boundaries between predicted ancient fusions involving B. mori chromosomes 11, 23 and 24, we constructed FISH-based chromosome maps of the European corn borer, Ostrinia nubilalis (n = 31). We first determined 511 Mb genomic sequence of the Asian corn borer, Ostrinia furnacalis, a congener of O. nubilalis, and isolated BAC and fosmid clones that were expected to localize in candidate regions for the boundaries using these sequences. Combined with FISH and genetic analysis, we narrowed down the candidate regions to 40kb ??? 1.5Mb, in strong agreement with a previous estimate based on the genome of a butterfly, Melitaea cinxia. The significant difference in the lengths of the candidate regions where no functional genes were observed may reflect the evolutionary time after fusion events.

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 Genomes Project phase I data

Debora Yoshihara Caldeira Brandt, Vitor Rezende da Costa Aguiar, Bárbara Domingues Bitarello, Kelly Nunes, Jérôme Goudet, Diogo Meyer

Next Generation Sequencing (NGS) technologies have become the standard for data generation in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are known to be problematic when applied to highly polymorphic genomic regions, such as the Human Leukocyte Antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to population genomics analises, it is important to assess the reliability of NGS data. Here, we evaluate the reliability of genotype calls and allele frequency estimates of the SNPs reported by 1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, -DQB1 ). We take advantage of the availability of HLA Sanger sequencing of 930 of the 1,092 1000G samples, and use this as a gold standard to benchmark the 1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect, and that allele frequencies are estimated with an error higher than ??0.1 at approximately 25% of the SNPs in HLA genes. We found a bias towards overestimation of reference allele frequency for the 1000G data, indicating mapping bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have poor allele frequency estimates, and discuss the outcomes of including those sites in different kinds of analyses. Since the HLA region is the most polymorphic in the human genome, our results provide insights into the challenges of using of NGS data at other genomic regions of high diversity.

Genome-engineering with CRISPR-Cas9 in the mosquito Aedes aegypti

Genome-engineering with CRISPR-Cas9 in the mosquito Aedes aegypti

Kathryn E Kistler, Leslie B Vosshall, Benjamin J Matthews

The mosquito Aedes aegypti is a potent vector of the Chikungunya, yellow fever, and Dengue viruses, which result in hundreds of millions of infections and over 50,000 human deaths per year. Loss-of-function mutagenesis in Ae. aegypti has been established with TALENs, ZFNs, and homing endonucleases, which require the engineering of DNA-binding protein domains to generate target specificity for a particular stretch of genomic DNA. Here, we describe the first use of the CRISPR-Cas9 system to generate targeted, site-specific mutations in Ae. aegypti. CRISPR-Cas9 relies on RNA-DNA base-pairing to generate targeting specificity, resulting in cheaper, faster, and more flexible genome-editing reagents. We investigate the efficiency of reagent concentrations and compositions, demonstrate the ability of CRISPR-Cas9 to generate several different types of mutations via disparate repair mechanisms, and show that stable germ-line mutations can be readily generated at the vast majority of genomic loci tested. This work offers a detailed exploration into the optimal use of CRISPR-Cas9 in Ae. aegypti that should be applicable to non-model organisms previously out of reach of genetic modification.

Expansion of the HSFY gene family in pig lineages

Expansion of the HSFY gene family in pig lineages

Benjamin M Skinner, Kim Lachani, Carole A Sargent, Fengtang Yang, Peter JI Ellis, Toby Hunt, Beiyuan Fu, Sandra Louzada, Carol Churcher, Chris Tyler-Smith, Nabeel A Affara

Amplified gene families on sex chromosomes can harbour genes with important biological functions, especially relating to fertility. The HSFY family has amplified on the Y chromosome of the domestic pig (Sus scrofa), in an apparently independent event to an HSFY expansion on the Y chromosome of cattle (Bos taurus). Although the biological functions of HSFY genes are poorly understood, they appear to be involved in gametogenesis in a number of mammalian species, and, in cattle, HSFY gene copy number correlates with levels of fertility. We have investigated the HSFY family in domestic pigs, and other suid species including warthogs, bushpigs, babirusas and peccaries. The domestic pig contains at least two amplified variants of HSFY, distinguished predominantly by presence or absence of a SINE within the intron. Both these variants are expressed in testis, and both are present in approximately 50 copies each in a single cluster on the short arm of the Y. The longer form has multiple nonsense mutations rendering it likely non-functional, but many of the shorter forms still have coding potential. Other suid species also have these two variants of HSFY, and estimates of copy number suggest the HSFY family may have amplified independently twice during suid evolution. Given the association of HSFY gene copy number with fertility in cattle, HSFY is likely to play an important role in spermatogenesis in pigs also.

Stationary solutions for metapopulation Moran models with mutation and selection

Stationary solutions for metapopulation Moran models with mutation and selection

George W. A. Constable, Alan J. McKane
(Submitted on 19 Dec 2014)

We construct an individual-based metapopulation model of population genetics featuring migration, mutation, selection and genetic drift. In the case of a single `island’, the model reduces to the Moran model. Using the diffusion approximation and timescale separation arguments, an effective one-variable description of the model is developed. The effective description bears similarities to the well-mixed Moran model with effective parameters which depend on the network structure and island sizes, and is amenable to analysis. Predictions from the reduced theory match the results from stochastic simulations across a range of parameters. The nature of the fast-variable elimination technique we adopt is further studied by applying it to a linear system, where it provides a precise description of the slow-dynamics in the limit of large timescale separation.