Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites

Estimating heterozygosity from a low-coverage genome sequence, leveraging data from other individuals sequenced at the same sites
Katarzyna Bryc, Nick Patterson, David Reich
(Submitted on 17 Dec 2012)

High-throughput shotgun sequence data makes it possible in principle to accurately estimate population genetic parameters without confounding by SNP ascertainment bias. One such statistic of interest is the proportion of heterozygous sites within an individual’s genome, which is informative about inbreeding and effective population size. However, in many cases, the available sequence data of an individual is limited to low coverage, preventing the confident calling of genotypes necessary to directly count the proportion of heterozygous sites. Here, we present a method for estimating an individual’s genome-wide rate of heterozygosity from low-coverage sequence data, without an intermediate step calling genotypes. Our method jointly learns the shared allele distribution between the individual and a panel of other individuals, together with the sequencing error distributions and the reference bias. We show our method works well, first by its performance on simulated sequence data, and secondly on real sequence data where we obtain estimates using low coverage data consistent with those from higher coverage. We apply our method to obtain estimates of the rate of heterozygosity for 11 humans from diverse world-wide populations, and through this analysis reveal the complex dependency of local sequencing coverage on the true underlying heterozygosity, which complicates the estimation of heterozygosity from sequence data. We show filters can correct for the confounding by sequencing depth. We find in practice that ratios of heterozygosity are more interpretable than absolute estimates, and show that we obtain excellent conformity of ratios of heterozygosity with previous estimates from higher coverage data.

The GenoChip: A New Tool for Genetic Anthropology

The GenoChip: A New Tool for Genetic Anthropology
Eran Elhaik, Elliott Greenspan, Sean Staats, Thomas Krahn, Chris Tyler-Smith, Yali Xue, Sergio Tofanelli, Paolo Francalacci, Francesco Cucca, Luca Pagani, Li Jin, Hui Li, Theodore G. Schurr, Bennett Greenspan, R. Spencer Wells, the Genographic Consortium
(Submitted on 17 Dec 2012)

The Genographic Project is an international effort using genetic data to chart human migratory history. The project is non-profit and non-medical, and through its Legacy Fund supports locally led efforts to preserve indigenous and traditional cultures. In its second phase, the project is focusing on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide SNP genotyping, they were designed for medical genetic studies and contain medically related markers that are not appropriate for global population genetic studies. GenoChip, the Genographic Project’s new genotyping array, was designed to resolve these issues and enable higher-resolution research into outstanding questions in genetic anthropology. We developed novel methods to identify AIMs and genomic regions that may be enriched with alleles shared with ancestral hominins. Overall, we collected and ascertained AIMs from over 450 populations. Containing an unprecedented number of Y-chromosomal and mtDNA SNPs and over 130,000 SNPs from the autosomes and X-chromosome, the chip was carefully vetted to avoid inclusion of medically relevant markers. The GenoChip results were successfully validated. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays for three continental populations. While all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The GenoChip is a dedicated genotyping platform for genetic anthropology and promises to be the most powerful tool available for assessing population structure and migration history.

Comment on “Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”

Comment on “Evidence of Abundant and Purifying Selection in Humans for Recently Acquired Regulatory Functions”
Nicolas Bray, Lior Pachter
(Submitted on 13 Dec 2012)

Ward and Kellis (Reports, September 5 2012) identify regulatory regions in the human genome exhibiting lineage-specific constraint and estimate the extent of purifying selection. There is no statistical rationale for the examples they highlight, and their estimates of the fraction of the genome under constraint are biased by arbitrary designations of completely constrained regions.

Assembling large, complex environmental metagenomes

Assembling large, complex environmental metagenomes
Adina Chuang Howe, Janet Jansson, Stephanie A. Malfatti, Susannah G. Tringe, James M. Tiedje, C. Titus Brown
(Submitted on 12 Dec 2012)

The large volumes of sequencing data required to deeply sample complex environments pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two data reduction approaches, digital normalization and partitioning, to this challenge. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.

Compensatory evolution and the origins of innovations.

Compensatory evolution and the origins of innovations. (arXiv:1212.2658v1 [q-bio.PE])
by Etienne Rajon, Joanna Masel

Cryptic genetic sequences have attenuated effects on phenotypes. In the classic view, relaxed selection allows cryptic genetic diversity to build up across individuals in a population, providing alleles that may later contribute to adaptation when co-opted – e.g. following a mutation increasing expression from a low, attenuated baseline. This view is described, for example, by the metaphor of the spread of a population across a neutral network in genotype space. As an alternative view, consider the fact that most phenotypic traits are affected by multiple sequences, including cryptic ones. Even in a strictly clonal population, the co-option of cryptic sequences at different loci may have different phenotypic effects and offer the population multiple adaptive possibilities. Here, we model the evolution of quantitative phenotypic characters encoded by cryptic sequences, and compare the relative contributions of genetic diversity and of variation across sites to the phenotypic potential of a population. We show that most of the phenotypic variation accessible through co-option would exist even in populations with no polymorphism. This is made possible by a history of compensatory evolution, whereby the phenotypic effect of a cryptic mutation at one site was balanced by mutations elsewhere in the genome, leading to a diversity of cryptic effect sizes across sites rather than across individuals. Cryptic sequences might accelerate adaptation and facilitate large phenotypic changes even in the absence of genetic diversity, as traditionally defined in terms of alternative alleles.

Efficient moment-based inference of admixture parameters and sources of gene flow

Efficient moment-based inference of admixture parameters and sources of gene flow
Mark Lipson, Po-Ru Loh, Alex Levin, David Reich, Nick Patterson, Bonnie Berger
(Submitted on 11 Dec 2012)

The recent explosion in available genetic data has led to significant advances in understanding the demographic histories of and relationships among human populations. It is still a challenge, however, to infer reliable parameter values for complicated models involving many populations. Here we present MixMapper, an efficient, interactive method for constructing phylogenetic trees including admixture events using single nucleotide polymorphism (SNP) genotype data. MixMapper implements a novel two-phase approach to admixture inference using moment statistics, first building an unadmixed scaffold tree and then adding admixed populations by solving systems of equations that express allele frequency divergences in terms of mixture parameters. Importantly, all features of the tree, including topology, sources of gene flow, branch lengths, and mixture proportions, are optimized automatically from the data and include estimates of statistical uncertainty. MixMapper also uses a new method to express branch lengths in easily interpretable drift units. We apply MixMapper to recently published data for HGDP individuals genotyped on a SNP array designed especially for use in population genetics studies, obtaining confident results for 30 populations, 20 of them admixed. Notably, we confirm a signal of ancient admixture in European populations—including previously undetected admixture in Sardinians and Basques—involving a proportion of 20-40% ancient northern Eurasian ancestry.

Detection of selective sweeps in cattle using genome-wide SNP data

Detection of selective sweeps in cattle using genome-wide SNP data
Holly R. Ramey, Jared E. Decker, Stephanie D. McKay, Megan M. Rolf, Robert D. Schnabel, Jeremy F. Taylor
(Submitted on 11 Dec 2012)

The domestication and subsequent selection by humans to create breeds of cattle undoubtedly altered the patterning of variation within their genomes. Strong selection to fix advantageous large-effect mutations underlying domesticability, breed characteristics or productivity created selective sweeps in which variation was lost in the chromosomal region flanking the selected allele. Selective sweeps have been identified in the genomes of many species including humans, dogs, horses, and chickens. We attempt to identify regions of the bovine genome that have been subjected to selective sweeps. Two datasets were used for the discovery and validation of selective sweeps via the fixation of alleles at a series of contiguous SNP loci. BovineSNP50 data were used to identify 28 putative sweep regions among 14 cattle breeds. Affymetrix BOS 1 prescreening assay data for five breeds were used to identify 114 regions and validate 5 regions identified using the BovineSNP50 data. Many genes are located within these regions; however, phenotypes that we predict to have historically been under strong selection include horned-polled, coat color, stature, ear morphology, and behavior. The identified selective sweeps represent recent events associated with breed formation rather than ancient events associated with domestication. No sweep regions were shared between indicine and taurine breeds reflecting their divergent selection histories. A primary finding of this study is the sensitivity of results to assay resolution. Despite the bias towards common SNPs in the BovineSNP50 design, false positive sweep regions appear to be common due to the limited resolution of the assay. This assay design bias leads to the detection of breed-specific sweep regions, or regions shared by a small number of breeds, restricting the suite of selected phenotypes detected to primarily those associated with breed characteristics.

Our paper: Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait

This guest post is by Pleuni Pennings on the paper “Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait”, available from the arXiv here. This is cross-posted from her website here

Tobias Pamminger, Susanne Foitzik, Dirk Metzler and I analyzed the small scale spatial structure of ants of the species Temnothorax longispinosus. These ants are the host of a slavemaking ant. The slavemakers go on raids, and steal young from the host species to work as slaves in their nests. We wanted to know whether the slaves still have relatives in the nearby nests. If they do, then their behavior – which influences the slavemakers – could have an effect on their relatives and therefore on their indirect fitness.

To find out if slaves are related to their neighbours, we collected lots of ant nests (they nest in acorns), both in New York and in West Virginia, marked exactly where we found them and genotyped them at six microsatellites.

Ants in acorn

Photograph by Andreas Gros
Temnothorax longispinosus in acorn

US2009 132

We put little flags at the exact location of an ant nest to measure the distances between the nests.

Microsat Data

This is one of the figures from the manuscript. Plot R (from West Virginia) is is shown to demonstrate the distribution of colonies within a plot and to show the distribution of alleles of one of the six microsatellite loci (GT1) among colonies. Each colony is represented by a pie-diagram with the frequencies of different GT1 alleles amongst the genotyped individuals of the colony. R3 is a slavemaker nest (we genotyped the slaves, not the slavemakers) and shares most of its alleles with the free nest R7. R13 and R15 are free living host colonies in close proximity and appear to be related.

Our main conclusion is that the enslaved ants are indeed related to their neighbors. The manuscript can be found on the arXiv here: http://arxiv.org/abs/1212.0790

The manuscript was peer-reviewed at Peerage of Science, a new and very useful community of scientists who agree to review each others papers fairly. See http://www.peerageofscience.org/

The manuscript is part of Tobias Pamminger’s PhD thesis. Tobias defends his thesis this week in Mainz!! Congrats Tobias!

Tobias came up with the awesome title for the paper “Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait.”

Age of an allele and gene genealogies of nested subsamples for populations admitting large offspring numbers

Age of an allele and gene genealogies of nested subsamples for populations admitting large offspring numbers
Bjarki Eldon
(Submitted on 8 Dec 2012)

Coalescent processes, including mutation, are derived from Moran type population models admitting large offspring numbers. Including mutation in the coalescent process allows for quantifying the turnover of alleles by computing the distribution of the number of original alleles still segregating in the population at a given time in the past. The turnover of alleles is considered for specific classes of the Moran model admitting large offspring numbers. Versions of the Kingman coalescent are also derived whose rates are functions of the mean and variance of the offspring distribution. High variance in the offspring distribution results in higher turnover and younger age of alleles than predicted by the usual Kingman coalescent.

Fast Algorithms for Reconciliation under Hybridization and Incomplete Lineage Sorting

Fast Algorithms for Reconciliation under Hybridization and Incomplete Lineage Sorting
Yun Yu, Luay Nakhleh
(Submitted on 9 Dec 2012)

Reconciling a gene tree with a species tree is an important task that reveals much about the evolution of genes, genomes, and species, as well as about the molecular function of genes. A wide array of computational tools have been devised for this task under certain evolutionary events such as hybridization, gene duplication/loss, or incomplete lineage sorting. Work on reconciling gene tree with species phylogenies under two or more of these events have also begun to emerge. Our group recently devised both parsimony and probabilistic frameworks for reconciling a gene tree with a phylogenetic network, thus allowing for the detection of hybridization in the presence of incomplete lineage sorting. While the frameworks were general and could handle any topology, they are computationally intensive, rendering their application to large datasets infeasible. In this paper, we present two novel approaches to address the computational challenges of the two frameworks that are based on the concept of ancestral configurations. Our approaches still compute exact solutions while improving the computational time by up to five orders of magnitude. These substantial gains in speed scale the applicability of these unified reconciliation frameworks to much larger data sets. We discuss how the topological features of the gene tree and phylogenetic network may affect the performance of the new algorithms. We have implemented the algorithms in our PhyloNet software package, which is publicly available in open source.