Inference in Kingman’s Coalescent with Particle Markov Chain Monte Carlo Method

Inference in Kingman’s Coalescent with Particle Markov Chain Monte Carlo Method
Yifei Chen, Xiaohui Xie
(Submitted on 3 May 2013)

We propose a new algorithm to do posterior sampling of Kingman’s coalescent, based upon the Particle Markov Chain Monte Carlo methodology. Specifically, the algorithm is an instantiation of the Particle Gibbs Sampling method, which alternately samples coalescent times conditioned on coalescent tree structures, and tree structures conditioned on coalescent times via the conditional Sequential Monte Carlo procedure. We implement our algorithm as a C++ package, and demonstrate its utility via a parameter estimation task in population genetics on both single- and multiple-locus data. The experiment results show that the proposed algorithm performs comparable to or better than several well-developed methods.

YHap: software for probabilistic assignment of Y haplogroups from population re-sequencing data

YHap: software for probabilistic assignment of Y haplogroups from population re-sequencing data
Fan Zhang, Ruoyan Chen, Dongbing Liu, Xiaotian Yao, Guoqing Li, Yabin Jin, Chang Yu, Yingrui Li, Lachlan Coin
(Submitted on 11 Apr 2013)

Y haplogroup analyses are an important component of genealogical reconstruction, population genetic analyses, medical genetics and forensics. These fields are increasingly moving towards use of low-coverage, high throughput sequencing. However, there is as yet no software available for using sequence data to assign Y haplogroup groups probabilistically, such that the posterior probability of assignment fully reflects the information present in the data, and borrows information across all samples sequenced from a population. YHap addresses this problem.

Detecting the structure of haplotypes, local ancestry and excessive local European ancestry in Mexicans

Detecting the structure of haplotypes, local ancestry and excessive local European ancestry in Mexicans
Yongtao Guan
(Submitted on 5 Apr 2013)

We present a two-layer hidden Markov model to detect structure of haplotypes for unrelated individuals. This allows modeling two scales of linkage disequilibrium (one within a group of haplotypes and one between groups), thereby taking advantage of rich haplotype information to infer local ancestry for admixed individuals. Our method outperforms competing state-of-art methods, particularly for regions of small ancestral track lengths. Applying our method to Mexican samples in HapMap3, we found five coding regions, ranging from $0.3 -1.3$ megabase (Mb) in lengths, that exhibit excessive European ancestry (average dosage > 1.6). A particular interesting region of 1.1Mb (with average dosage 1.95) locates on Chromosome 2p23 that harbors two genes, PXDN and MYT1L, both of which are associated with autism and schizophrenia. In light of the low prevalence of autism in Hispanics, this region warrants special attention. We confirmed our findings using Mexican samples from the 1000 genomes project. A software package implementing methods described in the paper is freely available at this http URL.

A new approach to estimate directional genetic differentiation and asymmetric migration patterns

A new approach to estimate directional genetic differentiation and asymmetric migration patterns
Lisa Sundqvist, Martin Zackrisson, David Kleinhans
(Submitted on 30 Mar 2013)

In the field of population genetics measures of genetic differentiation are widely used to gather information on the structure and the amount of gene flow between populations. These indirect measures are based on a number of simplifying assumptions, for instance equal population size and symmetric migration. Structured populations with asymmetric migration patterns, frequently occur in nature and information about directional gene flow would here be of great interest. Nevertheless current measures of genetic differentiation cannot be used in such systems without violating the assumptions. To get information on asymmetric migration patterns from genetic data rather complex models using maximum likelihood or Bayesian approaches generally need to be applied. In such models a large number of parameters are estimated simultaneously and this involves complex optimization algorithms. We here introduce a new approach that intends to fill the gap between the complex approaches and the symmetric measures of genetic differentiation. Our approach makes it possible to calculate a directional component of genetic differentiation at low computational effort using any of the classical measures of genetic differentiation. The approach is based on defining a pool of migrants for any pair of populations and calculating measures for genetic differentiation between the populations and the respective pools. The directional measures of genetic differentiation can further be used to calculate asymmetric migration. The procedure is demonstrated with a simulated data set with known migration pattern. A comparison of the estimation results with the migration pattern used for simulation suggests, that our method captures relevant properties of migration patterns even at low migration frequencies and with few marker loci.

Detecting range expansions from genetic data

Detecting range expansions from genetic data

Benjamin M Peter, Montgomery Slatkin
(Submitted on 29 Mar 2013)

We propose a method that uses genetic data to test for the occurrence of a recent range expansion and to infer the location of the origin of the expansion. We introduce a statistic for pairs of populations \psi (the directionality index) that detects asymmetries in the two-dimensional allele frequency spectrum caused by the series of founder events that happen during an expansion. Such asymmetry arises because low frequency alleles tend to be lost during founder events, thus creating clines in the frequencies of surviving low-frequency alleles. Using simulations, we further show that \psi is more powerful for detecting range expansions than both F_{ST} and clines in heterozygosity. We illustrate the utility of \psi by applying it to a data set from modern humans and show how we can include more complicated scenarios such as multiple expansion origins or barriers to migration in the model.

Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster

Soft selective sweeps are the primary mode of recent adaptation in Drosophila melanogaster
Nandita R. Garud, Philipp W. Messer, Erkan O. Buzbas, Dmitri A. Petrov
(Submitted on 5 Mar 2013)

Adaptation is often thought to leave the signature of a hard selective sweep, in which a single haplotype bearing the beneficial allele reaches high population frequency. However, an alternative and often-overlooked scenario is that of a soft selective sweep, in which multiple adaptive haplotypes sweep through the population simultaneously. Soft selective sweeps are likely either when adaptation proceeds from standing genetic variation or in large populations where adaptation is not mutation-limited. Current statistical methods are not well designed to test for soft sweeps, and thus are likely to miss these possibly numerous adaptive events because they look for characteristic reductions in heterozygosity. Here, we developed a statistical test based on a haplotype statistic, H12, capable of detecting both hard and soft sweeps with similar power. We used H12 to identify multiple genomic regions that have undergone recent and strong adaptation using a population sample of fully sequenced Drosophila melanogaster strains (DGRP). We then developed a second statistical test based on a statistic H2/H1 | H12, to test whether a given selective sweep detected by H12 is hard or soft. Surprisingly, when applying the test based on H2/H1 | H12 to the top 50 most extreme H12 candidates in the DGRP data, we reject the hard sweep hypothesis in every case. In contrast, all 50 cases show strong support (Bayes Factor >10) for a soft sweep model. Our results suggest that recent adaptation in North American populations of D. melanogaster has led primarily to soft sweeps either because it utilized standing genetic variation or because the short-term effective population size in D. melanogaster is on the order of billions or larger.

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees
Sha Zhu, James H Degnan, Bjarki Eldon
(Submitted on 4 Mar 2013)

Hybrid-Lambda is a software package that simulates gene trees under Kingman or two Lambda-coalescent processes within species networks or species trees. It is written in C++, and re- leased under GNU General Public License (GPL) version 3. Users can modify and make new dis- tribution under the terms of this license. For details of this license, visit this http URL. Hybrid Lambda is available at this https URL.

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution

Inferring ancestral states without assuming neutrality or gradualism using a stable model of continuous character evolution
Michael G. Elliot, Arne O. Mooers
(Submitted on 20 Feb 2013)

The value of a continuous character evolving on a phylogenetic tree is commonly modelled as the location of a particle moving under one-dimensional Brownian motion with constant rate. The Brownian motion model is best suited to characters evolving under neutral drift or tracking an optimum that drifts neutrally. We present a generalization of the Brownian motion model which relaxes assumptions of neutrality and gradualism by considering increments to evolving characters to be drawn from a heavy-tailed stable distribution (of which the normal distribution is a specialized form). We describe Markov chain Monte Carlo methods for fitting the model to biological data paying special attention to ancestral state reconstruction, and study the performance of the model in comparison with a selection of existing comparative methods, using both simulated data and a database of body mass in 1,679 mammalian species. We discuss hypothesis testing and model selection. The new model is well suited to a stochastic process with a volatile rate of change in which biological characters undergo a mixture of neutral drift and occasional evolutionary events of large magnitude.

Thoughts on “Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes”

Our next guest post is by Pleuni Pennings [@pleunipennings] with her thoughts on:
Integrating genealogical and dynamical modelling to infer escape and reversion rates in HIV epitopes, Duncan Palmer, John Frater, Rodney Philips, Angela McLean, Gil McVean, arXived here

[UPDATED]

Last week, a group of people from Oxford University published an interesting paper on the ArXiv. The paper is about using genealogical data (from HIV sequences), in combination with cross-sectional data (on patient and HIV phenotypes) to infer rates of evolution in HIV.

My conclusion: the approach is very interesting, and it makes total sense to use genealogical data to improve the inference from cross-sectional data. In fact, it is quite surprising to me that inferring rates from cross-sectional data works at all. However, in a previous paper by (partly) the same people, they show that it is possible to infer rates from using cross-sectional data only, and the estimates they get are very similar to the estimates from longitudinal data. The current paper provides a new and improved method, whose results are consistent with the previous papers.

The biological conclusion of the paper is that HIV adaptation is slower than many previous studies suggested. Case studies of fast evolution of the virus suffer from extreme publication bias and give the impression that evolution in HIV is always fast, whereas cross-sectional and longitudinal data show that evolution is often slow. Waiting times for CTL-escape and reversion are on the order of years.

1. What rates are they interested in?

The rates of interest here are the rate of escape from CTL pressure and the rate of reversion if there is no CTL pressure.

When someone is infected with HIV, the CTL response by the immune system of the patient can reduce the amount of virus in the patient. CTL stands for cytotoxic lymphocytes. Which amino-acid sequences (epitopes) can be recognized by the host’s CTL response depends on the HLA genotype of the host.
Suppose I have a certain HLA genotype X, such that my CTLs can recognize virus with a specific sequence of about 9 amino acids, let’s call this sequence Y. To escape from the pressure of these CTLs, the virus can mutate sequence Y to sequence Y’. A virus with sequence Y’ is called an escape mutant. The host (patient) with HLA X is referred to as a “matched host” and hosts without HLA X are referred to as “unmatched.” The escape mutations are thought to be costly for the virus.
So, for each CTL epitope there are 4 possible combinations of host and virus:
1. matched host and wildtype virus (there is selection pressure on the virus to “escape”)
2. matched host and escape mutant virus
3. unmatched host and wildtype virus
4. unmatched host and escape mutant virus (there is selection pressure on the virus to revert)

The question is “how fast does the virus escape if it is in a matched host and how fast does it revert if it is in an unmatched host?”

2. Why do we want to know these rates?

First of all, just out of curiosity, it is interesting to study how fast things evolve – it is surprising how little we know about rates of adaptive evolution. Secondly, because escape rates are relevant for the success of a potential HIV vaccine, if escape rates are high, then vaccines will probably not be very successful.

3. What are cross-sectional data and how can we infer rates from them?

Cross-sectional data are snap-shots of the population, with information on hosts and their virus. Here, it is the number of matched and unmatched hosts with wildtype and escape virus at a given point in time.

So how do these data tell us what escape rates and reversion rates are? Intuitively, it is easy to see how very high or very low rates would shape the data. For example, if escape and reversion would happen very fast, then the virus would always be perfectly adapted: we’d only find wildtype virus in unmatched hosts and only escape mutant virus in matched hosts. Conversely, if escape and reversion would be extremely slow, than the fraction of escape mutant virus would not differ between matched and unmatched hosts. Everyone would be infected with a random virus and this would never change.
The real situation is somewhere in between: the fraction of escape mutant virus is higher in matched hosts than in unmatched hosts. With the help of an standard epidemiological SI-model (ODE-model) and an estimate of the age of the epidemic, the fraction of escape mutant virus in the two types of hosts translates into estimates of the rates of escape and reversion. In the earlier paper, this is exactly what the authors did, and the results make a lot of sense. Rates range from months to years, reversion is always slower than escape, and there are large differences between CTLs. The results also matched well with data from longitudinal studies. In a longitudinal study, the patients are followed over time and evolution of the virus can be more directly observed. This is much more costly, but a much better way to estimate rates.

4. Why are the estimates from cross-sectional data not good enough?

Unfortunately, the estimates from cross-sectional data are only point estimates, and maybe not very good ones. The problem is that the method (implicitly) assumes that each virus is independently derived from an ancestor at the beginning of the epidemic. For example, if there are a lot of escape mutant viruses in the dataset, then the estimated rate of escape will be high. However, the high number of escape mutant virus may be due to one or a few escape events early on in the epidemic that got transmitted to a lot of other patients. It is a classical case of non-independence of data. It could lead us to believe that we can have more confidence in the estimates than we should have.

5. Genealogical data to the rescue!

Fortunately, the authors have viral sequences that provide much more information than just whether or not the virus is an escape mutant. The sequences of the virus can inform us about the underlying genealogical tree and can tell us how non-independent the data really are (two escape mutants that are very close to each other in the tree are not very independent). The goal of the current paper is to use the genealogical data to get better estimates of the escape and reversion rates.

A large part of the paper deals with the nuts and bolts of how to combine all the data, but in essence, this is what they do: They first estimate the genealogical tree for the viruses of the patients for which they have data (while allowing for uncertainty in the estimated tree). Then they add information on the states of the tips (wildtype vs escape for the virus and matched vs unmatched for the patient), and use the tree with the tip-labels to estimate the rates. This seems to be a very useful new method, that may give better estimates and a natural way to get credible intervals for the estimates.

The results they obtain with the new method are similar to the previous results for three CTL epitopes and slower rates for one CTL epitope. The credible intervals are quite wide, which shows that the data (from 84 patients) really don’t contain a whole lot of information about the rates, possibly because the trees are rather star-shaped, due to the exponential growth of the epidemic. Interestingly, the fact that the tree is rather star-shaped could explain why the older approach (based only on cross-sectional data) worked quite well. However, this will not necessarily be the case for other datasets.

Question for the authors

Do you use the information about the specific escape mutations in the data? Certainly not all sequences that are considered “escape mutants” carry exactly the same nucleotide changes? Whenever they carry different mutations, you know they must be independent.

Disentangling the effects of geographic and ecological isolation on genetic differentiation

Disentangling the effects of geographic and ecological isolation on genetic differentiation
Gideon Bradburd, Peter Ralph, Graham Coop
(Submitted on 13 Feb 2013)

Populations can be genetically isolated by geographic distance and by differences in their ecology or environment that decrease the rate of successful migration. Empirical studies often seek to investigate the relationship between genetic differentiation and some ecological variable(s) while accounting for geographic distance, but common approaches to this problem (e.g. the partial Mantel test) have a number of drawbacks. In this article, we present a Bayesian method that enables users to quantify the relative contributions of geographic distance and ecological distance to genetic differentiation between sampled populations or individuals. We model the allele frequencies in populations at a set of unlinked loci as spatial Gaussian processes, and model the covariance structure of pairs of populations as a decreasing function of both geographic and ecological distance between that pair. Parameters of the model are estimated using a Markov chain Monte Carlo algorithm. We have implemented this method, Bayesian Estimation of Differentiation in Alleles by Spatial Structure and Local Ecology (BEDASSLE), in a user-friendly format in the statistical platform R, and we demonstrate its utility with a simulation study and empirical applications to human and teosinte datasets.