Soft selective sweeps in complex demographic scenarios

Soft selective sweeps in complex demographic scenarios

Benjamin A Wilson, Dmitri Petrov, Philipp W Messer

Recent studies have shown that adaptation from de novo mutation often produces so-called soft selective sweeps, where adaptive mutations of independent mutational origin sweep through the population at the same time. Population genetic theory predicts that soft sweeps should be likely if the product of the population size and the mutation rate towards the adaptive allele is sufficiently large, such that multiple adaptive mutations can establish before one has reached fixation; however, it remains unclear how demographic processes affect the probability of observing soft sweeps. Here we extend the theory of soft selective sweeps to realistic demographic scenarios that allow for changes in population size over time. We first show that population bottlenecks can lead to the removal of all but one adaptive lineage from an initially soft selective sweep. The parameter regime under which such ‘hardening’ of soft selective sweeps is likely is determined by a simple heuristic condition. We further develop a generalized analytical framework, based on an extension of the coalescent process, for calculating the probability of soft sweeps under arbitrary demographic scenarios. Two important limits emerge within this analytical framework: In the limit where population size fluctuations are fast compared to the duration of the sweep, the likelihood of soft sweeps is determined by the harmonic mean of the variance effective population size estimated over the duration of the sweep; in the opposing slow fluctuation limit, the likelihood of soft sweeps is determined by the instantaneous variance effective population size at the onset of the sweep. We show that as a consequence of this finding the probability of observing soft sweeps becomes a function of the strength of selection. Specifically, in species with sharply fluctuating population size, strong selection is more likely to produce soft sweeps than weak selection. Our results highlight the importance of accurate demographic estimates over short evolutionary timescales for understanding the population genetics of adaptation from de novo mutation.


Author post: VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

This guest post is by Olly Burren and Chris Wallace on their preprint, VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes, arXived here.

The idea for this paper came from reading a study by Liu et al. ( and the fact that summary p values from genome wide association studies are increasingly becoming publicly available. In the field of human disease, genome-wide association studies have been very successful in isolating regions of the genome that confer disease susceptibility. The next step however, is to understand mechanistically exactly how variation in these loci gives rise to this susceptibility. There are a myriad of pre-existing methods available for integrating genetic and genomic datasets, however things are complicated by the high degree of linkage disequilibrium that exists, which causes substantial inflation in the variance of any test statistic. This inter-SNP correlation must be taken into account, classically by permuting case/control status and recomputing association, requiring access to raw genotyping data. Indeed, this approach was taken in our previously published method see Heing et al. ( which uses a non-parametric test to compare distribution of GWAS p values from two sets of SNPs (“test” and “control”). As most researchers working with GWAS know gaining access to raw genotyping data is often difficult, and then how to include meta-analysis and imputed data? Liu et al., got around this by estimating the inter-SNP correlation using public datasets and sampling from a multivariate normal to generate simulated p values, analogous to the permuted p values possible with permuting phenotype status when raw data are available. VEGAS uses genotype data publicly available through the International HapMap project and aims to integrate GWAS results with trans eQTLs to identify causal disease genes.

Our thought was that by combining our previously published method, with the VEGAS approach, we could create a novel approach that would allow the integration of genetic information from GWAS with functional information from for example a set of micro-array experiments, crucially without the need for genotype information. The rationale being that it would help to prioritise future mechanistic studies, which can be costly and time-consuming to conduct. We also upped the stakes, and decided to use 1000 Genomes Project genotyping information for our estimations, to allow application to dense-genotyping technologies. The result was a software pipeline that takes as input a gene set of interest, a matched ‘control’ set and a summary set of GWAS statistics and computes an enrichment score.

Note that this approach differs from the Bayesian model suggested by Pickrell ( as it focuses on comparing broad regions, rather than on considering more targeted genomic annotation, and in that sense is perhaps more akin to pathway analysis, although we do suggest that functionally defined genes sets, such as those found by knock down experiments in cell lines, may be more productive than using manually annotated pathways whose completeness can vary considerably.

To illustrate the method we applied it to a large meta-analysis GWAS study of type 1 diabetes (8000 case vs 8000 controls), and an interesting dataset examining the effect on gene-expression of knocking down a series of 59 transcription factors in a lymphoblastoid cell line see Cusanovich et al ( We identified three transcription factors, IKZF3, BATF and ESRRA, whose putative targets are significantly enriched for variation associated with type 1 diabetes susceptibility. IKZF3 overlaps a known type 1 diabetes susceptibility region, whereas BATF and ESRRA overlap other autoimmune susceptibility regions, validating our approach. Of course there are caveats interpreting results derived from cell lines, however we think it’s promising that our top hit lies in a region already associated with type 1 diabetes susceptibility.
Using the quantities already computed, once enrichment is detected, we implemented a simple technique to prioritise genes within the set. This allows the generation of a succinct list of genes that are responsible for the enrichment detected on the global level. Cross referenced with other information these can either be informative in their own right or be used to inform future studies.

This study is also an example of the preprint process speeding up scientific discovery. We knew about the Cusanovich dataset because they released a preprint on arXiv, which was caught by Haldane’s Sieve ( in October 2013. One email, and the authors kindly shared their complete results. Had we waited for it to be published in PLoS Genetics in March 2014, we’d have been five months behind where we are.

The major benefit is that all of the datasets employed are within the public domain. Our hope is that either this or other methods in the same vein will help to bridge the gap between GWAS and disease mechanisms, ultimately fuelling the development of new therapeutics.

Estimating transcription factor abundance and specificity from genome-wide binding profiles

Estimating transcription factor abundance and specificity from genome-wide binding profiles

Nicolae Radu Zabet, Boris Adryan
Comments: 39 pages, 25 figures, 10 tables
Subjects: Quantitative Methods (q-bio.QM)

The binding of transcription factors (TFs) is essential for gene expression. One important characteristic is the actual occupancy of a putative binding site in the genome. In this study, we propose an analytical model to predict genomic occupancy that incorporates the preferred target sequence of a TF in the form of a position weight matrix (PWM), DNA accessibility data (in case of eukaryotes), the number of TF molecules expected to be bound to the DNA and a parameter that modulates the specificity of the TF. Given actual occupancy data in form of ChIP-seq profiles, we backwards inferred copy number and specificity for five Drosophila TFs during early embryonic development: Bicoid, Caudal, Giant, Hunchback and Kruppel. Our results suggest that these TFs display a lower number of DNA-bound molecules than previously assumed (in the range of tens and hundreds) and that, while Bicoid and Caudal display a higher specificity, the other three transcription factors (Giant, Hunchback and Kruppel) display lower specificity in their binding (despite having PWMs with higher information content). This study gives further weight to earlier investigations into TF copy numbers that suggest a significant proportion of molecules are not bound to the DNA.

Are we able to detect mass extinction events using phylogenies ?

Are we able to detect mass extinction events using phylogenies ?
Sacha S.J. Laurent, Marc Robinson-Rechavi, Nicolas Salamin
Comments: 14 pages, 8 figures
Subjects: Populations and Evolution (q-bio.PE)

The estimation of the rates of speciation and extinction provides important information on the macro-evolutionary processes shaping biodiversity through time (Ricklefs 2007). Since the seminal paper by Nee et al. (1994), much work have been done to extend the applicability of the birth-death process, which now allows us to test a wide range of hypotheses on the dynamics of the diversification process. Several approaches have been developed to identify the changes in rates of diversification occurring along a phylogenetic tree. Among them, we can distinguish between lineage-dependent, trait-dependent, time-dependent and density-dependent changes. Lineage specific methods identify changes in speciation and extinction rates — {\lambda} and {\mu}, respectively — at inner nodes of a phylogenetic tree (Rabosky et al. 2007; Alfaro et al. 2009; Silvestro et al. 2011). We can also identify trait-dependence in macro-evolutionary rates if the states of the particular trait of interest are known for the species under study (Maddison et al. 2007; FitzJohn et al. 2009; Mayrose et al. 2011). It is also possible to look for concerted changes in rates on independent branches of the phylogenetic tree by dividing the tree into time slices (Stadler 2011a). Finally, density-dependent effects can be detected when changes of diversification are correlated with overall species number (Etienne et al. 2012). Most methods can correct for incomplete taxon sampling, by assigning species numbers at tips of the phylogeny (Alfaro et al. 2009; Stadler and Bokma 2013), or by introducing a sampling parameter (Nee et al. 1994). By taking into account this sampling parameter at time points in the past, one can also look for events of mass extinction (Stadler 2011a).

Mapping to a Reference Genome Structure

Mapping to a Reference Genome Structure
Benedict Paten, Adam Novak, David Haussler
Comments: 25 pages
Subjects: Genomics (q-bio.GN)

To support comparative genomics, population genetics, and medical genetics, we propose that a reference genome should come with a scheme for mapping each base in any DNA string to a position in that reference genome. We refer to a collection of one or more reference genomes and a scheme for mapping to their positions as a reference structure. Here we describe the desirable properties of reference structures and give examples. To account for natural genetic variation, we consider the more general case in which a reference genome is represented by a graph rather than a set of phased chromosomes; the latter is treated as a special case.

Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene

Genome-wide association of foraging behavior in Drosophila melanogaster fails to support large-effect alleles at the foraging gene
Thomas Turner, Christopher C Giauque, Daniel R Schrider, Andrew D Kern

Thirty four years ago, it was postulated that natural populations of Drosophila melanogaster are comprised of two behavioral morphs termed “rover” and “sitter”, and that this variation is caused mainly by large-effect alleles at a single locus. Since that time, considerable data has been amassed that compares the behavior and physiology of these morphs. Contrary to common assertions, however, published support for the existence of common large effect alleles in nature is quite limited. To further investigate, we quantified the foraging behavior of 36 natural strains, performed a genome-wide association study, and described patterns of molecular evolution at the foraging locus. Though there was significant variation in foraging behavior among genotypes, this variation was continuously distributed and not significantly associated with genetic variation at the foraging gene. Patterns of molecular population genetic variation at this gene also provide no support for the hypothesis that for is a target of long term balancing selection We propose that additional data is required to support a hypothesis of common alleles of large effect on foraging behavior in nature. Genome-wide association does support a role for natural variation at several other loci, including the sulfateless gene, though these associations should be considered preliminary until validated with a larger sample size.

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species

Identifying adaptive and plastic gene expression levels using a unified model for expression variance between and within species
Rori Rohlfs, Rasmus Nielsen

Thanks to the reduced cost of RNA-Sequencing and other advanced methods for quantifying expression levels, accurate and expansive comparative expression data sets including data from multiple individuals per species are emerging. Comparative genomics has been greatly facilitated by the availability of statistical methods considering both between and within species variation for testing hypotheses regarding the evolution of DNA sequences. Similar methods are now needed to fully leverage comparative expression data. In this paper, we describe the β model which parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses, including a test for expression divergence or diversity for a single gene or a class of genes. The β model can also be used to test for lineage-specific shifts in expression level, amongst other applications. We use simulations to explore the functionality of these tests under a variety of circumstances. We then apply them to a mammalian phylogeny of 15 species typed in liver tissue. We identify genes with high expression divergence between species as candidates for expression level adaptation, and genes with high expression diversity within species as candidates for expression level conservation and plasticity. Using the test for lineage-specific expression shifts, we identify several candidate genes for expression level adaptation on the catarrhine and human lineages, including genes possibly related to dietary changes in humans. We compare these results to those reported previously using the species mean model which ignores population expression variance, uncovering important differences in performance.

Regulatory variants explain much more heritability than coding variants across 11 common diseases

Regulatory variants explain much more heritability than coding variants across 11 common diseases
Alexander Gusev, S Hong Lee, Benjamin M Neale, Gosia Trynka, Bjarni J Vilhjalmsson, Hilary Finucane, Han Xu, Chongzhi Zang, Stephan Ripke, Eli Stahl, n/a Schizophrenia Working Group of the PGC, n/a SWE-SCZ Consortium, Anna K Kahler, Christina M Hultman, Shaun M Purcell, Steven A McCarroll, Mark Daly, Bogdan Pasaniuc, Patrick F Sullivan, Naomi R Wray, Soumya Raychaudhuri, Alkes L Price

Common variants implicated by genome-wide association studies (GWAS) of complex diseases are known to be enriched for coding and regulatory variants. We applied methods to partition the heritability explained by genotyped SNPs (h2g) across functional categories (while accounting for shared variance due to linkage disequilibrium) to genotype and imputed data for 11 common diseases. DNaseI Hypersensitivity Sites (DHS) from 218 cell-types, spanning 16% of the genome, explained an average of 79% of h2g (5.1× enrichment; P < 10−20); further enrichment was observed at enhancer and cell-type specific DHS elements. The enrichments were much smaller in analyses that did not use imputed data or were restricted to GWAS- associated SNPs. In contrast, coding variants, spanning 1% of the genome, explained only 8% of h2g (13.8× enrichment; P = 5 × 10−4). We replicated these findings but found no significant contribution from rare coding variants in an independent schizophrenia cohort genotyped on GWAS and exome chips.

VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

VSEAMS: A pipeline for variant set enrichment analysis using summary GWAS data identifies IKZF3, BATF and ESRRA as key transcription factors in type 1 diabetes

Oliver S Burren, Hui Guo, Chris Wallace
(Submitted on 17 Apr 2014)

Motivation: Genome-wide association studies (GWAS) have identified many loci implicated in disease susceptibility. Integration of GWAS summary statistics (p values) and functional genomic datasets should help to elucidate mechanisms. Results: We describe the extension of a previously described non-parametric method to test whether GWAS signals are enriched in functionally defined loci to a situation where only GWAS p values are available. The approach is implemented in VSEAMS, a freely available software pipeline. We use VSEAMS to integrate functional gene sets defined via transcription factor knock down experiments with GWAS results for type 1 diabetes and find variant set enrichment in gene sets associated with IKZF3, BATF and ESRRA. IKZF3 lies in a known T1D susceptibility region, whilst BATF and ESRRA overlap other immune disease susceptibility regions, validating our approach and suggesting novel avenues of research for type 1 diabetes. Availability and implementation: VSEAMS is available for download this http URL

Gradual divergence and diversification of mammalian duplicate gene functions

Gradual divergence and diversification of mammalian duplicate gene functions

Raquel Assis, Doris Bachtrog

Gene duplication provides raw material for the evolution of functional innovation. We recently developed a phylogenetic method to classify the evolutionary processes underlying the retention and functional evolution of duplicate genes by quantifying divergence of their gene expression profiles. Here, we apply our method to pairs of duplicate genes in eight mammalian genomes, using data from 11 distinct tissues to construct spatial gene expression profiles. We find that young mammalian duplicates are often functionally conserved, and that functional divergence gradually increases with evolutionary distance between species. Examination of expression patterns in genes with conserved and new functions supports the ?out-of-testes? hypothesis, in which new genes arise with testis-specific functions and acquire functions in other tissues over time. While new functions tend to be tissue-specific, there is no bias toward expression in any particular tissue. Thus, duplicate genes acquire a diversity of functions outside of the testes, possibly contributing to the origin of a multitude of complex phenotypes during mammalian evolution.