Distribution of gene tree histories under the coalescent model with gene flow

Distribution of gene tree histories under the coalescent model with gene flow

Yuan Tian, Laura Kubatko
doi: http://dx.doi.org/10.1101/023937

We propose a coalescent model for three species that allows gene flow between both pairs of sister populations. The model is designed to analyze multilocus genomic sequence alignments, with one sequence sampled from each of the three species. The model is formulated using a Markov chain representation, which allows use of matrix exponentiation to compute analytical expressions for the probability density of gene tree genealogies. The gene tree history distribution as well as the gene tree topology distribution under this coalescent model with gene flow are then calculated via numerical integration. We analyze the model to compare the distributions of gene tree topologies and gene tree histories for species trees with differing effective population sizes and gene flow rates. Our results suggest conditions under which the species tree and associated parameters are not identifiable from the gene tree topology distribution when gene flow is present, but indicate that the gene tree history distribution may identify the species tree and associated parameters. Thus, the gene tree history distribution can be used to infer parameters such as the ancestral effective population sizes and the rates of gene flow in a maximum likelihood (ML) framework. We conduct computer simulations to evaluate the performance of our method in estimating these parameters, and we apply our method to an Afrotropical mosquito data set (Fontaine et al., 2015) to demonstrate the usefulness of our method for the analysis of empirical data. Key words: coalescent, gene flow, migration, hybridization, gene tree, topology, history, maximum likelihood, speciation.

More efficacious drugs lead to harder selective sweeps in the evolution of drug resistance in HIV-1

More efficacious drugs lead to harder selective sweeps in the evolution of drug resistance in HIV-1

Alison F Feder, Soo-Yon Rhee, Robert W Shafer, Dmitri A Petrov, Pleuni S Pennings
doi: http://dx.doi.org/10.1101/024109

In the early days of HIV treatment, drug resistance occurred rapidly and predictably in all patients, but under modern treatments, resistance arises slowly, if at all. The probability of resistance should be controlled by the rate of generation of resistant mutations. If many adaptive mutations arise simultaneously, then adaptation proceeds by soft selective sweeps in which multiple adaptive mutations spread concomitantly, but if adaptive mutations occur rarely in the population, then a single adaptive mutation should spread alone in a hard selective sweep. Here we use 6,717 HIV-1 consensus sequences from patients treated with first-line therapies between 1989 and 2013 to confirm that the transition from fast to slow evolution of drug resistance was indeed accompanied with the expected transition from soft to hard selective sweeps. This suggests more generally that evolution proceeds via hard sweeps if resistance is unlikely and via soft sweeps if it is likely.

A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics

A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics

Guillaume Pare, Shihong Mao, Wei Deng
doi: http://dx.doi.org/10.1101/024067

Despite considerable efforts, known genetic associations only explain a small fraction of predicted heritability. Regional associations combine information from multiple contiguous genetic variants and can improve variance explained at established association loci. However, regional associations are not easily amenable to estimation using summary association statistics because of sensitivity to linkage disequilibrium (LD). We now propose a novel method to estimate phenotypic variance explained by regional associations using summary statistics while accounting for LD. Our method is asymptotically equivalent to multiple regression models when no interaction or haplotype effects are present. It has multiple applications, such as ranking of genetic regions according to variance explained and derivation of regional gene scores (GS). We show that most genetic variance lies in a small proportion of the genome, and that GS derived from regional associations can improve trait prediction above optimal polygenic scores. Our results also suggest regional associations underlie known linkage peaks.

Decreased transcription factor binding levels nearby primate pseudogenes suggests regulatory degeneration

Decreased transcription factor binding levels nearby primate pseudogenes suggests regulatory degeneration

Gavin M Douglas, Michael D Wilson, Alan M Moses
doi: http://dx.doi.org/10.1101/024026

Characteristics of pseudogene degeneration at the coding level are well-known, such as a shift towards neutral rates of nonsynonymous substitutions and gain of frameshift mutations. In contrast, degeneration of pseudogene transcriptional regulation is not well understood. Here, we test two predictions of regulatory degeneration along the pseudogenized lineage: (1) decreased transcription factor binding and (2) accelerated evolution in putative cis-regulatory regions. We find evidence for decreased TF binding levels nearby two primate pseudogenes compared to functional liver genes. We also find evidence for pseudogene-lineage-specific relaxation of sequence constraint on a fragment of the promoter of the primate pseudogene urate oxidase (Uox) and a nearby cis-regulatory module (CRM). However, the majority of TF-bound sequences nearby pseudogenes do not show evidence for lineage-specific accelerated rates of evolution. We conclude that decreases in TF binding level could be a marker for regulatory degeneration, while sequence degeneration in most CRMs may be obscured by background rates of TF binding site turnover.

Inference and analysis of population structure using genetic data and network theory

Inference and analysis of population structure using genetic data and network theory

Gili Greenbaum, Alan R. Templeton, Shirli Bar-David
doi: http://dx.doi.org/10.1101/024042

Clustering individuals based on genetic data has become commonplace in many genetic and ecological studies. Most often, statistical inference of population structure is done by applying model-based approaches, such as Bayesian clustering, aided by visualization using distance-based approaches, such as PCA (Principle Component Analysis). While existing distance-based approaches suffer from lack of statistical rigour, model-based approaches entail assumption of prior conditions such as that the subpopulations are at Hardy-Wienberg equilibria. Here we present a distance-based approach for inference of population structure using genetic data based on the network theory concept of community, a dense subgraph within a network. A network is constructed using the pairwise genetic-distance matrix of all sampled individuals, and utilizes community detection algorithms to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition’s modularity, a network theory concept measuring the strength in which partitions divide the network. In order to further characterize population structure, a measure of the Strength of Association (SA) for an individual to its assigned community is calculated, and the Strength of Association Distribution (SAD) of the communities is analysed to provide additional population structure details. The approach presented here provides a novel, computationally efficient, method for inference of population structure which does not assume an underlying model nor prior conditions, making inference potentially more robust. The method is implemented in the software NetStruct, available at https://github.com/GiliG/NetStruct.

The impact of partitioning on phylogenomic accuracy

The impact of partitioning on phylogenomic accuracy

Diego Darriba, David Posada
doi: http://dx.doi.org/10.1101/023978

Several strategies have been proposed to assign substitution models in phylogenomic datasets, or partitioning. The accuracy of these methods, and most importantly, their impact on phylogenetic estimation has not been thoroughly assessed using computer simulations. We simulated multiple partitioning scenarios to benchmark two a priori partitioning schemes (one model for the whole alignment, one model for each data block), and two statistical approaches (hierarchical clustering and greedy) implemented in PartitionFinder and in our new program, PartitionTest. Most methods were able to identify optimal partitioning schemes closely related to the true one. Greedy algorithms identified the true partitioning scheme more frequently than the clustering algorithms, but selected slightly less accurate partitioning schemes and tended to underestimate the number of partitions. PartitionTest was several times faster than PartitionFinder, with equal or better accuracy. Importantly, maximum likelihood phylogenetic inference was very robust to the partitioning scheme. Best-fit partitioning schemes resulted in optimal phylogenetic performance, without appreciable differences compared to the use of the true partitioning scheme. However, accurate trees were also obtained by a “simple” strategy consisting of assigning independent GTR+G models to each data block. On the contrary, leaving the data unpartitioned always diminished the quality of the trees inferred, to a greater or lesser extent depending on the simulated scenario. The analysis of empirical data confirmed these trends, although suggesting a stronger influence of the partitioning scheme. Overall, our results suggests that statistical partitioning, but also the a priori assignment of independent GTR+G models, maximize phylogenomic performance.

Learning the human chromatin network from all ENCODE ChIP-seq data

Learning the human chromatin network from all ENCODE ChIP-seq data

Scott M. Lundberg, William B. Tu, Brian Raught, Linda Z. Penn, Michael M. Hoffman, Su-In Lee
doi: http://dx.doi.org/10.1101/023911

Introduction: A cell’s epigenome arises from interactions among chromatin factors — transcription factors, histones, and other DNA-associated proteins — co-localized at particular genomic regions. Identifying the network of interactions among chromatin factors, the chromatin network, is of paramount importance in understanding epigenome regulation. Methods: We developed a novel computational approach, ChromNet, to infer the chromatin network from a set of ChIP-seq datasets. ChromNet has three key features that enable its use on large collections of ChIP-seq data. First, rather than using pairwise co-localization of factors along the genome, ChromNet identifies conditional dependence relationships that better discriminate direct and indirect interactions. Second, our novel statistical technique, the group graphical model, improves inference of conditional dependence on tightly correlated datasets. These datasets include transcription factors that form a complex or the same transcription factor assayed in different laboratories. Third, ChromNet’s computationally efficient method allows network learning among thousands of factors, and efficient relearning as new data is added. Results: We applied ChromNet to all available ChIP-seq data from the ENCODE Project, consisting of 1,415 ChIP-seq datasets, which revealed previously known chromatin factor interactions better than alternative approaches. ChromNet also identified previously unreported chromatin factor interactions. We experimentally validated one of these interactions, between the MYC and HCFC1 transcription factors. Discussion: ChromNet provides a useful tool for understanding the interactions among chromatin factors and identifying novel interactions. We have provided an interactive web-based visualization of the full ENCODE chromatin network and the ability to incorporate custom datasets at http://chromnet.cs.washington.edu.

Infinitely Long Branches and an Informal Test of Common Ancestry

Infinitely Long Branches and an Informal Test of Common Ancestry

Leonardo de Oliveira Martins, David Posada
doi: http://dx.doi.org/10.1101/023903

The evidence for universal common ancestry (UCA) is vast and persuasive, and a phylogenetic test was proposed for quantifying its odds against independently originated sequences based on the comparison between one and several trees. This test was successfully applied to a well-supported homologous sequence alignment, being however criticized once simulations showed that even alignments without any phylogenetic structure could mislead its conclusions. Despite claims to the contrary, we believe that the counterexample successfully showed a drawback of the test, of relying on good alignments. Here we present a simplified version of this counterexample, which can be interpreted as a tree with arbitrarily long branches, and where the test again fails. We also present another simulation showing circumstances whereby any sufficiently similar alignment will favor UCA irrespective of the true independent origins for the sequences. We therefore conclude that the test should not be trusted unless convergence has already been ruled out a priori. Finally, we present a class of frequentist tests that perform better than the purportedly formal UCA test.

Trait Evolution on Phylogenetic Networks

Trait Evolution on Phylogenetic Networks

Dwueng-Chwuan Jhwueng, Brian O’Meara
doi: http://dx.doi.org/10.1101/023986

Species may evolve on a reticulate network due to hybridization or other gene flow rather than on a strictly bifurcating tree, but comparative methods to deal with trait evolution on a network are lacking. We create such a method, which uses a Brownian motion model. Our method seeks to separately or jointly detect a bias in trait value coming from hybridization (β) and a burst of variation at the time of hybridization (v_H) associated with the hybridization event, as well as traditional Brownian motion parameters of ancestral state (μ) and rate of evolution (σ^2) of Brownian motion, as well as measurement error of the tips (SE). We test the method with extensive simulations. We also apply the model to two empirical examples, cichlid body size and Nicotiana drought tolerance, and find substantial measurement error and a hint that hybrids have greater drought tolerance in the latter case. The new methods are available in CRAN R package BMhyd.

Disentangling sources of selection on exonic transcriptional enhancers

Disentangling sources of selection on exonic transcriptional enhancers

Rachel M Agoglia, Hunter B Fraser
doi: http://dx.doi.org/10.1101/024000

In addition to coding for proteins, exons can also impact transcription by encoding regulatory elements such as enhancers. It has been debated whether such features confer heightened selective constraint, or evolve neutrally. We have addressed this question by developing a new approach to disentangle the sources of selection acting on exonic enhancers, in which we model the evolutionary rates of every possible substitution as a function of their effects on both protein sequence and enhancer activity. In three exonic enhancers, we found no significant association between evolutionary rates and effects on enhancer activity. This suggests that despite having biochemical activity, these exonic enhancers have no detectable selective constraint, and thus are unlikely to play a major role in protein evolution.