Estimating phylogenetic trees from genome-scale data

Estimating phylogenetic trees from genome-scale data
Liang Liu, Zhenxiang Xi, Shaoyuan Wu, Charles Davis, Scott V. Edwards
Comments: 39 pages, 3 figures
Subjects: Populations and Evolution (q-bio.PE)

As researchers collect increasingly large molecular data sets to reconstruct the Tree of Life, the heterogeneity of signals in the genomes of diverse organisms poses challenges for traditional phylogenetic analysis. A class of phylogenetic methods known as “species tree methods” have been proposed to directly address one important source of gene tree heterogeneity, namely the incomplete lineage sorting or deep coalescence that occurs when evolving lineages radiate rapidly, resulting in a diversity of gene trees from a single underlying species tree. Although such methods are gaining in popularity, they are being adopted with caution in some quarters, in part because of an increasing number of examples of strong phylogenetic conflict between concatenation or supermatrix methods and species tree methods. Here we review theory and empirical examples that help clarify these conflicts. Thinking of concatenation as a special case of the more general model provided by the multispecies coalescent can help explain a number of differences in the behavior of the two methods on phylogenomic data sets. Recent work suggests that species tree methods are more robust than concatenation approaches to some of the classic challenges of phylogenetic analysis, including rapidly evolving sites in DNA sequences, base compositional heterogeneity and long branch attraction. We show that approaches such as binning, designed to augment the signal in species tree analyses, can distort the distribution of gene trees and are inconsistent. Computationally efficient species tree methods that incorporate biological realism are a key to phylogenetic analysis of whole genome data.

A Spatial Framework for Understanding Population Structure and Admixture.

A Spatial Framework for Understanding Population Structure and Admixture.
Gideon Bradburd, Peter L. Ralph, Graham Coop
doi: http://dx.doi.org/10.1101/013474

Geographic patterns of genetic variation within modern populations, produced by complex histories of migration, can be difficult to infer and visually summarize. A general consequence of geographically limited dispersal is that samples from nearby locations tend to be more closely related than samples from distant locations, and so genetic covariance often recapitulates geographic proximity. We use genome-wide polymorphism data to build “geogenetic maps”, which, when applied to stationary populations, produces a map of the geographic positions of the populations, but with distances distorted to reflect historical rates of gene flow. In the underlying model, allele frequency covariance is a decreasing function of geogenetic distance, and nonlocal gene flow such as admixture can be identified as anomalously strong covariance over long distances. This admixture is explicitly co-estimated and depicted as arrows, from the source of admixture to the recipient, on the geogenetic map. We demonstrate the utility of this method on a circum-Tibetan sampling of the greenish warbler (Phylloscopus trochiloides), in which we find evidence for gene flow between the adjacent, terminal populations of the ring species. We also analyze a global sampling of human populations, for which we largely recover the geography of the sampling, with support for significant histories of admixture in many samples. This new tool for understanding and visualizing patterns of population structure is implemented in a Bayesian framework in the program SpaceMix.

The effect of the dispersal kernel on isolation-by-distance in a continuous population


The effect of the dispersal kernel on isolation-by-distance in a continuous population

Tara N. Furstenau, Reed A. Cartwright
Comments: 18 pages (main); 4 pages (supp)
Subjects: Populations and Evolution (q-bio.PE)

Under models of isolation-by-distance, population structure is determined by the probability of identity-by-descent between pairs of genes according to the geographic distance between them. Well established analytical results indicate that the relationship between geographical and genetic distance depends mostly on the neighborhood size of the population, $N_b = 4{\pi}{\sigma}^2 D_e$, which represents a standardized measure of dispersal. To test this prediction, we model local dispersal of haploid individuals on a two-dimensional torus using four dispersal kernels: Rayleigh, exponential, half-normal and triangular. When neighborhood size is held constant, the distributions produce similar patterns of isolation-by-distance, confirming predictions. Considering this, we propose that the triangular distribution is the appropriate null distribution for isolation-by-distance studies. Under the triangular distribution, dispersal is uniform within an area of $4{\pi}{\sigma}^2$ (i.e. the neighborhood area), which suggests that the common description of neighborhood size as a measure of a local panmictic population is valid for popular families of dispersal distributions. We further show how to draw from the triangular distribution efficiently and argue that it should be utilized in other studies in which computational efficiency is important

Geographic range size is predicted by plant mating system

Geographic range size is predicted by plant mating system
Dena Grossenbacher, Ryan Briscoe Runquist, Emma Goldberg, Yaniv Brandvain
doi: http://dx.doi.org/10.1101/013417

Species ranges vary enormously, and even closest relatives may differ in range size by several orders of magnitude. With data from hundreds of species spanning 20 genera and generic sections, we show that plant species that autonomously reproduce via self-pollination consistently have larger geographic ranges than their close relatives that generally require two parents for reproduction. Further analyses strongly implicate autonomous fertilization in causing this relationship, as it is not driven by traits such as polyploidy or annual life history whose evolution is sometimes correlated with the transition to autonomous self-fertilization. Furthermore, we find that selfers occur at higher maximum latitudes and that disparity in range size between selfers and outcrossers increases with time since their separation. Together, these results show that autonomous reproduction – a critical biological trait that eliminates mate limitation and thus potentially increases the probability of establishment – increases range size.

Too packed to change: site-specific substitution rates and side-chain packing in protein evolution

Too packed to change: site-specific substitution rates and side-chain packing in protein evolution
María Laura Marcos, Julian Echave
doi: http://dx.doi.org/10.1101/013359

In protein evolution, due to functional and biophysical constraints, the rates of amino acid substitution differ from site to site. Among the best predictors of site-specific rates is packing density. The packing density measure that best correlates with rates is the weighted contact number (WCN), the sum of inverse square distances between the site’s Cα and the other Cαs . According to a mechanistic stress model proposed recently, rates are determined by packing because mutating packed sites stresses and destabilizes the protein’s active conformation. While WCN is a measure of Cα packing, mutations replace side chains, which prompted us to consider whether a site’s evolutionary divergence is constrained by main-chain packing or side-chain packing. To address this issue, we extended the stress theory to model side chains explicitly. The theory predicts that rates should depend solely on side-chain packing. We tested these predictions on a data set of structurally and functionally diverse monomeric enzymes. We found that, on average, side-chain contact density (WCNρ ) explains 39.1% of among-sites rate variation, larger than main-chain contact density (WCNα ) which explains 32.1%. More importantly, the independent contribution of WCNα is only 0.7%. Thus, as predicted by the stress theory, site-specific evolutionary rates are determined by side-chain packing.

Limits to adaptation along environmental gradients

Limits to adaptation along environmental gradients
Jitka Polechová, Nick Barton
doi: http://dx.doi.org/10.1101/012690

Why do species not adapt to ever-wider ranges of conditions, gradually expanding their ecological niche? Theories of niche evolution typically omit spatial context, yet all species experience spatially variable conditions. Gene flow across environments has two conflicting effects on adaptation: while it increases genetic variation, which is a prerequisite for adaptation, gene flow may swamp adaptation to local conditions. We show that genetic drift can generate a sharp margin to a species’ range, by reducing genetic variance below the level needed for adaptation to spatially variable conditions. Dimensional arguments and separation of ecological and evolutionary time scales reveal a simple threshold that predicts when adaptation at the range margin fails. Two observable parameters describe the threshold: i) the effective environmental gradient, which can be measured by the loss of fitness due to dispersal to a different environment, and ii) the efficacy of selection relative to genetic drift. The theory predicts sharp range margins even in the absence of abrupt changes in the environment. Furthermore, it implies that gradual worsening of conditions across a species’ habitat may suddenly lead to range fragmentation – as adaptation to a wide span of conditions within a single species becomes impossible.

An annotated consensus genetic map for Pinus taeda L. and extent of linkage disequilibrium in three genotype-phenotype discovery populations

An annotated consensus genetic map for Pinus taeda L. and extent of linkage disequilibrium in three genotype-phenotype discovery populations
Jared W. Westbrook, Vikram E. Chhatre, Le-Shin Wu, Srikar Chamala, Leandro Gomide Neves, Patricio Muñoz, Pedro J Martínez-García, David B. Neale, Matias Kirst, Keithanne Mockaitis, C. Dana Nelson, Gary F. Peter, John M. Davis, Craig S. Echt
doi: http://dx.doi.org/10.1101/012625

A consensus genetic map for Pinus taeda (loblolly pine) was constructed by merging three previously published maps with a map from a pseudo-backcross between P. taeda and P. elliottii (slash pine). The consensus map positioned 4981 markers via genotyping of 1251 individuals from four pedigrees. It is the densest linkage map for a conifer to date. Average marker spacing was 0.48 centiMorgans and total map length was 2372 centiMorgans. Functional predictions for 4762 markers for expressed sequence tags were improved by alignment to full-length P. taeda transcripts. Alignments to the P. taeda genome mapped 4225 scaffold sequences onto linkage groups. The consensus genetic map was used to compare the extent of genome-wide linkage disequilibrium in an association population of distantly related P. taeda individuals (ADEPT2), a multiple-family pedigree used for genomic selection studies (CCLONES), and a full-sib quantitative trait locus mapping population (BC1). Weak linkage disequilibrium was observed in CCLONES and ADEPT2. Average squared correlations, R2, between genotypes at SNPs less than one centiMorgan apart was less than 0.05 in both populations and R2 did not decay substantially with genetic distance. By contrast, strong and extended linkage disequilibrium was observed among BC1 full-sibs where average R2 decayed from 0.8 to less than 0.1 over 53 centiMorgans. The consensus map and analysis of linkage disequilibrium establish a foundation for comparative association and quantitative trait locus mapping between genotype-phenotype discovery populations. 

Testing for genetic associations in arbitrarily structured populations

Testing for genetic associations in arbitrarily structured populations
Minsun Song, Wei Hao, John D. Storey
doi: http://dx.doi.org/10.1101/012682

We present a new statistical test of association between a trait (either quantitative or binary) and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as that measured in genome-wide associations studies (GWAS). We also derive a new set of methodologies, called a genotype-conditional association test (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and environmental contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods. We provide some discussion on its similarities and differences with the linear mixed model and principal component approaches.

The competition between simple and complex evolutionary trajectories in asexual populations


The competition between simple and complex evolutionary trajectories in asexual populations

Ian E. Ochs, Michael M. Desai
Comments: 8 pages, 3 figures
Subjects: Populations and Evolution (q-bio.PE)

On rugged fitness landscapes where sign epistasis is common, adaptation can often involve either individually beneficial “uphill” mutations or more complex mutational trajectories involving fitness valleys or plateaus. The dynamics of the evolutionary process determine the probability that evolution will take any specific path among a variety of competing possible trajectories. Understanding this evolutionary choice is essential if we are to understand the outcomes and predictability of adaptation on rugged landscapes. We present a simple model to analyze the probability that evolution will eschew immediately uphill paths in favor of crossing fitness valleys or plateaus that lead to higher fitness but less accessible genotypes. We calculate how this probability depends on the population size, mutation rates, and relevant selection pressures, and compare our analytical results to Wright-Fisher simulations. We find that the probability of valley crossing depends nonmonotonically on population size: intermediate size populations are most likely to follow a “greedy” strategy of acquiring immediately beneficial mutations even if they lead to evolutionary dead ends, while larger and smaller populations are more likely to cross fitness valleys to reach distant advantageous genotypes. We explicitly identify the boundaries between these different regimes in terms of the relevant evolutionary parameters. Above a certain threshold population size, we show that the degree of evolutionary “foresight” depends only on a single simple combination of the relevant parameters.

The genomic signature of social interactions regulating honey bee caste development

The genomic signature of social interactions regulating honey bee caste development
Svjetlana Vojvodic, Brian R Johnson, Brock Harpur, Clement Kent, Amro Zayed, Kirk E Anderson, Timothy Linksvayer
doi: http://dx.doi.org/10.1101/012385

Social evolution theory posits the existence of genes expressed in one individual that affect the traits and fitness of social partners. The archetypal example of reproductive altruism, honey bee reproductive caste, involves strict social regulation of larval caste fate by care-giving nurses. However, the contribution of nurse-expressed genes, which are prime socially-acting candidate genes, to the caste developmental program and to caste evolution remains mostly unknown. We experimentally induced new queen production by removing the current colony queen, and we used RNA sequencing to study the gene expression profiles of both developing larvae and their care-giving nurses before and after queen removal. By comparing the gene expression profiles between both queen-destined larvae and their nurses to worker-destined larvae and their nurses in queen-present and queen-absent conditions, we identified larval and nurse genes associated with larval caste development and with queen presence. Of 950 differentially-expressed genes associated with larval caste development, 82% were expressed in larvae and 18% were expressed in nurses. Behavioral and physiological evidence suggests that nurses may specialize in the short term feeding queen- versus worker-destined larvae. Estimated selection coefficients indicated that both nurse and larval genes associated with caste are rapidly evolving, especially those genes associated with worker development. Of the 1863 differentially-expressed genes associated with queen presence, 90% were expressed in nurses. Altogether, our results suggest that socially-acting genes play important roles in both the expression and evolution of socially-influenced traits like caste.