Recent technological developments allow investigation of the repeatability of evolution at the genomic level. Such investigation is particularly powerful when applied to a ring species, in which spatial variation can be used to represent the evolutionary changes that occurred during the evolution of two species from one. We examined patterns of genomic variation among three populations of the greenish warbler ring species, using genotypes at 13,013,950 nucleotide sites along a new greenish warbler consensus genome assembly. Genomic regions of low within-group variation are remarkably consistent between the three populations. These regions show high relative differentiation but surprisingly low absolute differentiation between populations. We propose that these regions underwent selective sweeps over a broad geographic area followed by within-population selection-induced reductions in variation. A surprising implication of this “sweep-before-differentiation” model is that genomic regions of high relative differentiation may have moved among populations more recently than regions elsewhere in the genome.
Background. Copy number variants (CNVs) are a type of polymorphism found to underlie phenotypic variation, both in humans and livestock. Most surveys of CNV in livestock have been conducted in the cattle genome, and often utilise only a single approach for the detection of copy number differences. Here we performed a study of CNV in sheep, using multiple methods to identify and characterise copy number changes. Comprehensive information from small pedigrees (trios) was collected using multiple platforms (array CGH, SNP chip and whole genome sequence data), with these data then analysed via multiple approaches to identify and verify CNVs. Results. In total, 3,488 autosomal CNV regions (CNVRs) were identified from 30 sheep. The average length of the identified CNVRs was 19kb (range of 1kb to 3.6Mb), with shorter CNVRs being more frequent than longer CNVRs. The total length of all CNVRs was 67.6Mbps, which equates to 2.7% of the sheep autosomes. For individuals this value ranged from 0.24 to 0.55%, and the majority of CNVRs were identified in single animals. Rather than being uniformly distributed throughout the genome, CNVRs tended to be clustered. Application of three independent approaches for CNVR detection facilitated a comparison of validation rates. CNVs identified on the Roche-NimbleGen 2.1M CGH array generally had low validation rates, while whole genome sequence data had the highest validation rate. Conclusions. This study represents the first comprehensive survey of the distribution, prevalence and characteristics of CNVR in sheep. Multiple approaches were used to detect CNV regions and it appears that the best method for verifying CNVR on a large scale involves using a combination of detection methodologies. The characteristics of the 3,488 autosomal CNV regions identified in this study are comparable to other CNV regions reported in the literature and provide a valuable addition to the small subset of published sheep CNVs.
Background: In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an open-source code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites x #taxa) yields different models and, as a consequence, different tree topologies. Results: We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10%) for approximately 5% of the tree inferences conducted on the 39 empirical datasets used in our study. Conclusions: We find that, using the best-fit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences.
A number of methods have been developed to use genetic sequence data to identify and delineate species. Some methods are based on heuristics, such as DNA barcoding which is based on a sequence-distance threshold, while others use Bayesian model comparison under the multispecies coalescent model. Here we use mathematical analysis and computer simulation to demonstrate large differences in statistical performance of species identification between DNA barcoding and Bayesian inference under the multispecies coalescent model as implemented in the bpp program. We show that a fixed genetic-distance threshold as used in DNA barcoding is problematic for delimiting species, even if the threshold is “optimized”, because different species have different population sizes and different divergence times, and therefore display different amounts of intra-species versus inter-species variation. In contrast, bpp can reliably delimit species in such situations with only one locus and rarely supports a wrong assignment with high posterior probability. While under-sampling or rare specimens may pose problems for heuristic methods, bpp can delimit species with high power when multi-locus data are used, even if the species is represented by a single specimen. Finally we demonstrate that bpp may be powerful for delimiting cryptic species using specimens that are misidentified as a single species in the barcoding library.
Understanding both the role of selection in driving phenotypic change and its underlying genetic basis remain major challenges in evolutionary biology. Here we focus on a classic system of local adaptation in the North American deer mouse, Peromyscus maniculatus, which occupies two main habitat types, prairie and forest. Using historical collections we demonstrate that forest-dwelling mice have longer tails than those from non-forested habitats, even when we account for individual and population relatedness. Based on genome-wide SNP capture data, we find that mice from forested habitats in the eastern and western parts of their range form separate clades, suggesting that increased tail length evolved independently from a short-tailed ancestor. Two major changes in skeletal morphology can give rise to longer tails–increased number and increased length of vertebrae–and we find that forest mice in the east and west have both more and longer caudal vertebrae, but not trunk vertebrae, than nearby prairie forms. Using a second-generation intercross between a prairie and forest pair, we show that the number and length of caudal vertebrae are not correlated in this recombinant population, suggesting that variation in these traits is controlled by separate genetic loci. Together, these results demonstrate convergent evolution of the long-tailed forest phenotype through multiple, distinct genetic mechanisms (controlling vertebral length and vertebral number), thus suggesting that these morphological changes–either independently or together–are adaptive.
A crucial component of major transitions theory is that after the transition, adaptation occurs primarily at the level of the new, higher-level unit. For collective-level adaptations to occur, though, collective-level traits must be heritable. Since collective-level traits are functions of lower-level traits, collective-level heritability is related to particle-level heritability. However, the nature of this relationship has rarely been explored in the context of major transitions. We examine relationships between particle-level heritability and collective-level heritability for several functions that express collective-level traits in terms of particle-level traits. When this relationship is linear, the heritability of a collective-level trait is never less than that of the corresponding particle-level trait and is higher under most conditions. For more complicated functions, collective-level heritability is higher under most conditions, but can be lower when the function relating particle to cell-level traits is sensitive to small fluctuations in the state of the particles within the collective. Rather than being an impediment to major transitions, we show that collective-level heritability superior to that of the lower-level units can often arise ‘for free’, simply as a byproduct of collective formation.
Correctly estimating the age of a gene or gene family is important for a variety of fields, including molecular evolution, comparative genomics, and phylogenetics, and increasingly for systems biology and disease genetics. However, most studies use only a point estimate of a gene’s age, neglecting the substantial uncertainty involved in this estimation. Here, we characterize this uncertainty by investigating the effect of algorithm choice on gene-age inference and calculate consensus gene ages with attendant error distributions for a variety of model eukaryotes. We use thirteen orthology inference algorithms to create gene-age datasets and then characterize the error around each age-call on a per-gene and per-algorithm basis. Systematic error was found to be a large factor in estimating gene age, suggesting that simple consensus algorithms are not enough to give a reliable point estimate. We also found that different sources of error can affect downstream analyses, such as gene ontology enrichment. Our consensus gene-age datasets, with associated error terms, are made fully available at so that researchers can propagate this uncertainty through their analyses (https://github.com/marcottelab/Gene-Ages).
Genotypic fitness landscapes are constructed by assessing the fitness of all possible combinations of a given number of mutations. In the last years, several experimental fitness landscapes have been completely resolved. As fitness landscapes are high-dimensional, simple measures of their structure are used as statistics in empirical applications. Epistasis is one of the most relevant features of fitness landscapes. Here we propose a new natural measure of the amount of epistasis based on the correlation of fitness effects of mutations. This measure has a natural interpretation, captures well the interaction between mutations and can be obtained analytically for most landscape models. We discuss how this measure is related to previous measures of epistasis (number of peaks, roughness/slope, fraction of sign epistasis, Fourier-Walsh spectrum) and how it can be easily extended to landscapes with missing data or with fitness ranks only. Furthermore, the dependence of the correlation of fitness effects on mutational distance contains interesting information about the patterns of epistasis. This dependence can be used to uncover the amount and nature of epistatic interactions in a landscape or to discriminate between different landscape models.
Conserved genes evolve slowly in nature, by definition, but we find that some conserved genes are among the fastest-evolving genes in the long-term evolution experiment with Escherichia coli (LTEE). We identified the set of almost 2000 core genes shared among sixty clinical, environmental, and laboratory strains of E. coli. During the LTEE, these core genes accumulated significantly more nonsynonymous mutations than did flexible (i.e., noncore) genes after accounting for the mutational target size. Furthermore, the core genes under strongest positive selection in the LTEE are more conserved in nature than the average core gene based both on sequence diversity among E. coli strains and divergence between E. coli and Salmonella enterica. We conclude that the conditions of the LTEE are novel for E. coli, at least in relation to the long sweep of its evolution in nature. We suggest that what is most novel about the LTEE for the bacteria is the constancy of the environment, its biophysical simplicity, and the absence of microbial competitors, predators, and parasites.
We investigate the dependence of the site frequency spectrum (SFS) on the topological structure of genealogical trees. We show that basic population genetic statistics — for instance estimators of θ or neutrality tests such as Tajima’s D — can be decomposed into components of waiting times between coalescent events and of tree topology. Our results clarify the relative impact of the two components on these statistics. We provide a rigorous interpretation of positive or negative values of neutrality tests in terms of the underlying tree shape. In particular, we show that values of Tajima’s D and Fay and Wu’s H depend in a direct way on a measure of tree balance which is mostly determined by the root balance of the tree. We also compute the maximum and minimum values for neutrality tests as a function of sample size. Focusing on the standard coalescent model of neutral evolution, we discuss how waiting times between coalescent events are related to derived allele frequencies and thereby to the frequency spectrum. Finally, we show how tree balance affects the frequency spectrum. In particular, we derive the complete SFS conditioned on the root imbalance. We show that the conditional spectrum is peaked at frequencies corresponding to the root imbalance and strongly biased towards rare alleles.