A major goal of speciation research is to reveal the genomic signatures that accompany the speciation process. Genome scans are routinely used to explore genome-wide variation and identify highly differentiated loci that may contribute to ecological divergence, but they do not incorporate spatial, phenotypic, or environmental data that might enhance outlier detection. Geographic cline analysis provides a potential framework for integrating diverse forms of data in a spatially-explicit framework, but it has not been used to study genome-wide patterns of divergence. Aided by a first-draft genome assembly, we combine an FCT scan and geographic cline analysis to characterize patterns of genome-wide divergence between divergent pollination ecotypes of Mimulus aurantiacus. FCT analysis of 58,872 SNPs generated via RADseq revealed little ecotypic differentiation (mean FCT = 0.041), though a small number of loci were moderately to highly diverged. Consistent with our previous results from the gene MaMyb2, which contributes to differences in flower color, 130 loci have cline shapes that recapitulate the spatial pattern of trait divergence, suggesting that they reside in or near the genomic regions that contribute to pollinator isolation. In the narrow hybrid zone between the ecotypes, extensive admixture among individuals and low linkage disequlibrium between markers indicate that outlier loci are scattered throughout the genome, rather than being restricted to one or a few regions. In addition to revealing the genomic consequences of ecological divergence in this system, we discuss how geographic cline analysis is a powerful but under-utilized framework for studying genome-wide patterns of divergence.
Read-Based Phasing of Related Individuals
Read-Based Phasing of Related Individuals
Motivation: Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information — reads and pedigree — has the potential to deliver results better than each individually. Results: We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2x for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15x coverage per individual.
Genetic regulation of transcriptional variation in wild-collected Arabidopsis thaliana accessions
Genetic regulation of transcriptional variation in wild-collected Arabidopsis thaliana accessions
An increased understanding of how genetics contributes to expression variation in natural Arabidopsis thaliana populations is of fundamental importance to understand adaptation. Here, we reanalyse data from two publicly available datasets with genome-wide data on genetic and transcript variation from whole-genome and RNA-sequencing in populations of wild-collected A. thaliana accessions. We found transcripts from more than half of all genes (55%) in the leaf of all accessions. In the population with higher RNA-sequencing coverage, transcripts from nearly all annotated genes were present in the leaf of at least one of the accessions. Thousands of genes, however, were found to have high transcript levels in some accessions and no detectable transcripts in others. The presence or absence of particular gene transcripts within the accessions was correlated with the genome-wide genotype, suggesting that part of this variability was due to a genetically controlled accession-specific expression. This was confirmed using the data from the largest collection of accessions, where cis-eQTL with a major influence on the presence or absence of transcripts was detected for 349 genes. Transcripts from 172 of these genes were present in the second, smaller collection of accessions and there, 81 of the eQTLs for these genes could be replicated. Twelve of the replicated genes, including HAC1, are particularly interesting candidate adaptive loci as earlier studies have shown that lack-of-function alleles at these genes have measurable phenotypic effects on the plant. In the larger collection, we also mapped 2,320 eQTLs regulating the expression of 2,240 genes that were expressed in nearly all accessions, and 636 of these replicated in the smaller collection. This study thus provides new insights to the genetic regulation of global gene-expression diversity in the leaf of wild-collected A. thaliana accessions and in particular illustrate that strong cis-acting polymorphisms are an important genetic mechanisms leading to the presence or absence of transcripts in individual accessions.
AdmixSim: A Forward-Time Simulator for Various and Complex Scenarios of Population Admixture
AdmixSim: A Forward-Time Simulator for Various and Complex Scenarios of Population Admixture
Background: Population admixture has been a common phenomenon in human, animals and plants, and plays a very important role in shaping individual genetic architecture and population genetic diversity. Inference of population admixture, however, is challenging and typically relies on in silico simulation. We are aware of the lack of a computer tool for such a purpose, especially a simulator is not available for generating data under various and complex admixture scenarios. Results: Here we developed a forward-time simulator (AdmixSim) under standard Wright Fisher model, which can simulate admixed populations with: 1) multiple ancestral populations; 2) multiple waves of admixture events; 3) fluctuating population size; and 4) fluctuating admixture proportions. Results of analysis of the simulated data by AdmixSim show that our simulator can fast and accurately generate data resemble real one. We included in AdmixSim all possible parameters that allow users to modify and simulate any kinds of admixture scenarios easily so that it is very flexible. AdmixSim records recombination break points and trace of each chromosomal segment from different ancestral populations, with which users can easily do further analysis and comparative studies with empirical data. Conclusions: AdmixSim is expected to facilitate the study of population admixture by providing a simulation framework with flexible implementation of various admixture models and parameters.
A high-quality reference panel reveals the complexity and distribution of structural genome changes in a human population
Structural variation (SV) represents a major source of differences between individual human genomes and has been linked to disease phenotypes. However, current studies on SVs have failed to provide a global view of the full spectrum of SVs and to integrate them into reference panels of genetic variation. Here, we analyzed 769 individuals from 250 Dutch families, whole genome sequenced at an average coverage of 14.5x, and provide a haplotype-resolved map of 1.9 million genome variants across 9 different variant classes, including novel forms of complex indels and retrotransposition-mediated insertions of mobile elements and processed RNAs. A large proportion of the structural variants (36%) were discovered in the size range of 21 to 100bp, a size range which remains under reported in many studies. Furthermore, we detected 4 megabases of novel sequence, extending the human pangenome with 11 new active transcripts. Finally, we show 191 known, trait-associated SNPs to be in strong linkage disequilibrium with a structural variant and demonstrate that our panel facilitates accurate imputation of SVs into unrelated individuals, which is essential for future genome-wide association studies.
A novel nuclear genetic code alteration in yeasts and the evolution of codon reassignment in eukaryotes
The genetic code is the universal cellular translation table to convert nucleotide into amino acid sequences. Changes to sense codons are expected to be highly detrimental. However, reassignments of single or multiple codons in mitochondria and nuclear genomes demonstrated that the code can evolve. Still, alterations of nuclear genetic codes are extremely rare leaving hypotheses to explain these variations, such as the ‘codon capture’, the ‘genome streamlining’ and the ‘ambiguous intermediate’ theory, in strong debate. Here, we report on a novel sense codon reassignment in Pachysolen tannophilus, a yeast related to the Pichiaceae. By generating proteomics data and using tRNA sequence comparisons we show that in Pachysolen CUG codons are translated as alanine and not as the universal leucine. The polyphyly of the CUG- decoding tRNAs in yeasts is best explained by a tRNA loss driven codon reassignment mechanism. Loss of the CUG-tRNA in the ancient yeast is followed by gradual decrease of respective codons and subsequent codon capture by tRNAs whose anticodon is outside the aminoacyl-tRNA synthetase recognition region. Our hypothesis applies to all nuclear genetic code alterations and provides several testable predictions. We anticipate more codon reassignments to be uncovered in existing and upcoming genome projects.
Fine-scale crossover rate variation on the C. elegans X chromosome
Meiotic recombination creates genotypic diversity within species. Recombination rates vary substantially across taxa and the distribution of crossovers can differ significantly among populations and between sexes. Crossover locations within species have been found to vary by chromosome and by position within chromosomes, where most crossover events occur in small regions known as recombination hotspots. However, several species appear to lack hotspots despite significant crossover heterogeneity. The nematode Caenorhabditis elegans was previously found to have the least fine-scale variation in crossover distribution among organisms studied to date. It is unclear whether this pattern extends to the X chromosome given its unique compaction through the pachytene stage of meiotic prophase in hermaphrodites. We generated 798 recombinant nested near-isogenic lines (NILs) with crossovers in a 1.41 Mb region on the left arm of the X chromosome to determine if its recombination landscape is similar to that of the autosomes. We find that the fine-scale variation in crossover rate is lower than that of other model species and is inconsistent with hotspots. The relationship of genomic features to crossover rate is dependent on scale, with GC content, histone modifications, and nucleosome occupancy being negatively associated with crossovers. We also find that the abundances of 4-6 base pair DNA motifs significantly explain crossover density. These results are consistent with recombination occurring at unevenly distributed sites of open chromatin.
Genome-wide patterns of regulatory divergence revealed by introgression lines
Genome-wide patterns of regulatory divergence revealed by introgression lines
Understanding the genetic basis for changes in transcriptional regulation is an important aspect of understanding phenotypic evolution. Using interspecific introgression lines, we infer the mechanisms of divergence in genome-wide patterns of gene expression between the nightshades Solanum pennellii and S. lycopersicum (domesticated tomato). We find that cis- and trans-regulatory changes have had qualitatively similar contributions to divergence in this clade, unlike results from other systems. Additionally, expression data from four tissues (shoot apex, ripe fruit, pollen, and seed) suggest that introgressed regions in these hybrid lines tend to be down-regulated, while background (non-introgressed) genes tend to be up-regulated. Finally, we find no evidence for an association between the magnitude of differential expression in NILs and previously determined sterility phenotypes. Our results contradict previous predictions of the predominant role of cis- over trans-regulatory divergence between species, and do not support a major role for gross genome-wide misregulation in reproductive isolation between these species.
Bayesian inference of natural selection from allele frequency time series
Bayesian inference of natural selection from allele frequency time series
The advent of accessible ancient DNA technology now allows the direct ascertainment of allele frequencies in ancestral populations, thereby enabling the use of allele frequency time series to detect and estimate natural selection. Such direct observations of allele frequency dynamics are expected to be more powerful than inferences made using patterns of linked neutral variation obtained from modern individuals. We developed a Bayesian method to make use of allele frequency time series data and infer the parameters of general diploid selection, along with allele age, in non-equilibrium populations. We introduce a novel path augmentation approach, in which we use Markov chain Monte Carlo to integrate over the space of allele frequency trajectories consistent with the observed data. Using simulations, we show that this approach has good power to estimate selection coefficients and allele age. Moreover, when applying our approach to data on horse coat color, we find that ignoring a relevant demographic history can significantly bias the results of inference. Our approach is made available in a C++ software package.
Divorcing strain classification from species names
Confusion about strain classification and nomenclature permeates modern microbiology. Although taxonomists have traditionally acted as gatekeepers of order, the numbers of and speed at which new strains are identified has outpaced the opportunity for professional classification for many lineages. Furthermore, the growth of bioinformatics and database-fueled investigations have placed metadata curation in the hands of researchers with little taxonomic experience. Here I describe practical challenges facing modern microbial taxonomy, provide an overview of complexities of classification for environmentally ubiquitous taxa like Pseudomonas syringae, and emphasize that classification and nomenclature need not be the one in the same. A move toward implementation of relational classification schemes based on inherent properties of whole genomes could provide sorely needed continuity in how strains are referenced across manuscripts and data sets.