Comparing Evolutionary Rates Using An Exact Test for 2×2 Tables with Continuous Cell Entries
A. Morgan Thompson, M. Cyrus Maher, Lawrence H. Uricchio, Zachary A. Szpiech, Ryan D. Hernandez
(Submitted on 11 Apr 2014)
Assessing the statistical significance of an observed 2×2 contingency table can easily be accomplished using Fisher’s exact test (FET). However, if the cell entries are continuous or represent values inferred from a continuous parametric model, then FET cannot be applied. Such tables arise frequently in areas of biostatistical research including population genetics and evolutionary genomics, where cell entries are estimated by computational methods and result in cell entries drawn from the non-negative real line R+. Simply rounding cell entries to conform to the assumptions of FET is an ill-suited approach that we show creates problems related to both type-I and type-II errors. Pearson’s chi^2 test for independence, while technically applicable, is not often effective for these tables, as the test has several limiting assumptions that make application of this method inadvisable in many common instances (particularly with small cell entries). Here we develop a novel method for tables with continuous entries, which we term continuous Fisher’s Exact Test (cFET). Through simulations, we show that cFET has a close-to-uniform distribution of p-values under the null hypothesis of independence, and more power when applied to tables where the null hypothesis is false (compared to FET applied to rounded cell entries). We apply cFET to an example from comparative genomics to confirm an overall increased evolutionary rate among primates compared to rodents, and identify several genes that show particularly elevated evolutionary rates in primates. Some of these genes exhibit signatures of continued positive selection along the human lineage since our divergence with chimpanzee 5-7 million years ago, as well as ongoing selection in modern humans.
Selection signatures in worldwide Sheep populations
Maria-Ines Fariello, Bertrand Servin, Gwenola Tosser-Klopp, Rachelle Rupp, Carole Moreno, International Sheep Genomics Consortium n.a., Magali San Cristobal, simon boitard
The diversity of populations in domestic species offers great opportunities to study genome response to selection. The recently published Sheep HapMap dataset is a great example of characterization of the world wide genetic diversity in sheep. In this study, we re-analyzed the Sheep HapMap dataset to identify selection signatures in worldwide sheep populations. Compared to previous analyses, we made use of statistical methods that (i) take account of the hierarchical structure of sheep populations, (ii) make use of linkage disequilibrium information and (iii) focus specifically on either recent or older selection signatures. We show that this allows pinpointing several new selection signatures in the sheep genome and distinguishing those related to modern breeding objectives and to earlier post-domestication constraints. The newly identified regions, together with the ones previously identified, reveal the extensive genome response to selection on morphology, color and adaptation to new environments.
Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans
Gundula Povysil, Sepp Hochreiter
We analyze the sharing of very short identity by descent (IBD) segments between humans, Neandertals, and Denisovans to gain new insights into their demographic history. Short IBD segments convey information about events far back in time because the shorter IBD segments are, the older they are assumed to be. The identification of short IBD segments becomes possible through next generation sequencing (NGS), which offers high variant density and reports variants of all frequencies. However, only recently HapFABIA has been proposed as the first method for detecting very short IBD segments in NGS data. HapFABIA utilizes rare variants to identify IBD segments with a low false discovery rate. We applied HapFABIA to the 1000 Genomes Project whole genome sequencing data to identify IBD segments which are shared within and between populations. Some IBD segments are shared with the reconstructed ancestral genome of humans and other primates. These segments are tagged by rare variants, consequently some rare variants have to be very old. Other IBD segments are also old since they are shared with Neandertals or Denisovans, which explains their shorter lengths compared to segments that are not shared with these ancient genomes. The Denisova genome most prominently matched IBD segments that are shared by Asians. Many of these segments were found exclusively in Asians and they are longer than segments shared between other continental populations and the Denisova genome. Therefore, we could confirm an introgression from Deniosvans into ancestors of Asians after their migration out of Africa. While Neandertal-matching IBD segments are most often shared by Asians, Europeans share a considerably higher percentage of IBD segments with Neandertals compared to other populations, too. Again, many of these Neandertal-matching IBD segments are found exclusively in Asians, whereas Neandertal-matching IBD segments that are shared by Europeans are often found in other populations, too. Neandertal-matching IBD segments that are shared by Asians or Europeans are longer than those observed in Africans. This hints at a gene flow from Neandertals into ancestors of Asians and Europeans after they left Africa. Interestingly, many Neandertal- or Denisova-matching IBD segments are predominantly observed in Africans – some of them even exclusively. IBD segments shared between Africans and Neandertals or Denisovans are strikingly short, therefore we assume that they are very old. This may indicate that these segments stem from ancestors of humans, Neandertals, and Denisovans and have survived in Africans.
An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs
Jesse D Bloom
Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are unknown, and so most phylogenetic models treat this selection crudely with a variety of free parameters designed to represent general features of mutation and selection. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than existing models. Here I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg et al, 2014) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much better than existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.
Multilocus Species Trees Show the Recent Adaptive Radiation of the Mimetic Heliconius Butterflies
Krzysztof M Kozak, Niklas Wahlberg, Andrew Neild, Kanchon K Dasmahapatra, James Mallet, Chris D Jiggins
Müllerian mimicry among Neotropical Heliconiini butterflies is an excellent example of natural selection, and is associated with the diversification of a large continental-scale radiation. Some of the processes driving the evolution of mimicry rings are likely to generate incongruent phylogenetic signals across the assemblage, and thus pose a challenge for systematics. We use a dataset of 22 mitochondrial and nuclear markers from 92% of species in the tribe to re-examine the phylogeny of Heliconiini with both supermatrix and multi-species coalescent approaches, characterise the patterns of conflicting signal and compare the performance of various methodological approaches to reflect the heterogeneity across the data. Despite the large extent of reticulate signal and strong conflict between markers, nearly identical topologies are consistently recovered by most of the analyses, although the supermatrix approach fails to reflect the underlying variation in the history of individual loci. The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Amazonia over the past 23 million years, or changes caused by diversity-dependent effects on the rate of diversification. We find that the tribe Heliconiini had doubled its rate of speciation around 11 Ma and that the presently most speciose genus Heliconius started diversifying rapidly at 10 Ma, likely in response to the recent drastic changes in topography of the region. Our study provides comprehensive evidence for a rapid adaptive radiation among an important insect radiation in the most biodiverse region of the planet.
New whole genome de novo assemblies of three divergent strains of rice (O. sativa) documents novel gene space of aus and indica
Michael C Schatz, Lyza G Maron, Joshua C Stein, Alejandro Hernandez Wences, James Gurtowski, Eric Biggers, Hayan Lee, Melissa Kramer, Eric Antonio, Elena Ghiban, Mark H Wright, Jer-ming Chia, Doreen Ware, Susan R McCouch, William Richard McCombie
The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. Currently, when the genomes of different strains of a given organism are compared, whole genome resequencing data are aligned to an established reference sequence. However when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate. Here, we use rice as a model to explore the extent of structural variation among strains adapted to different ecologies and geographies, and show that this variation can be significant, often matching or exceeding the variation present in closely related human populations or other mammals. We demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared to provide an unbiased assessment. Using this approach, we are able to accurately assess the ?pan-genome? of three divergent rice varieties and document several megabases of each genome absent in the other two. Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard resequencing approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.
Group A Rotavirus NSP4 is Under Negative Selective Pressure
Jackson Cordeiro Lima, Paulo Bandiera-Paiva
(Submitted on 2 Apr 2014)
Rotavirus (RV) is the major etiologic agent of severe infantile gastroenteritis; its genome has 11 segments of double stranded RNA, encoding 12 proteins. The non-structural protein 4 (NSP4) encoded by segment 10 is multifunctional. The aim of this study is to analyze the selective pressure driving the NSP4 of RV, through the ratio of non-synonymous substitutions per synonymous substitutions (dN/dS). Our results show that NSP4 is under negative evolutionary pressure (84.57% of the amino acid sequence) and no site was found under positive selection. This may support other evolutionary studies of different RV proteins or viral agents.
High burden of private mutations due to explosive human population growth and purifying selection
Feng Gao, Alon Keinan
(Submitted on 22 Mar 2014)
Recent studies have shown that human populations have experienced a complex demographic history, including a recent epoch of rapid population growth that led to an excess in the proportion of rare genetic variants in humans today. This excess can impact the burden of private mutations for each individual, defined here as the proportion of heterozygous variants in each newly sequenced individual that are novel compared to another large sample of sequenced individuals. We calculated the burden of private mutations predicted by different demographic models, and compared with empirical estimates based on data from the NHLBI Exome Sequencing Project and data from the Neutral Regions (NR) dataset. We observed a significant excess in the proportion of private mutations in the empirical data compared with models of demographic history without a recent epoch of population growth. Incorporating recent growth into the model provides a much improved fit to empirical observations. This phenomenon becomes more marked for larger sample sizes. The proportion of private mutations is additionally increased by purifying selection, which differentially affect mutations of different functional annotations. These results have important implications to the design and analysis of sequencing-based association studies of complex human disease as they pertain to private and very rare variants.
Population genetics of identity by descent
Pier Francesco Palamara, Ph.D. thesis
Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.
Analysis of stop-gain and frameshift variants in human innate immunity genes
Antonio Rausell, Pejman Mohammadi, Paul J McLaren, Ioannis Xenarios, Jacques Fellay, Amalio Telenti
Loss-of-function variants in innate immunity genes are associated with Mendelian disorders in the form of primary immunodeficiencies. Recent resequencing projects report that stop-gains and frameshifts are collectively prevalent in humans and could be responsible for some of the inter-individual variability in innate immune response. Current computational approaches evaluating loss-of-function in genes carrying these variants rely on gene-level characteristics such as evolutionary conservation and functional redundancy across the genome. However, innate immunity genes represent a particular case because they are more likely to be under positive selection and duplicated. To create a ranking of severity that would be applicable to the innate immunity genes we first evaluated 17764 stop-gain and 13915 frameshift variants from the NHLBI Exome Sequencing Project and 1000 Genomes Project. Sequence-based features such as loss of functional domains, isoform-specific truncation and non-sense mediated decay were found to correlate with variant allele frequency and validated with gene expression data. We integrated these features in a Bayesian classification scheme and benchmarked its use in predicting pathogenic variants against OMIM disease stop-gains and frameshifts. The classification scheme was applied in the assessment of 335 stop-gains and 236 frameshifts affecting 227 interferon-stimulated genes. The sequence-based score ranks variants in innate immunity genes according to their potential to cause disease, and complements existing gene-based pathogenicity scores.