Assessing phenotypic correlation through the multivariate phylogenetic latent liability model

Assessing phenotypic correlation through the multivariate phylogenetic latent liability model

Gabriela B. Cybis, Janet S. Sinsheimer, Trevor Bedford, Alison E. Mather, Philippe Lemey, Marc A. Suchard
(Submitted on 15 Jun 2014)

Understanding which phenotypic traits are consistently correlated throughout evolution is a highly pertinent problem in modern evolutionary biology. Here, we propose a multivariate phylogenetic latent liability model for assessing the correlation between multiple types of data, while simultaneously controlling for their unknown shared evolutionary history informed through molecular sequences. The latent formulation enables us to consider in a single model combinations of continuous traits, discrete binary traits, and discrete traits with multiple ordered and unordered states. Previous approaches have entertained a single data type generally along a fixed history, precluding estimation of correlation between traits and ignoring uncertainty in the history. We implement our model in a Bayesian phylogenetic framework, and discuss inference techniques for hypothesis testing. Finally, we showcase the method through applications to columbine flower morphology, antibiotic resistance in Salmonella, and epitope evolution in influenza.

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors

Identifying the Genetic Basis of Functional Protein Evolution Using Reconstructed Ancestors

Victor Hanson-Smith, Christopher Baker, Alexander Johnson
(Submitted on 11 Jun 2014)

A central challenge in the study of protein evolution is the identification of historic amino acid sequence changes responsible for creating novel functions observed in present-day proteins. To address this problem, we developed a new method to identify and rank amino acid mutations in ancestral protein sequences according to their function-shifting potential. Our approach scans the changes between two reconstructed ancestral sequences in order to find (1) sites with sequence changes that significantly deviate from our model-based probabilistic expectations, (2) sites that demonstrate extreme changes in mutual information, and (3) sites with extreme gains or losses of information content. By taking the overlaps of these statistical signals, the method accurately identifies cryptic evolutionary patterns that are often not obvious when examining only the conservation of modern-day protein sequences. We validated this method with a training set of previously-discovered function-shifting mutations in three essential protein families in animals and fungi, whose evolutionary histories were the prior subject of systematic molecular biological investigation. Our method identified the known function-shifting mutations in the training set with a very low rate of false positive discovery. Further, our approach significantly outperformed other methods that use variability in evolutionary rates to detect functional loci. The accuracy of our approach indicates it could be a useful tool for generating specific testable hypotheses regarding the acquisition of new functions across a wide range of protein families.

Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation

Accounting for biases in riboprofiling data indicates a major role for proline and not positive amino acids in stalling translation

Carlo G. Artieri, Hunter B. Fraser

The recent advent of ribosome profiling ? sequencing of short ribosome-bound fragments of mRNA ? has offered an unprecedented opportunity to interrogate the sequence features responsible for modulating translational rates. Nevertheless, numerous analyses of the first riboprofiling dataset have produced equivocal and often incompatible results. Here we analyze three independent yeast riboprofiling data sets, including two with much higher coverage than previously available, and find that all three show substantial technical sequence biases that confound interpretations of ribosomal occupancy. After accounting for these biases, we find no effect of previously implicated factors on ribosomal pausing. Rather, we find that incorporation of proline, whose unique side-chain stalls peptide synthesis in vitro, also slows the ribosome in vivo. We also reanalyze a recent method that reported positively charged amino acids as the major determinant of ribosomal stalling and demonstrate that its assumptions lead to false signals of stalling in low-coverage data. Our results suggest that any analysis of riboprofiling data should account for sequencing biases and sparse coverage. To this end, we establish a robust methodology that enables analysis of ribosome profiling data without prior assumptions regarding which positions spanned by the ribosome cause stalling.

The rugged adaptive landscape of an emerging plant RNA virus

The rugged adaptive landscape of an emerging plant RNA virus

Jasna Lalic, Santiago F. Elena

RNA viruses are the main source of emerging infectious diseases owed to the evolutionary potential bestowed by their fast replication, large population sizes and high mutation and recombination rates. However, an equally important parameter, which is usually neglected, is the topography of the fitness landscape, that is, how many fitness maxima exist and how well connected they are, which determines the number of accessible evolutionary pathways. To address this question, we have reconstructed the fitness landscape describing the adaptation of Tobacco etch potyvirus to its new host, Arabidopsis thaliana. Fitness was measured for most of the genotypes in the landscape, showing the existence of peaks and holes. We found prevailing epistatic effects between mutations, with cases of reciprocal sign epistasis being common at latter stages. Therefore, results suggest that the landscape was rugged and holey, with several local fitness peaks and a very limited number of potential neutral paths. The viral genotype fixed at the end of the evolutionary process was not on the global fitness optima but stuck into a suboptimal peak.

Genome-wide Identification of Zero Nucleotide Recursive Splicing in Drosophila

Genome-wide Identification of Zero Nucleotide Recursive Splicing in Drosophila

Michael O Duff, Sara Olson, Xintao Wei, Ahmad Osman, Alex Plocik, Mohan Bolisetty, Susan Celniker, Brenton Graveley

Recursive splicing is a process in which large introns are removed in multiple steps by resplicing at ratchet points – 5? splice sites recreated after splicing. Recursive splicing was first identified in the Drosophila Ultrabithorax (Ubx) gene and only three additional Drosophila genes have since been experimentally shown to undergo recursive splicing. Here, we identify 196 zero nucleotide exon ratchet points in 130 introns of 115 Drosophila genes from total RNA sequencing data generated from developmental time points, dissected tissues, and cultured cells. Recursive splicing events were identified by splice junctions that map to annotated 5? splice sites and unannotated intronic 3? splice sites, the presence of the sequence AG/GT at the 3? splice site, and a 5? to 3? gradient of decreasing RNA-Seq read density indicative of co-transcriptional splicing. The sequential nature of recursive splicing was confirmed by identification of lariat introns generated by splicing to and from the ratchet points. We also show that recursive splicing is a constitutive process, and that the sequence and function of ratchet points are evolutionarily conserved. Together these results indicate that recursive splicing is commonly used in Drosophila and provides insight into the mechanisms by which some introns are removed.

Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels.

Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels.

Nicholas E Banovich, Xun Lan, Graham McVicker, Bryce Van de Geijn, Jacob F Degner, John D. Blischak, Jonathan K. Pritchard, Yoav Gilad

DNA methylation is an important epigenetic regulator of gene expression. Recent studies have revealed widespread associations between genetic variation and methylation levels. However, the mechanistic links between genetic variation and methylation remain unclear. To begin addressing this gap, we collected methylation data at ~300,000 loci in lymphoblastoid cell lines (LCLs) from 64 HapMap Yoruba individuals, and genome-wide bisulfite sequence data in ten of these individuals. We identified (at an FDR of 10%) 11,752 methylation QTLs (meQTLs)?i.e., loci in which genetic variation is associated with changes in DNA methylation. We found that meQTLs are frequently associated with changes in methylation at multiple CpGs across regions of up to 3 kb. Interestingly, meQTLs are also frequently associated with variation in other properties of gene regulation, including histone modifications, DNase I accessibility, chromatin accessibility, and expression levels of nearby genes. These observations suggest that genetic variants may lead to coordinated molecular changes in all of these regulatory phenotypes. One plausible driver of coordinated changes in different regulatory mechanisms is variation in transcription factor (TF) binding. Indeed, we found that SNPs that change predicted TF binding affinities are significantly enriched for associations with DNA methylation at nearby CpGs. Taken together, our observations are consistent with a model whereby changes in TF binding may frequently drive coordinated changes in DNA methylation, histone modification, and gene expression levels.

Validation of methods for Low-volume RNA-seq

Validation of methods for Low-volume RNA-seq

Peter Acuña Combs, Michael B Eisen

Recently, a number of protocols extending RNA-sequencing to the single-cell regime have been published. However, we were concerned that the additional steps to deal with such minute quantities of input sample would introduce serious biases that would make analysis of the data using existing approaches invalid. In this study, we performed a critical evaluation of several of these low-volume RNA-seq protocols, and found that they performed slightly less well in metrics of interest to us than a more standard protocol, but with at least two orders of magnitude less sample required. We also explored a simple modification to one of these protocols that, for many samples, reduced the cost of library preparation to approximately $20/sample

Reducing INDEL errors in whole-genome and exome sequencing

Reducing INDEL errors in whole-genome and exome sequencing

Han Fang, Giuseppe Narzisi, Jason A. O’Rawe, Yiyang Wu, Julie Rosenbaum, Michael Ronemus, Ivan Iossifov, Michael C. Schatz, Gholson J. Lyon

Background INDELs, especially those disrupting protein-coding regions of the genome, have been associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. We have recently developed a new INDEL-calling algorithm, Scalpel, with substantially improved accuracy. Results We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate false-positive and false-negative INDEL errors. We developed a classification scheme utilizing validation data to define a class of low-quality INDELs with ~2.7-fold higher error rates than high-quality INDELs. The mean concordance of INDEL detection between WGS and WES data was ~52%, while WGS data uniquely identified ~10.8-fold more high-quality INDELs. Concordance of INDEL detection between standard and PCR-free sequencing data was ~71%, while PCR-free data uniquely yielded ~6.3-fold fewer low-quality INDELs. We demonstrate that these INDEL errors are significantly reduced with a PCR-free library protocol, implying that these errors are introduced with PCR amplification. We calculated that 60X WGS data from the HiSeq 2000 platform are needed to recover ~95% of INDELs, much higher than that for SNP detection. Accurate detection of heterozygous INDELs requires ~1.2-fold higher coverage than that for homozygous INDELs. Conclusions Homopolymer A/T INDELs are a major source of low quality and/or uncertain INDEL calls, and these are highly enriched in the WES data. We recommend WGS for human genomes at 60X mean coverage with PCR-free protocols, which can substantially improve the quality of personal genomes.

Natural selection helps explain the small range of genetic variation within species

Natural selection helps explain the small range of genetic variation within species

Russell B. Corbett-Detig, Daniel L. Hartl, Timothy B. Sackton

The range of genetic diversity observed within natural populations is much more narrow than expected based on models of neutral molecular evolution. Although the increased efficacy of natural selection in larger populations has been invoked to explain this paradox, to date no tests of this hypothesis have been conducted. Here, we present an analysis of whole-genome polymorphism data and genetic maps from 39 species to estimate for each species the reduction in genetic variation attributable to the operation of natural selection on the genome. We find that species with larger population sizes do in fact show greater reductions in genetic variation. This finding provides the first experimental support for the hypothesis that natural selection contributes to the restricted range of within-species genetic diversity.

Recombination impacts damaging and disease mutations accumulation in human populations

Recombination impacts damaging and disease mutations accumulation in human populations

Julie Hussin, Alan Hodgkinson, Youssef Idaghdour, Jean-Christophe Grenier, Jean-Philippe Goulet, Elias Gbeha, Elodie Hip-Ki, Philip Awadalla

Many decades of theory have demonstrated that in non-recombining systems, slightly deleterious mutations accumulate non-reversibly, potentially driving the extinction of many asexual species. Non-recombining chromosomes in sexual organisms are thought to have degenerated in a similar fashion, however it is not clear the extent to which these processes operate along recombining chromosomes with highly variable rates of crossing over. Using high coverage sequencing data from over 1400 individuals, we show that recombination rate modulates the genomic distribution of putatively deleterious variants across the entire human genome. We find that exons in regions of low recombination are significantly enriched for deleterious and disease variants, a signature that varies in strength across worldwide human populations with different demographic histories. As low recombining regions are enriched for highly conserved genes with essential cellular functions and show an excess of mutations with demonstrated effect on health, this phenomenon likely affects disease susceptibility in humans.