MERS-CoV recombination: implications about the reservoir and potential for adaptation

MERS-CoV recombination: implications about the reservoir and potential for adaptation

Gytis Dudas, Andrew Rambaut
doi: http://dx.doi.org/10.1101/020834

Recombination is a process that unlinks neighbouring loci allowing for independent evolutionary trajectories within genomes of many organisms. If not properly accounted for, recombination can compromise many evolutionary analyses. In addition, when dealing with organisms that are not obligately sexually reproducing, recombination gives insight into the rate at which distinct genetic lineages come into contact. Since June, 2012, Middle East respiratory syndrome coronavirus (MERS-CoV) has caused 1106 laboratory-confirmed infections, with 421 MERS-CoV associated deaths as of April 16, 2015. Although bats are considered as the likely ultimate source of zoonotic betacoronaviruses, dromedary camels have been consistently implicated as the source of current human infections in the Middle East. In this paper we use phylogenetic methods and simulations to show that MERS-CoV genome has likely undergone numerous recombinations recently. Recombination in MERS-CoV implies frequent co-infection with distinct lineages of MERS-CoV, probably in camels given the current understanding of MERS-CoV epidemiology.

Folding and unfolding phylogenetic trees and networks

Folding and unfolding phylogenetic trees and networks

Katharina T. Huber, Vincent Moulton, Mike Steel, Taoyang Wu
(Submitted on 14 Jun 2015)

Phylogenetic networks are rooted, labelled directed acyclic graphs which are commonly used to represent reticulate evolution. There is a close relationship between phylogenetic networks and multi-labelled trees (MUL-trees). Indeed, any phylogenetic network N can be ‘unfolded’ to obtain a MUL-tree U(N) and, conversely, a MUL-tree T can in certain circumstances be ‘folded’ to obtain a phylogenetic network F(T) that exhibits T. In this paper, we study properties of the operations U and F in more detail. In particular, we introduce the class of stable networks, phylogenetic networks N for which F(U(N)) is isomorphic to N, characterise such networks, and show that that they are related to the well-known class of tree-sibling networks. We also explore how the concept of displaying a tree in a network N can be related to displaying the tree in the MUL-tree U(N). To do this, we develop a phylogenetic analogue of graph fibrations. This allows us to view U(N) as the analogue of the universal cover of a digraph, and to establish a close connection between displaying trees in U(N) and reconciling phylogenetic trees with networks.

Identification of Slco1a6 as a candidate gene that broadly affects gene expression in mouse pancreatic islets

Identification of Slco1a6 as a candidate gene that broadly affects gene expression in mouse pancreatic islets

Jianan Tian, Mark Keller, Angie Oler, Mary Rabagalia, Kathryn Schueler, Donald Stapleton, Aimee Teo Broman, Wen Zhao, Christina Kendziorski, Brian S. Yandell, Bruno Hagenbuch, Karl W Broman, Alan D. Attie
doi: http://dx.doi.org/10.1101/020974

We surveyed gene expression in six tissues in an F2 intercross between mouse strains C57BL/6J (abbreviated B6) and BTBR T+ tf /J (abbreviated BTBR) made genetically obese with the Leptin(ob) mutation. We identified a number of expression quantitative trait loci (eQTL) affecting the expression of numerous genes distal to the locus, called trans-eQTL hotspots. Some of these trans-eQTL hotspots showed effects in multiple tissues, whereas some were specific to a single tissue. An unusually large number of transcripts (7% of genes) mapped in trans to a hotspot on chromosome 6, specifically in pancreatic islets. By considering the first two principal components of the expression of genes mapping to this region, we were able to convert the multivariate phenotype into a simple Mendelian trait. Fine-mapping the locus by traditional methods reduced the QTL interval to a 298 kb region containing only three genes, including Slco1a6, one member of a large family of organic anion transporters. Direct genomic sequencing of all Slco1a6 exons identified a non-synonymous coding SNP that converts a highly conserved proline residue at amino acid position 564 to serine. Molecular modeling suggests that Pro564 faces an aqueous pore within this 12-transmembrane domain-spanning protein. When transiently overexpressed in HEK293 cells, BTBR OATP1A6-mediated cellular uptake of the bile acid taurocholic acid (TCA) was enhanced compared to B6 OATP1A6. Our results suggest that genetic variation in Slco1a6 leads to altered transport of TCA (and potentially other bile acids) by pancreatic islets, resulting in broad gene regulation.

bModelTest: Bayesian site model selection for nucleotide data

bModelTest: Bayesian site model selection for nucleotide data

Remco Bouckaert
doi: http://dx.doi.org/10.1101/020792

bModelTest allows for a Bayesian approach to inferring a site model for phylogenetic analysis. It is based on trans dimensional MCMC proposals that allow switching between substitution models, whether gamma rate heterogeneity is used and whether a proportion of the sites is invariant. The model can be used with the set of reversible models on nucleotides, but we also introduce other sets of substitution models, and show how to use these sets of models. With the method, the site model can be inferred during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood based methods.

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

David M Rocke, Luyao Ruan, Yilun Zhang, J. Jared Gossett, Blythe Durbin-Johnson, Sharon Aviran
doi: http://dx.doi.org/10.1101/020784

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.

Independent molecular basis of convergent highland adaptation in maize

Independent molecular basis of convergent highland adaptation in maize

Shohei Takuno, Peter Ralph, Kelly Swarts, Rob J Elshire, Jeffrey C Glaubitz, Edward S. Buckler, Matthew B Hufford, Jeffrey Ross-Ibarra
doi: http://dx.doi.org/10.1101/013607

Convergent evolution is the independent evolution of similar traits in different species or lineages of the same species; this often is a result of adaptation to similar environments, a process referred to as convergent adaptation.} We investigate here the molecular basis of convergent adaptation in maize to highland climates in Mesoamerica and South America using genome-wide SNP data. Taking advantage of archaeological data on the arrival of maize to the highlands, we infer demographic models for both populations, identifying evidence of a strong bottleneck and rapid expansion in South America. We use these models to then identify loci showing an excess of differentiation as a means of identifying putative targets of natural selection, and compare our results to expectations from recently developed theory on convergent adaptation. Consistent with predictions across a wide parameter space, we see limited evidence for convergent evolution at the nucleotide level in spite of strong similarities in overall phenotypes. Instead, we show that selection appears to have predominantly acted on standing genetic variation, and that introgression from wild teosinte populations appears to have played a role in highland adaptation in Mexican maize.

Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles

Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles

Lindsay V Clark, Andrea Drauch Schreier
doi: http://dx.doi.org/10.1101/020610

A major limitation in the analysis of genetic marker data from polyploid organisms is non-Mendelian segregation, particularly when a single marker yields allelic signals from multiple, independently segregating loci (isoloci). However, with markers such as microsatellites that detect more than two alleles, it is sometimes possible to deduce which alleles belong to which isoloci. Here we describe a novel mathematical property of codominant marker data when it is recoded as binary (presence/absence) allelic variables: under random mating in an infinite population, two allelic variables will be negatively correlated if they belong to the same locus, but uncorrelated if they belong to different loci. We present an algorithm to take advantage of this mathematical property, sorting alleles into isoloci based on correlations, then refining the allele assignments after checking for consistency with individual genotypes. We demonstrate the utility of our method on simulated data, as well as a real microsatellite dataset from a natural population of octoploid white sturgeon (Acipenser transmontanus). Our methodology is implemented in the R package polysat version 1.4.

Are Genetic Interactions Influencing Gene Expression Evidence for Biological Epistasis or Statistical Artifacts?

Are Genetic Interactions Influencing Gene Expression Evidence for Biological Epistasis or Statistical Artifacts?

Alexandra Fish, John A. Capra, William S Bush
doi: http://dx.doi.org/10.1101/020479

Interactions between genetic variants, also called epistasis, are pervasive in model organisms; however, their importance in humans remains unclear because statistical interactions in observational studies can be explained by processes other than biological epistasis. Using statistical modeling, we identified 1,093 interactions between pairs of cis-regulatory variants impacting gene expression in lymphoblastoid cell lines. Factors known to confound these analyses (ceiling/floor effects, population stratification, haplotype effects, or single variants tagged through linkage disequilibrium) explained most of these interactions. However, we found 15 interactions robust to these explanations, and we further show that despite potential confounding, interacting variants were enriched in numerous regulatory regions suggesting potential biological importance. While genetic interactions may not be the true underlying mechanism of all our statistical models, our analyses discover new signals undetected in standard single-marker analyses. Ultimately, we identified new complex genetic architectures regulating 23 genes, suggesting that single-variant analyses may miss important modifiers.

RNA:DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationships

RNA:DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationshipsJulie Nadel, Rodoniki Athanasiadou, Christophe Lemetre, Neil Ari Wijetunga, Pilib Ó Broin, Hanae Sato, Zhengdong Zhang, Jeffrey Jeddeloh, Cristina Montagna, Aaron Golden, Cathal Seoighe, John Greally
doi: http://dx.doi.org/10.1101/020545
RNA:DNA hybrids represent a non-canonical nucleic acid structure that has been associated with a range of human diseases and potential transcriptional regulatory functions. Mapping of RNA:DNA hybrids in human cells reveals them to have a number of characteristics that give insights into their functions. A directional sequencing approach shows the RNA component of the RNA:DNA hybrid to be purine-rich, indicating a thermodynamic contribution to their in vivo stability. The RNA:DNA hybrids are enriched at loci with decreased DNA methylation and increased DNase hypersensitivity, and within larger domains with characteristics of heterochromatin formation, indicating potential transcriptional regulatory properties. Mass spectrometry studies of chromatin at RNA:DNA hybrids shows the presence of the ILF2 and ILF3 transcription factors, supporting a model of certain transcription factors binding preferentially to the RNA:DNA conformation. Overall, there is little to indicate a dependence for RNA:DNA hybrids forming co-transcriptionally, with results from the ribosomal DNA repeat unit instead supporting a model of RNA generating these structures in trans. The results of the study indicate heterogeneous functions of these genomic elements and new insights into their formation and stability in vivo.

The Nature, Extent, and Consequences of Cryptic Genetic Variation in the opa Repeats of Notch in Drosophila

The Nature, Extent, and Consequences of Cryptic Genetic Variation in the opa Repeats of Notch in DrosophilaClinton Rice, Daniel Beekman, Liping Liu, Albert Erives
doi: http://dx.doi.org/10.1101/020529
Polyglutamine (pQ) tracts are abundant in many proteins co-interacting on DNA. The lengths of these pQ tracts can modulate their interaction strengths. However, pQ tracts > 40 residues are pathologically prone to amyloidogenic self-assembly. Here, we assess the extent and consequences of variation in the pQ-encoding opa repeats of Notch (N) in Drosophila melanogaster. We use Sanger sequencing to genotype opa sequences (50-CAX repeats), which have resisted assembly using short sequence reads. While the majority of N sequences pertain to reference opa31 (Q13HQ17) and opa32 (Q13HQ18) allelic classes, several rare alleles encode tracts > 32 residues: opa33a (Q14HQ18), opa33b (Q15HQ17), opa34 (Q16HQ17), opa35a1/opa35a2 (Q13HQ21), opa36 (Q13HQ22), and opa37 (Q13HQ23). Only one rare allele encodes a tract < 31 residues: opa23 (Q13?Q10). This opa23 allele shortens the pQ tract while simultaneously eliminating the interrupting histidine. Homozygotes for the short and long opa alleles have defects in sensory bristle organ specification, abdominal patterning, and embryonic survival. Inbred stocks with wild-type opa31 alleles become more viable when outbred, while an inbred stock with the longer opa35 becomes less viable after outcrossing to different backgrounds. In contrast, an inbred stock with the short opa23 allele is semi-viable in both inbred and outbred genetic backgrounds. This opa23 Notch allele also produces notched wings when recombined out of the X chromosome. Importantly, w[apricot]-linked X balancers carry the N allele opa33b and suppress AS-C insufficiency caused by the sc8 inversion. Our results demonstrate significant cryptic variation and epistatic sensitivity for the N locus, and the need for long read genotyping of key repeat variables underlying gene regulatory networks.