Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles

Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles

Lindsay V Clark, Andrea Drauch Schreier
doi: http://dx.doi.org/10.1101/020610

A major limitation in the analysis of genetic marker data from polyploid organisms is non-Mendelian segregation, particularly when a single marker yields allelic signals from multiple, independently segregating loci (isoloci). However, with markers such as microsatellites that detect more than two alleles, it is sometimes possible to deduce which alleles belong to which isoloci. Here we describe a novel mathematical property of codominant marker data when it is recoded as binary (presence/absence) allelic variables: under random mating in an infinite population, two allelic variables will be negatively correlated if they belong to the same locus, but uncorrelated if they belong to different loci. We present an algorithm to take advantage of this mathematical property, sorting alleles into isoloci based on correlations, then refining the allele assignments after checking for consistency with individual genotypes. We demonstrate the utility of our method on simulated data, as well as a real microsatellite dataset from a natural population of octoploid white sturgeon (Acipenser transmontanus). Our methodology is implemented in the R package polysat version 1.4.

Are Genetic Interactions Influencing Gene Expression Evidence for Biological Epistasis or Statistical Artifacts?

Are Genetic Interactions Influencing Gene Expression Evidence for Biological Epistasis or Statistical Artifacts?

Alexandra Fish, John A. Capra, William S Bush
doi: http://dx.doi.org/10.1101/020479

Interactions between genetic variants, also called epistasis, are pervasive in model organisms; however, their importance in humans remains unclear because statistical interactions in observational studies can be explained by processes other than biological epistasis. Using statistical modeling, we identified 1,093 interactions between pairs of cis-regulatory variants impacting gene expression in lymphoblastoid cell lines. Factors known to confound these analyses (ceiling/floor effects, population stratification, haplotype effects, or single variants tagged through linkage disequilibrium) explained most of these interactions. However, we found 15 interactions robust to these explanations, and we further show that despite potential confounding, interacting variants were enriched in numerous regulatory regions suggesting potential biological importance. While genetic interactions may not be the true underlying mechanism of all our statistical models, our analyses discover new signals undetected in standard single-marker analyses. Ultimately, we identified new complex genetic architectures regulating 23 genes, suggesting that single-variant analyses may miss important modifiers.

RNA:DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationships

RNA:DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationshipsJulie Nadel, Rodoniki Athanasiadou, Christophe Lemetre, Neil Ari Wijetunga, Pilib Ó Broin, Hanae Sato, Zhengdong Zhang, Jeffrey Jeddeloh, Cristina Montagna, Aaron Golden, Cathal Seoighe, John Greally
doi: http://dx.doi.org/10.1101/020545
RNA:DNA hybrids represent a non-canonical nucleic acid structure that has been associated with a range of human diseases and potential transcriptional regulatory functions. Mapping of RNA:DNA hybrids in human cells reveals them to have a number of characteristics that give insights into their functions. A directional sequencing approach shows the RNA component of the RNA:DNA hybrid to be purine-rich, indicating a thermodynamic contribution to their in vivo stability. The RNA:DNA hybrids are enriched at loci with decreased DNA methylation and increased DNase hypersensitivity, and within larger domains with characteristics of heterochromatin formation, indicating potential transcriptional regulatory properties. Mass spectrometry studies of chromatin at RNA:DNA hybrids shows the presence of the ILF2 and ILF3 transcription factors, supporting a model of certain transcription factors binding preferentially to the RNA:DNA conformation. Overall, there is little to indicate a dependence for RNA:DNA hybrids forming co-transcriptionally, with results from the ribosomal DNA repeat unit instead supporting a model of RNA generating these structures in trans. The results of the study indicate heterogeneous functions of these genomic elements and new insights into their formation and stability in vivo.

The Nature, Extent, and Consequences of Cryptic Genetic Variation in the opa Repeats of Notch in Drosophila

The Nature, Extent, and Consequences of Cryptic Genetic Variation in the opa Repeats of Notch in DrosophilaClinton Rice, Daniel Beekman, Liping Liu, Albert Erives
doi: http://dx.doi.org/10.1101/020529
Polyglutamine (pQ) tracts are abundant in many proteins co-interacting on DNA. The lengths of these pQ tracts can modulate their interaction strengths. However, pQ tracts > 40 residues are pathologically prone to amyloidogenic self-assembly. Here, we assess the extent and consequences of variation in the pQ-encoding opa repeats of Notch (N) in Drosophila melanogaster. We use Sanger sequencing to genotype opa sequences (50-CAX repeats), which have resisted assembly using short sequence reads. While the majority of N sequences pertain to reference opa31 (Q13HQ17) and opa32 (Q13HQ18) allelic classes, several rare alleles encode tracts > 32 residues: opa33a (Q14HQ18), opa33b (Q15HQ17), opa34 (Q16HQ17), opa35a1/opa35a2 (Q13HQ21), opa36 (Q13HQ22), and opa37 (Q13HQ23). Only one rare allele encodes a tract < 31 residues: opa23 (Q13?Q10). This opa23 allele shortens the pQ tract while simultaneously eliminating the interrupting histidine. Homozygotes for the short and long opa alleles have defects in sensory bristle organ specification, abdominal patterning, and embryonic survival. Inbred stocks with wild-type opa31 alleles become more viable when outbred, while an inbred stock with the longer opa35 becomes less viable after outcrossing to different backgrounds. In contrast, an inbred stock with the short opa23 allele is semi-viable in both inbred and outbred genetic backgrounds. This opa23 Notch allele also produces notched wings when recombined out of the X chromosome. Importantly, w[apricot]-linked X balancers carry the N allele opa33b and suppress AS-C insufficiency caused by the sc8 inversion. Our results demonstrate significant cryptic variation and epistatic sensitivity for the N locus, and the need for long read genotyping of key repeat variables underlying gene regulatory networks.

Bayesian co-estimation of selfing rate and locus-specific mutation rates for a partially selfing population

Bayesian co-estimation of selfing rate and locus-specific mutation rates for a partially selfing populationBenjamin D Redelings, Seiji Kumagai, Liuyang Wang, Andrey Tatarenkov, Ann K. Sakai, Stephen G. Weller, Theresa M. Culley, John C. Avise, Marcy K. Uyenoyama
doi: http://dx.doi.org/10.1101/020537
We present a Bayesian method for characterizing the mating system of populations reproducing through a mixture of self-fertilization and random outcrossing. Our method uses patterns of genetic variation across the genome as a basis for inference about pure hermaphroditism, androdioecy, and gynodioecy. We extend the standard coalescence model to accommodate these mating systems, accounting explicitly for multilocus identity disequilibrium, inbreeding depression, and variation in fertility among mating types. We incorporate the Ewens Sampling Formula (ESF) under the infinite-alleles model of mutation to obtain a novel expression for the likelihood of mating system parameters. Our Markov chain Monte Carlo (MCMC) algorithm assigns locus-specific mutation rates, drawn from a common mutation rate distribution that is itself estimated from the data using a Dirichlet Process Prior model. Among the parameters jointly inferred are the population-wide rate of self-fertilization, locus-specific mutation rates, and the number of generations since the most recent outcrossing event for each sampled individual.

Coalescent inference using serially sampled, high-throughput sequencing data from HIV infected patients

Coalescent inference using serially sampled, high-throughput sequencing data from HIV infected patientsKevin Dialdestoro, Jonas Andreas Sibbesen, Lasse Maretty, Jayna Raghwani, Astrid Gall, Paul Kellam, Oliver Pybus, Jotun Hein, Paul Jenkins
doi: http://dx.doi.org/10.1101/020552
Human immunodeficiency virus (HIV) is a rapidly evolving pathogen that causes chronic infections, so genetic diversity within a single infection can be very high. High-throughput “deep” sequencing can now measure this diversity in unprecedented detail, particularly since it can be performed at different timepoints during an infection, and this offers a potentially powerful way to infer the evolutionary dynamics of the intra-host viral population. However, population genomic inference from HIV sequence data is challenging because of high rates of mutation and recombination, rapid demographic changes, and ongoing selective pressures. In this paper we develop a new method for inference using HIV deep sequencing data using an approach based on importance sampling of ancestral recombination graphs under a multi-locus coalescent model. The approach further extends recent progress in the approximation of so-called ‘conditional sampling distributions’, a quantity of key interest when approximating coalescent likelihoods. The chief novelties of our method are that it is able to infer rates of recombination and mutation, as well as the effective population size, while handling sampling over different timepoints and missing data without extra computational difficulty. We apply our method to a dataset of HIV-1, in which several hundred sequences were obtained from an infected individual at seven timepoints over two years. We find mutation rate and effective population size estimates to be comparable to those produced by the software BEAST. Additionally, our method is able to produce local recombination rate estimates. The software underlying our method, Coalescenator, is freely available.

Predictable patterns of CTL escape and reversion and the limited role of epistasis in HIV-1 evolution

Predictable patterns of CTL escape and reversion and the limited role of epistasis in HIV-1 evolution
Duncan S. Palmer, Emily Adland, John A. Frater, Philip J.R. Goulder, Kuan-Hsiang Gary Huang, Thumbi Ndung’u, Philippa C. Matthews, Rodney E. Phillips, Roger Shapiro, Gil McVean, Angela R. McLean
Subjects: Populations and Evolution (q-bio.PE)

The twin processes of viral evolutionary escape and reversion in response to host immune pressure, in particular the cytotoxic T-lymphocyte (CTL) response, helps shape Human Immunodeficiency Virus-1 sequence evolution in infected host populations. The tempo of CTL escape and reversion is known to differ between CTL escape variants in a given host population. A wealth of epistatic effects – both intermediary sequence changes on the path to CTL escape and compensatory mutations which restore replicative capacity following viral escape – have been reported. Given the importance of epistatic effects in these processes, we ask: are rates of escape and reversion comparable across infected host populations? For three cohorts taken from three continents, we estimate escape and reversion rates at 23 escape sites in $gag$ epitopes. Surprisingly, we find highly consistent escape rate estimates across the examined cohorts. Reversion rates are also consistent between a Canadian and South African infected host population. We investigate the importance of epistasis further by examining $in$ $vitro$ replicative capacities of viral sequences with minimal variation: point escape mutants induced in a lab strain. Remarkably, despite the complexities of epistatic effects and the diversity of both hosts and viruses, CTL escape mutants which escape rapidly tend to be those with the highest replicative capacity when applied as a single point mutation. Similarly, mutants inducing the greatest costs to viral replicative capacity tend to revert more quickly. These data suggest that escape rates in $gag$ are consistent across host populations and, in general, epistatic effects do not dramatically affect escape rates.

Phylodynamics of H5N1 Highly Pathogenic Avian Influenza in Europe, 2005–2010: Potential for Molecular Surveillance of New Outbreaks

Phylodynamics of H5N1 Highly Pathogenic Avian Influenza in Europe, 2005–2010: Potential for Molecular Surveillance of New Outbreaks

Mohammad A. Alkhamis, Brian R. Moore, Andrés M. Perez
doi: http://dx.doi.org/10.1101/020339

Previous Bayesian phylogeographic studies of H5N1 highly pathogenic avian influenza viruses (HPAIVs) explored the origin and spread of the epidemic from China into Russia, indicating that HPAIV circulated in Russia prior to its detection there in 2005. In this study, we extend this research to explore the evolution and spread of HPAIV within Europe during the 2005–2010 epidemic, using all available sequences of the HA and NA gene regions that were collected in Europe and Russia during the outbreak. We use discrete-trait phylodynamic models within a Bayesian statistical framework to explore the evolution of HPAIV. Our results indicate that the genetic diversity and effective population size of HPAIV peaked between mid-2005 and early 2006, followed by drastic decline in 2007, which coincides with the end of the epidemic in Europe. Our results also suggest that domestic birds were the most likely source of the spread of the virus from Russia into Europe. Additionally, estimates of viral dispersal routes indicate that Russia, Romania, and Germany were key epicenters of these outbreaks. Our study quantifies the dynamics of a major European HPAIV pandemic and substantiates the ability of phylodynamic models to improve molecular surveillance of novel AIVs.

A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells

A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells
Juan Gonzalez-Vallinas, Amadís Pagès, Babita Singh, Eduardo Eyras
doi: http://dx.doi.org/10.1101/020362

Background Transcriptional enhancers are generally known to regulate gene transcription from afar. Their activation involves a series of changes in chromatin marks and recruitment of protein factors. These enhancers may also occur inside genes, but how many may be active in human cells and their effects on the regulation of the host gene remains unclear. Results We describe a novel semi-supervised method based on the relative enrichment of chromatin signals between 2 conditions to predict active enhancers. We applied this method to the tumoral K562 and the normal GM12878 cell lines to predict enhancers that are differentially active in one cell type. These predictions show enhancer-like properties according to positional distribution, correlation with gene expression and production of enhancer RNAs. Using this model, we predict 10,365 and 9,777 intragenic active enhancers in K562 and GM12878, respectively, and relate the differential activation of these enhancers to expression and splicing differences of the host genes. Conclusions We propose that the activation or silencing of intragenic transcriptional enhancers modulate the regulation of the host gene by means of a local change of the chromatin and the recruitment of enhancer-related factors that may interact with the RNA directly or through the interaction with RNA binding proteins. Predicted enhancers are available at http://regulatorygenomics.upf.edu/Projects/enhancers.html

The weighting is the hardest part: on the behavior of the likelihood ratio test and score test under weight misspecification in rare variant association studies

The weighting is the hardest part: on the behavior of the likelihood ratio test and score test under weight misspecification in rare variant association studies

Camelia Claudia Minica, Giulio Genovese, Dorret I. Boomsma, Christina M. Hultman, René Pool, Jacqueline M. Vink, Conor V. Dolan, Benjamin M. Neale
doi: http://dx.doi.org/10.1101/020198

Rare variant association studies are gaining importance in human genetic research with the increasing availability of exome/genome sequence data. One important test of association between a target set of rare variants (RVs) and a given phenotype is the sequence kernel association test (SKAT). Assignment of weights reflecting the hypothesized contribution of the RVs to the trait variance is embedded within any set-based test. Since the true weights are generally unknown, it is important to establish the effect of weight misspecification in SKAT. We used simulated and real data to characterize the behavior of the likelihood ratio test (LRT) and score test under weight misspecification. Results revealed that LRT is generally more robust to weight misspecification, and more powerful than score test in such a circumstance. For instance, when the rare variants within the target were simulated to have larger betas than the more common ones, incorrect assignment of equal weights reduced the power of the LRT by ~5% while the power of score test dropped by ~30%. Furthermore, LRT was more robust to the inclusion of weighed neutral variation in the test. To optimize weighting we proposed the use of a data-driven weighting scheme. With this approach and the LRT we detected significant enrichment of case mutations with MAF below 5% (P-value=7E-04) of a set of highly constrained genes in the Swedish schizophrenia case-control cohort of 4,940 individuals with observed exome-sequencing data. The score test is currently widely used in sequence kernel association studies for both its computational efficiency and power. Indeed, assuming correct specification, in some circumstances the score test is the most powerful test. However, our results showed that LRT has the compelling qualities of being generally more robust and more powerful under weight misspecification. This is a paramount result, given that, arguably, misspecified models are likely to be the rule rather than the exception in the weighting-based approaches.