Coalescent inference using serially sampled, high-throughput sequencing data from HIV infected patientsKevin Dialdestoro, Jonas Andreas Sibbesen, Lasse Maretty, Jayna Raghwani, Astrid Gall, Paul Kellam, Oliver Pybus, Jotun Hein, Paul Jenkins
doi: http://dx.doi.org/10.1101/020552
Human immunodeficiency virus (HIV) is a rapidly evolving pathogen that causes chronic infections, so genetic diversity within a single infection can be very high. High-throughput “deep” sequencing can now measure this diversity in unprecedented detail, particularly since it can be performed at different timepoints during an infection, and this offers a potentially powerful way to infer the evolutionary dynamics of the intra-host viral population. However, population genomic inference from HIV sequence data is challenging because of high rates of mutation and recombination, rapid demographic changes, and ongoing selective pressures. In this paper we develop a new method for inference using HIV deep sequencing data using an approach based on importance sampling of ancestral recombination graphs under a multi-locus coalescent model. The approach further extends recent progress in the approximation of so-called ‘conditional sampling distributions’, a quantity of key interest when approximating coalescent likelihoods. The chief novelties of our method are that it is able to infer rates of recombination and mutation, as well as the effective population size, while handling sampling over different timepoints and missing data without extra computational difficulty. We apply our method to a dataset of HIV-1, in which several hundred sequences were obtained from an infected individual at seven timepoints over two years. We find mutation rate and effective population size estimates to be comparable to those produced by the software BEAST. Additionally, our method is able to produce local recombination rate estimates. The software underlying our method, Coalescenator, is freely available.
Author Archives: cooplab
Bayesian co-estimation of selfing rate and locus-specific mutation rates for a partially selfing population
Bayesian co-estimation of selfing rate and locus-specific mutation rates for a partially selfing populationBenjamin D Redelings, Seiji Kumagai, Liuyang Wang, Andrey Tatarenkov, Ann K. Sakai, Stephen G. Weller, Theresa M. Culley, John C. Avise, Marcy K. Uyenoyama
doi: http://dx.doi.org/10.1101/020537
We present a Bayesian method for characterizing the mating system of populations reproducing through a mixture of self-fertilization and random outcrossing. Our method uses patterns of genetic variation across the genome as a basis for inference about pure hermaphroditism, androdioecy, and gynodioecy. We extend the standard coalescence model to accommodate these mating systems, accounting explicitly for multilocus identity disequilibrium, inbreeding depression, and variation in fertility among mating types. We incorporate the Ewens Sampling Formula (ESF) under the infinite-alleles model of mutation to obtain a novel expression for the likelihood of mating system parameters. Our Markov chain Monte Carlo (MCMC) algorithm assigns locus-specific mutation rates, drawn from a common mutation rate distribution that is itself estimated from the data using a Dirichlet Process Prior model. Among the parameters jointly inferred are the population-wide rate of self-fertilization, locus-specific mutation rates, and the number of generations since the most recent outcrossing event for each sampled individual.
The Nature, Extent, and Consequences of Cryptic Genetic Variation in the opa Repeats of Notch in Drosophila
The Nature, Extent, and Consequences of Cryptic Genetic Variation in the opa Repeats of Notch in DrosophilaClinton Rice, Daniel Beekman, Liping Liu, Albert Erives
doi: http://dx.doi.org/10.1101/020529
Polyglutamine (pQ) tracts are abundant in many proteins co-interacting on DNA. The lengths of these pQ tracts can modulate their interaction strengths. However, pQ tracts > 40 residues are pathologically prone to amyloidogenic self-assembly. Here, we assess the extent and consequences of variation in the pQ-encoding opa repeats of Notch (N) in Drosophila melanogaster. We use Sanger sequencing to genotype opa sequences (50-CAX repeats), which have resisted assembly using short sequence reads. While the majority of N sequences pertain to reference opa31 (Q13HQ17) and opa32 (Q13HQ18) allelic classes, several rare alleles encode tracts > 32 residues: opa33a (Q14HQ18), opa33b (Q15HQ17), opa34 (Q16HQ17), opa35a1/opa35a2 (Q13HQ21), opa36 (Q13HQ22), and opa37 (Q13HQ23). Only one rare allele encodes a tract < 31 residues: opa23 (Q13?Q10). This opa23 allele shortens the pQ tract while simultaneously eliminating the interrupting histidine. Homozygotes for the short and long opa alleles have defects in sensory bristle organ specification, abdominal patterning, and embryonic survival. Inbred stocks with wild-type opa31 alleles become more viable when outbred, while an inbred stock with the longer opa35 becomes less viable after outcrossing to different backgrounds. In contrast, an inbred stock with the short opa23 allele is semi-viable in both inbred and outbred genetic backgrounds. This opa23 Notch allele also produces notched wings when recombined out of the X chromosome. Importantly, w[apricot]-linked X balancers carry the N allele opa33b and suppress AS-C insufficiency caused by the sc8 inversion. Our results demonstrate significant cryptic variation and epistatic sensitivity for the N locus, and the need for long read genotyping of key repeat variables underlying gene regulatory networks.
RNA:DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationships
RNA:DNA hybrids in the human genome have distinctive nucleotide characteristics, chromatin composition, and transcriptional relationshipsJulie Nadel, Rodoniki Athanasiadou, Christophe Lemetre, Neil Ari Wijetunga, Pilib Ó Broin, Hanae Sato, Zhengdong Zhang, Jeffrey Jeddeloh, Cristina Montagna, Aaron Golden, Cathal Seoighe, John Greally
doi: http://dx.doi.org/10.1101/020545
RNA:DNA hybrids represent a non-canonical nucleic acid structure that has been associated with a range of human diseases and potential transcriptional regulatory functions. Mapping of RNA:DNA hybrids in human cells reveals them to have a number of characteristics that give insights into their functions. A directional sequencing approach shows the RNA component of the RNA:DNA hybrid to be purine-rich, indicating a thermodynamic contribution to their in vivo stability. The RNA:DNA hybrids are enriched at loci with decreased DNA methylation and increased DNase hypersensitivity, and within larger domains with characteristics of heterochromatin formation, indicating potential transcriptional regulatory properties. Mass spectrometry studies of chromatin at RNA:DNA hybrids shows the presence of the ILF2 and ILF3 transcription factors, supporting a model of certain transcription factors binding preferentially to the RNA:DNA conformation. Overall, there is little to indicate a dependence for RNA:DNA hybrids forming co-transcriptionally, with results from the ribosomal DNA repeat unit instead supporting a model of RNA generating these structures in trans. The results of the study indicate heterogeneous functions of these genomic elements and new insights into their formation and stability in vivo.
A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells
A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells
Juan Gonzalez-Vallinas, Amadís Pagès, Babita Singh, Eduardo Eyras
doi: http://dx.doi.org/10.1101/020362
Background Transcriptional enhancers are generally known to regulate gene transcription from afar. Their activation involves a series of changes in chromatin marks and recruitment of protein factors. These enhancers may also occur inside genes, but how many may be active in human cells and their effects on the regulation of the host gene remains unclear. Results We describe a novel semi-supervised method based on the relative enrichment of chromatin signals between 2 conditions to predict active enhancers. We applied this method to the tumoral K562 and the normal GM12878 cell lines to predict enhancers that are differentially active in one cell type. These predictions show enhancer-like properties according to positional distribution, correlation with gene expression and production of enhancer RNAs. Using this model, we predict 10,365 and 9,777 intragenic active enhancers in K562 and GM12878, respectively, and relate the differential activation of these enhancers to expression and splicing differences of the host genes. Conclusions We propose that the activation or silencing of intragenic transcriptional enhancers modulate the regulation of the host gene by means of a local change of the chromatin and the recruitment of enhancer-related factors that may interact with the RNA directly or through the interaction with RNA binding proteins. Predicted enhancers are available at http://regulatorygenomics.upf.edu/Projects/enhancers.html
Pyvolve: a flexible Python module for simulating sequences along phylogenies
Pyvolve: a flexible Python module for simulating sequences along phylogenies
Stephanie J Spielman, Claus O Wilke
doi: http://dx.doi.org/10.1101/020214
We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny according to continuous-time Markov models of sequence evolution. Pyvolve incorporates most standard models of nucleotide, amino-acid, and codon sequence evolution, and it allows users to fully customize all model parameters. Pyvolve additionally allows users to specify custom evolutionary models and incorporates several novel features, including a novel rate matrix scaling algorithm and branch-length perturbations. Easily incorporated into Python bioinformatics pipelines, Pyvolve represents a convenient and flexible alternative to third-party simulation softwares. Pyvolve is an open-source project available, along with a detailed user-manual, under a FreeBSD license from https://github.com/sjspielman/pyvolve. API documentation is available from http://sjspielman.org/pyvolve.
OPERA-LG: Efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
OPERA-LG: Efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees
Song Gao, Denis Bertrand, Niranjan Nagarajan
doi: http://dx.doi.org/10.1101/020230
The assembly of large, repeat-rich eukaryotic genomes continues to represent a significant challenge in genomics. While long-read technologies have made the high-quality assembly of small, microbial genomes increasingly feasible, data generation can be prohibitively expensive for larger genomes. Fundamental advances in assembly algorithms are thus essential to exploit the characteristics of short and long-read sequencing technologies to consistently and reliably provide high-qualities assemblies in a cost-efficient manner. Here we present a scalable, exact algorithm (OPERA-LG) for the scaffold assembly of large, repeat-rich genomes that exhibits almost an order of magnitude improvement over the state-of-the-art programs in both correctness (>5X on average) and contiguity (>10X). This provides a systematic approach for combining data from different sequencing technologies, as well as a rigorous framework for scaffolding of repetitive sequences. OPERA-LG represents the first in a new class of algorithms that can efficiently assemble large genomes while providing formal guarantees about assembly quality, providing an avenue for systematic augmentation and improvement of 1000s of existing draft eukaryotic genome assemblies.
Approximately independent linkage disequilibrium blocks in human populations
Approximately independent linkage disequilibrium blocks in human populations
Tomaz Berisa, Joseph K. Pickrell
doi: http://dx.doi.org/10.1101/020255
We present a method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome. These blocks enable automated analysis of multiple genome-wide association studies.
SWEEPFINDER2: Increased sensitivity, robustness, and flexibility
SWEEPFINDER2: Increased sensitivity, robustness, and flexibility
Michael DeGiorgio, Christian D. Huber, Melissa J. Hubisz, Ines Hellmann, Rasmus Nielsen
Subjects: Populations and Evolution (q-bio.PE)
SweepFinder is a popular program that implements a powerful likelihood-based method for detecting recent positive selection, or selective sweeps. Here, we present SweepFinder2, an extension of SweepFinder with increased sensitivity and robustness to the confounding effects of mutation rate variation and background selection, as well as increased flexibility that enables the user to examine genomic regions in greater detail and to specify a fixed distance between test sites. Moreover, SweepFinder2 enables the use of invariant sites for sweep detection, increasing both its power and precision relative to SweepFinder.
Detection and interpretation of shared genetic influences on 40 human traits
Detection and interpretation of shared genetic influences on 40 human traits
Joseph Pickrell, Tomaz Berisa, Laure Segurel, Joyce Y Tung, David Hinds
doi: http://dx.doi.org/10.1101/019885
We performed a genome-wide scan for genetic variants that influence multiple human phenotypes by comparing large genome-wide association studies (GWAS) of 40 traits or diseases, including anthropometric traits (e.g. nose size and male pattern baldness), immune traits (e.g. susceptibility to childhood ear infections and Crohn’s disease), metabolic phenotypes (e.g. type 2 diabetes and lipid levels), and psychiatric diseases (e.g. schizophrenia and Parkinson’s disease). First, we identified 307 loci (at a false discovery rate of 10%) that influence multiple traits (excluding “trivial” phenotype pairs like type 2 diabetes and fasting glucose). Several loci influence a large number of phenotypes; for example, variants near the blood group gene ABO influence eleven of these traits, including risk of childhood ear infections (rs635634: log-odds ratio = 0.06, P = 1.4 × 10−8) and allergies (log-odds ratio = 0.05, P = 2.5 × 10−8), among others. Similarly, a nonsynonymous variant in the zinc transporter SLC39A8 influences seven of these traits, including risk of schizophrenia (rs13107325: log-odds ratio = 0.15, P = 2 × 10−12) and Parkinson’s disease (log-odds ratio = -0.15, P = 1.6 × 10−7), among others. Second, we used these loci to identify traits that share multiple genetic causes in common. For example, genetic variants that delay age of menarche in women also, on average, delay age of voice drop in men, decrease body mass index (BMI), increase adult height, and decrease risk of male pattern baldness. Finally, we identified four pairs of traits that show evidence of a causal relationship. For example, we show evidence that increased BMI causally increases triglyceride levels, and that increased liability to hypothyroidism causally decreases adult height.