Bayesian co-estimation of selfing rate and locus-specific mutation rates for a partially selfing populationBenjamin D Redelings, Seiji Kumagai, Liuyang Wang, Andrey Tatarenkov, Ann K. Sakai, Stephen G. Weller, Theresa M. Culley, John C. Avise, Marcy K. Uyenoyama
doi: http://dx.doi.org/10.1101/020537
We present a Bayesian method for characterizing the mating system of populations reproducing through a mixture of self-fertilization and random outcrossing. Our method uses patterns of genetic variation across the genome as a basis for inference about pure hermaphroditism, androdioecy, and gynodioecy. We extend the standard coalescence model to accommodate these mating systems, accounting explicitly for multilocus identity disequilibrium, inbreeding depression, and variation in fertility among mating types. We incorporate the Ewens Sampling Formula (ESF) under the infinite-alleles model of mutation to obtain a novel expression for the likelihood of mating system parameters. Our Markov chain Monte Carlo (MCMC) algorithm assigns locus-specific mutation rates, drawn from a common mutation rate distribution that is itself estimated from the data using a Dirichlet Process Prior model. Among the parameters jointly inferred are the population-wide rate of self-fertilization, locus-specific mutation rates, and the number of generations since the most recent outcrossing event for each sampled individual.
Coalescent inference using serially sampled, high-throughput sequencing data from HIV infected patients
Coalescent inference using serially sampled, high-throughput sequencing data from HIV infected patientsKevin Dialdestoro, Jonas Andreas Sibbesen, Lasse Maretty, Jayna Raghwani, Astrid Gall, Paul Kellam, Oliver Pybus, Jotun Hein, Paul Jenkins
doi: http://dx.doi.org/10.1101/020552
Human immunodeficiency virus (HIV) is a rapidly evolving pathogen that causes chronic infections, so genetic diversity within a single infection can be very high. High-throughput “deep” sequencing can now measure this diversity in unprecedented detail, particularly since it can be performed at different timepoints during an infection, and this offers a potentially powerful way to infer the evolutionary dynamics of the intra-host viral population. However, population genomic inference from HIV sequence data is challenging because of high rates of mutation and recombination, rapid demographic changes, and ongoing selective pressures. In this paper we develop a new method for inference using HIV deep sequencing data using an approach based on importance sampling of ancestral recombination graphs under a multi-locus coalescent model. The approach further extends recent progress in the approximation of so-called ‘conditional sampling distributions’, a quantity of key interest when approximating coalescent likelihoods. The chief novelties of our method are that it is able to infer rates of recombination and mutation, as well as the effective population size, while handling sampling over different timepoints and missing data without extra computational difficulty. We apply our method to a dataset of HIV-1, in which several hundred sequences were obtained from an infected individual at seven timepoints over two years. We find mutation rate and effective population size estimates to be comparable to those produced by the software BEAST. Additionally, our method is able to produce local recombination rate estimates. The software underlying our method, Coalescenator, is freely available.
Predictable patterns of CTL escape and reversion and the limited role of epistasis in HIV-1 evolution
Predictable patterns of CTL escape and reversion and the limited role of epistasis in HIV-1 evolution
Duncan S. Palmer, Emily Adland, John A. Frater, Philip J.R. Goulder, Kuan-Hsiang Gary Huang, Thumbi Ndung’u, Philippa C. Matthews, Rodney E. Phillips, Roger Shapiro, Gil McVean, Angela R. McLean
Subjects: Populations and Evolution (q-bio.PE)
The twin processes of viral evolutionary escape and reversion in response to host immune pressure, in particular the cytotoxic T-lymphocyte (CTL) response, helps shape Human Immunodeficiency Virus-1 sequence evolution in infected host populations. The tempo of CTL escape and reversion is known to differ between CTL escape variants in a given host population. A wealth of epistatic effects – both intermediary sequence changes on the path to CTL escape and compensatory mutations which restore replicative capacity following viral escape – have been reported. Given the importance of epistatic effects in these processes, we ask: are rates of escape and reversion comparable across infected host populations? For three cohorts taken from three continents, we estimate escape and reversion rates at 23 escape sites in $gag$ epitopes. Surprisingly, we find highly consistent escape rate estimates across the examined cohorts. Reversion rates are also consistent between a Canadian and South African infected host population. We investigate the importance of epistasis further by examining $in$ $vitro$ replicative capacities of viral sequences with minimal variation: point escape mutants induced in a lab strain. Remarkably, despite the complexities of epistatic effects and the diversity of both hosts and viruses, CTL escape mutants which escape rapidly tend to be those with the highest replicative capacity when applied as a single point mutation. Similarly, mutants inducing the greatest costs to viral replicative capacity tend to revert more quickly. These data suggest that escape rates in $gag$ are consistent across host populations and, in general, epistatic effects do not dramatically affect escape rates.
Phylodynamics of H5N1 Highly Pathogenic Avian Influenza in Europe, 2005–2010: Potential for Molecular Surveillance of New Outbreaks
Mohammad A. Alkhamis, Brian R. Moore, Andrés M. Perez
doi: http://dx.doi.org/10.1101/020339
Previous Bayesian phylogeographic studies of H5N1 highly pathogenic avian influenza viruses (HPAIVs) explored the origin and spread of the epidemic from China into Russia, indicating that HPAIV circulated in Russia prior to its detection there in 2005. In this study, we extend this research to explore the evolution and spread of HPAIV within Europe during the 2005–2010 epidemic, using all available sequences of the HA and NA gene regions that were collected in Europe and Russia during the outbreak. We use discrete-trait phylodynamic models within a Bayesian statistical framework to explore the evolution of HPAIV. Our results indicate that the genetic diversity and effective population size of HPAIV peaked between mid-2005 and early 2006, followed by drastic decline in 2007, which coincides with the end of the epidemic in Europe. Our results also suggest that domestic birds were the most likely source of the spread of the virus from Russia into Europe. Additionally, estimates of viral dispersal routes indicate that Russia, Romania, and Germany were key epicenters of these outbreaks. Our study quantifies the dynamics of a major European HPAIV pandemic and substantiates the ability of phylodynamic models to improve molecular surveillance of novel AIVs.
A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells
A semi-supervised approach uncovers thousands of intragenic enhancers differentially activated in human cells
Juan Gonzalez-Vallinas, Amadís Pagès, Babita Singh, Eduardo Eyras
doi: http://dx.doi.org/10.1101/020362
Background Transcriptional enhancers are generally known to regulate gene transcription from afar. Their activation involves a series of changes in chromatin marks and recruitment of protein factors. These enhancers may also occur inside genes, but how many may be active in human cells and their effects on the regulation of the host gene remains unclear. Results We describe a novel semi-supervised method based on the relative enrichment of chromatin signals between 2 conditions to predict active enhancers. We applied this method to the tumoral K562 and the normal GM12878 cell lines to predict enhancers that are differentially active in one cell type. These predictions show enhancer-like properties according to positional distribution, correlation with gene expression and production of enhancer RNAs. Using this model, we predict 10,365 and 9,777 intragenic active enhancers in K562 and GM12878, respectively, and relate the differential activation of these enhancers to expression and splicing differences of the host genes. Conclusions We propose that the activation or silencing of intragenic transcriptional enhancers modulate the regulation of the host gene by means of a local change of the chromatin and the recruitment of enhancer-related factors that may interact with the RNA directly or through the interaction with RNA binding proteins. Predicted enhancers are available at http://regulatorygenomics.upf.edu/Projects/enhancers.html
The weighting is the hardest part: on the behavior of the likelihood ratio test and score test under weight misspecification in rare variant association studies
Camelia Claudia Minica, Giulio Genovese, Dorret I. Boomsma, Christina M. Hultman, René Pool, Jacqueline M. Vink, Conor V. Dolan, Benjamin M. Neale
doi: http://dx.doi.org/10.1101/020198
Rare variant association studies are gaining importance in human genetic research with the increasing availability of exome/genome sequence data. One important test of association between a target set of rare variants (RVs) and a given phenotype is the sequence kernel association test (SKAT). Assignment of weights reflecting the hypothesized contribution of the RVs to the trait variance is embedded within any set-based test. Since the true weights are generally unknown, it is important to establish the effect of weight misspecification in SKAT. We used simulated and real data to characterize the behavior of the likelihood ratio test (LRT) and score test under weight misspecification. Results revealed that LRT is generally more robust to weight misspecification, and more powerful than score test in such a circumstance. For instance, when the rare variants within the target were simulated to have larger betas than the more common ones, incorrect assignment of equal weights reduced the power of the LRT by ~5% while the power of score test dropped by ~30%. Furthermore, LRT was more robust to the inclusion of weighed neutral variation in the test. To optimize weighting we proposed the use of a data-driven weighting scheme. With this approach and the LRT we detected significant enrichment of case mutations with MAF below 5% (P-value=7E-04) of a set of highly constrained genes in the Swedish schizophrenia case-control cohort of 4,940 individuals with observed exome-sequencing data. The score test is currently widely used in sequence kernel association studies for both its computational efficiency and power. Indeed, assuming correct specification, in some circumstances the score test is the most powerful test. However, our results showed that LRT has the compelling qualities of being generally more robust and more powerful under weight misspecification. This is a paramount result, given that, arguably, misspecified models are likely to be the rule rather than the exception in the weighting-based approaches.
Sequence capture of ultraconserved elements from bird museum specimens
Sequence capture of ultraconserved elements from bird museum specimens
John McCormack, Whitney L.E. Tsai, Brant C Faircloth
doi: http://dx.doi.org/10.1101/020271
New DNA sequencing technologies are allowing researchers to explore the genomes of the millions of natural history specimens collected prior to the molecular era. Yet, we know little about how well specific next-generation sequencing (NGS) techniques work with the degraded DNA typically extracted from museum specimens. Here, we use one type of NGS approach, sequence capture of ultraconserved elements (UCEs), to collect data from bird museum specimens as old as 120 years. We targeted approximately 5,000 UCE loci in 27 Western Scrub-Jays (Aphelocoma californica) representing three evolutionary lineages, and we collected an average of 3,749 UCE loci containing 4,460 single nucleotide polymorphisms (SNPs). Despite older specimens producing fewer and shorter loci in general, we collected thousands of markers from even the oldest specimens. More sequencing reads per individual helped to boost the number of UCE loci we recovered from older specimens, but more sequencing was not as successful at increasing the length of loci. We detected contamination in some samples and determined contamination was more prevalent in older samples that were subject to less sequencing. For the phylogeny generated from concatenated UCE loci, contamination led to incorrect placement of some individuals. In contrast, a species tree constructed from SNPs called within UCE loci correctly placed individuals into three monophyletic groups, perhaps because of the stricter analytical procedures we used for SNP calling. This study and other recent studies on the genomics of museums specimens have profound implications for natural history collections, where millions of older specimens should now be considered genomic resources.
The next 20 years of genome research
The next 20 years of genome research
Michael Schatz
doi: http://dx.doi.org/10.1101/020289
The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than one billion genomes, bringing with it even deeper insights into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches can keep pace with the rapid improvements to biotechnology. In this perspective, we aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world.
Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species
Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species
Jeanneth Mosquera-Rendón, Ana M. Rada-Bravo, Sonia Cárdenas-Brito, Mauricio Corredor, Eliana Restrepo-Pineda, Alfonso Benítez-Páez
doi: http://dx.doi.org/10.1101/020305
Background. Drug treatments and vaccine designs against the opportunistic human pathogen Pseudomonas aeruginosa have multiple issues, all associated with the diverse genetic traits present in this pathogen, ranging from multi-drug resistant genes to the molecular machinery for the biosynthesis of biofilms. Several candidate vaccines against P. aeruginosa have been developed, which target the outer membrane proteins; however, major issues arise when attempting to establish complete protection against this pathogen due to its presumably genotypic variation at the strain level. To shed light on this concern, we proposed this study to assess the P. aeruginosa pangenome and its molecular evolution across multiple strains. Results. The P. aeruginosa pangenome was estimated to contain almost 17,000 non-redundant genes, and approximately 15% of these constituted the core genome. Functional analyses of the accessory genome indicated a wide presence of genetic elements directly associated with pathogenicity. An in-depth molecular evolution analysis revealed the full landscape of selection forces acting on the P. aeruginosa pangenome, in which purifying selection drives evolution in the genome of this human pathogen. We also detected distinctive positive selection in a wide variety of outer membrane proteins, with the data supporting the concept of substantial genetic variation in proteins probably recognized as antigens. Approaching the evolutionary information of genes under extremely positive selection, we designed a new Multi-Locus Sequencing Typing assay for an informative, rapid, and cost-effective genotyping of P. aeruginosa clinical isolates. Conclusions. We report the unprecedented pangenome characterization of P. aeruginosa on a large scale, which included almost 200 bacterial genomes from one single species and a molecular evolutionary analysis at the pangenome scale. Evolutionary information presented here provides a clear explanation of the issues associated with the use of protein conjugates from pili, flagella, or secretion systems as antigens for vaccine design, which exhibit high genetic variation in terms of non-synonymous substitutions in P. aeruginosa strains.
Pyvolve: a flexible Python module for simulating sequences along phylogenies
Pyvolve: a flexible Python module for simulating sequences along phylogenies
Stephanie J Spielman, Claus O Wilke
doi: http://dx.doi.org/10.1101/020214
We introduce Pyvolve, a flexible Python module for simulating genetic data along a phylogeny according to continuous-time Markov models of sequence evolution. Pyvolve incorporates most standard models of nucleotide, amino-acid, and codon sequence evolution, and it allows users to fully customize all model parameters. Pyvolve additionally allows users to specify custom evolutionary models and incorporates several novel features, including a novel rate matrix scaling algorithm and branch-length perturbations. Easily incorporated into Python bioinformatics pipelines, Pyvolve represents a convenient and flexible alternative to third-party simulation softwares. Pyvolve is an open-source project available, along with a detailed user-manual, under a FreeBSD license from https://github.com/sjspielman/pyvolve. API documentation is available from http://sjspielman.org/pyvolve.