bModelTest: Bayesian site model selection for nucleotide data

bModelTest: Bayesian site model selection for nucleotide data

Remco Bouckaert
doi: http://dx.doi.org/10.1101/020792

bModelTest allows for a Bayesian approach to inferring a site model for phylogenetic analysis. It is based on trans dimensional MCMC proposals that allow switching between substitution models, whether gamma rate heterogeneity is used and whether a proportion of the sites is invariant. The model can be used with the set of reversible models on nucleotides, but we also introduce other sets of substitution models, and show how to use these sets of models. With the method, the site model can be inferred during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood based methods.

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

Excess False Positive Rates in Methods for Differential Gene Expression Analysis using RNA-Seq Data

David M Rocke, Luyao Ruan, Yilun Zhang, J. Jared Gossett, Blythe Durbin-Johnson, Sharon Aviran
doi: http://dx.doi.org/10.1101/020784

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.

Independent molecular basis of convergent highland adaptation in maize

Independent molecular basis of convergent highland adaptation in maize

Shohei Takuno, Peter Ralph, Kelly Swarts, Rob J Elshire, Jeffrey C Glaubitz, Edward S. Buckler, Matthew B Hufford, Jeffrey Ross-Ibarra
doi: http://dx.doi.org/10.1101/013607

Convergent evolution is the independent evolution of similar traits in different species or lineages of the same species; this often is a result of adaptation to similar environments, a process referred to as convergent adaptation.} We investigate here the molecular basis of convergent adaptation in maize to highland climates in Mesoamerica and South America using genome-wide SNP data. Taking advantage of archaeological data on the arrival of maize to the highlands, we infer demographic models for both populations, identifying evidence of a strong bottleneck and rapid expansion in South America. We use these models to then identify loci showing an excess of differentiation as a means of identifying putative targets of natural selection, and compare our results to expectations from recently developed theory on convergent adaptation. Consistent with predictions across a wide parameter space, we see limited evidence for convergent evolution at the nucleotide level in spite of strong similarities in overall phenotypes. Instead, we show that selection appears to have predominantly acted on standing genetic variation, and that introgression from wild teosinte populations appears to have played a role in highland adaptation in Mexican maize.

Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles

Resolving microsatellite genotype ambiguity in populations of allopolyploid and diploidized autopolyploid organisms using negative correlations between alleles

Lindsay V Clark, Andrea Drauch Schreier
doi: http://dx.doi.org/10.1101/020610

A major limitation in the analysis of genetic marker data from polyploid organisms is non-Mendelian segregation, particularly when a single marker yields allelic signals from multiple, independently segregating loci (isoloci). However, with markers such as microsatellites that detect more than two alleles, it is sometimes possible to deduce which alleles belong to which isoloci. Here we describe a novel mathematical property of codominant marker data when it is recoded as binary (presence/absence) allelic variables: under random mating in an infinite population, two allelic variables will be negatively correlated if they belong to the same locus, but uncorrelated if they belong to different loci. We present an algorithm to take advantage of this mathematical property, sorting alleles into isoloci based on correlations, then refining the allele assignments after checking for consistency with individual genotypes. We demonstrate the utility of our method on simulated data, as well as a real microsatellite dataset from a natural population of octoploid white sturgeon (Acipenser transmontanus). Our methodology is implemented in the R package polysat version 1.4.

Are Genetic Interactions Influencing Gene Expression Evidence for Biological Epistasis or Statistical Artifacts?

Are Genetic Interactions Influencing Gene Expression Evidence for Biological Epistasis or Statistical Artifacts?

Alexandra Fish, John A. Capra, William S Bush
doi: http://dx.doi.org/10.1101/020479

Interactions between genetic variants, also called epistasis, are pervasive in model organisms; however, their importance in humans remains unclear because statistical interactions in observational studies can be explained by processes other than biological epistasis. Using statistical modeling, we identified 1,093 interactions between pairs of cis-regulatory variants impacting gene expression in lymphoblastoid cell lines. Factors known to confound these analyses (ceiling/floor effects, population stratification, haplotype effects, or single variants tagged through linkage disequilibrium) explained most of these interactions. However, we found 15 interactions robust to these explanations, and we further show that despite potential confounding, interacting variants were enriched in numerous regulatory regions suggesting potential biological importance. While genetic interactions may not be the true underlying mechanism of all our statistical models, our analyses discover new signals undetected in standard single-marker analyses. Ultimately, we identified new complex genetic architectures regulating 23 genes, suggesting that single-variant analyses may miss important modifiers.

Phylodynamics of H5N1 Highly Pathogenic Avian Influenza in Europe, 2005–2010: Potential for Molecular Surveillance of New Outbreaks

Phylodynamics of H5N1 Highly Pathogenic Avian Influenza in Europe, 2005–2010: Potential for Molecular Surveillance of New Outbreaks

Mohammad A. Alkhamis, Brian R. Moore, Andrés M. Perez
doi: http://dx.doi.org/10.1101/020339

Previous Bayesian phylogeographic studies of H5N1 highly pathogenic avian influenza viruses (HPAIVs) explored the origin and spread of the epidemic from China into Russia, indicating that HPAIV circulated in Russia prior to its detection there in 2005. In this study, we extend this research to explore the evolution and spread of HPAIV within Europe during the 2005–2010 epidemic, using all available sequences of the HA and NA gene regions that were collected in Europe and Russia during the outbreak. We use discrete-trait phylodynamic models within a Bayesian statistical framework to explore the evolution of HPAIV. Our results indicate that the genetic diversity and effective population size of HPAIV peaked between mid-2005 and early 2006, followed by drastic decline in 2007, which coincides with the end of the epidemic in Europe. Our results also suggest that domestic birds were the most likely source of the spread of the virus from Russia into Europe. Additionally, estimates of viral dispersal routes indicate that Russia, Romania, and Germany were key epicenters of these outbreaks. Our study quantifies the dynamics of a major European HPAIV pandemic and substantiates the ability of phylodynamic models to improve molecular surveillance of novel AIVs.

The weighting is the hardest part: on the behavior of the likelihood ratio test and score test under weight misspecification in rare variant association studies

The weighting is the hardest part: on the behavior of the likelihood ratio test and score test under weight misspecification in rare variant association studies

Camelia Claudia Minica, Giulio Genovese, Dorret I. Boomsma, Christina M. Hultman, René Pool, Jacqueline M. Vink, Conor V. Dolan, Benjamin M. Neale
doi: http://dx.doi.org/10.1101/020198

Rare variant association studies are gaining importance in human genetic research with the increasing availability of exome/genome sequence data. One important test of association between a target set of rare variants (RVs) and a given phenotype is the sequence kernel association test (SKAT). Assignment of weights reflecting the hypothesized contribution of the RVs to the trait variance is embedded within any set-based test. Since the true weights are generally unknown, it is important to establish the effect of weight misspecification in SKAT. We used simulated and real data to characterize the behavior of the likelihood ratio test (LRT) and score test under weight misspecification. Results revealed that LRT is generally more robust to weight misspecification, and more powerful than score test in such a circumstance. For instance, when the rare variants within the target were simulated to have larger betas than the more common ones, incorrect assignment of equal weights reduced the power of the LRT by ~5% while the power of score test dropped by ~30%. Furthermore, LRT was more robust to the inclusion of weighed neutral variation in the test. To optimize weighting we proposed the use of a data-driven weighting scheme. With this approach and the LRT we detected significant enrichment of case mutations with MAF below 5% (P-value=7E-04) of a set of highly constrained genes in the Swedish schizophrenia case-control cohort of 4,940 individuals with observed exome-sequencing data. The score test is currently widely used in sequence kernel association studies for both its computational efficiency and power. Indeed, assuming correct specification, in some circumstances the score test is the most powerful test. However, our results showed that LRT has the compelling qualities of being generally more robust and more powerful under weight misspecification. This is a paramount result, given that, arguably, misspecified models are likely to be the rule rather than the exception in the weighting-based approaches.

Sequence capture of ultraconserved elements from bird museum specimens

Sequence capture of ultraconserved elements from bird museum specimens

John McCormack, Whitney L.E. Tsai, Brant C Faircloth
doi: http://dx.doi.org/10.1101/020271

New DNA sequencing technologies are allowing researchers to explore the genomes of the millions of natural history specimens collected prior to the molecular era. Yet, we know little about how well specific next-generation sequencing (NGS) techniques work with the degraded DNA typically extracted from museum specimens. Here, we use one type of NGS approach, sequence capture of ultraconserved elements (UCEs), to collect data from bird museum specimens as old as 120 years. We targeted approximately 5,000 UCE loci in 27 Western Scrub-Jays (Aphelocoma californica) representing three evolutionary lineages, and we collected an average of 3,749 UCE loci containing 4,460 single nucleotide polymorphisms (SNPs). Despite older specimens producing fewer and shorter loci in general, we collected thousands of markers from even the oldest specimens. More sequencing reads per individual helped to boost the number of UCE loci we recovered from older specimens, but more sequencing was not as successful at increasing the length of loci. We detected contamination in some samples and determined contamination was more prevalent in older samples that were subject to less sequencing. For the phylogeny generated from concatenated UCE loci, contamination led to incorrect placement of some individuals. In contrast, a species tree constructed from SNPs called within UCE loci correctly placed individuals into three monophyletic groups, perhaps because of the stricter analytical procedures we used for SNP calling. This study and other recent studies on the genomics of museums specimens have profound implications for natural history collections, where millions of older specimens should now be considered genomic resources.

The next 20 years of genome research

The next 20 years of genome research

Michael Schatz
doi: http://dx.doi.org/10.1101/020289

The last 20 years have been a remarkable era for biology and medicine. One of the most significant achievements has been the sequencing of the first human genomes, which has laid the foundation for profound insights into human genetics, the intricacies of regulation and development, and the forces of evolution. Incredibly, as we look into the future over the next 20 years, we see the very real potential for sequencing more than one billion genomes, bringing with it even deeper insights into human genetics as well as the genetics of millions of other species on the planet. Realizing this great potential, though, will only be achieved through the integration and development of highly scalable computational and quantitative approaches can keep pace with the rapid improvements to biotechnology. In this perspective, we aim to chart out these future technologies, anticipate the major themes of research, and call out the challenges ahead. One of the largest shifts will be in the training used to prepare the class of 2035 for their highly interdisciplinary world.

Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species

Pangenome-wide and molecular evolution analyses of the Pseudomonas aeruginosa species

Jeanneth Mosquera-Rendón, Ana M. Rada-Bravo, Sonia Cárdenas-Brito, Mauricio Corredor, Eliana Restrepo-Pineda, Alfonso Benítez-Páez
doi: http://dx.doi.org/10.1101/020305

Background. Drug treatments and vaccine designs against the opportunistic human pathogen Pseudomonas aeruginosa have multiple issues, all associated with the diverse genetic traits present in this pathogen, ranging from multi-drug resistant genes to the molecular machinery for the biosynthesis of biofilms. Several candidate vaccines against P. aeruginosa have been developed, which target the outer membrane proteins; however, major issues arise when attempting to establish complete protection against this pathogen due to its presumably genotypic variation at the strain level. To shed light on this concern, we proposed this study to assess the P. aeruginosa pangenome and its molecular evolution across multiple strains. Results. The P. aeruginosa pangenome was estimated to contain almost 17,000 non-redundant genes, and approximately 15% of these constituted the core genome. Functional analyses of the accessory genome indicated a wide presence of genetic elements directly associated with pathogenicity. An in-depth molecular evolution analysis revealed the full landscape of selection forces acting on the P. aeruginosa pangenome, in which purifying selection drives evolution in the genome of this human pathogen. We also detected distinctive positive selection in a wide variety of outer membrane proteins, with the data supporting the concept of substantial genetic variation in proteins probably recognized as antigens. Approaching the evolutionary information of genes under extremely positive selection, we designed a new Multi-Locus Sequencing Typing assay for an informative, rapid, and cost-effective genotyping of P. aeruginosa clinical isolates. Conclusions. We report the unprecedented pangenome characterization of P. aeruginosa on a large scale, which included almost 200 bacterial genomes from one single species and a molecular evolutionary analysis at the pangenome scale. Evolutionary information presented here provides a clear explanation of the issues associated with the use of protein conjugates from pili, flagella, or secretion systems as antigens for vaccine design, which exhibit high genetic variation in terms of non-synonymous substitutions in P. aeruginosa strains.