Quantification of GC-biased gene conversion in the human genome

Quantification of GC-biased gene conversion in the human genome
Sylvain Glemin, Peter F Arndt, Philipp W Messer, Dmitri Petrov, Nicolas Galtier, Laurent Duret
doi: http://dx.doi.org/10.1101/010173

Many lines of evidence indicate GC-biased gene conversion (gBGC) has a major impact on the evolution of mammalian genomes. However, up to now, this process had not been properly quantified. In principle, the strength of gBGC can be measured from the analysis of derived allele frequency spectra. However, this approach is sensitive to a number of confounding factors. In particular, we show by simulations that the inference is pervasively affected by polymorphism polarization errors, especially at hypermutable sites, and spatial heterogeneity in gBGC strength. Here we propose a new method to quantify gBGC from DAF spectra, incorporating polarization errors and taking spatial heterogeneity into account. This method is very general in that it does not require any prior knowledge about the source of polarization errors and also provides information about mutation patterns. We apply this approach to human polymorphism data from the 1000 genomes project. We show that the strength of gBGC does not differ between hypermutable CpG sites and non-CpG sites, suggesting that in humans gBGC is not caused by the base-excision repair machinery. We further find that the impact of gBGC is concentrated primarily within recombination hotspots: genome-wide, the strength of gBGC is in the nearly neutral area, but 2% of the human genome is subject to strong gBGC, with population-scaled gBGC coefficients above 5. Given that the location of recombination hotspots evolves very rapidly, our analysis predicts that in the long term, a large fraction of the genome is affected by short episodes of strong gBGC.

STACEY: species delimitation and phylogeny estimation under the multispecies coalescent

STACEY: species delimitation and phylogeny estimation under the multispecies coalescent
Graham R Jones
doi: http://dx.doi.org/10.1101/010199

This article describes a new package called STACEY for BEAST2 which is capable of both species delimitation and species tree estimation using DNA sequences from multiple loci. The focus in this article is on species delimitation. STACEY is based on the multispecies coalescent model, and builds on earlier software (DISSECT), which uses a `birth-death-collapse’ prior to deal with delimitations without the need for reversible-jump Markov chain Monte Carlo moves. Like DISSECT, it requires no a priori assignment of individuals to species or populations, and no guide tree. This paper introduces two innovations. The first is a new model for the populations along the branches of the species tree, and the second is a new MCMC move for exploring the posterior when the multispecies coalescent model is assumed. The main benefit of STACEY over DISSECT is much better convergence. Current practice, using a pipeline approach to species delimitation under the multispecies coalescent, has been shown to have major problems on simulated data. The same simulated data set is used to demonstrate the accuracy and efficiency of STACEY.

RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing

RNA-Seq analysis and annotation of a draft blueberry genome assembly identifies candidate genes involved in fruit ripening, biosynthesis of bioactive compounds, and stage-specific alternative splicing

Vikas Gupta, April Dawn Estrada, Ivory Clabaugh Blakley, Rob Reid, Ketan Patel, Mason D. Meyer, Stig Uggerhoj Andersen, Allan F. Brown, Mary Ann Lila, Ann Loraine
doi: http://dx.doi.org/10.1101/010116

Background: Blueberries are a rich source of antioxidants and other beneficial compounds that can protect against disease. Identifying genes involved in synthesis of bioactive compounds could enable breeding berry varieties with enhanced health benefits. Results: Toward this end, we annotated a draft blueberry genome assembly using RNA-Seq data from five stages of berry fruit development and ripening. Genome-guided assembly of RNA-Seq read alignments combined with output from ab initio gene finders produced around 60,000 gene models, of which more than half were similar to proteins from other species, typically the grape Vitis vinifera. Comparison of gene models to the PlantCyc database of metabolic pathway enzymes identified candidate genes involved in synthesis of bioactive compounds, including bixin, an apocarotenoid with potential disease-fighting properties, and defense-related cyanogenic glycosides, which are toxic. Cyanogenic glycoside (CG) biosynthetic enzymes were highly expressed in green fruit, and a candidate CG detoxification enzyme was up regulated during fruit ripening. Candidate genes for ethylene, anthocyanin, and 400 other biosynthetic pathways were also identified. RNA-Seq expression profiling showed that blueberry growth, maturation, and ripening involve dynamic gene expression changes, including coordinated up and down regulation of metabolic pathway enzymes, cell growth-related genes, and putative transcriptional regulators. Analysis of RNA-seq alignments also identified developmentally regulated alternative splicing, promoter use, and 3′ end formation. Conclusions: We report genome sequence, gene models, functional annotations, and RNA-Seq expression data which provide an important new resource enabling high throughput studies in blueberry. RNA-Seq data are freely available for visualization in Integrated Genome Browser, and analysis code is available from the git repository at http://bitbucket.org/lorainelab/blueberrygenome.

Synonymous and Nonsynonymous Distances Help Untangle Convergent Evolution and Recombination

Synonymous and Nonsynonymous Distances Help Untangle Convergent Evolution and Recombination

Peter B. Chi, Sujay Chattopadhyay, Philippe Lemey, Evgeni V. Sokurenko, Vladimir N. Minin
(Submitted on 6 Oct 2014)

When estimating a phylogeny from a multiple sequence alignment, researchers often assume the absence of recombination. However, if recombination is present, then tree estimation and all downstream analyses will be impacted, because different segments of the sequence alignment support different phylogenies. Similarly, convergent selective pressures at the molecular level can also lead to phylogenetic tree incongruence across the sequence alignment. Current methods for detection of phylogenetic incongruence are not equipped to distinguish between these two different mechanisms and assume that the incongruence is a result of recombination or other horizontal transfer of genetic information. We propose a new recombination detection method that can make this distinction, based on synonymous codon substitution distances. Although some power is lost by discarding the information contained in the nonsynonymous substitutions, our new method has lower false positive probabilities than the original Dss statistic when the phylogenetic incongruence signal is due to convergent evolution. We conclude with three empirical examples, where we analyze: 1) sequences from a transmission network of the human immunodeficiency virus, 2) tlpB gene sequences from a geographically diverse set of 38 Helicobacter pylori strains, and 3) Hepatitis C virus sequences sampled longitudinally from one patient.

Fitting the Balding-Nichols model to forensic databases

Fitting the Balding-Nichols model to forensic databases

Rori Rohlfs, Vitor R.C. Aguiar, Kirk E. Lohmueller, Amanda M. Castro, Alessandro C.S. Ferreira, Vanessa C.O. Almeida, Iuri D. Louro, Rasmus Nielsen
doi: http://dx.doi.org/10.1101/009969
AbstractInfo/HistoryMetricsData Supplements Preview PDF

Large forensic databases provide an opportunity to compare observed empirical rates of genotype matching with those expected under forensic genetic models. A number of researchers have taken advantage of this opportunity to validate some forensic genetic approaches, particularly to ensure that estimated rates of genotype matching between unrelated individuals are indeed slight overestimates of those observed. However, these studies have also revealed systematic error trends in genotype probability estimates. In this analysis, we investigate these error trends and show how they result from inappropriate implementation of the Balding-Nichols model in the context of database-wide matching. Specifically, we show that in addition to accounting for increased allelic matching between individuals with recent shared ancestry, studies must account for relatively decreased allelic matching between individuals with more ancient shared ancestry.

Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders

Leveraging ancestry to improve causal variant identification in exome sequencing for monogenic disorders

Robert P Brown, Hane Lee, Ascia Eskin, Gleb Kichaev, Kirk E Lohmueller, Bruno Reversade, Stanley F Nelson, Bogdan Pasaniuc
doi: http://dx.doi.org/10.1101/010017

Recent breakthroughs in exome sequencing technology have made possible the identification of many causal variants of monogenic disorders. Although extremely powerful when closely related individuals (e.g. child and parents) are simultaneously sequenced, exome sequencing of individual only cases is often unsuccessful due to the large number of variants that need to be followed-up for functional validation. Many approaches remove from consideration common variants above a given frequency threshold (e.g. 1%), and then prioritize the remaining variants according to their allele frequency, functional, structural and conservation properties. In this work, we present methods that leverage the genetic structure of different populations while accounting for the finite sample size of the reference panels to improve the variant filtering step. Using simulations and real exome data from individuals with monogenic disorders, we show that our methods significantly reduce the number of variants to be followed-up (e.g. a 36% reduction from an average 418 variants per exome when ancestry is ignored to 267 when ancestry is taken into account for case-only sequenced individuals). Most importantly our proposed approaches are well calibrated with respect to the probability of filtering out a true causal variant (i.e. false negative rate, FNR), whereas existing approaches are susceptible to high FNR when reference panel sizes are limited.

Inference of evolutionary forces acting on human biological pathways

Inference of evolutionary forces acting on human biological pathways

Josephine T Daub, Isabelle Dupanloup, Marc Robinson-Rechavi, Laurent Excoffier
doi: http://dx.doi.org/10.1101/009928

Because natural selection is likely to act on multiple genes underlying a given phenotypic trait, we study here the potential effect of ongoing and past selection on the genetic diversity of human biological pathways. We first show that genes included in gene sets are generally under stronger selective constraints than other genes and that their evolutionary response is correlated. We then introduce a new procedure to detect selection at the pathway level based on a decomposition of the classical McDonald-Kreitman test extended to multiple genes. This new test, called 2DNS, detects outlier gene sets and takes into account past demographic effects as well as evolutionary constraints specific to gene sets. Selective forces acting on gene sets can be easily identified by a mere visual inspection of the position of the gene sets relative to their 2D null distribution. We thus find several outlier gene sets that show signals of positive, balancing, or purifying selection, but also others showing an ancient relaxation of selective constraints. The principle of the 2DNS test can also be applied to other genomic contrasts. For instance, the comparison of patterns of polymorphisms private to African and non-African populations reveals that most pathways show a higher proportion of non-synonymous mutations in non-Africans than in Africans, potentially due to different demographic histories and selective pressures.