Evaluating intra- and inter-individual variation in the human placental transcriptome

Evaluating intra- and inter-individual variation in the human placental transcriptome
David A Hughes, Martin Kircher, Zhisong He, Song Guo, Genevieve L Fairbrother, Carlos S Moreno, Philipp Khaitovich, Mark Stoneking
doi: http://dx.doi.org/10.1101/012468

Background: Gene expression variation is a phenotypic trait of particular interest as it represents the initial link between genotype and other phenotypes. Analyzing how such variation apportions among and within groups allows for the evaluation of how genetic and environmental factors influence such traits. It also provides opportunities to identify genes and pathways that may have been influenced by non-neutral processes. Here we use a population genetics framework and next generation sequencing to evaluate how gene expression variation is apportioned among four human groups in a natural biological tissue, the placenta. Results: We estimate that on average, 33.2%, 58.9% and 7.8% of the placental transcriptome is explained by variation within individuals, among individuals and among human groups, respectively. Additionally, when technical and biological traits are included in models of gene expression they account for roughly 2% of total gene expression variation. Notably, the variation that is significantly different among groups is enriched in biological pathways associated with immune response, cell signaling and metabolism. Many biological traits demonstrated correlated changes in expression in numerous pathways of potential interest to clinicians and evolutionary biologists. Finally, we estimate that the majority of the human placental transcriptome (65% of expressed genes) exhibits expression profiles consistent with neutrality; the remainder are consistent with stabilizing selection (26%), directional selection (4.9%), or diversifying selection (4.8%). Conclusion: We apportion placental gene expression variation into individual, population and biological trait factors and identify how each influence the transcriptome. Additionally, we advance methods to associate expression profiles with different forms of selection.

The rate and molecular spectrum of spontaneous mutations in the GC-rich multi-chromosome genome of Burkholderia cenocepacia

The rate and molecular spectrum of spontaneous mutations in the GC-rich multi-chromosome genome of Burkholderia cenocepacia
Marcus M Dillon, Way Sung, Michael Lynch, Vaughn S Cooper
doi: http://dx.doi.org/10.1101/011841
Spontaneous mutations are ultimately essential for evolutionary change and are also the root cause of nearly all disease. However, until recently, both biological and technical barriers have prevented detailed analyses of mutation profiles, constraining our understanding of the mutation process to a few model organisms and leaving major gaps in our understanding of the role of genome content and structure on mutation. Here, we present a genome-wide view of the molecular mutation spectrum in Burkholderia cenocepacia, a clinically relevant pathogen with high %GC content and multiple chromosomes. We find that B. cenocepacia has low genome-wide mutation rates with insertion-deletion mutations biased towards deletions, consistent with the idea that deletion pressure reduces prokaryotic genome sizes. Unlike previously assayed organisms, B. cenocepacia exhibits a GC-mutation bias, which suggests that at least some genomes with high GC content may be driven to this point by unusual base-substitution mutation pressure. Notably, we also observed variation in both the rates and spectra of mutations among chromosomes, and a significant elevation of G:C>T:A transversions in late-replicating regions. Thus, although some patterns of mutation appear to be highly conserved across cellular life, others vary between species and even between chromosomes of the same species, potentially influencing the evolution of nucleotide composition and genome architecture.

A new FST-based method to uncover local adaptation using environmental variables

A new $F_{\text{ST}}$-based method to uncover local adaptation using environmental variables
Pierre de Villemereuil, Oscar E. Gaggiotti
Comments: 18 pages, 5 figures, Supplementary Information at the end of the document
Subjects: Populations and Evolution (q-bio.PE)

Genome-scan methods are used for screening genome-wide patterns of DNA polymorphism to detect signatures of positive selection. There are two main types of methods: (i) “outlier” detection methods based on $F_{\text{ST}}$ that detect loci with high differenciation compared to the rest of the genomes and, (ii) environmental association methods that test the association between allele frequencies and environmental variables. In this article, we present a new $F_{\text{ST}}$-based genome scan method, BayeScEnv, which incorporates environmental information in the form of “environmental differentiation”. It is based on the F model but as opposed to existing approaches it considers two locus-specific effects, one due to divergent selection and another due to other processes such as differences in mutation rates across loci or background selection. Simulation studies showed that our method has a much lower false positive rate than an existing $F_{\text{ST}}$-based method, BayeScan, under a wide range of demographic scenarios. Although it had lower power, it leads to a better compromise between power and false positive rate. We apply our method to Human and Salmon datasets and show that it can be used successfully to study local adaptation. The method was developped in C++ and is avaible at this http URL

Visualizing spatial population structure with estimated effective migration surfaces

Visualizing spatial population structure with estimated effective migration surfaces
Desislava Petkova, John Novembre, Matthew Stephens
doi: http://dx.doi.org/10.1101/011809

Genetic data often exhibit patterns that are broadly consistent with “isolation by distance” – a phenomenon where genetic similarity tends to decay with geographic distance. In a heterogeneous habitat, decay may occur more quickly in some regions than others: for example, barriers to gene flow can accelerate the genetic differentiation between groups located close in space. We use the concept of “effective migration” to model the relationship between genetics and geography: in this paradigm, effective migration is low in regions where genetic similarity decays quickly. We present a method to quantify and visualize variation in effective migration across the habitat, which can be used to identify potential barriers to gene flow, from geographically indexed large-scale genetic data. Our approach uses a population genetic model to relate underlying migration rates to expected pairwise genetic dissimilarities, and estimates migration rates by matching these expectations to the observed dissimilarities. We illustrate the potential and limitations of our method using simulations and data from elephant, human, and Arabidopsis thaliana populations. The resulting visualizations highlight important features of the spatial population structure that are difficult to discern using existing methods for summarizing genetic variation such as principal components analysis.

Reveel: large-scale population genotyping using low-coverage sequencing data

Reveel: large-scale population genotyping using low-coverage sequencing data
Lin Huang, Bo Wang, Ruitang Chen, Sivan Bercovici, Serafim Batzoglou
doi: http://dx.doi.org/10.1101/011882

Population low-coverage whole-genome sequencing is rapidly emerging as a prominent approach for discovering genomic variation and genotyping a cohort. This approach combines substantially lower cost than full-coverage sequencing with whole-genome discovery of low-allele-frequency variants, to an extent that is not possible with array genotyping or exome sequencing. However, a challenging computational problem arises when attempting to discover variants and genotype the entire cohort. Variant discovery and genotyping are relatively straightforward on a single individual that has been sequenced at high coverage, because the inference decomposes into the independent genotyping of each genomic position for which a sufficient number of confidently mapped reads are available. However, in cases where low-coverage population data are given, the joint inference requires leveraging the complex linkage disequilibrium patterns in the cohort to compensate for sparse and missing data in each individual. The potentially massive computation time for such inference, as well as the missing data that confound low-frequency allele discovery, need to be overcome for this approach to become practical. Here, we present Reveel, a novel method for single nucleotide variant calling and genotyping of large cohorts that have been sequenced at low coverage. Reveel introduces a novel technique for leveraging linkage disequilibrium that deviates from previous Markov-based models. We evaluate Reveel???s performance through extensive simulations as well as real data from the 1000 Genomes Project, and show that it achieves higher accuracy in low-frequency allele discovery and substantially lower computation cost than previous state-of-the-art methods.

Bet-hedging, seasons and the evolution of behavioral diversity in Drosophila

Bet-hedging, seasons and the evolution of behavioral diversity in Drosophila
Jamey Kain, Sarah Zhang, Mason Klein, Aravinthan Samuel, Benjamin de Bivort
doi: http://dx.doi.org/10.1101/012021

Organisms use various strategies to cope with fluctuating environmental conditions. In diversified bet-hedging, a single genotype exhibits phenotypic heterogeneity with the expectation that some individuals will survive transient selective pressures. To date, empirical evidence for bet-hedging is scarce. Here, we observe that individual Drosophila melanogaster flies exhibit striking variation in light- and temperature-preference behaviors. With a modeling approach that combines real world weather and climate data to simulate temperature preference-dependent survival and reproduction, we find that a bet-hedging strategy may underlie the observed inter-individual behavioral diversity. Specifically, bet-hedging outcompetes strategies in which individual thermal preferences are heritable. Animals employing bet-hedging refrain from adapting to the coolness of spring with increased warm-seeking that inevitably becomes counterproductive in the hot summer. This strategy is particularly valuable when mean seasonal temperatures are typical, or when there is considerable fluctuation in temperature within the season. The model predicts, and we experimentally verify, that the behaviors of individual flies are not heritable. Finally, we model the effects of historical weather data, climate change, and geographic seasonal variation on the optimal strategies underlying behavioral variation between individuals, characterizing the regimes in which bet-hedging is advantageous.

>msCentipede: Modeling heterogeneity across genomic sites improves accuracy in the inference of transcription factor binding

msCentipede: Modeling heterogeneity across genomic sites improves accuracy in the inference of transcription factor binding
Anil Raj, Heejung Shim, Yoav Gilad, Jonathan K Pritchard, Matthew Stephens
doi: http://dx.doi.org/10.1101/012013

Motivation: Understanding global gene regulation depends critically on accurate annotation of regulatory elements that are functional in a given cell type. CENTIPEDE, a powerful, probabilistic framework for identifying transcription factor binding sites from tissue-specific DNase I cleavage patterns and genomic sequence content, leverages the hypersensitivity of factor-bound chromatin and the information in the DNase I spatial cleavage profile characteristic of each DNA binding protein to accurately infer functional factor binding sites. However, the model for the spatial profile in this framework underestimates the substantial variation in the DNase I cleavage profiles across factor-bound genomic locations and across replicate measurements of chromatin accessibility. Results: In this work, we adapt a multi-scale modeling framework for inhomogeneous Poisson processes to better model the underlying variation in DNase I cleavage patterns across genomic locations bound by a transcription factor. In addition to modeling variation, we also model spatial structure in the heterogeneity in DNase I cleavage patterns for each factor. Using DNase-seq measurements assayed in a lymphoblastoid cell line, we demonstrate the improved performance of this model for several transcription factors by comparing against the Chip-Seq peaks for those factors. Finally, we propose an extension to this framework that allows for a more flexible background model and evaluate the additional gain in accuracy achieved when the background model parameters are estimated using DNase-seq data from naked DNA. The proposed model can also be applied to paired-end ATAC-seq and DNase-seq data in a straightforward manner. Availability: msCentipede, a Python implementation of an algorithm to infer transcription factor binding using this model, is made available at https://github.com/rajanil/msCentipede

Demographic inference using genetic data from a single individual: separating population size variation

Demographic inference using genetic data from a single individual: separating population size variation from population structure
Olivier Mazet, Willy Rodríguez, Lounès Chikhi
doi: http://dx.doi.org/10.1101/011866

The rapid development of sequencing technologies represents new opportunities for population genetics research. It is expected that genomic data will increase our ability to reconstruct the history of populations. While this increase in genetic information will likely help biologists and anthropologists to reconstruct the demographic history of populations, it also represents new challenges. Recent work has shown that structured populations generate signals of population size change. As a consequence it is often difficult to determine whether demographic events such as expansions or contractions (bottlenecks) inferred from genetic data are real or due to the fact that populations are structured in nature. Given that few inferential methods allow us to account for that structure, and that genomic data will necessarily increase the precision of parameter estimates, it is important to develop new approaches. In the present study we analyse two demographic models. The first is a model of instantaneous population size change whereas the second is the classical symmetric island model. We (i) re-derive the distribution of coalescence times under the two models for a sample of size two, (ii) use a maximum likelihood approach to estimate the parameters of these models (iii) validate this estimation procedure under a wide array of parameter combinations, (iv) implement and validate a model choice procedure by using a Kolmogorov-Smirnov test. Altogether we show that it is possible to estimate parameters under several models and perform efficient model choice using genetic data from a single diploid individual.

Recent Y chromosome divergence despite ancient origin of dioecy in poplars (Populus)

Recent Y chromosome divergence despite ancient origin of dioecy in poplars (Populus)
Armando Geraldes, Charles A Hefer, Arnaud Capron, Natalia Kolosova, Felix Martinez-Nuñez, Raju Y Soolanayakanahally, Brian Stanton, Robert D Guy, Shawn D Mansfield, Carl J Douglas, Quentin C B Cronk
doi: http://dx.doi.org/10.1101/011817
Abstract

All species of the genus Populus (poplar, aspen) are dioecious, suggesting an ancient origin of this trait. Theory suggests that non-recombining sex-linked regions should quickly spread, eventually becoming heteromorphic chromosomes. In contrast, we show using whole genome scans that the sex-associated region in P. trichocarpa is small and much younger than the age of the genus. This indicates that sex-determination is highly labile in poplar, consistent with recent evidence of “turnover” of sex determination regions in animals. We performed whole genome resequencing of 52 Populus trichocarpa (black cottonwood) and 34 P. balsamifera (balsam poplar) individuals of known sex. Genome-wide association studies (GWAS) in these unstructured populations identified 650 SNPs significantly associated with sex. We estimate the size of the sex-linked region to be ∼100 Kbp. All significant SNPs were in strong linkage disequilibrium despite the fact that they were mapped to six different chromosomes (plus 3 unmapped scaffolds) in version 2.2 of the reference genome. We show that this is likely due to genome misassembly. The segregation pattern of sex associated SNPs revealed this to be an XY sex determining system. Estimated divergence times of X and Y haplotype sequences (6-7 MYA) are much more recent than the divergence of P. trichocarpa (poplar) and P. tremuloides (aspen). Consistent with this, in P. tremuloides we found no XY haplotype divergence within the P. trichocarpa sex-determining region. These two species therefore have a different genomic architecture of sex, suggestive of at least one turnover event in the recent past.

Full-genome evolutionary histories of selfing, splitting and selection in Caenorhabditis

Full-genome evolutionary histories of selfing, splitting and selection in Caenorhabditis
Cristel G. Thomas, Wei Wang, Richard Jovelin, Rajarshi Ghosh, Tatiana Lomasko, Quang Trinh, Leonid Kruglyak, Lincoln D Stein, Asher D Cutter
doi: http://dx.doi.org/10.1101/011502

The nematode Caenorhabditis briggsae is a model for comparative developmental evolution with C. elegans. Worldwide collections of C. briggsae have implicated an intriguing history of divergence among genetic groups separated by latitude, or by restricted geography, that is being exploited to dissect the genetic basis to adaptive evolution and reproductive incompatibility. And yet, the genomic scope and timing of population divergence is unclear. We performed high-coverage whole-genome sequencing of 37 wild isolates of the nematode C. briggsae and applied a pairwise sequentially Markovian coalescent (PSMC) model to 703 combinations of genomic haplotypes to draw inferences about population history, the genomic scope of natural selection, and to compare with 40 wild isolates of C. elegans. We estimate that a diaspora of at least 6 distinct C. briggsae lineages separated from one another approximately 200 thousand generations ago, including the ???Temperate??? and ???Tropical??? phylogeographic groups that dominate most samples from around the world. Moreover, an ancient population split in its history 2 million generations ago, coupled with only rare gene flow among lineage groups, validates this system as a model for incipient speciation. Low versus high recombination regions of the genome give distinct signatures of population size change through time, indicative of widespread effects of selection on highly linked portions of the genome owing to extreme inbreeding by self-fertilization. Analysis of functional mutations indicates that genomic context, owing to selection that acts on long linkage blocks, is a more important driver of population variation than are the functional attributes of the individually encoded genes.