Impact of the X chromosome and sex on regulatory variation

Impact of the X chromosome and sex on regulatory variation

Kimberly R Kukurba, Princy Parsana, Kevin S Smith, Zachary Zappala, David A Knowles, Marie-Julie Favé, Xin Li, Xiaowei Zhu, James B Potash, Myrna M Weissman, Jianxin Shi, Anshul Kundaje, Douglas F Levinson, Philip Awadalla, Sara Mostafavi, Alexis Battle, Stephen B Montgomery
doi: http://dx.doi.org/10.1101/024117

The X chromosome, with its unique mode of inheritance, contributes to differences between the sexes at a molecular level, including sex-specific gene expression and sex-specific impact of genetic variation. We have conducted an analysis of the impact of both sex and the X chromosome on patterns of gene expression identified through transcriptome sequencing of whole blood from 922 individuals. We identified that genes on the X chromosome are more likely to have sex-specific expression compared to the autosomal genes. Furthermore, we identified a depletion of regulatory variants on the X chromosome, especially among genes under high selective constraint. In contrast, we discovered an enrichment of sex-specific regulatory variants on the X chromosome. To resolve the molecular mechanisms underlying such effects, we generated and connected sex-specific chromatin accessibility to sex-specific expression and regulatory variation. As sex-specific regulatory variants can inform sex differences in genetic disease prevalence, we have integrated our data with genome-wide association study data for multiple immune traits and to identify traits with significant sex biases. Together, our study provides genome-wide insight into how the X chromosome and sex shape human gene regulation and disease.

Genome variation and meiotic recombination in Plasmodium falciparum: insights from deep sequencing of genetic crosses

Genome variation and meiotic recombination in Plasmodium falciparum: insights from deep sequencing of genetic crosses

Alistair Miles, Zamin Iqbal, Paul Vauterin, Richard Pearson, Susana Campino, Michel Theron, Kelda Gould, Daniel Mead, Eleanor Drury, John O’Brien, Valentin Ruano Rubio, Bronwyn MacInnis, Jonathan Mwangi, Upeka Samarakoon, Lisa Ranford-Cartwright, Michael Ferdig, Karen Hayton, Xinzhuan Su, Thomas Wellems, Julian Rayner, Gil McVean, Dominic Kwiatkowski
doi: http://dx.doi.org/10.1101/024182

The malaria parasite Plasmodium falciparum has a great capacity for evolutionary adaptation to evade host immunity and develop drug resistance. Current understanding of parasite evolution is impeded by the fact that a large fraction of the genome is either highly repetitive or highly variable, and thus difficult to analyse using short read technologies. Here we describe a resource of deep sequencing data on parents and progeny from genetic crosses, which has enabled us to perform the first integrated analysis of SNP, INDEL and complex polymorphisms, using Mendelian error rates as an indicator of genotypic accuracy. These data reveal that INDELs are exceptionally abundant and the dominant mode of polymorphism within the core genome. We analyse patterns of meiotic recombination, including the relative contribution of crossover and non-crossover events, and we observe several instances of recombination that modify copy number variants associated with drug resistance. We describe a novel web application that allows these data to be explored in detail.

Isolation-By-Distance-and-Time in a stepping-stone model

Isolation-By-Distance-and-Time in a stepping-stone model

Nicolas Duforet-Frebourg, Montgomery Slatkin
doi: http://dx.doi.org/10.1101/024133

With the great advances in ancient DNA extraction, population genetics data are now made of geographically separated individuals from both present and ancient times. However, population genetics theory about the joint effect of space and time has not been thoroughly studied. Based on the classical stepping–stone model, we develop the theory of Isolation by Distance and Time. We derive the correlation of allele frequencies between demes in the case where ancient samples are present in the data, and investigate the impact of edge effects with forward-in-time simulations. We also derive results about coalescent times in circular/toroidal models. As one of the most common way to investigate population structure is to apply principal component analysis, we evaluate the impact of this theory on plots of principal components. Our results demonstrate that time between samples is a non-negligible factor that requires new attention in population genetics.

Integrative approaches for large-scale transcriptome-wide association studies

Integrative approaches for large-scale transcriptome-wide association studies

Alexander Gusev, Arthur Ko, Huwenbo Shi, Gaurav Bhatia, Wonil Chung, Brenda WJ Penninx, Rick Jansen, Eco JC de Geus, Dorret I Boomsma, Fred A Wright, Patrick F Sullivan, Elina Nikkola, Marcus Alvarez, Mete Civelek, Aldonis J Lusis, Terho Lehtimaki, Emma Raitoharju, Mika Kahonen, Ilkka Seppala, Olli Raitakari, Johanna Kuusisto, Markku Laakso, Alkes L Price, Paivi Pajukanta, Bogdan Pasaniuc
doi: http://dx.doi.org/10.1101/024083

Many genetic variants influence complex traits by modulating gene expression, thus altering the abundance levels of one or multiple proteins. In this work we introduce a powerful strategy that integrates gene expression measurements with large-scale genome-wide association data to identify genes whose cis-regulated expression is associated to complex traits. We use a relatively small reference panel of individuals for which both genetic variation and gene expression have been measured to impute gene expression into large cohorts of individuals and identify expression-trait associations. We extend our methods to allow for indirect imputation of the expression-trait association from summary association statistics of large-scale GWAS1-3. We applied our approaches to expression data from blood and adipose tissue measured in ~3,000 individuals overall. We then imputed gene expression into GWAS data from over 900,000 phenotype measurements4-6 to identify 69 novel genes significantly associated to obesity-related traits (BMI, lipids, and height). Many of the novel genes were associated with relevant phenotypes in the Hybrid Mouse Diversity Panel. Overall our results showcase the power of integrating genotype, gene expression and phenotype to gain insights into the genetic basis of complex traits.

Cross-species transmission and differential fate of an endogenous retrovirus in three mammal lineages

Cross-species transmission and differential fate of an endogenous retrovirus in three mammal lineages

Xiaoyu Zhuo, Cedric Feschotte
doi: http://dx.doi.org/10.1101/024190

Endogenous retroviruses (ERVs) arise from retroviruses chromosomally integrated in the host germline. ERVs are common in vertebrate genomes and provide a valuable fossil record of past retroviral infections to investigate the biology and evolution of retroviruses over a deep time scale, including cross-species transmission events. Here we took advantage of a catalog of ERVs we recently produced for the bat Myotis lucifugus to seek evidence for infiltration of these retroviruses in other mammalian species (>100) currently represented in the genome sequence database. We provide multiple lines of evidence for the cross-ordinal transmission of a gammaretrovirus endogenized independently in the lineages of vespertilionid bats, felid cats and pangolin ~13-25 million years ago. Following its initial introduction, the ERV amplified extensively in parallel in both bat and cat lineages, generating hundreds of species-specific insertions throughout evolution. However, despite being derived from the same viral species, phylogenetic and selection analyses suggest that the ERV experienced different amplification dynamics in the two mammalian lineages. In the cat lineage, the ERV appears to have expanded primarily by retrotransposition of a single proviral progenitor that lost infectious capacity shortly after endogenization. In the bat lineage, the ERV followed a more complex path of germline invasion characterized by both retrotransposition and multiple infection events. The results also suggest that some of the bat ERVs have maintained infectious capacity for extended period of time and may be still infectious today. This study provides one of the most rigorously documented cases of cross-ordinal transmission of a mammalian retrovirus. It also illustrates how the same retrovirus species has transitioned multiple times from an infectious pathogen to a genomic parasite (i.e. retrotransposon), yet experiencing different invasion dynamics in different mammalian hosts.

Distribution of gene tree histories under the coalescent model with gene flow

Distribution of gene tree histories under the coalescent model with gene flow

Yuan Tian, Laura Kubatko
doi: http://dx.doi.org/10.1101/023937

We propose a coalescent model for three species that allows gene flow between both pairs of sister populations. The model is designed to analyze multilocus genomic sequence alignments, with one sequence sampled from each of the three species. The model is formulated using a Markov chain representation, which allows use of matrix exponentiation to compute analytical expressions for the probability density of gene tree genealogies. The gene tree history distribution as well as the gene tree topology distribution under this coalescent model with gene flow are then calculated via numerical integration. We analyze the model to compare the distributions of gene tree topologies and gene tree histories for species trees with differing effective population sizes and gene flow rates. Our results suggest conditions under which the species tree and associated parameters are not identifiable from the gene tree topology distribution when gene flow is present, but indicate that the gene tree history distribution may identify the species tree and associated parameters. Thus, the gene tree history distribution can be used to infer parameters such as the ancestral effective population sizes and the rates of gene flow in a maximum likelihood (ML) framework. We conduct computer simulations to evaluate the performance of our method in estimating these parameters, and we apply our method to an Afrotropical mosquito data set (Fontaine et al., 2015) to demonstrate the usefulness of our method for the analysis of empirical data. Key words: coalescent, gene flow, migration, hybridization, gene tree, topology, history, maximum likelihood, speciation.

More efficacious drugs lead to harder selective sweeps in the evolution of drug resistance in HIV-1

More efficacious drugs lead to harder selective sweeps in the evolution of drug resistance in HIV-1

Alison F Feder, Soo-Yon Rhee, Robert W Shafer, Dmitri A Petrov, Pleuni S Pennings
doi: http://dx.doi.org/10.1101/024109

In the early days of HIV treatment, drug resistance occurred rapidly and predictably in all patients, but under modern treatments, resistance arises slowly, if at all. The probability of resistance should be controlled by the rate of generation of resistant mutations. If many adaptive mutations arise simultaneously, then adaptation proceeds by soft selective sweeps in which multiple adaptive mutations spread concomitantly, but if adaptive mutations occur rarely in the population, then a single adaptive mutation should spread alone in a hard selective sweep. Here we use 6,717 HIV-1 consensus sequences from patients treated with first-line therapies between 1989 and 2013 to confirm that the transition from fast to slow evolution of drug resistance was indeed accompanied with the expected transition from soft to hard selective sweeps. This suggests more generally that evolution proceeds via hard sweeps if resistance is unlikely and via soft sweeps if it is likely.

A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics

A method to estimate the contribution of regional genetic associations to complex traits from summary association statistics

Guillaume Pare, Shihong Mao, Wei Deng
doi: http://dx.doi.org/10.1101/024067

Despite considerable efforts, known genetic associations only explain a small fraction of predicted heritability. Regional associations combine information from multiple contiguous genetic variants and can improve variance explained at established association loci. However, regional associations are not easily amenable to estimation using summary association statistics because of sensitivity to linkage disequilibrium (LD). We now propose a novel method to estimate phenotypic variance explained by regional associations using summary statistics while accounting for LD. Our method is asymptotically equivalent to multiple regression models when no interaction or haplotype effects are present. It has multiple applications, such as ranking of genetic regions according to variance explained and derivation of regional gene scores (GS). We show that most genetic variance lies in a small proportion of the genome, and that GS derived from regional associations can improve trait prediction above optimal polygenic scores. Our results also suggest regional associations underlie known linkage peaks.

Decreased transcription factor binding levels nearby primate pseudogenes suggests regulatory degeneration

Decreased transcription factor binding levels nearby primate pseudogenes suggests regulatory degeneration

Gavin M Douglas, Michael D Wilson, Alan M Moses
doi: http://dx.doi.org/10.1101/024026

Characteristics of pseudogene degeneration at the coding level are well-known, such as a shift towards neutral rates of nonsynonymous substitutions and gain of frameshift mutations. In contrast, degeneration of pseudogene transcriptional regulation is not well understood. Here, we test two predictions of regulatory degeneration along the pseudogenized lineage: (1) decreased transcription factor binding and (2) accelerated evolution in putative cis-regulatory regions. We find evidence for decreased TF binding levels nearby two primate pseudogenes compared to functional liver genes. We also find evidence for pseudogene-lineage-specific relaxation of sequence constraint on a fragment of the promoter of the primate pseudogene urate oxidase (Uox) and a nearby cis-regulatory module (CRM). However, the majority of TF-bound sequences nearby pseudogenes do not show evidence for lineage-specific accelerated rates of evolution. We conclude that decreases in TF binding level could be a marker for regulatory degeneration, while sequence degeneration in most CRMs may be obscured by background rates of TF binding site turnover.

Inference and analysis of population structure using genetic data and network theory

Inference and analysis of population structure using genetic data and network theory

Gili Greenbaum, Alan R. Templeton, Shirli Bar-David
doi: http://dx.doi.org/10.1101/024042

Clustering individuals based on genetic data has become commonplace in many genetic and ecological studies. Most often, statistical inference of population structure is done by applying model-based approaches, such as Bayesian clustering, aided by visualization using distance-based approaches, such as PCA (Principle Component Analysis). While existing distance-based approaches suffer from lack of statistical rigour, model-based approaches entail assumption of prior conditions such as that the subpopulations are at Hardy-Wienberg equilibria. Here we present a distance-based approach for inference of population structure using genetic data based on the network theory concept of community, a dense subgraph within a network. A network is constructed using the pairwise genetic-distance matrix of all sampled individuals, and utilizes community detection algorithms to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition’s modularity, a network theory concept measuring the strength in which partitions divide the network. In order to further characterize population structure, a measure of the Strength of Association (SA) for an individual to its assigned community is calculated, and the Strength of Association Distribution (SAD) of the communities is analysed to provide additional population structure details. The approach presented here provides a novel, computationally efficient, method for inference of population structure which does not assume an underlying model nor prior conditions, making inference potentially more robust. The method is implemented in the software NetStruct, available at https://github.com/GiliG/NetStruct.