qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

Stephen D. Turner

Summary: Genome-wide association studies (GWAS) have identified thousands of human trait-associated single nucleotide polymorphisms. Here, I describe a freely available R package for visualizing GWAS results using Q-Q and manhattan plots. The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest. Availability: qqman is released under the GNU General Public License, and is freely available on the Comprehensive R Archive Network (http://cran.r-project.org/package=qqman). The source code is available on GitHub (https://github.com/stephenturner/qqman).

Cosi2 : An efficient simulator of exact and approximate coalescent with selection

Cosi2 : An efficient simulator of exact and approximate coalescent with selection

Ilya Shlyakhter, Pardis C. Sabeti, Stephen F. Schaffner

Motivation: Efficient simulation of population genetic samples under a given demographic model is a prerequisite for many analyses. Coalescent theory provides an efficient framework for such simulations, but simulating longer regions and higher recombination rates remains challenging. Simulators based on a Markovian approximation to the coalescent scale well, but do not support simulation of selection. Gene conversion is not supported by any published coalescent simulators that support selection. Results: We describe cosi2 , an efficient simulator that supports both exact and approximate coalescent simulation with positive selection. cosi2 improves on the speed of existing exact simulators, and permits further speedup in approximate mode while retaining support for selection. cosi2 supports a wide range of demographic scenarios including recombination hot spots, gene conversion, population size changes, population structure and migration. cosi2 implements coalescent machinery efficiently by tracking only a small subset of the Ancestral Recombination Graph, sampling only relevant recombination events, and using augmented skip lists to represent tracked genetic segments. To preserve support for selection in approximate mode, the Markov approximation is implemented not by moving along the chromosome but by performing a standard backwards-in-time coalescent simulation while restricting coalescence to node pairs with overlapping or near-overlapping genetic material. We describe the algorithms used by cosi2 and present comparisons with existing selection simulators.

Properties of selected mutations and genotypic landscapes under Fisher’s Geometric Model

Properties of selected mutations and genotypic landscapes under Fisher’s Geometric Model

François Blanquart, Guillaume Achaz, Thomas Bataillon, Olivier Tenaillon
(Submitted on 14 May 2014)

The fitness landscape – the mapping between genotypes and fitness – determines properties of the process of adaptation. Several small genetic fitness landscapes have recently been built by selecting a handful of beneficial mutations and measuring fitness of all combinations of these mutations. Here we generate several testable predictions for the properties of these landscapes under Fisher’s geometric model of adaptation (FGMA). When far from the fitness optimum, we analytically compute the fitness effect of beneficial mutations and their epistatic interactions. We show that epistasis may be negative or positive on average depending on the distance of the ancestral genotype to the optimum and whether mutations were independently selected or co-selected in an adaptive walk. Using simulations, we show that genetic landscapes built from FGMA are very close to an additive landscape when the ancestral strain is far from the optimum. However, when close to the optimum, a large diversity of landscape with substantial ruggedness and sign epistasis emerged. Strikingly, landscapes built from different realizations of stochastic adaptive walks in the same exact conditions were highly variable, suggesting that several realizations of small genetic landscapes are needed to gain information about the underlying architecture of the global adaptive landscape.

When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis

When genomes collide: multiple modes of germline misregulation in a dysgenic syndrome of Drosophila virilis
Mauricio A. Galdos, Alexandra A. Erwin, Michelle L. Wickersheim, Chris C. Harrison, Kendra D. Marr, Justin Blumenstiel

In sexually reproducing species the union of gametes that are not closely related can result in genomic incompatibility. Hybrid dysgenic syndromes represent a form of genomic incompatibility that can arise when transposable element (TE) abundance differs between two parents. When TEs lacking in the female parent are transmitted paternally, a lack of corresponding silencing small RNAs (piRNAs) transmitted through the female germline can lead to TE mobilization in progeny. The epigenetic nature of this phenomenon is demonstrated by the fact that genetically identical females of the reciprocal cross are normal. Here we show that in the hybrid dysgenic syndrome of Drosophila virilis, an excess of paternally inherited TE families leads not only to increased expression of these TEs, but also coincides with derepression of TEs in equal abundance within parents. Moreover, TE derepression is stable as flies age and associated with piRNA biogenesis defects for only some TEs. At the same time, TE activation is associated with a genome wide shift in the distribution of endogenous gene expression and an increase in abundance of off-target genic piRNAs. To identify regions of the maternal genome that most protect against dysgenesis, we performed an F3 backcross analysis. We find that pericentric regions play a dominant role in maternal protection. This F3 backcross approach additionally allowed us to clarify the properties of genic paramutation in D. virilis. Overall, results support a model in which early germline events in dysgenesis establish a chronic, stable state of mis-expression that is maintained through adulthood. Such early events in the germline that are mediated by parent-of-origin effects may be important in determining patterns of gene expression in natural populations.

Quadri-allele frequency spectrum in a coalescent topology for mutations in non-constant population size

Quadri-allele frequency spectrum in a coalescent topology for mutations in non-constant population size

Arka Bhattacharya
(Submitted on 11 May 2014)

The sample frequency spectrum of a segregating site is the probability distribution of a sample of alleles from a genetic locus, conditional on observing the sample to have more than one clearly different phenotypes. We present a model for analyzing quadri-allele frequency spectrum, where the ancestral population diverged into three populations at a certain divergence time and the resulting mutations on the branches of the coalescent tree gave rise to three different derived alleles, which could be observed in the present generation along with the ancestral allele. The model has been analyzed for non-constant population size, assuming we had a certain number of extant lineages at the divergence time and no migration occurs between the populations.

Effective Genetic Risk Prediction Using Mixed Models

Effective Genetic Risk Prediction Using Mixed Models

David Golan, Saharon Rosset
(Submitted on 12 May 2014)

To date, efforts to produce high-quality polygenic risk scores from genome-wide studies of common disease have focused on estimating and aggregating the effects of multiple SNPs. Here we propose a novel statistical approach for genetic risk prediction, based on random and mixed effects models. Our approach (termed GeRSI) circumvents the need to estimate the effect sizes of numerous SNPs by treating these effects as random, producing predictions which are consistently superior to current state of the art, as we demonstrate in extensive simulation. When applying GeRSI to seven phenotypes from the WTCCC study, we confirm that the use of random effects is most beneficial for diseases that are known to be highly polygenic: hypertension (HT) and bipolar disorder (BD). For HT, there are no significant associations in the WTCCC data. The best existing model yields an AUC of 54%, while GeRSI improves it to 59%. For BD, using GeRSI improves the AUC from 55% to 62%. For individuals ranked at the top 10% of BD risk predictions, using GeRSI substantially increases the BD relative risk from 1.4 to 2.5.

diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals

diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals

Paula Tataru, Jasmine A. Nirody, Yun S. Song

Summary: We present a tool, diCal-IBD, for detecting identity-by-descent (IBD) tracts between pairs of genomic sequences. Our method builds on a recent demographic inference method based on the coalescent with recombination, and is able to incorporate demographic information as a prior. Simulation study shows that diCal-IBD has significantly higher recall and precision than that of existing IBD detection methods, while retaining reasonable accuracy for IBD tracts as small as 0.1 cM. Availability: https://sourceforge.net/projects/dical-ibd/ Contact: yss@eecs.berkeley.edu

Nonspecic transcription factor binding reduces variability in transcription factor and target protein expression

Nonspecic transcription factor binding reduces variability in transcription factor and target protein expression

Mohammad Soltani, Pavol Bokes, Zachary Fox, Abhyudai Singh
(Submitted on 11 May 2014)

Transcription factors (TFs) interact with a multitude of binding sites on DNA and partner proteins inside cells. We investigate how nonspecific binding/unbinding to such decoy binding sites affects the magnitude and time-scale of random fluctuations in TF copy numbers arising from stochastic gene expression. A stochastic model of TF gene expression, together with decoy site interactions is formulated. Distributions for the total (bound and unbound) and free (unbound) TF levels are derived by analytically solving the chemical master equation under physiologically relevant assumptions. Our results show that increasing the number of decoy binding sides considerably reduces stochasticity in free TF copy numbers. The TF autocorrelation function reveals that decoy sites can either enhance or shorten the time-scale of TF fluctuations depending on model parameters. To understand how noise in TF abundances propagates downstream, a TF target gene is included in the model. Intriguingly, we find that noise in the expression of the target gene decreases with increasing decoy sites for linear TF-target protein dose-responses, even in regimes where decoy sites enhance TF autocorrelation times. Moreover, counterintuitive noise transmissions arise for nonlinear dose-responses. In summary, our study highlights the critical role of molecular sequestration by decoy binding sites in regulating the stochastic dynamics of TFs and target proteins at the single-cell level.

Author post: Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans

This guest post is by Rebekah Rogers (@evolscientist) on her paper with coauthors “Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans” arXived here.

Tandem duplications are widely recognized as a source of genetic novelty. Duplication of gene sequences can result in adaptive evolution through the development of novel functions or specialization in subsets of ancestral functions when ‘spare parts’ are relieved of evolutionary constraints. Additionally, tandem duplications have the potential to create entirely novel gene structures through chimeric gene formation and recruitment of formerly non-coding sequence. Here, we survey the limits of standing variation for tandem duplications in natural populations of D. yakuba and D. simulans, estimate the upper bound of mutation rates, and explore their role in rapid evolution.

Tandem duplicates on the X chromosome in D. simulans show an excess of high frequency variants consistent with adaptive evolution through tandem duplication. Furthermore, we identify an overrepresentation of genes involved in rapidly evolving phenotypes such as chorion development and oogenesis, drug and toxin metabolism, chitin cuticle formation, chemosensory processes, lipases and endopeptidases expressed in male reproduction, as well as immune response to pathogens in both D. yakuba and D. simulans. The enrichment of such rapidly evolving functional classes points to a role for tandem duplicates in Red Queen dynamics and responses to strong selective pressures.
In spite of the observed concordance across functional classes we observe few duplicated genes that are shared across species indicating that parallel recruitment of tandem duplications is rare. The span of duplicates in the population is quite limited, and we estimate that less than 15% of the genome is represented among the tandem duplications segregating in the entire population for the species. Moreover, many duplicates are present at low frequency and will have difficulty escaping the forces of drift during selective sweeps. This very limited standing variation combined with low mutation rates for tandem duplications results in severe limitations in the substrate of genetic novelty that is available for adaptation.

Thus, the limits of standing variation and the rate of new mutations are expected to play a vital role in defining evolutionary trajectories and the ability of organisms to adapt in the event of gross environmental change. Given the limited substrate of genetic novelty, we expect that if adaptation is dependent upon gene duplications, suboptimal outcomes in adaptive walks will be common, long wait times will occur for new phenotypic changes, and many multicellular eukaryotes will display limited ability to adapt to rapidly changing environments.

Diversity and evolution of centromere repeats in the maize genome

Diversity and evolution of centromere repeats in the maize genome

Paul Bilinski, Kevin Distor, Jose Gutierrez-Lopez, Gabriela Mendoza Mendoza, Jinghua Shi, R. Kelly Dawe, Jeffrey Ross-Ibarra

Centromere repeats are found in most eukaryotes and play a critical role in kinetochore formation. Though CentC repeats exhibit considerable diversity both within and among species, little is understood about the mechanisms that drive cen- tromere repeat evolution. Here, we use maize as a model to investigate how a complex history involving polyploidy, fractionation, and recent domestication has impacted the diversity of the maize CentC repeat. We first validate the existence of long tan- dem arrays of repeats in maize and other taxa in the genus Zea. Although we find considerable sequence diversity among CentC copies genome-wide, genetic similar- ity among repeats is highest within these arrays, suggesting that tandem duplica- tions are the primary mechanism for the generation of new copies. Genetic clustering analyses identify similar sequences among distant repeats, and simulations suggest that this pattern may be due to homoplasious mutation. Although the two ancestral subgenomes of maize have contributed nearly equal numbers of centromeres, our analysis shows that the vast majority of all CentC repeats derive from one of the parental genomes. Finally, by comparing maize with its wild progenitor teosinte, we find that the abundance of CentC has decreased through domestication while the peri- centromeric repeat Cent4 has drastically increased.