qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots

Stephen D. Turner

Summary: Genome-wide association studies (GWAS) have identified thousands of human trait-associated single nucleotide polymorphisms. Here, I describe a freely available R package for visualizing GWAS results using Q-Q and manhattan plots. The qqman package enables the flexible creation of manhattan plots, both genome-wide and for single chromosomes, with optional highlighting of SNPs of interest. Availability: qqman is released under the GNU General Public License, and is freely available on the Comprehensive R Archive Network (http://cran.r-project.org/package=qqman). The source code is available on GitHub (https://github.com/stephenturner/qqman).

Cosi2 : An efficient simulator of exact and approximate coalescent with selection

Cosi2 : An efficient simulator of exact and approximate coalescent with selection

Ilya Shlyakhter, Pardis C. Sabeti, Stephen F. Schaffner

Motivation: Efficient simulation of population genetic samples under a given demographic model is a prerequisite for many analyses. Coalescent theory provides an efficient framework for such simulations, but simulating longer regions and higher recombination rates remains challenging. Simulators based on a Markovian approximation to the coalescent scale well, but do not support simulation of selection. Gene conversion is not supported by any published coalescent simulators that support selection. Results: We describe cosi2 , an efficient simulator that supports both exact and approximate coalescent simulation with positive selection. cosi2 improves on the speed of existing exact simulators, and permits further speedup in approximate mode while retaining support for selection. cosi2 supports a wide range of demographic scenarios including recombination hot spots, gene conversion, population size changes, population structure and migration. cosi2 implements coalescent machinery efficiently by tracking only a small subset of the Ancestral Recombination Graph, sampling only relevant recombination events, and using augmented skip lists to represent tracked genetic segments. To preserve support for selection in approximate mode, the Markov approximation is implemented not by moving along the chromosome but by performing a standard backwards-in-time coalescent simulation while restricting coalescence to node pairs with overlapping or near-overlapping genetic material. We describe the algorithms used by cosi2 and present comparisons with existing selection simulators.

Properties of selected mutations and genotypic landscapes under Fisher’s Geometric Model

Properties of selected mutations and genotypic landscapes under Fisher’s Geometric Model

François Blanquart, Guillaume Achaz, Thomas Bataillon, Olivier Tenaillon
(Submitted on 14 May 2014)

The fitness landscape – the mapping between genotypes and fitness – determines properties of the process of adaptation. Several small genetic fitness landscapes have recently been built by selecting a handful of beneficial mutations and measuring fitness of all combinations of these mutations. Here we generate several testable predictions for the properties of these landscapes under Fisher’s geometric model of adaptation (FGMA). When far from the fitness optimum, we analytically compute the fitness effect of beneficial mutations and their epistatic interactions. We show that epistasis may be negative or positive on average depending on the distance of the ancestral genotype to the optimum and whether mutations were independently selected or co-selected in an adaptive walk. Using simulations, we show that genetic landscapes built from FGMA are very close to an additive landscape when the ancestral strain is far from the optimum. However, when close to the optimum, a large diversity of landscape with substantial ruggedness and sign epistasis emerged. Strikingly, landscapes built from different realizations of stochastic adaptive walks in the same exact conditions were highly variable, suggesting that several realizations of small genetic landscapes are needed to gain information about the underlying architecture of the global adaptive landscape.

Quadri-allele frequency spectrum in a coalescent topology for mutations in non-constant population size

Quadri-allele frequency spectrum in a coalescent topology for mutations in non-constant population size

Arka Bhattacharya
(Submitted on 11 May 2014)

The sample frequency spectrum of a segregating site is the probability distribution of a sample of alleles from a genetic locus, conditional on observing the sample to have more than one clearly different phenotypes. We present a model for analyzing quadri-allele frequency spectrum, where the ancestral population diverged into three populations at a certain divergence time and the resulting mutations on the branches of the coalescent tree gave rise to three different derived alleles, which could be observed in the present generation along with the ancestral allele. The model has been analyzed for non-constant population size, assuming we had a certain number of extant lineages at the divergence time and no migration occurs between the populations.

Effective Genetic Risk Prediction Using Mixed Models

Effective Genetic Risk Prediction Using Mixed Models

David Golan, Saharon Rosset
(Submitted on 12 May 2014)

To date, efforts to produce high-quality polygenic risk scores from genome-wide studies of common disease have focused on estimating and aggregating the effects of multiple SNPs. Here we propose a novel statistical approach for genetic risk prediction, based on random and mixed effects models. Our approach (termed GeRSI) circumvents the need to estimate the effect sizes of numerous SNPs by treating these effects as random, producing predictions which are consistently superior to current state of the art, as we demonstrate in extensive simulation. When applying GeRSI to seven phenotypes from the WTCCC study, we confirm that the use of random effects is most beneficial for diseases that are known to be highly polygenic: hypertension (HT) and bipolar disorder (BD). For HT, there are no significant associations in the WTCCC data. The best existing model yields an AUC of 54%, while GeRSI improves it to 59%. For BD, using GeRSI improves the AUC from 55% to 62%. For individuals ranked at the top 10% of BD risk predictions, using GeRSI substantially increases the BD relative risk from 1.4 to 2.5.

diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals

diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals

Paula Tataru, Jasmine A. Nirody, Yun S. Song

Summary: We present a tool, diCal-IBD, for detecting identity-by-descent (IBD) tracts between pairs of genomic sequences. Our method builds on a recent demographic inference method based on the coalescent with recombination, and is able to incorporate demographic information as a prior. Simulation study shows that diCal-IBD has significantly higher recall and precision than that of existing IBD detection methods, while retaining reasonable accuracy for IBD tracts as small as 0.1 cM. Availability: https://sourceforge.net/projects/dical-ibd/ Contact: yss@eecs.berkeley.edu

Nonspecic transcription factor binding reduces variability in transcription factor and target protein expression

Nonspecic transcription factor binding reduces variability in transcription factor and target protein expression

Mohammad Soltani, Pavol Bokes, Zachary Fox, Abhyudai Singh
(Submitted on 11 May 2014)

Transcription factors (TFs) interact with a multitude of binding sites on DNA and partner proteins inside cells. We investigate how nonspecific binding/unbinding to such decoy binding sites affects the magnitude and time-scale of random fluctuations in TF copy numbers arising from stochastic gene expression. A stochastic model of TF gene expression, together with decoy site interactions is formulated. Distributions for the total (bound and unbound) and free (unbound) TF levels are derived by analytically solving the chemical master equation under physiologically relevant assumptions. Our results show that increasing the number of decoy binding sides considerably reduces stochasticity in free TF copy numbers. The TF autocorrelation function reveals that decoy sites can either enhance or shorten the time-scale of TF fluctuations depending on model parameters. To understand how noise in TF abundances propagates downstream, a TF target gene is included in the model. Intriguingly, we find that noise in the expression of the target gene decreases with increasing decoy sites for linear TF-target protein dose-responses, even in regimes where decoy sites enhance TF autocorrelation times. Moreover, counterintuitive noise transmissions arise for nonlinear dose-responses. In summary, our study highlights the critical role of molecular sequestration by decoy binding sites in regulating the stochastic dynamics of TFs and target proteins at the single-cell level.

Diversity and evolution of centromere repeats in the maize genome

Diversity and evolution of centromere repeats in the maize genome

Paul Bilinski, Kevin Distor, Jose Gutierrez-Lopez, Gabriela Mendoza Mendoza, Jinghua Shi, R. Kelly Dawe, Jeffrey Ross-Ibarra

Centromere repeats are found in most eukaryotes and play a critical role in kinetochore formation. Though CentC repeats exhibit considerable diversity both within and among species, little is understood about the mechanisms that drive cen- tromere repeat evolution. Here, we use maize as a model to investigate how a complex history involving polyploidy, fractionation, and recent domestication has impacted the diversity of the maize CentC repeat. We first validate the existence of long tan- dem arrays of repeats in maize and other taxa in the genus Zea. Although we find considerable sequence diversity among CentC copies genome-wide, genetic similar- ity among repeats is highest within these arrays, suggesting that tandem duplica- tions are the primary mechanism for the generation of new copies. Genetic clustering analyses identify similar sequences among distant repeats, and simulations suggest that this pattern may be due to homoplasious mutation. Although the two ancestral subgenomes of maize have contributed nearly equal numbers of centromeres, our analysis shows that the vast majority of all CentC repeats derive from one of the parental genomes. Finally, by comparing maize with its wild progenitor teosinte, we find that the abundance of CentC has decreased through domestication while the peri- centromeric repeat Cent4 has drastically increased.

Quantifying evolutionary dynamics of the basic genome of E. coli

Quantifying evolutionary dynamics of the basic genome of E. coli

Purushottam Dixit, Tin Yau Pang, F. William Studier, Sergei Maslov
(Submitted on 11 May 2014)

The ~4-Mbp basic genome shared by 32 independent isolates of E. coli representing considerable population diversity has been approximated by whole-genome multiple-alignment and computational filtering designed to remove mobile elements and highly variable regions. Single nucleotide polymorphisms (SNPs) in the 496 basic-genome pairs are identified and clonally inherited stretches are distinguished from those acquired by horizontal transfer (HT) by sharp discontinuities in SNP density. The six least diverged genome-pairs each have only one or two HT stretches, each occupying 42-115-kbp of basic genome and containing at least one gene cluster known to confer selective advantage. At higher divergences, the typical mosaic pattern of interspersed clonal and HT stretches across the entire basic genome are observed, including likely fragmented integrations across a restriction barrier. A simple model suggests that individual HT events are of the order of 10-kbp and are the chief contributor to genome divergence, bringing in almost 12 times more SNPs than point mutations. As a result of continuing horizontal transfer of such large segments, 400 out of the 496 strain-pairs beyond genomic divergence of share virtually no genomic material with their common ancestor. We conclude that the active and continuing horizontal transfer of moderately large genomic fragments is likely to be mediated primarily by a co evolving population of phages that distribute random genome fragments throughout the population by generalized transduction, allowing efficient adaptation to environmental changes.

Sequence co-evolution gives 3D contacts and structures of protein complexes

Sequence co-evolution gives 3D contacts and structures of protein complexes

Thomas A. Hopf, Charlotta P.I. Schärfe, João P.G.L.M. Rodrigues, Anna G. Green, Chris Sander, Alexandre M.J.J. Bonvin, Debora S. Marks

High-throughput experiments in bacteria and eukaryotic cells have identified tens of thousands of possible interactions between proteins. This genome-wide view of the protein interaction universe is coarse-grained, whilst fine-grained detail of macro- molecular interactions critically depends on lower throughput, labor-intensive experiments. Computational approaches using measures of residue co-evolution across proteins show promise, but have been limited to specific interactions. Here we present a new generalized method showing that patterns of evolutionary sequence changes across proteins reflect residues that are close in space, and with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We demonstrate that the inferred evolutionary coupling scores distinguish between interacting and non-interacting proteins and the accurate prediction of residue interactions. To illustrate the utility of the method, we predict unknown 3D interactions between subunits of ATP synthase and find results consistent with detailed experimental data. We expect that the method can be generalized to genome- wide interaction predictions at residue resolution.