Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays

Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays
Heejung Shim, Matthew Stephens
(Submitted on 27 Jul 2013)

Understanding how genetic variants influence cellular-level processes is an important step towards understanding how they influence important organismal-level traits, or “phenotypes”, including human disease susceptibility. To this end scientists are undertaking large-scale genetic association studies that aim to identify genetic variants associated with molecular and cellular phenotypes, such as gene expression, transcription factor binding, or chromatin accessibility. These studies use high-throughput sequencing assays (e.g. RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution data on how the traits vary along the genome in each sample. However, typical association analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. Here we develop and apply statistical methods that better exploit the high-resolution data. The key idea is to treat the sequence data as measuring an underlying “function” that varies along the genome, and then, building on wavelet-based methods for functional data analysis, test for association between genetic variants and the underlying function. Applying these methods to identify genetic variants associated with chromatin accessibility (dsQTLs) we find that they identify substantially more associations than a simpler window-based analysis, and in total we identify 772 novel dsQTLs not identified by the original analysis.

Robust forward simulations of recurrent positive selection

Robust forward simulations of recurrent positive selection
Lawrence H. Uricchio, Ryan D. Hernandez
(Submitted on 24 Jul 2013)

It is well known that recurrent positive selection reduces the amount of genetic variation at linked sites. In recent decades, analytical results have been proposed to quantify the magnitude of this reduction with simple Wright-Fisher models and diffusion approximations. However, extending these results to include interference between selected sites, arbitrary selection schemes, and complicated demographic processes has proved to be challenging. Forward simulation can provide insights into these processes, but few studies have examined recurrent positive selection in a forward simulation context due to computational constraints. Here, we extend the flexible forward simulator SFS_CODE to greatly improve the efficiency of simulations of recurrent positive selection. Forward simulations are computationally intensive and often necessitate rescaling of relevant parameters (e.g., population size and sequence length) to achieve computational feasibility. However, it is not obvious that parameter rescaling will maintain expected patterns of diversity in all parameter regimes. We develop a simple method for parameter rescaling that provides the best possible computational performance for a given error tolerance, and a detailed theoretical analysis of the robustness of rescaling across the parameter space. These results show that ad hoc approaches to parameter rescaling under the recurrent hitchhiking model may not always provide sufficiently accurate dynamics, potentially skewing patterns of diversity in simulated DNA sequences.


Genetics of single-cell protein abundance variation in large yeast populations

Genetics of single-cell protein abundance variation in large yeast populations
Frank W. Albert, Sebastian Treusch, Arthur H. Shockley, Joshua S. Bloom, Leonid Kruglyak
(Submitted on 25 Jul 2013)

Many DNA sequence variants influence phenotypes by altering gene expression. Our understanding of these variants is limited by sample sizes of current studies and by measurements of mRNA rather than protein abundance. We developed a powerful method for identifying genetic loci that influence protein expression in very large populations of the yeast Saccharomyes cerevisiae. The method measures single-cell protein abundance through the use of green-fluorescent-protein tags. We applied this method to 160 genes and detected many more loci per gene than previous studies. We also observed closer correspondence between loci that influence protein abundance and loci that influence mRNA abundance of a given gene. Most loci cluster at hotspot locations that influence multiple proteins – in some cases, more than half of those examined. The variants that underlie these hotspots have profound effects on the gene regulatory network and provide insights into genetic variation in cell physiology between yeast strains.

Speed of adaptation and genomic signatures in arms race and trench warfare models of host-parasite coevolution

Speed of adaptation and genomic signatures in arms race and trench warfare models of host-parasite coevolution
Aurelien Tellier, Stefany Moreno-Game, Wolfgang Stephan
(Submitted on 25 Jul 2013)

Host and parasite population genomic data are increasingly used to discover novel major genes underlying coevolution, assuming that natural selection generates two distinguishable polymorphism patterns: selective sweeps and balancing selection. These genomic signatures would result from two coevolutionary dynamics, the trench warfare with fast cycles of allele frequencies and the arms race with slow recurrent fixation of alleles. However, based on genome scans for selection, few genes for coevolution have yet been found in hosts. To address this issue, we build a gene-for-gene model with genetic drift, mutation and integrating coalescent simulations to study observable genomic signatures at host and parasite loci. In contrast to the conventional wisdom, we show that coevolutionary cycles are not faster under the trench warfare model compared to the arms race, except for large population sizes and high values of coevolutionary costs. Based on the generated SNP frequencies, the expected balancing selection signature under the trench warfare dynamics appears to be only observable in parasite sequences in a limited range of parameter, if effective population sizes are sufficiently large (>1000) and if selection has been acting for a long time (>4N generations). On the other hand, the typical signature of the arms race dynamics, i.e. selective sweeps, can be detected in parasite and to a lesser extent in host populations even if coevolution is recent. We suggest to study signatures of coevolution via population genomics of parasites rather than hosts, and caution against inferring coevolutionary dynamics based on the speed of coevolution.

Agalma: an automated phylogenomics workflow

Agalma: an automated phylogenomics workflow
Casey W. Dunn, Mark Howison, Felipe Zapata
(Submitted on 24 Jul 2013)

In the past decade, transcriptome data have become an important component of many phylogenetic studies. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies. We present Agalma, an automated tool that conducts phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, and enables rich HTML reports for all stages of the analysis. Agalma includes a test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at this https URL. Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended.

An Arrow-type result for inferring a species tree from gene trees

An Arrow-type result for inferring a species tree from gene trees
Mike Steel
(Submitted on 19 Jul 2013)

The reconstruction of a central tendency `species tree’ from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all taxa are present in every gene tree. Here, we list four desirable properties that a method for estimating a species tree from gene trees should have. We show that while these can be achieved when taxon coverage is complete (by the Adams consensus method), they cannot all be satisfied in the more general setting of partial taxon coverage.

Guidelines for the design of evolve and resequencing studies

Guidelines for the design of evolve and resequencing studies
Robert Kofler, Christian Schlötterer
(Submitted on 18 Jul 2013)

Standing genetic variation provides a rich reservoir of potentially useful mutations facilitating the adaptation to novel environments. Experimental evolution studies have demonstrated that rapid and strong phenotypic responses to selection can also be obtained in the laboratory. When combined with the Next Generation Sequencing technology, these experiments promise to identify the individual loci contributing to adaption. Nevertheless, until now, very little is known about the design of such evolve and resequencing (E&R) studies. Here, we use forward simulations of entire genomes to evaluate different experimental designs that aim to maximize the power to detect selected variants. We show that low linkage disequilibrium in the starting population, population size, duration of the experiment and the number of replicates are the key factors in determining the power and accuracy of E&R studies. Furthermore, replication of E&R is more important for detecting the targets of selection than increasing the population size. Using an optimized design beneficial loci with a selective advantage as low as s=0.005 can be identified at the nucleotide level. Even when a large number of loci are selected simultaneously, up to 56% can be reliably detected without incurring large numbers of false positives. Our computer simulations suggest that, with an adequate experimental design, E&R studies are a powerful tool to identify adaptive mutations from standing genetic variation and thereby provide an excellent means to analyze the trajectories of selected alleles in evolving populations

Computational aspects of DNA mixture analysis

Computational aspects of DNA mixture analysis
Therese Graversen, Steffen Lauritzen
(Submitted on 18 Jul 2013)

Statistical analysis of DNA mixtures is known to pose computational challenges due to the enormous state space of possible DNA profiles. We propose a Bayesian network representation for genotypes, allowing computations to be performed locally involving only a few alleles at each step. In addition, we describe a general method for computing the expectation of a product of discrete random variables using auxiliary variables and probability propagation in a Bayesian network, which in combination with the genotype network allows efficient computation of the likelihood function and various other quantities relevant to the inference. Lastly, we introduce a set of diagnostic tools for assessing the adequacy of the model for describing a particular dataset.

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome
Justin M. Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, Marc Salit
(Submitted on 17 Jul 2013)

Clinical adoption of human genome sequencing requires methods with known accuracy of genotype calls at millions or billions of positions across a genome. Previous work showing discordance amongst sequencing methods and algorithms has made clear the need for a highly accurate set of genotypes across a whole genome that could be used as a benchmark. We present methods we used to make highly confident SNP, indel, and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. To minimize bias towards any sequencing method, we integrate 9 whole genome and 3 exome datasets from 5 different sequencing platforms (Illumina, Complete Genomics, SOLiD, 454, and Ion Torrent), 7 mappers, and 3 variant callers. The resulting genotype calls are highly sensitive and specific, and allow performance assessment of more difficult variants than typically investigated using microarrays as a benchmark. Regions for which no confident genotype call could be made are identified as uncertain, and classified into different reasons for uncertainty (e.g. low coverage, mapping/alignment bias, etc.). As a community resource, we have integrated our highly confident genotype calls into the GCAT website for interactive assessment of false positive and negative rates of different datasets and bioinformatics methods using our highly confident calls. Application of the concepts of our integration process may be interesting beyond whole genome sequencing, for other measurement problems with large datasets from multiple methods, where none of the methods is a Reference Method that can be relied upon as highly sensitive and specific.

A model-based approach for identifying signatures of balancing selection in genetic data

A model-based approach for identifying signatures of balancing selection in genetic data
Michael DeGiorgio, Kirk E. Lohmueller, Rasmus Nielsen
(Submitted on 16 Jul 2013)

While much effort has focused on detecting positive and negative directional selection in the human genome, relatively little work has been devoted to balancing selection. This lack of attention is likely due to the paucity of sophisticated methods for identifying sites under balancing selection. Here we develop two composite likelihood ratio tests for detecting balancing selection. Using simulations, we show that these methods outperform competing methods under a variety of assumptions and demographic models. We apply the new methods to whole-genome human data, and find a number of previously-identified loci with strong evidence of balancing selection, including several HLA genes. Additionally, we find evidence for many novel candidates, the strongest of which is FANK1, an imprinted gene that suppresses apoptosis, is expressed during meiosis in males, and displays marginal signs of segregation distortion. We hypothesize that balancing selection acts on this locus to stabilize the segregation distortion and negative fitness effects of the distorter allele. Thus, our methods are able to reproduce many previously-hypothesized signals of balancing selection, as well as discover novel interesting candidates.