Robust forward simulations of recurrent positive selection

Robust forward simulations of recurrent positive selection
Lawrence H. Uricchio, Ryan D. Hernandez
(Submitted on 24 Jul 2013)

It is well known that recurrent positive selection reduces the amount of genetic variation at linked sites. In recent decades, analytical results have been proposed to quantify the magnitude of this reduction with simple Wright-Fisher models and diffusion approximations. However, extending these results to include interference between selected sites, arbitrary selection schemes, and complicated demographic processes has proved to be challenging. Forward simulation can provide insights into these processes, but few studies have examined recurrent positive selection in a forward simulation context due to computational constraints. Here, we extend the flexible forward simulator SFS_CODE to greatly improve the efficiency of simulations of recurrent positive selection. Forward simulations are computationally intensive and often necessitate rescaling of relevant parameters (e.g., population size and sequence length) to achieve computational feasibility. However, it is not obvious that parameter rescaling will maintain expected patterns of diversity in all parameter regimes. We develop a simple method for parameter rescaling that provides the best possible computational performance for a given error tolerance, and a detailed theoretical analysis of the robustness of rescaling across the parameter space. These results show that ad hoc approaches to parameter rescaling under the recurrent hitchhiking model may not always provide sufficiently accurate dynamics, potentially skewing patterns of diversity in simulated DNA sequences.

Comments:

Genetics of single-cell protein abundance variation in large yeast populations

Genetics of single-cell protein abundance variation in large yeast populations
Frank W. Albert, Sebastian Treusch, Arthur H. Shockley, Joshua S. Bloom, Leonid Kruglyak
(Submitted on 25 Jul 2013)

Many DNA sequence variants influence phenotypes by altering gene expression. Our understanding of these variants is limited by sample sizes of current studies and by measurements of mRNA rather than protein abundance. We developed a powerful method for identifying genetic loci that influence protein expression in very large populations of the yeast Saccharomyes cerevisiae. The method measures single-cell protein abundance through the use of green-fluorescent-protein tags. We applied this method to 160 genes and detected many more loci per gene than previous studies. We also observed closer correspondence between loci that influence protein abundance and loci that influence mRNA abundance of a given gene. Most loci cluster at hotspot locations that influence multiple proteins – in some cases, more than half of those examined. The variants that underlie these hotspots have profound effects on the gene regulatory network and provide insights into genetic variation in cell physiology between yeast strains.

Speed of adaptation and genomic signatures in arms race and trench warfare models of host-parasite coevolution

Speed of adaptation and genomic signatures in arms race and trench warfare models of host-parasite coevolution
Aurelien Tellier, Stefany Moreno-Game, Wolfgang Stephan
(Submitted on 25 Jul 2013)

Host and parasite population genomic data are increasingly used to discover novel major genes underlying coevolution, assuming that natural selection generates two distinguishable polymorphism patterns: selective sweeps and balancing selection. These genomic signatures would result from two coevolutionary dynamics, the trench warfare with fast cycles of allele frequencies and the arms race with slow recurrent fixation of alleles. However, based on genome scans for selection, few genes for coevolution have yet been found in hosts. To address this issue, we build a gene-for-gene model with genetic drift, mutation and integrating coalescent simulations to study observable genomic signatures at host and parasite loci. In contrast to the conventional wisdom, we show that coevolutionary cycles are not faster under the trench warfare model compared to the arms race, except for large population sizes and high values of coevolutionary costs. Based on the generated SNP frequencies, the expected balancing selection signature under the trench warfare dynamics appears to be only observable in parasite sequences in a limited range of parameter, if effective population sizes are sufficiently large (>1000) and if selection has been acting for a long time (>4N generations). On the other hand, the typical signature of the arms race dynamics, i.e. selective sweeps, can be detected in parasite and to a lesser extent in host populations even if coevolution is recent. We suggest to study signatures of coevolution via population genomics of parasites rather than hosts, and caution against inferring coevolutionary dynamics based on the speed of coevolution.

An Arrow-type result for inferring a species tree from gene trees

An Arrow-type result for inferring a species tree from gene trees
Mike Steel
(Submitted on 19 Jul 2013)

The reconstruction of a central tendency `species tree’ from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all taxa are present in every gene tree. Here, we list four desirable properties that a method for estimating a species tree from gene trees should have. We show that while these can be achieved when taxon coverage is complete (by the Adams consensus method), they cannot all be satisfied in the more general setting of partial taxon coverage.

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome

Integrating sequencing datasets to form highly confident SNP and indel genotype calls for a whole human genome
Justin M. Zook, Brad Chapman, Jason Wang, David Mittelman, Oliver Hofmann, Winston Hide, Marc Salit
(Submitted on 17 Jul 2013)

Clinical adoption of human genome sequencing requires methods with known accuracy of genotype calls at millions or billions of positions across a genome. Previous work showing discordance amongst sequencing methods and algorithms has made clear the need for a highly accurate set of genotypes across a whole genome that could be used as a benchmark. We present methods we used to make highly confident SNP, indel, and homozygous reference genotype calls for NA12878, the pilot genome for the Genome in a Bottle Consortium. To minimize bias towards any sequencing method, we integrate 9 whole genome and 3 exome datasets from 5 different sequencing platforms (Illumina, Complete Genomics, SOLiD, 454, and Ion Torrent), 7 mappers, and 3 variant callers. The resulting genotype calls are highly sensitive and specific, and allow performance assessment of more difficult variants than typically investigated using microarrays as a benchmark. Regions for which no confident genotype call could be made are identified as uncertain, and classified into different reasons for uncertainty (e.g. low coverage, mapping/alignment bias, etc.). As a community resource, we have integrated our highly confident genotype calls into the GCAT website for interactive assessment of false positive and negative rates of different datasets and bioinformatics methods using our highly confident calls. Application of the concepts of our integration process may be interesting beyond whole genome sequencing, for other measurement problems with large datasets from multiple methods, where none of the methods is a Reference Method that can be relied upon as highly sensitive and specific.

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce

Cloudbreak: Accurate and Scalable Genomic Structural Variation Detection in the Cloud with MapReduce
Christopher W. Whelan, Jeffrey Tyner, Alberto L’Abbate, Clelia Tiziana Storlazzi, Lucia Carbone, Kemal Sönmez
(Submitted on 9 Jul 2013)

The detection of genomic structural variations (SV) remains a difficult challenge in analyzing sequencing data, and the growing size and number of sequenced genomes have rendered SV detection a bona fide big data problem. MapReduce is a proven, scalable solution for distributed computing on huge data sets. We describe a conceptual framework for SV detection algorithms in MapReduce based on computing local genomic features, and use it to develop a deletion and insertion detection algorithm, Cloudbreak. On simulated and real data sets, Cloudbreak achieves accuracy improvements over popular SV detection algorithms, and genotypes variants from diploid samples. It provides dramatically shorter runtimes and the ability to scale to big data volumes on large compute clusters. Cloudbreak includes tools to set up and configure MapReduce (Hadoop) clusters on cloud services, enabling on-demand cluster computing. Our implementation and source code are available at this http URL

The Changing Geometry of a Fitness Landscape Along an Adaptive Walk

The Changing Geometry of a Fitness Landscape Along an Adaptive Walk
Devin Greene, Krisitna Crona
(Submitted on 7 Jul 2013)

It has recently been noted that the relative prevalence of the various kinds of epistasis varies along an adaptive walk. This has been explained as a result of mean regression in NK model fitness landscapes. Here we show that this phenomenon occurs quite generally in fitness landscapes. We propose a simple and general explanation for this phenomemon, confirming the role of mean regression. We provide support for this explanation with simulations, and discuss the empirical relevance of our findings.

RNA secondary structure prediction from multi-aligned sequences

RNA secondary structure prediction from multi-aligned sequences
Michiaki Hamada
(Submitted on 8 Jul 2013)

It has been well accepted that the RNA secondary structures of most functional non-coding RNAs (ncRNAs) are closely related to their functions and are conserved during evolution. Hence, prediction of conserved secondary structures from evolutionarily related sequences is one important task in RNA bioinformatics; the methods are useful not only to further functional analyses of ncRNAs but also to improve the accuracy of secondary structure predictions and to find novel functional RNAs from the genome. In this review, I focus on common secondary structure prediction from a given aligned RNA sequence, in which one secondary structure whose length is equal to that of the input alignment is predicted. I systematically review and classify existing tools and algorithms for the problem, by utilizing the information employed in the tools and by adopting a unified viewpoint based on maximum expected gain (MEG) estimators. I believe that this classification will allow a deeper understanding of each tool and provide users with useful information for selecting tools for common secondary structure predictions.

Evaluating strategies of phylogenetic analyses by the coherence of their results

Evaluating strategies of phylogenetic analyses by the coherence of their results
Blaise Li
(Submitted on 5 Jul 2013)

I propose an approach to identify, among several strategies of phylogenetic analysis, those producing the most accurate results. This approach is based on the hypothesis that the more a result is reproduced from independent data, the more it reflects the historical signal common to the analysed data. Under this hypothesis, the capacity of an analytical strategy to extract historical signal should correlate positively with the coherence of the obtained results. I apply this approach to a series of analyses on empirical data, basing the coherence measure on the Robinson-Foulds distances between the obtained trees. At first approximation, the analytical strategies most suitable for the data produce the most coherent results. However, risks of false positives and false negatives are identified, which are difficult to rule out.

Evolution on genotype networks leads to phenotypic entrapment

Evolution on genotype networks leads to phenotypic entrapment
Susanna Manrubia, José A. Cuesta
(Submitted on 3 Jul 2013)

Large sets of genotypes give rise to the same phenotype because phenotypic expression is highly redundant. Accordingly, a population can accept mutations without altering its phenotype, as long as they transform its genotype into another one on the same set. By linking every pair of genotypes that are mutually accessible through mutation, genotypes organize themselves into genotype networks (GN). These networks are known to be heterogeneous and assortative. As these features condition the probability that mutations keep the phenotype unchanged—hence becoming blind to natural selection—it follows that the topology of the GN will influence the evolutionary dynamics of the population. In this letter we analyze this effect by studying the dynamics of random walks (RW) on assortative networks with arbitrary topology. We find that the probability that a RW leaves the network is smaller the longer the time spent in it—i.e., the process is not Markovian. From the biological viewpoint, this “phenotypic entrapment” entails an acceleration in the fixation of neutral mutations, thus implying a non-uniform increase in the ticking rate of the molecular clock with the age of branches in phylogenetic trees. We also show that this effect is stronger the larger the fitness of the current phenotype relative to that of neighboring phenotypes.