SLiM: Simulating Evolution with Selection and Linkage

SLiM: Simulating Evolution with Selection and Linkage
Philipp W. Messer
(Submitted on 14 Jan 2013)

SLiM is an efficient forward population genetic simulation designed for studying the effects of linkage and selection on a chromosome-wide scale. The program can incorporate complex scenarios of demography and population substructure, various models for selection and dominance of new mutations, arbitrary gene and chromosomal structure, and user-defined recombination maps.

Consider public archiving for your dissertation

This guest post is by Carl Boettiger (@cboettig). Carl is a postdoc with interests in theoretical and applied ecology, evolution, and phylogenetics. He’s a supporter of open access and open science, and recently posted his PhD thesis to figshare (see discussion with him on the merits of theses on figshare and University archives here).

Consider public archiving for your dissertation

As researchers we spend an immense amount of time generating products other than papers. While we go through great lengths to see that our papers are published in just the right place to be seen by our colleagues (fretting about the different impact factors, percieved audience, editorial boards, open access policies, and many other factors that determine just how a paper will see the light of day), other products of our labors largely languish on forgotten hard-drives from long ago.

Among the items that recieve considerable investiment of blood, sweat and tears in not only producing but formating just right, etc, is the PhD dissertation. As much of this work will no doubt eventually make its way into various formal publications, if it hasn’t already, it easy to view the process more as ritual than practical, whose only outcome will be another dusty black cover to grace the darkest shelves of the University library and the office of any adviser over fifty. Yet dissertations have more practical uses than bookends
as well.

A dissertation is frequently the first time certain results will see the light of day, and may offer a more accessible introduction with more complete review of background material than a published paper, thanks to the long-hand monograph style that seems to be out of vogue in the peer reviewed literature. Dissertation acknowledgements often provide wonderful snapshot into the toils of a PhD in recognizing contributions and support. And while the published results may appear only in journals requiring subscriptions, the author can almost always still release the original thesis as open access to gain the potential benefits of larger readership.[1]

While some dissertations have been important references to me during my own PhD and beyond, they aren’t always easy to find — for me, author’s webpages have been a more common source than University or publisher catalogs. Meanwhile, many other researchers do not even mention their dissertations on their own websites. Today, there are better and easier alternatives for sharing your dissertation.

An increasing recognition of other products of research has led to a proliferation of possible outlets to share research materials. Repositories such as arXiv and Figshare are indexed by Google Scholar, provide reliable persistent storage, and permanent identifiers or DOIs that can make it easy to cite or link.

[1]: e.g. see:
1. Gargouri, Y. et al. Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research. PLoS ONE 5, e13636 (2010).
2. Eysenbach, G. Citation advantage of open access articles. PLoS Biology 4, e157 (2006).

Improving the Efficiency of Genomic Selection

Improving the Efficiency of Genomic Selection
Marco Scutari, Ian Mackay, David J. Balding
(Submitted on 10 Jan 2013)

We investigate two approaches to increase the efficiency of phenotypic prediction from genome-wide markers, which is a key step for genomic selection (GS) in plant and animal breeding. The first approach is feature selection based on Markov blankets, which provide a theoretically-sound framework for identifying non-informative markers. Fitting GS models using only the informative markers results in simpler models, which may allow cost savings from reduced genotyping. We show that this is accompanied by no loss, and possibly a small gain, in predictive power for four GS models: partial least squares (PLS), ridge regression, LASSO and elastic net. The second approach is the choice of kinship coefficients for genomic best linear unbiased prediction (GBLUP). We compare kinships based on different combinations of centring and scaling of marker genotypes, and a newly proposed kinship measure that adjusts for linkage disequilibrium (LD).
We illustrate the use of both approaches and examine their performances using three real-world data sets from plant and animal genetics. We find that elastic net with feature selection and GBLUP using LD-adjusted kinships performed similarly well, and were the best-performing methods in our study.

A genome-wide survey of genetic variation in gorillas using reduced representation sequencing

A genome-wide survey of genetic variation in gorillas using reduced representation sequencing
Aylwyn Scally, Bryndis Yngvadottir, Yali Xue, Qasim Ayub, Richard Durbin, Chris Tyler-Smith
(Submitted on 9 Jan 2013)

All non-human great apes are endangered in the wild, and it is therefore important to gain an understanding of their demography and genetic diversity. To date, however, genetic studies within these species have largely been confined to mitochondrial DNA and a small number of other loci. Here, we present a genome-wide survey of genetic variation in gorillas using a reduced representation sequencing approach, focusing on the two lowland subspecies. We identify 3,274,491 polymorphic sites in 14 individuals: 12 western lowland gorillas (Gorilla gorilla gorilla) and 2 eastern lowland gorillas (Gorilla beringei graueri). We find that the two species are genetically distinct, based on levels of heterozygosity and patterns of allele sharing. Focusing on the western lowland population, we observe evidence for population substructure, and a deficit of rare genetic variants suggesting a recent episode of population contraction. In western lowland gorillas, there is an elevation of variation towards telomeres and centromeres on the chromosomal scale. On a finer scale, we find substantial variation in genetic diversity, including a marked reduction close to the major histocompatibility locus, perhaps indicative of recent strong selection there. These findings suggest that despite their maintaining an overall level of genetic diversity equal to or greater than that of humans, population decline, perhaps associated with disease, has been a significant factor in recent and long-term pressures on wild gorilla populations.

Dynamics of adaptation: extreme value domains, distance to fitness optimum and fitness correlations

Dynamics of adaptation: extreme value domains, distance to fitness optimum and fitness correlations
Sarada Seetharaman, Kavita Jain
(Submitted on 8 Jan 2013)

We study the properties of adaptive walk performed by a maladapted asexual population in which beneficial mutations fix sequentially until a local fitness peak is reached. Here we consider three factors that govern the adaptation dynamics: the extreme value domain of beneficial mutations, initial distance to the local fitness optimum and the correlations amongst the fitnesses. We show that there is a transition in the behaviour of the walk length and average fitness fixed during adaptation when the mean and variance of the fitness distribution respectively become infinite. When the mean is finite, walk length decreases logarithmically with initial fitness but is a constant otherwise. We also find that the walks are longer for faster decaying fitness distributions and correlated fitnesses. For fitness distributions with finite variance, the fitness fixed during initial steps does not depend on the fitness of the local optimum but increases with the local peak fitness otherwise. Interestingly, the fitness difference between successive steps shows a pattern of diminishing returns for bounded distributions and accelerating returns for fat-tailed distributions. These trends are found to be robust with respect to fitness correlations.

A comparative analysis of transcription factor expression during metazoan embryonic development

A comparative analysis of transcription factor expression during metazoan embryonic development
Alicia Schep, Boris Adryan
(Submitted on 8 Jan 2013)

During embryonic development, a complex organism is formed from a single starting cell. These processes of growth and differentiation are driven by large transcriptional changes, which are following the expression and activity of transcription factors (TFs). This study sought to compare TF expression during embryonic development in a diverse group of metazoan animals: representatives of vertebrates (Danio rerio, Xenopus tropicalis), a chordate (Ciona intestinalis) and invertebrate phyla such as insects (Drosophila melanogaster, Anopheles gambiae) and nematodes (Caenorhabditis elegans) were sampled, The different species showed overall very similar TF expression patterns, with TF expression increasing during the initial stages of development. C2H2 zinc finger TFs were over-represented and Homeobox TFs were under-represented in the early stages in all species. We further clustered TFs for each species based on their quantitative temporal expression profiles. This showed very similar TF expression trends in development in vertebrate and insect species. However, analysis of the expression of orthologous pairs between more closely related species showed that expression of most individual TFs is not conserved, following the general model of duplication and diversification. The degree of similarity between TF expression between Xenopus tropicalis and Danio rerio followed the hourglass model, with the greatest similarity occuring during the early tailbud stage in Xenopus tropicalis and the late segmentation stage in Danio rerio. However, for Drosophila melanogaster and Anopheles gambiae there were two periods of high TF transcriptome similarity, one during the Arthropod phylotypic stage at 8-10 hours into Drosophila development and the other later at 16-18 hours into Drosophila development.

Our paper: A statistical framework for joint eQTL analysis in multiple tissues

This guest post is by Timothée Flutre and William Wen on their paper “A statistical framework for joint eQTL analysis in multiple tissues” with Matthew Stephens and Jonathan Pritchard arXived here.

As large eQTL data sets are being produced for multiple tissues, it is important to leverage all the information in the data to detect eQTLs as well as to provide ways to interpret them. Motivated by this, we developed a statistical framework for eQTL discovery that allows for joint analysis of multiple tissues. Though the details are in the paper, in this blog post we take the opportunity to highlight what we think are the main statistical features.

Looking for eQTLs in multiple tissues immediately raises the question of tissue specificity. In this paper, we define an eQTL as “active” in a particular tissue if it has a non-zero genetic effect on the expression of the target gene in this tissue. Most published works implicitly use this definition to refer to tissue-specific eQTLs. One could take issue with this definition: for example, if an eQTL is very strong in one tissue and very weak in another then one might think of this as “tissue-specific”, or at least “tissue-inconsistent”, but in our paper we stick with the binary representation of activity as a useful first step. We represent the activity pattern of a potential eQTL by a binary vector called a configuration (see Han & Eskin, PLoS Genetics 2012, and Wen & Stephens, arXiv 1111.1210). As an example, the following configuration, (110), corresponds to the case where three tissues are analyzed and the eQTL is active only in the first two tissues.

In a brief summary, we can highlight three important features of our model. First, by mapping eQTLs jointly rather than in each tissue separately, our model borrows information between the tissues in which an eQTL is active, and thereby greatly increases power. This is somewhat equivalent to relaxing the threshold of significance in the second tissue when one has already detected the eQTL in the first tissue. Second, by comparing evidence in the data for each configuration, our model provides an interpretation of how an eQTL acts in multiple tissues. In statistical terms, as more than two hypotheses are being tested (for three tissues there are 7 non-null configurations), one usually speaks of model comparison. Third, our model also estimates the proportion of each configuration in the data set. This is achieved by pooling all genes together, and thus borrowing information between them.

Besides simulations, we re-analyzed the largest available data set so far, 3 tissues from 75 individuals, from Dimas et al (Science 2009). Our joint analysis model has more power and detects substantially more eQTLs than a tissue-by-tissue analysis (63% at FDR=0.05). Moreover, we show how a tissue-by-tissue analysis can largely overestimate the fraction of tissue-specific eQTLs, because it does not account for incomplete power when testing in each tissue separately. Qualitatively, the discrepancy between both methods is very large on this data set. Indeed, according to the tissue-by-tissue analysis, only 19% of eQTLs are consistent across tissues, i.e. configuration (111), whereas our model estimates >80% of eQTLs to be consistent. After checking several of our assumptions, we are fairly confident in our estimate. Moreover such a high proportion of consistent eQTLs is also obtained with the pairwise approach originally used by Nica et al (2011).

The analysis of this specific data set therefore indicates that most eQTLs are consistent across tissues. Yet we find examples of strong tissue-specific eQTLs, such as between gene ENSG00000166839 (ANKDD1A) and SNP rs1628955:

box-forest_strong-specific_rmvPCs_ENSG00000166839-rs1628955

Haldane’s Sieve sifts through 2012

We started Haldane’s Sieve back in August 2012 to promote a preprint culture in evolutionary genetics (see here for more details). Since starting we’ve had ~150 posts, the vast majority of which have been preprint abstracts. We’ve had over 30,000 views from all over the world. During this time we’ve also seen more journals adopting favorable policies towards preprints, in particular Genetics and Genome Research, reflecting a growing recognition that preprint archives are a natural stage in the publication process. Overall it has been great to see the support for Haldane’s Sieve from so many people; we hope that it, and preprints more generally, will go from strength to strength in 2013.

Below are our top 10 most viewed pages of 2012. Each one of these has received hundreds of views. One noticeable trend is that many of them are the “Our paper” posts, which suggests that writing a blurb about your paper for Haldane’s Sieve is a great way to bring it more attention. Let us know if you want to write a post on your preprint article, or a quick post on a preprint you’ve enjoyed.

  1. Horizontal gene transfer may explain variation in θs. Maddamsetti et al. respond to a recent paper by Martincorena et al. The attention garnered by this post is undoubtedly due to its lively comment section. Martincorena et al. themselves responded with a pre-print here.
  2. Our paper: The genetic prehistory of southern Africa. Pickrell et al. write about their preprint. Their published paper is out at Nature Communications.
  3. Thoughts on: Finding the sources of missing heritability in a yeast cross. Joe Pickrell’s post about Bloom et al.
  4. Our paper: The geography of recent genetic ancestry across Europe. Peter Ralph and Graham Coop write about their arXived paper.
  5. Thoughts on: The date of interbreeding between Neandertals and modern humans. Graham Coop’s post on Sankararaman et al.’s paper. The authors’ post on their paper (Our paper: The date of interbreeding between Neandertals and modern humans) also made our top 10. The paper was published in PLoS Genetics.
  6. Our paper: Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. Casey Bergman’s post on his group’s paper by Richardson et al. The paper was published in PLoS Genetics.
  7. Our paper: A genetic variant near olfactory receptor genes influences cilantro preference. Nick Eriksson’s post about 23andMe’s preprint. The paper appeared in Flavour.
  8. Species Identification and Unbiased Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences. Ong et al.
  9. Our paper: Population genomics of sub-Saharan Drosophila melanogaster: African diversity and non-African admixture. John Pool’s post on Pool et al. The paper appeared in PLoS Genetics.
  10. Blood ties: ABO is a trans-species polymorphism in primates . Ségurel et al.’s paper which Laure Ségurel posted about here. The paper came out in PNAS.

Optimal Assembly for High Throughput Shotgun Sequencing

Optimal Assembly for High Throughput Shotgun Sequencing
Guy Bresler, Ma’ayan Bresler, David Tse
(Submitted on 1 Jan 2013)

We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. We design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes.

Most viewed on Haldane’s Sieve: December 2012

The most viewed preprints on Haldane’s Sieve in December 2012 were: