Multi-locus analysis of genomic time series data from experimental evolution


Multi-locus analysis of genomic time series data from experimental evolution

Jonathan Terhorst, Yun S. Song

Genomic time series data generated by evolve-and-resequence (E&R) experiments offer a powerful window into the mechanisms that drive evolution. However, standard population genetic inference procedures do not account for sampling serially over time, and new methods are needed to make full use of modern experimental evolution data. To address this problem, we develop a Gaussian process approximation to the multi-locus Wright-Fisher process with selection over a time course of tens of generations. The mean and covariance structure of the Gaussian process are obtained by computing the corresponding moments in discrete-time Wright-Fisher models conditioned on the presence of a linked selected site. This enables our method to account for the effects of linkage and selection, both along the genome and across sampled time points, in an approximate but principled manner. Using simulated data, we demonstrate the power of our method to correctly detect, locate and estimate the fitness of a selected allele from among several linked sites. We also study how this power changes for different values of selection strength, initial haplotypic diversity, population size, sampling frequency, experimental duration, number of replicates, and sequencing coverage depth. In addition to providing quantitative estimates of selection parameters from experimental evolution data, our model can be used by practitioners to design E&R experiments with requisite power. Finally, we explore how our likelihood-based approach can be used to infer other model parameters, including effective population size and recombination rate, and discuss extensions to more complex models.

Long-term balancing selection in LAD1 maintains a missense trans-species polymorphism in humans, chimpanzees and bonobos

Long-term balancing selection in LAD1 maintains a missense trans-species polymorphism in humans, chimpanzees and bonobos
João C. Teixeira, Cesare de Filippo, Antje Weihmann, Juan R. Meneu, Fernando Racimo, Michael Dannemann, Birgit Nickel, Anne Fischer, Michel Halbwax, Claudine Andre, Rebeca Atencia, Matthias Meyer, Genís Parra, Svante Pääbo, Aida M. Andrés

Balancing selection maintains advantageous genetic and phenotypic diversity in populations. When selection acts for long evolutionary periods selected polymorphisms may survive species splits and segregate in present-day populations of different species. Here, we investigated the role of long-term balancing selection in the evolution of protein-coding sequences in the Pan-Homo clade. We sequenced the exome of 20 humans, 20 chimpanzees and 20 bonobos and detected eight coding trans-species polymorphisms (trSNPs) that are shared among the three species and have segregated for approximately 14 million years of independent evolution. While the majority of these trSNPs were found in three genes of the MHC cluster, we also uncovered one coding trSNP (rs12088790) in the gene LAD1. All these trSNPs show clustering of sequences by allele rather than by species and also exhibit other signatures of long-term balancing selection, such as segregating at intermediate frequency and lying in a locus with high genetic diversity. Here we focus on the trSNP in LAD1, a gene that encodes for Ladinin-1, a collagenous anchoring filament protein of basement membrane that is responsible for maintaining cohesion at the dermal-epidermal junction; the gene is also an autoantigen responsible for linear IgA disease. This trSNP results in a missense change (Leucine257Proline) and, besides altering the protein sequence, is associated with changes in gene expression of LAD1.

The effective founder effect in a spatially expanding population

The effective founder effect in a spatially expanding population
Benjamin Marco Peter, Montgomery Slatkin

The gradual loss of diversity associated with range expansions is a well known pattern observed in many species, and can be explained with a serial founder model. We show that under a branching process approximation, this loss in diversity is due to the difference in offspring variance between individuals at and away from the expansion front, which allows us to measure the strength of the founder effect, dependant on an effective founder size. We demonstrate that the predictions from the branching process model fit very well with Wright-Fisher forward simulations and backwards simulations under a modified Kingman coalescent, and further show that estimates of the effective founder size are robust to possibly confounding factors such as migration between subpopulations. We apply our method to a data set of Arabidopsis thaliana, where we find that the founder effect is about three times stronger in the Americas than in Europe, which may be attributed to the more recent, faster expansion.

Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling


Robust identification of noncoding RNA from transcriptomes requires phylogenetically-informed sampling

Stinus Lindgreen, Sinan Ugur Umu, Alicia Sook-Wei Lai, Hisham Eldai, Wenting Liu, Stephanie McGimpsey, Nicole Wheeler, Patrick J. Biggs, Nick R. Thomson, Lars Barquist, Anthony M. Poole, Paul P. Gardner
Comments: 16 pages, 4 figures
Subjects: Genomics (q-bio.GN)

Noncoding RNAs are increasingly recognized as integral to a wide range of biological processes, including translation, gene regulation, host-pathogen interactions and environmental sensing. While genomics is now a mature field, our capacity to identify noncoding RNA elements in bacterial and archaeal genomes is hampered by the difficulty of de novo identification. The emergence of new technologies for characterizing transcriptome outputs, notably RNA-seq, are improving noncoding RNA identification and expression quantification. However, a major challenge is to robustly distinguish functional outputs from transcriptional noise. To establish whether annotation of existing transcriptome data has effectively captured all functional outputs, we analysed over 400 publicly available RNA-seq datasets spanning 37 different Archaea and Bacteria. Using comparative tools, we identify close to a thousand highly-expressed candidate noncoding RNAs. However, our analyses reveal that capacity to identify noncoding RNA outputs is strongly dependent on phylogenetic sampling. Surprisingly, and in stark contrast to protein-coding genes, the phylogenetic window for effective use of comparative methods is perversely narrow: aggregating public datasets only produced one phylogenetic cluster where these tools could be used to robustly separate unannotated noncoding RNAs from a null hypothesis of transcriptional noise. Our results show that for the full potential of transcriptomics data to be realized, a change in experimental design is paramount: effective transcriptomics requires phylogeny-aware sampling.

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

SRST2: Rapid genomic surveillance for public health and hospital microbiology labs

Michael Inouye, Harriet Dashnow, Lesley Raven, Mark B Schultz, Bernard J Pope, Takehiro Tomita, Justin Zobel, Kathryn E Holt

Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control, yet routine use of whole genome sequencing (WGS) for these purposes poses significant challenges. Here we present SRST2, a tool for fast and accurate detection of genes, alleles and multi-locus sequence types from WGS data, which outperforms assembly-based methods. Using >900 genomes from common pathogens, we demonstrate SRST2’s utility for rapid genome surveillance in public health laboratory and hospital infection control settings.

svaseq: removing batch effects and other unwanted noise from sequencing data

svaseq: removing batch effects and other unwanted noise from sequencing data

Jeffrey Leek

It is now well known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. We introduced surrogate variable analysis for estimating these artifacts by (1) identifying the part of the genomic data only affected by artifacts and (2) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors. Here I describe an update to the sva approach that can be applied to analyze count data or FPKMs from sequencing experiments. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. These updates are available through the surrogate variable analysis (sva) Bioconductor package.

Redefining Genomic Privacy: Trust and Empowerment

Redefining Genomic Privacy: Trust and Empowerment

Arvind Narayanan, Kenneth Yocum, David Glazer, Nita Farahany, Maynard Olson, Lincoln D. Stein, James B. Williams, Jan A. Witkowski, Robert C. Kain, Yaniv Erlich

Fulfilling the promise of the genetic revolution requires the analysis of large datasets containing information from thousands to millions of participants. However, sharing human genomic data requires protecting subjects from potential harm. Current models rely on de-identification techniques that treat privacy versus data utility as a zero-sum game. Instead we propose using trust-enabling techniques to create a solution where researchers and participants both win. To do so we introduce three principles that facilitate trust in genetic research and outline one possible framework built upon those principles. Our hope is that such trust-centric frameworks provide a sustainable solution that reconciles genetic privacy with data sharing and facilitates genetic research.

Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Efficient Algorithms for de novo Assembly of Alternative Splicing Events from RNA-seq Data

Gustavo Sacomoto
(Submitted on 23 Jun 2014)

In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the fundamental idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events. Finally, we show that it enables to identify more correct events than general purpose transcriptome assemblers.
In order to deal with ever-increasing volumes of NGS data, we put an extra effort to make KisSplice as scalable as possible. First, to improve its running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. Then, to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time.
Additionally, we apply the same techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson’s algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the classical K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms with the same time complexities but using exponentially less memory than previous approaches.

Autosomal admixture levels are informative about sex bias in admixed populations

Autosomal admixture levels are informative about sex bias in admixed populations

Amy Goldberg, Paul Verdu, Noah A Rosenberg

Sex-biased admixture has been observed in a wide variety of admixed populations. Genetic variation in sex chromosomes and ratios of quantities computed from sex chromosomes and autosomes have often been examined in order to infer patterns of sex-biased admixture, typically using statistical approaches that do not mechanistically model the complexity of a sex-specific history of admixture. Here, expanding on a model of Verdu \& Rosenberg (2011) that did not include sex specificity, we develop a model that mechanistically examines sex-specific admixture histories. Under the model, multiple source populations contribute to an admixed population, potentially with their male and female contributions varying over time. In an admixed population descended from two source groups, we derive the moments of the distribution of the autosomal admixture fraction from a specific source population as a function of sex-specific introgression parameters and time. Considering admixture processes that are constant in time, we demonstrate that surprisingly, although the mean autosomal admixture fraction from a specific source population does not reveal a sex bias in the admixture history, the variance of autosomal admixture is informative about sex bias. Specifically, the long-term variance decreases as the sex bias from a contributing source population increases. This result can be viewed as analogous to the reduction in effective population size for populations with an unequal number of breeding males and females. Our approach can contribute to methods for inference of the history of complex sex-biased admixture processes by enabling consideration of the effect of sex-biased admixture on autosomal DNA.

Are phylogenetic patterns the same in anthropology and biology?

Are phylogenetic patterns the same in anthropology and biology?

David Morrison

The use of phylogenetic methods in anthropological fields such as archaeology, linguistics and stemmatology (involving what are often called ?culture data?) is based on an analogy between human cultural evolution and biological evolution. We need to understand this analogy thoroughly, including how well anthropology data fit the model of a phylogenetic tree, as used in biology. I provide a direct comparison of anthropology datasets with both phenotype and genotype datasets from biology. The anthropology datasets fit the tree model approximately as well as do the genotype data, which is detectably worse than the fit of the phenotype data. This is true for datasets with <500 parsimony-informative characters, as well as for larger datasets. This implies that cross-cultural (horizontal) processes have been important in the evolution of cultural artifacts, as well as branching historical (vertical) processes, and thus a phylogenetic network will be a more appropriate model than a phylogenetic tree.