The equilibrium allele frequency distribution for a population with reproductive skew

The equilibrium allele frequency distribution for a population with reproductive skew
Ricky Der, Joshua B. Plotkin
(Submitted on 20 Jun 2013)

We study the population genetics of two neutral alleles under reversible mutation in the \Lambda-processes, a population model that features a skewed offspring distribution. We describe the shape of the equilibrium allele frequency distribution as a function of the model parameters. We show that the mutation rates can be uniquely identified from the equilibrium distribution, but that the form of the offspring distribution itself cannot be uniquely identified. We also introduce an infinite-sites version of the \Lambda-process, and we use it to study how reproductive skew influences standing genetic diversity in a population. We derive asymptotic formulae for the expected number of segregating sizes as a function of sample size. We find that the Wright-Fisher model minimizes the equilibrium genetic diversity, for a given mutation rate and variance effective population size, compared to all other \Lambda-processes.

Efficient Two-Stage Group Testing Algorithms for Genetic Screening

Efficient Two-Stage Group Testing Algorithms for Genetic Screening
Michael Huber
(Submitted on 19 Jun 2013)

Efficient two-stage group testing algorithms that are particularly suited for rapid and less-expensive DNA library screening and other large scale biological group testing efforts are investigated in this paper. The main focus is on novel combinatorial constructions in order to minimize the number of individual tests at the second stage of a two-stage disjunctive testing procedure. Building on recent work by Levenshtein (2003) and Tonchev (2008), several new infinite classes of such combinatorial designs are presented.

Reconstructing Native American Migrations from Whole-genome and Whole-exome Data

Reconstructing Native American Migrations from Whole-genome and Whole-exome Data
Simon Gravel, Fouad Zakharia, Jake K Byrnes, Marina Muzzio, Andres Moreno-Estrada, Juan L. Rodriguez-Flores, Eimear E. Kenny, Christopher R. Gignoux, Brian K. Maples, Wilfried Guiblet, Julie Dutil, Karla Sandoval, Gabriel Bedoya, The 1000 Genomes Project, Taras K Oleksyk, Andres Ruiz-Linares, Esteban G Burchard, Juan Carlos Martinez-Cruzado, Carlos D. Bustamante
(Submitted on 17 Jun 2013)

There is great scientific and popular interest in understanding the genetic history of populations in the Americas. We wish to understand when different regions of the continent were inhabited, where settlers came from, and how current inhabitants relate genetically to earlier populations. Recent studies unraveled parts of the genetic history of the continent using genotyping arrays and uniparental markers. The 1000 Genomes Project provides a unique opportunity for improving our understanding of population genetic history by providing over a hundred sequenced low coverage genomes and exomes from Colombian (CLM), Mexican-American (MXL), and Puerto Rican (PUR) populations. Here, we explore the genomic contributions of African, European, and especially Native American ancestry to these populations. Estimated Native American ancestry is 48% in MXL, 25% in CLM, and 13% in PUR. Native American ancestry in PUR appears most closely related to Equatorial-Tucanoan-speaking populations, supporting a Southern America ancestry of the Taino people of the Caribbean. We present new methods to estimate the allele frequencies in the Native American fraction of the populations, and model their distribution using a three-population demographic model. The ancestral populations to the three groups likely split in close succession: the most likely scenario, based on a peopling of the Americas 16 thousand years ago (kya), supports that the MXL Ancestors split 12.2kya, with a subsequent split of the ancestors to CLM and PUR 11.7kya. The model also features a Mexican population of 62,000, a Colombian population of 8,700, and a Puerto Rican population of 1,900. Modeling Identity-by-descent (IBD) and ancestry tract length, we show that post-contact populations also differ markedly in their effective sizes and migration patterns, with Puerto Rico showing the smallest size and the earlier migration from Europe.

Differential meta-analysis of RNA-seq data from multiple studies

Differential meta-analysis of RNA-seq data from multiple studies
Andrea Rau (GABI), Guillemette Marot (INRIA Lille – Nord Europe, CERIM), Florence Jaffr├ęzic (GABI)
(Submitted on 16 Jun 2013)

High-throughput sequencing is now regularly used for studies of the transcriptome (RNA-seq), particularly for comparisons among experimental conditions. For the time being, a limited number of biological replicates are typically considered in such experiments, leading to low detection power for differential expression. As their cost continues to decrease, it is likely that additional follow-up studies will be conducted to re-address the same biological question. We demonstrate how p-value combination techniques previously used for microarray meta-analyses can be used for the differential analysis of RNA-seq data from multiple related studies. These techniques are compared to a negative binomial generalized linear model (GLM) including a fixed study effect on simulated data and real data on human melanoma cell lines. The GLM with fixed study effect performed well for low inter-study variation and small numbers of studies, but was outperformed by the meta-analysis methods for moderate to large inter-study variability and larger numbers of studies. To conclude, the p-value combination techniques illustrated here are a valuable tool to perform differential meta-analyses of RNA-seq data by appropriately accounting for biological and technical variability within studies as well as additional study-specific effects. An R package metaRNASeq is available on the R Forge.

Our paper: Sashimi plots: Quantitative visualization of RNA sequencing read alignments

This is a guest post by Yarden Katz [@yardenkatz] on his paper (along with coauthors): katz et al. Sashimi plots: Quantitative visualization of RNA sequencing read alignments arXived here

A first draft of our paper Sashimi plots: Quantitative visualization of RNA sequencing read alignments is now available. Sashimi plots are a simple visualization of RNA sequencing data, intended to make it easier to detect differentially spliced exons across multiple RNA-Seq samples. In a Sashimi plot, RNA-Seq reads are summarized as read densities, and junction reads are collapsed into arcs whose width is proportional to the number of reads spanning the exons connected by the arc. See the paper for examples.

We call it a Sashimi plot in part because of the impeccable resemblance of bumpy RNA-Seq read densities in exons to small pieces of Sashimi, and also because we tried to keep the plots as close to the “raw” data as possible. While Sashimi plots can display estimates of isoform abundance levels from programs like MISO, the goal here was to summarize the read alignments as they are, without further processing or inference, so that conclusions from probabilistic models can be visually verified.

The original Sashimi plot program is a command line utility that makes customizable Sashimi plots using Python (using the matplotlib library). Recently, the IGV genome browser team implemented a version of Sashimi plots in their browser (see installation instructions.) This allows Sashimi plots to be made dynamically for any genomic region of interest, at a resolution set by the zoom in/out features of the browser. The plot can be made for all or a subset of the tracks loaded, and the scales can be adjusted by the user as in the main IGV window. Both the static, Python-based version of Sashimi plots and the dynamic version within IGV are available and actively maintained, and code bases for both are available on GitHub.

Sashimi plots still have important limitations. First, the junction arcs can get messy for genes with many alternative isoforms. This can be partially addressed by looking at simplified event annotations (e.g. ones containing only two isoforms, or a handful of isoforms, as in these annotations) rather than making plots for the full set of isoforms of a gene. The second limitation is that sometimes subtle differences are not readily seen from junction arc widths. We’re considering alternative representations (such as circle area or diameter) for quantitatively representing junction read counts.

The paper is meant primarily as advertisement for the software. We hope that other members of the RNA processing/sequencing community will find this useful and come up with their own variants of these plots.

Relevant links:

Analysis and rejection sampling of Wright-Fisher diffusion bridges

Analysis and rejection sampling of Wright-Fisher diffusion bridges
Joshua G. Schraiber, Robert C. Griffiths, Steven N. Evans
(Submitted on 14 Jun 2013)

We investigate the properties of a Wright-Fisher diffusion process started from frequency x at time 0 and conditioned to be at frequency y at time T. Such a process is called a bridge. Bridges arise naturally in the analysis of selection acting on standing variation and in the inference of selection from allele frequency time series. We establish a number of results about the distribution of neutral Wright-Fisher bridges and develop a novel rejection sampling scheme for bridges under selection that we use to study their behavior.

Phylogenetic analysis accounting for age-dependent death and sampling with applications to epidemics

Phylogenetic analysis accounting for age-dependent death and sampling with applications to epidemics
Amaury Lambert, Helen K. Alexander, Tanja Stadler
(Submitted on 14 Jun 2013)

The reconstruction of phylogenetic trees based on viral genetic sequence data sequentially sampled from an epidemic provides estimates of the past transmission dynamics, by fitting epidemiological models to these trees. To our knowledge, none of the epidemiological models currently used in phylogenetics can account for recovery rates and sampling rates dependent on the time elapsed since transmission.
Here we introduce an epidemiological model where infectives leave the epidemic, either by recovery or sampling, after some random time which may follow an arbitrary distribution.
We derive an expression for the likelihood of the phylogenetic tree of sampled infectives under our general epidemiological model. The analytic concept developed in this paper will facilitate inference of past epidemiological dynamics and provide an analytical framework for performing very efficient simulations of phylogenetic trees under our model. The main idea of our analytic study is that the non-Markovian epidemiological model giving rise to phylogenetic trees growing vertically as time goes by, can be represented by a Markovian “coalescent point process” growing horizontally by the sequential addition of pairs of coalescence and sampling times.
As examples, we discuss two special cases of our general model, namely an application to influenza and an application to HIV. Though phrased in epidemiological terms, our framework can also be used for instance to fit macroevolutionary models to phylogenies of extant and extinct species, accounting for general species lifetime distributions.