Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes

Characterizing the infection-induced transcriptome of Nasonia vitripennis reveals a preponderance of taxonomically-restricted immune genes
Timothy B. Sackton, John H. Werren, Andrew G. Clark
(Submitted on 23 Sep 2013)

The innate immune system in insects consists of a conserved core signaling network and rapidly diversifying effector and recognition components, often containing a high proportion of taxonomically-restricted genes. In the absence of functional annotation, genes encoding immune system proteins can thus be difficult to identify, as homology-based approaches generally cannot detect lineage-specific genes. Here, we use RNA-seq to compare the uninfected and infection-induced transcriptome in the parasitoid wasp Nasonia vitripennis to identify genes regulated by infection. We identify 183 genes significantly up-regulated by infection and 61 genes significantly down-regulated by infection. We also produce a new homology-based immune catalog in N. vitripennis, and show that most infection-induced genes are not assigned an immune function from homology alone, suggesting the potential for substantial novel immune components in less-well-studied systems. Finally, we show that a high proportion of these novel induced genes are taxonomically-restricted, highlighting the rapid evolution of immune gene content. The combination of functional annotation using RNA-seq and homology-based annotation provides a robust method to characterize the innate immune response across a wide variety of insects, and reveals significant novel features of the Nasonia immune response.

Author post: The identifiability of piecewise demographic models from the sample frequency spectrum

This guest post is by Anand Bhaskar and Yun Song on their paper: “The identifiability of piecewise demographic models from the sample frequency spectrum”. arXived here.

With the advent of high-throughput sequencing technologies, it has been of great interest to use genomic data to understand human demographic history. For example, we now estimate that modern humans migrated out of Africa around 60K-120K years ago [1,2], and that Neandertals may have admixed with modern humans in Europe as recently as 47,000 years ago [3]. Apart from satisfying curiosity about our anthropological history, the inference of demography is important for several scientific reasons. Most importantly, demographic processes influence genetic variation, and understanding the interplay between natural selection, genetic drift, and demography is a key question in population genetics. Also, controlling for demography is important for practical applications. For example, the demography inferred from neutrally evolving genomic regions can serve as a null model when searching for regions under selection. Demographic models could also be used to circumvent the problem of spurious associations in case-control studies induced by population substructure.

A summary of whole haplotypes that is commonly used in population genetic analyses is the sample frequency spectrum (SFS). For a sample of n haplotypes from a panmictic (i.e. without substructure) population, the SFS is an (n-1)-dimensional vector where the i-th entry is the proportion of SNPs with i copies of the mutant allele in the sample. One can talk about a mutant/derived allele because most analyses assume that mutations are rare enough that the observed SNPs are dimorphic. The first few entries of the SFS capture the proportion of rare SNPs in the sample and are especially useful for inferring recent population history. Several recent large sample sequencing studies [4-6] have found that humans have many more putatively neutral rare SNPs compared to predictions from a constant population size model. Using the SFS from their data, these studies all infer demographic models with recent exponential population expansion.

However, until fairly recently, it was not known whether the SFS of a sample uniquely determines the underlying demographic model. Could it be possible that two different demographic models produce the exact same expected SFS for all sample sizes? In 2008, Simon Myers, Charles Fefferman, and Nick Patterson came up with an elegant mathematical argument [7] to show that there are infinitely many population size histories that generate the same expected SFS for all sample sizes. They even provided an explicit example of a population size history which produced the same expected SFS as a constant population size model. However, their example history had increasingly rapid oscillations in the population size in the recent past, something that we might not expect to find in real biological populations. After all, even though we commonly use continuous-time models of evolution like coalescent theory and diffusion processes, biological populations evolve in discrete events of birth and death.

Our research group has been working on demographic inference from the SFS and from full sequence data for the last several years, and so it was natural for us to ask whether the class of population size histories that are commonly inferred using statistical algorithms might also suffer from this non-identifiability problem. Most statistical methods infer piecewise population size histories, where the pieces come from some biologically-motivated family of functions. In particular, piecewise constant and piecewise exponential models commonly appear in the literature. And if one can indeed uniquely identify piecewise demographic models from the SFS, what sample sizes are needed to do so?

In our paper, we address this question by proving that if the underlying population size function is piecewise with at most K pieces, then the expected SFS of a random sample of size n uniquely determines the demography as long as n is larger than some function of K that depends on the type of pieces of the population size function. For example, if the underlying demographic model was piecewise constant with at most K pieces (i.e. described by at most 2K – 1 parameters), then the expected SFS of a sample of size 2K uniquely determines the demographic model. In other words, no two piecewise constant population size functions with at most K pieces can generate the same expected SFS for a sample of size 2K or larger. For piecewise exponential demographic models with at most K pieces, a sample size of 4K – 1 is sufficient to uniquely determine the demographic model. When one doesn’t know which allele is ancestral and which is derived (for example, if outgroup information is lacking at the relevant SNPs), demographic analysis can still be carried out using the SFS by “folding” it. The folded SFS has floor(n/2) entries, where the i-th entry is the proportion of SNPs with i copies of the minor allele (which might be an ancestral or derived allele). Since the folded SFS has only roughly half the dimension as the full SFS, one might expect to require twice as many samples to uniquely determine the demographic model from the folded SFS compared to the full SFS. We formally prove in our paper that this intuition is indeed correct.

It is important to stress that this identifiability result is statistical rather than algorithmic in that that one would need to have perfect information about the expected SFS of a random sample in order to uniquely determine the underlying piecewise demography. In practice, one can get good estimates of the expected SFS by considering a large number of SNPs in the inference procedure, and by considering SNPs that are farther apart along the chromosomes so that the coalescent trees for the sample at different SNPs will be roughly independent of each other. More work is certainly needed to understand how much genomic data (measured both in terms of the number of SNPs and the sample size) would be needed in practice to robustly infer realistic demographic models.

Works cited:

[1] Li, H. and Durbin, R. (2011) Inference of human population history from individual whole-genome sequences. Nature 475, 493–496.

[2] Scally, A. and Durbin, R. (2012). Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics, 13(10), 745-753.

[3] Sankararaman, S., Patterson, N., Li, H., Pääbo, S., and Reich, D. (2012) The date of interbreeding between Neandertals and modern humans. PLoS Genetics 8, e1002947.

[4] Nelson, Matthew R., et al. (2012) An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104.

[5] Tennessen, Jacob A., et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69.

[6] Fu, Wenqing, et al. (2012) Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220.

[7] Myers, S., Fefferman, C., and Patterson, N. (2008) Can one learn history from the allelic spectrum? Theoretical Population Biology 73, 342–348.

mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

mTim: Rapid and accurate transcript reconstruction from RNA-Seq data

Georg Zeller, Nico Goernitz, Andre Kahles, Jonas Behr, Pramod Mudrakarta, Soeren Sonnenburg, Gunnar Raetsch
(Submitted on 20 Sep 2013)

Recent advances in high-throughput cDNA sequencing (RNA-Seq) technology have revolutionized transcriptome studies. A major motivation for RNA-Seq is to map the structure of expressed transcripts at nucleotide resolution. With accurate computational tools for transcript reconstruction, this technology may also become useful for genome (re-)annotation, which has mostly relied on de novo gene finding where gene structures are primarily inferred from the genome sequence. We developed a machine-learning method, called mTim (margin-based transcript inference method) for transcript reconstruction from RNA-Seq read alignments that is based on discriminatively trained hidden Markov support vector machines. In addition to features derived from read alignments, it utilizes characteristic genomic sequences, e.g. around splice sites, to improve transcript predictions. mTim inferred transcripts that were highly accurate and relatively robust to alignment errors in comparison to those from Cufflinks, a widely used transcript assembly method.

Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle
Jared E. Decker, Stephanie D. McKay, Megan M. Rolf, JaeWoo Kim, Antonio Molina Alcalá, Tad S. Sonstegard, Olivier Hanotte, Anders Götherström, Christopher M. Seabury, Lisa Praharani, Masroor Ellahi Babar, Luciana Correia de Almeida Regitano, Mehmet Ali Yildiz, Michael P. Heaton, Wansheng Lui, Chu-Zhao Lei, James M. Reecy, Muhammad Saif-Ur-Rehman, Robert D. Schnabel, Jeremy F. Taylor
(Submitted on 19 Sep 2013)

The domestication and development of cattle has considerably impacted human societies, but the histories of cattle breeds have been poorly understood especially for African, Asian, and American breeds. Using genotypes from 43,043 autosomal single nucleotide polymorphism markers scored in 1,543 animals, we evaluate the population structure of 134 domesticated bovid breeds. Regardless of the analytical method or sample subset, the three major groups of Asian indicine, Eurasian taurine, and African taurine were consistently observed. Patterns of geographic dispersal resulting from co-migration with humans and exportation are recognizable in phylogenetic networks. All analytical methods reveal patterns of hybridization which occurred after divergence. Using 19 breeds, we map the cline of indicine introgression into Africa. We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine. We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia. We also identify the influence of species other than Bos taurus in the formation of Asian breeds. We detect the pronounced influence of Shorthorn cattle in the formation of European breeds. Iberian and Italian cattle possess introgression from African taurine. American Criollo cattle are shown to be of Iberian, and not African, decent. Indicine introgression into American cattle occurred in the Americas, and not Europe. We argue that cattle migration, movement and trading followed by admixture have been important forces in shaping modern bovine genomic variation.

Change point analysis of histone modifications reveals epigenetic blocks with distinct regulatory activity and biological functions

Change point analysis of histone modifications reveals epigenetic blocks with distinct regulatory activity and biological functions
Mengjie Chen, Haifan Lin, Hongyu Zhao
(Submitted on 20 Sep 2013)

Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant biological functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks across the chromosome with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs manifest relatively homogeneous enrichment pattern of histone modifications. We further characterized BLOCKs by their transcription levels, distribution of genes, binding profiles of a broad panel of chromatin proteins, degree of co-expression and GO enrichment. Our results demonstrate that these blocks, although inferred merely from histone modifications, reveal strong relevance with transcription events and chromatin organization, which suggest their important roles in coordinated gene regulation.

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics

A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics
Sean Whalen, Gaurav Pandey
(Submitted on 19 Sep 2013)

The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.

The identifiability of piecewise demographic models from the sample frequency spectrum

The identifiability of piecewise demographic models from the sample frequency spectrum
Anand Bhaskar, Yun S. Song
(Submitted on 19 Sep 2013)

The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different demographic models can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this non-identifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant and piecewise-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the “folded” SFS, which is often used when there is ambiguity as to which allelic type is ancestral.