SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data

SInC: An accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data
Swetansu Pattnaik, Saurabh Gupta, Arjun A Rao, Binay Panda
(Submitted on 12 Jul 2013)

We report SInC (SNV, Indel and CNV) simulator and read generator, an open-source tool capable of simulating biological variants taking into account a platform-specific error model. SInC is capable of simulating and generating single- and paired-end reads with user-defined insert size with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes. Sinc can be downloaded from this https URL

kruX: Matrix-based non-parametric eQTL discovery

kruX: Matrix-based non-parametric eQTL discovery
Jianlong Qi, Hassan Foroughi Asl, Johan Bjorkegren, Tom Michoel
(Submitted on 12 Jul 2013)

The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitative trait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlying genetic model and expression trait distribution, but testing billions of marker-trait combinations one-by-one can become computationally prohibitive. We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplications to simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-trait combinations at once. KruX is more than 3,000 times faster than computing associations one-by-one on a typical human dataset.

Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster

Inference of chromosomal inversion dynamics from Pool-Seq data in natural and laboratory populations of Drosophila melanogaster
Martin Kapun, Hester van Schalkwyk, Bryant McAllister, Thomas Flatt, Christian Schlötterer
(Submitted on 9 Jul 2013)

Sequencing of pools of individuals (Pool-Seq) represents a reliable and cost- effective approach for estimating genome-wide SNP and transposable element insertion frequencies. However, Pool-Seq does not provide direct information on haplotypes so that for example obtaining inversion frequencies has not been possible until now. Here, we have developed a new set of diagnostic marker SNPs for 7 cosmopolitan inversions in Drosophila melanogaster that can be used to infer inversion frequencies from Pool-Seq data. We applied our novel marker set to Pool-Seq data from an experimental evolution study and from North American and Australian latitudinal clines. In the experimental evolution data, we find evidence that positive selection has driven the frequencies of In(3R)C and In(3R)Mo to increase over time. In the clinal data, we confirm the existence of frequency clines for In(2L)t, In(3L)P and In(3R)Payne in both North America and Australia and detect a previously unknown latitudinal cline for In(3R)Mo in North America. The inversion markers developed here provide a versatile and robust tool for characterizing inversion frequencies and their dynamics in Pool- Seq data from diverse D. melanogaster populations

The waiting time for a second mutation: an alternative to the Moran model

The waiting time for a second mutation: an alternative to the Moran model
Rinaldo B. Schinazi
(Submitted on 28 Jun 2013)

The appearance of cancer in a tissue is thought to be the result of two or more successive mutations. We propose a stochastic model that allows for an exact computation of the distribution of the waiting time for a second mutation. This models the time of appearance of the first cancerous cell in a tissue. Our model is an alternative to the Moran model with mutations.

Most viewed on Haldane’s Sieve: June 2013

This month was the most active yet on Haldane’s Sieve, with over 10,000 page views and several active comment threads. The most viewed preprints were:

Lateral Gene Transfer, Rearrangement and Reconciliation

Lateral Gene Transfer, Rearrangement and Reconciliation
Murray Patterson, Gergely J Szöllősi, Vincent Daubin, Eric Tannier
(Submitted on 27 Jun 2013)

Background.
Models of ancestral gene order reconstruction have progressively integrated different evolutionary patterns and processes such as unequal gene content, gene duplications, and implicitly sequence evolution via reconciled gene trees. In unicellular organisms, these models have so far ignored lateral gene transfer, even though it can have an important confounding effect on such models, as well as a rich source of information on the function of genes through the detection of transfers of entire clusters of genes.
Result.
We report an algorithm together with its implementation, DeCoLT, that reconstructs ancestral genome organization based on reconciled gene trees which summarize information on sequence evolution, gene origination, duplication, loss, and lateral transfer. DeCoLT finds in polynomial time the minimum number of rearrangements, computed as the number of gains and breakages of adjacencies between pairs of genes. We apply DeCoLT to 1099 gene families from 36 cyanobacteria genomes.
Conclusion.
DeCoLT is able to reconstruct adjacencies in 35 ancestral bacterial genomes with a thousand genes families in a few hours, and detects clusters of co-transferred genes. As there is no constraint on genome organization, adjacencies can be generalized to any relationship between genes to reconstruct ancestral interactions, functions or complexes with the same framework.

Bound to succeed: Transcription factor binding site prediction and its contribution to understanding virulence and environmental adaptation in bacterial plant pathogens

Bound to succeed: Transcription factor binding site prediction and its contribution to understanding virulence and environmental adaptation in bacterial plant pathogens
Surya Saha, Magdalen Lindeberg
(Submitted on 26 Jun 2013)

Bacterial plant pathogens rely on a battalion of transcription factors to fine-tune their response to changing environmental conditions and marshal the genetic resources required for successful pathogenesis. Prediction of transcription factor binding sites represents an important tool for elucidating regulatory networks, and has been conducted in multiple genera of plant pathogenic bacteria for the purpose of better understanding mechanisms of survival and pathogenesis. The major categories of transcription factor binding sites that have been characterized are reviewed here with emphasis on in silico methods used for site identification and challenges therein, their applicability to different types of sequence datasets, and insights into mechanisms of virulence and survival that have been gained through binding site mapping. An improved strategy for establishing E value cutoffs when using existing models to screen uncharacterized genomes is also discussed.

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data

A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data
John O’Brien, Xavier Didelot, Zamin Iqbal, LucasAmenga-Etego, Bartu Ahiska, Daniel Falush
(Submitted on 26 Jun 2013)

Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of strong mixing among samples. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples.

The complex hybrid origins of the root knot nematodes revealed through comparative genomics

The complex hybrid origins of the root knot nematodes revealed through comparative genomics
David H Lunt, Sujai Kumar, Georgios Koutsovoulos, Mark L Blaxter
(Submitted on 26 Jun 2013)

Meloidogyne root knot nematodes (RKN) can infect most of the world’s agricultural crop species and are among the most important of all plant pathogens. As yet however we have little understanding of their origins or the genomic basis of their extreme polyphagy. The most damaging pathogens reproduce by mitotic parthenogenesis and are suggested to originate by interspecific hybridizations between unknown parental taxa. We sequenced the genome of the diploid meiotic parthenogen Meloidogyne floridensis, and use a comparative genomic approach to test the hypothesis that it was involved in the hybrid origin of the tropical mitotic parthenogen M. incognita. Phylogenomic analysis of gene families from M. floridensis, M. incognita and an outgroup species M. hapla was used to trace the evolutionary history of these species’ genomes, demonstrating that M. floridensis was one of the parental species in the hybrid origins of M. incognita. Analysis of the M. floridensis genome revealed many gene loci present in divergent copies, as they are in M. incognita, indicating that it too had a hybrid origin. The triploid M. incognita is shown to be a complex double-hybrid between M. floridensis and a third, unidentified parent. The agriculturally important RKN have very complex origins involving the mixing of several parental genomes by hybridization and their extreme polyphagy and agricultural success may be related to this hybridization, producing transgressive variation on which natural selection acts. Studying RKN variation via individual marker loci may fail due to the species’ convoluted origins, and multi-species population genomics is essential to understand the hybrid diversity and adaptive variation of this important species complex. This comparative genomic analysis provides a compelling example of the importance and complexity of hybridization in generating animal species diversity more generally.

The impact of population demography and selection on the genetic architecture of complex traits

The impact of population demography and selection on the genetic architecture of complex traits
Kirk E. Lohmueller
(Submitted on 21 Jun 2013)

Studies of thousands of individuals have found genetic evidence for dramatic population growth in recent human history. These studies have also documents high numbers of amino acid changing polymorphisms that are likely evolutionarily important and may be of medic relevance. Here I use population genetic models to demonstrate how the recent population growth has directly led to the accumulation of deleterious amino acid changing polymorphism. I show that recent growth increases the proportion of non synonymous SNPs and that the average mutation is more deleterious in an expanding population than in a non-exanded population. However, population growth does not affect the genetic load of the population. Additionally, I investigate the consequences of recent population growth on the architecture of complex traits. If a mutation’s effect on disease status is correlated with its effect on fitness, then rare variants explain a greater portion of the additive genetic variance of the trait in a population that has recently expanded than in a population that did not recently expand. Further, recent growth can increase the expected number of causal variants for a disease. Such heterogeneity will likely reduce the power of commonly used rare variants association tests. Finally, recent population growth also reduces the causal allele frequency in cases at single mutations, which could decrease the power of single-marker association tests. These findings suggest careful consideration of recent population history will be essential for designing optimal association studies for low-frequency and rare variants.