Inferring the Clonal Structure of Viral Populations from Time Series Sequencing

Inferring the Clonal Structure of Viral Populations from Time Series Sequencing

Donatien Fotso-Chedom, Pablo R. Murcia, Chris D. Greenman
(Submitted on 30 Jul 2014)

RNA virus populations will undergo processes of mutation and selection resulting in a mixed population of viral particles. High throughput sequencing of a viral population subsequently contains a mixed signal of the underlying clones. We would like to identify the underlying evolutionary structures. We utilize two sources of information to attempt this; within segment linkage information, and mutation prevalence. We demonstrate that clone haplotypes, their prevalence, and maximum parsimony reticulate evolutionary structures can be identified, although the solutions may not be unique, even for complete sets of information. This is applied to a chain of influenza infection, where we infer evolutionary structures, including reassortment, and demonstrate some of the difficulties of interpretation that arise from deep sequencing due to artifacts such as template switching during PCR amplification.

The Genetic Legacy of the Expansion of Turkic-Speaking Nomads Across Eurasia

The Genetic Legacy of the Expansion of Turkic-Speaking Nomads Across Eurasia

Bayazit Yunusbayev, Mait Metspalu, Ene Metspalu, Albert Valeev, Sergei Litvinov, Ruslan Valiev, Vita Akhmetova, Elena Balanovska, Oleg Balanovsky, Shahlo Turdikulova, Dilbar Dalimova, Pagbajabyn Nymadawa, Ardeshir Bahmanimehr, Hovhannes Sahakyan, Kristiina Tambets, Sardana Fedorova, Nikolay Barashkov, Irina Khidiatova, Evelin Mihailov, Rita Khusainova, Larisa Damba, Miroslava Derenko, Boris Malyarchuk, Ludmila Osipova, Mikhail Voevoda, Levon Yepiskoposyan, Toomas Kivisild, Elza Khusnutdinova, Richard Villems
doi: http://dx.doi.org/10.1101/005850

The Turkic peoples represent a diverse collection of ethnic groups defined by the Turkic languages. These groups have dispersed across a vast area, including Siberia, Northwest China, Central Asia, East Europe, the Caucasus, Anatolia, the Middle East, and Afghanistan. The origin and early dispersal history of the Turkic peoples is disputed, with candidates for their ancient homeland ranging from the Transcaspian steppe to Manchuria in Northeast Asia. Previous genetic studies have not identified a clear-cut unifying genetic signal for the Turkic peoples, which lends support for language replacement rather than demic diffusion as the model for the Turkic language?s expansion. We addressed the genetic origin of 373 individuals from 22 Turkic-speaking populations, representing their current geographic range, by analyzing genome-wide high-density genotype data. Most of the Turkic peoples studied, except those in Central Asia, genetically resembled their geographic neighbors, in agreement with the elite dominance model of language expansion. However, western Turkic peoples sampled across West Eurasia shared an excess of long chromosomal tracts that are identical by descent (IBD) with populations from present-day South Siberia and Mongolia (SSM), an area where historians center a series of early Turkic and non-Turkic steppe polities. The observed excess of long chromosomal tracts IBD (> 1cM) between populations from SSM and Turkic peoples across West Eurasia was statistically significant. Finally, we used the ALDER method and inferred admixture dates (~9th?17th centuries) that overlap with the Turkic migrations of the 5th?16th centuries. Thus, our results indicate historical admixture among Turkic peoples, and the recent shared ancestry with modern populations in SSM supports one of the hypothesized homelands for their nomadic Turkic and related Mongolic ancestors.

QuASAR: Quantitative Allele Specific Analysis of Reads

QuASAR: Quantitative Allele Specific Analysis of Reads

Chris Harvey, Gregory A Moyebrailean, Omar Davis, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

Expression quantitative trait loci (eQTL) studies have discovered thousands of genetic variants that regulate gene expression and have been crucial to enable a better understanding of the functional role of non-coding sequences. However, eQTL studies are generally quite expensive, requiring a large sample size and genome-wide genotyping. On the other hand, allele specific expression (ASE) is becoming a very popular approach to detect the effect of a genetic variant on gene expression, even with a single individual. This is typically achieved by counting the number of RNA-seq reads for each allele at heterozygous sites and rejecting the null hypothesis of 1:1 ratio. When genotype information is not readily available it could be inferred from the RNA-seq reads directly, but there are no methods available that can incorporate the uncertainty on the genotype call with the ASE inference step. Here, we present QuASAR, Quantitative Allele Specific Analysis of Reads, a novel statistical learning method for jointly detecting heterozygote genotypes and inferring ASE. The proposed ASE inference step takes into consideration the uncertainty in the genotype calls while including parameters that model base-call errors in sequencing and allelic over-dispersion. We validated our method with experimental data for which high quality genotypes are available. Results on an additional dataset with multiple replicates at different sequencing depths demonstrate that QuASAR is a powerful tool for ASE analysis when genotypes are not available.

A Statistical Test for Clades in Phylogenies

A Statistical Test for Clades in Phylogenies

Thurston H. Y. Dang, Elchanan Mossel
(Submitted on 29 Jul 2014)

We investigated testing the likelihood of a phylogenetic tree by comparison to its subtree pruning and regrafting (SPR) neighbors, with or without re-optimizing branch lengths. This is inspired by aspects of Bayesian significance tests, and the use of SPRs for heuristically finding maximum likelihood trees. Through a number of simulations with the Jukes-Cantor model on various topologies, it is observed that the SPR tests are informative, and reasonably fast compared to searching for the maximum likelihood tree. This suggests that the SPR tests would be a useful addition to the suite of existing statistical tests, for identifying potential inaccuracies of inferred topologies.

Are all genetic variants in DNase I sensitivity regions functional?

Are all genetic variants in DNase I sensitivity regions functional?

Gregory A Moyerbrailean, Chris T Harvey, Cynthia A Kalita, Xiaoquan Wen, Francesca Luca, Roger Pique-Regi

A detailed mechanistic understanding of the direct functional consequences of DNA variation on gene regulatory mechanism is critical for a complete understanding of complex trait genetics and evolution. Here, we present a novel approach that integrates sequence information and DNase I footprinting data to predict the impact of a sequence change on transcription factor binding. Applying this approach to 653 DNase-seq samples, we identified 3,831,862 regulatory variants predicted to affect active regulatory elements for a panel of 1,372 transcription factor motifs. Using QuASAR, we validated the non-coding variants predicted to be functional by examining allele-specific binding (ASB). Combining the predictive model and the ASB signal, we identified 3,217 binding variants within footprints that are significantly imbalanced (20% FDR). Even though most variants in DNase I hypersensitive regions may not be functional, we estimate that 56% of our annotated functional variants show actual evidence of ASB. To assess the effect these variants may have on complex phenotypes, we examined their association with complex traits using GWAS and observed that ASB-SNPs are enriched 1.22-fold for complex traits variants. Furthermore, we show that integrating footprint annotations into GWAS meta-study results improves identification of likely causal SNPs and provides a putative mechanism by which the phenotype is affected.

MIPSTR: a method for multiplex genotyping of germ-line and somatic STR variation across many individuals

MIPSTR: a method for multiplex genotyping of germ-line and somatic STR variation across many individuals
Keisha Dawn Carlson, Peter H Sudmant, Maximilian Oliver Press, Evan E Eichler, Jay Shendure, Christine Queitsch

Abstract Short tandem repeats (STRs) are highly mutable genetic elements that often reside in functional genomic regions. The cumulative evidence of genetic studies on individual STRs suggests that STR variation profoundly affects phenotype and contributes to trait heritability. Despite recent advances in sequencing technology, STR variation has remained largely inaccessible across many individuals compared to single nucleotide variation or copy number variation. STR genotyping with short-read sequence data is confounded by (1) the difficulty of uniquely mapping short, low-complexity reads and (2) the high rate of STR amplification stutter. Here, we present MIPSTR, a robust, scalable, and affordable method that addresses these challenges. MIPSTR uses targeted capture of STR loci by single-molecule Molecular Inversion Probes (smMIPs) and a unique mapping strategy. Targeted capture and mapping strategy resolve the first challenge; the use of single molecule information resolves the second challenge. Unlike previous methods, MIPSTR is capable of distinguishing technical error due to amplification stutter from somatic STR mutations. In proof-of-principle experiments, we use MIPSTR to determine germ-line STR genotypes for 102 STR loci with high accuracy across diverse populations of the plant A. thaliana. We show that putatively functional STRs may be identified by deviation from predicted STR variation and by association with quantitative phenotypes. Employing DNA mixing experiments and a mutant deficient in DNA repair, we demonstrate that MIPSTR can detect low-frequency somatic STR variants. MIPSTR is applicable to any organism with a high-quality reference genome and is scalable to genotyping many thousands of STR loci in thousands of individuals.

An estimate of the average number of recessive lethal mutations carried by humans

An estimate of the average number of recessive lethal mutations carried by humans
Ziyue Gao, Darrel Waggoner, Matthew Stephens, Carole Ober, Molly Przeworski
(Submitted on 28 Jul 2014)

The effects of inbreeding on human health depend critically on the number and severity of recessive, deleterious mutations carried by individuals. In humans, existing estimates of these quantities are based on comparisons between consanguineous and non-consanguineous couples, an approach that confounds socioeconomic and genetic effects of inbreeding. To circumvent this limitation, we focused on a founder population with almost complete Mendelian disease ascertainment and a known pedigree. By considering all recessive lethal diseases reported in the pedigree and simulating allele transmissions, we estimated that each haploid set of human autosomes carries on average 0.29 (95% credible interval [0.10, 0.83]) autosomal, recessive alleles that lead to complete sterility or severe disorders at birth or before reproductive age when homozygous. Comparison to existing estimates of the deleterious effects of all recessive alleles suggests that a substantial fraction of the burden of autosomal, recessive variants is due to single mutations that lead to death between birth and reproductive age. In turn, the comparison to estimates from other eukaryotes points to a surprising constancy of the average number of recessive lethal mutations across organisms with markedly different genome sizes.

Bayesian mixture analysis for metagenomic community profiling.

Bayesian mixture analysis for metagenomic community profiling.

Sofia Morfopoulou, Vincent Plagnol

Deep sequencing of clinical samples is now an established tool for the detection of infectious pathogens, with direct medical applications. The large amount of data generated provides an opportunity to detect species even at very low levels, provided that computational tools can effectively interpret potentially complex metagenomic mixtures. Data interpretation is complicated by the fact that short sequencing reads can match multiple organisms and by the lack of completeness of existing databases, in particular for viral pathogens. This interpretation problem can be formulated statistically as a mixture model, where the species of origin of each read is missing, but the complete knowledge of all species present in the mixture helps with the individual reads assignment. Several analytical tools have been proposed to approximately solve this computational problem. Here, we show that the use of parallel Monte Carlo Markov chains (MCMC) for the exploration of the species space enables the identification of the set of species most likely to contribute to the mixture. The added accuracy comes at a cost of increased computation time. Our approach is useful for solving complex mixtures involving several related species. We designed our method specifically for the analysis of deep transcriptome sequencing datasets and with a particular focus on viral pathogen detection, but the principles are applicable more generally to all types of metagenomics mixtures. The code is available on github (http://github.com/smorfopoulou/metaMix) and the process is currently being implemented in a user friendly R package (metaMix, to be submitted to CRAN).

Long non-coding RNA discovery in Anopheles gambiae using deep RNA sequencing

Long non-coding RNA discovery in Anopheles gambiae using deep RNA sequencing

Adam M Jenkins, Robert M Waterhouse, Alan S Kopin, Marc A.T. Muskavitch

Long non-coding RNAs (lncRNAs) are mRNA-like transcripts longer than 200 bp that have no protein-coding potential. lncRNAs have recently been implicated in epigenetic regulation, transcriptional and post-transcriptional gene regulation, and regulation of genomic stability in mammals, Caenorhabditis elegans, and Drosophila melanogaster. Using deep RNA sequencing of multiple Anopheles gambiae life stages, we have identified over 600 novel lncRNAs and more than 200 previously unannotated putative protein-coding genes. The lncRNAs exhibit differential expression profiles across life stages and adult genders. Those lncRNAs that are antisense to known protein-coding genes or are contained within intronic regions of protein-coding genes may mediate transcriptional repression or stabilization of associated mRNAs. lncRNAs exhibit faster rates of sequence evolution across anophelines compared to previously known and newly identified protein-coding genes. This initial description of lncRNAs in An. gambiae offers the first genome-wide insights into long non-coding RNAs in this vector mosquito and defines a novel set of potential targets for the development of vector-based interventions that may curb the human malaria burden in disease-endemic countries.

Comparative Performance of Two Whole Genome Capture Methodologies on Ancient DNA Illumina Libraries

Comparative Performance of Two Whole Genome Capture Methodologies on Ancient DNA Illumina Libraries
Maria Avila-Arcos, Marcela Sandoval-Velasco, Hannes Schroeder, Meredith L Carpenter, Anna-Sapfo Malaspinas, Nathan Wales, Fernando Peñaloza, Carlos D Bustamante, M. Thomas P Gilbert

1. The application of whole genome capture (WGC) methods to ancient DNA (aDNA) promises to increase the efficiency of ancient genome sequencing. 2. We compared the performance of two recently developed WGC methods in enriching human aDNA within Illumina libraries built using both double-stranded (DSL) and single-stranded (SSL) build protocols. Although both methods effectively enriched aDNA, one consistently produced marginally better results, giving us the opportunity to further explore the parameters influencing WGC experiments. 3. Our results suggest that bait length has an important influence on library enrichment. Moreover, we show that WGC biases against the shorter molecules that are enriched in SSL preparation protocols. Therefore application of WGC to such samples is not recommended without future optimization. Lastly, we document the effect of WGC on other features including clonality, GC composition and repetitive DNA content of captured libraries. 4. Our findings provide insights for researchers planning to perform WGC on aDNA, and suggest future tests and optimization to improve WGC efficiency.