MIPSTR: a method for multiplex genotyping of germ-line and somatic STR variation across many individuals
Keisha Dawn Carlson, Peter H Sudmant, Maximilian Oliver Press, Evan E Eichler, Jay Shendure, Christine Queitsch
Abstract Short tandem repeats (STRs) are highly mutable genetic elements that often reside in functional genomic regions. The cumulative evidence of genetic studies on individual STRs suggests that STR variation profoundly affects phenotype and contributes to trait heritability. Despite recent advances in sequencing technology, STR variation has remained largely inaccessible across many individuals compared to single nucleotide variation or copy number variation. STR genotyping with short-read sequence data is confounded by (1) the difficulty of uniquely mapping short, low-complexity reads and (2) the high rate of STR amplification stutter. Here, we present MIPSTR, a robust, scalable, and affordable method that addresses these challenges. MIPSTR uses targeted capture of STR loci by single-molecule Molecular Inversion Probes (smMIPs) and a unique mapping strategy. Targeted capture and mapping strategy resolve the first challenge; the use of single molecule information resolves the second challenge. Unlike previous methods, MIPSTR is capable of distinguishing technical error due to amplification stutter from somatic STR mutations. In proof-of-principle experiments, we use MIPSTR to determine germ-line STR genotypes for 102 STR loci with high accuracy across diverse populations of the plant A. thaliana. We show that putatively functional STRs may be identified by deviation from predicted STR variation and by association with quantitative phenotypes. Employing DNA mixing experiments and a mutant deficient in DNA repair, we demonstrate that MIPSTR can detect low-frequency somatic STR variants. MIPSTR is applicable to any organism with a high-quality reference genome and is scalable to genotyping many thousands of STR loci in thousands of individuals.
An estimate of the average number of recessive lethal mutations carried by humans
Ziyue Gao, Darrel Waggoner, Matthew Stephens, Carole Ober, Molly Przeworski
(Submitted on 28 Jul 2014)
The effects of inbreeding on human health depend critically on the number and severity of recessive, deleterious mutations carried by individuals. In humans, existing estimates of these quantities are based on comparisons between consanguineous and non-consanguineous couples, an approach that confounds socioeconomic and genetic effects of inbreeding. To circumvent this limitation, we focused on a founder population with almost complete Mendelian disease ascertainment and a known pedigree. By considering all recessive lethal diseases reported in the pedigree and simulating allele transmissions, we estimated that each haploid set of human autosomes carries on average 0.29 (95% credible interval [0.10, 0.83]) autosomal, recessive alleles that lead to complete sterility or severe disorders at birth or before reproductive age when homozygous. Comparison to existing estimates of the deleterious effects of all recessive alleles suggests that a substantial fraction of the burden of autosomal, recessive variants is due to single mutations that lead to death between birth and reproductive age. In turn, the comparison to estimates from other eukaryotes points to a surprising constancy of the average number of recessive lethal mutations across organisms with markedly different genome sizes.
Comparative Performance of Two Whole Genome Capture Methodologies on Ancient DNA Illumina Libraries
Maria Avila-Arcos, Marcela Sandoval-Velasco, Hannes Schroeder, Meredith L Carpenter, Anna-Sapfo Malaspinas, Nathan Wales, Fernando Peñaloza, Carlos D Bustamante, M. Thomas P Gilbert
1. The application of whole genome capture (WGC) methods to ancient DNA (aDNA) promises to increase the efficiency of ancient genome sequencing. 2. We compared the performance of two recently developed WGC methods in enriching human aDNA within Illumina libraries built using both double-stranded (DSL) and single-stranded (SSL) build protocols. Although both methods effectively enriched aDNA, one consistently produced marginally better results, giving us the opportunity to further explore the parameters influencing WGC experiments. 3. Our results suggest that bait length has an important influence on library enrichment. Moreover, we show that WGC biases against the shorter molecules that are enriched in SSL preparation protocols. Therefore application of WGC to such samples is not recommended without future optimization. Lastly, we document the effect of WGC on other features including clonality, GC composition and repetitive DNA content of captured libraries. 4. Our findings provide insights for researchers planning to perform WGC on aDNA, and suggest future tests and optimization to improve WGC efficiency.
Butter: High-precision genomic alignment of small RNA-seq data
Michael J Axtell
Eukaryotes produce large numbers of small non-coding RNAs that act as specificity determinants for various gene-regulatory complexes. These include microRNAs (miRNAs), endogenous short interfering RNAs (siRNAs), and Piwi-associated RNAs (piRNAs). These RNAs can be discovered, annotated, and quantified using small RNA-seq, a variant RNA-seq method based on highly parallel sequencing. Alignment to a reference genome is a critical step in analysis of small RNA-seq data. Because of their small size (20-30 nts depending on the organism and sub-type) and tendency to originate from multi-gene families or repetitive regions, reads that align equally well to more than one genomic location are very common. Typical methods to deal with multi-mapped small RNA-seq reads sacrifice either precision or sensitivity. The tool ‘butter’ balances precision and sensitivity by placing multi-mapped reads using an iterative approach, where the decision between possible locations is dictated by the local densities of more confidently aligned reads. Butter displays superior performance relative to other small RNA-seq aligners. Treatment of multi-mapped small RNA-seq reads has substantial impacts on downstream analyses, including quantification of MIRNA paralogs, and discovery of endogenous siRNA loci. Butter is freely available under a GNU general public license.
Clonal interference and Muller’s ratchet in spatial habitats
Jakub Otwinowski, Joachim Krug
(Submitted on 18 Feb 2013 (v1), last revised 23 Jul 2014 (this version, v3))
Competition between independently arising beneficial mutations is enhanced in spatial populations due to the linear rather than exponential growth of clones. Recent theoretical studies have pointed out that the resulting fitness dynamics is analogous to a surface growth process, where new layers nucleate and spread stochastically, leading to the build up of scale-invariant roughness. This scenario differs qualitatively from the standard view of adaptation in that the speed of adaptation becomes independent of population size while the fitness variance does not. Here we exploit recent progress in the understanding of surface growth processes to obtain precise predictions for the universal, non-Gaussian shape of the fitness distribution for one-dimensional habitats, which are verified by simulations. When the mutations are deleterious rather than beneficial the problem becomes a spatial version of Muller’s ratchet. In contrast to the case of well-mixed populations, the rate of fitness decline remains finite even in the limit of an infinite habitat, provided the ratio Ud/s2 between the deleterious mutation rate and the square of the (negative) selection coefficient is sufficiently large. Using again an analogy to surface growth models we show that the transition between the stationary and the moving state of the ratchet is governed by directed percolation.
Concerning RNA-Guided Gene Drives for the Alteration of Wild Populations
Kevin M Esvelt, Andrea L Smidler, Flaminia Catteruccia, George M Church
Gene drives may be capable of addressing ecological problems by altering entire populations of wild organisms, but their use has remained largely theoretical due to technical constraints. Here we consider the potential for RNA-guided gene drives based on the CRISPR nuclease Cas9 to serve as a general method for spreading altered traits through wild populations over many generations. We detail likely capabilities, discuss limitations, and provide novel precautionary strategies to control the spread of gene drives and reverse genomic changes. The ability to edit populations of sexual species would offer substantial benefits to humanity and the environment. For example, RNA-guided gene drives could potentially prevent the spread of disease, support agriculture by reversing pesticide and herbicide resistance in insects and weeds, and control damaging invasive species. However, the possibility of unwanted ecological effects and near-certainty of spread across political borders demand careful assessment of each potential application. We call for thoughtful, inclusive, and well-informed public discussions to explore the responsible use of this currently theoretical technology.
Assessing allele specific expression across multiple tissues from RNA-seq read data
Matti Pirinen, Tuuli Lappalainen, Noah A Zaitlen, GTEx Consortium, Emmanouil T Dermitzakis, Peter Donnelly, Mark I McCarthy, Manuel A Rivas
Motivation: RNA sequencing enables allele specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression project (GTEx) is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data. Results: We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally. Availability: MAMBA software: http://birch.well.ox.ac.uk/~rivas/mamba/ R source code and data examples: http://www.iki.fi/mpirinen/ Contact: email@example.com firstname.lastname@example.org
Fixation properties of subdivided populations with balancing selection
Pierangelo Lombardo, Andrea Gambassi, Luca Dall’Asta
Comments: 17 pages, 10 figures
Subjects: Populations and Evolution (q-bio.PE); Statistical Mechanics (cond-mat.stat-mech); Biological Physics (physics.bio-ph)
In subdivided populations, migration acts together with selection and genetic drift and determines their evolution. Building up on a recently proposed method, which hinges on the emergence of a time scale separation between local and global dynamics, we study the fixation properties of subdivided populations in the presence of balancing selection. The approximation implied by the method is accurate when the effective selection strength is small and the number of subpopulations is large. In particular, it predicts a phase transition between species coexistence and biodiversity loss in the infinite-size limit and, in finite populations, a nonmonotonic dependence of the mean fixation time on the migration rate. In order to investigate the fixation properties of the subdivided population for stronger selection, we introduce an effective coarser description of the dynamics in terms of a voter model with intermediate states, which highlights the basic mechanisms driving the evolutionary process.
RNA-seq gene profiling – a systematic empirical comparison
Nuno A Fonseca, John A Marioni, Alvis Brazma
Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the “true” expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the ‘ground truth’ in real RNAseq data sets, we used simulated data to assess the differences between the true expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to assess the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.
Reagent contamination can critically impact sequence-based microbiome analyses
Susannah Salter, Michael J Cox, Elena M Turek, Szymon T Calus, William O Cookson, Miriam F Moffatt, Paul Turner, Julian Parkhill, Nick Loman, Alan W Walker
The study of microbial communities has been revolutionised in recent years by the widespread adoption of culture independent analytical techniques such as 16S rRNA gene sequencing and metagenomics. One potential confounder of these sequence-based approaches is the presence of contamination in DNA extraction kits and other laboratory reagents. In this study we demonstrate that contaminating DNA is ubiquitous in commonly used DNA extraction kits, varies greatly in composition between different kits and kit batches, and that this contamination critically impacts results obtained from samples containing a low microbial biomass. Contamination impacts both PCR-based 16S rRNA gene surveys and shotgun metagenomics. These results suggest that caution should be advised when applying sequence-based techniques to the study of microbiota present in low biomass environments. We provide an extensive list of potential contaminating genera, and guidelines on how to mitigate the effects of contamination. Concurrent sequencing of negative control samples is strongly advised.