Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Assembly of polymorphic Alu repeat sequences from whole genome sequence data in diverse humans

Julia H Wildschutte , Alayna A Baron , Nicolette M Diroff , Jeffrey M Kidd
doi: http://dx.doi.org/10.1101/014977

Alu insertions have contributed to >11% of the human genome. About ~30-35 Alu subfamilies remain actively mobile, and are recognized as major drivers of genetic variation and disease. Sophisticated computational methods permit identification of non-reference insertions based on specific signatures from whole genome sequencing data, but reporting of entire insertion sequences is limited. We build on existing methods and develop an approach that combines Alu detection and de novo assembly of WGS data to reconstruct the full sequence of insertion events. Using this approach, we generate a highly accurate call set of 1,614 completely assembled Alu variants from 53 samples from the Human Genome Diversity Project panel. Experimental validation of 30 sites shows 100% this method produces a highly accurate call set that accurately reconstructs insertion sequence. We utilize the reconstructed alternative insertion haplotypes to genotype 1,010 fully assembled insertions, obtaining >99% accuracy. We find evidence of insertion by non-classical mechanisms and observe 5??? truncation in 16% of AluYa5 and AluYb8 insertions. The sites of truncation coincide with stem-loop structures and SRP9/14 binding sites in the Alu RNA, implicating L1 ORF2p pausing in the generation of 5??? truncations.

Improving access to endogenous DNA in ancient bones and teeth

Improving access to endogenous DNA in ancient bones and teeth

Peter de Barros Damgaard , Ashot Margaryan , Hannes Schroeder , Ludovic Orlando , Eske Willerslev , Morten E Allentoft
doi: http://dx.doi.org/10.1101/014985

Poor DNA preservation is the most limiting factor in ancient genomic research. In the vast majority of ancient bones and teeth, endogenous DNA molecules only represent a minor fraction of the whole DNA extract, rendering traditional shot-gun sequencing approaches cost-ineffective for whole-genome characterization. Based on ancient human bone samples from temperate and tropical environments, we show that an initial EDTA-based enzymatic ‘pre-digestion’ of powdered bone increases the proportion of endogenous DNA several fold. By performing the pre-digestion step between 30 min and 6 hours on five bones, we identify the optimal pre-digestion time and document an average increase of 2.7 times in the endogenous DNA fraction after 1 hour of pre-digestion. With longer pre-digestion times, the increase is asymptotic while molecular complexity decreases. We repeated the experiment with n=21 and t=15-30′, and document a significant increase in endogenous DNA content (one-sided paired t-test: p=0.009). We advocate the implementation of a short pre-digestion step as a standard procedure in ancient DNA extractions from bone material. Finally, we demonstrate on 14 ancient teeth that crushed cementum of the roots contains up to 14 times more endogenous DNA than the dentine. Our presented methodological guidelines considerably advance the ability to characterize ancient genomes.

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel

Linkage Disequilibrium and Inversion-Typing of the Drosophila melanogaster Genome Reference Panel
David Houle , Eladio J. Marquez
doi: http://dx.doi.org/10.1101/014936

We calculated the linkage disequilibrium between all pairs of variants in the Drosophila Genome Reference Panel, and make available the list of all highly correlated SNPs for use in association studies. Seventy-three percent of variant SNPs are correlated at r2>0.5 with at least one other SNP, and the mean number of correlated SNPs per variant over the whole genome is 64.9. Disequilibrium between distant SNPs is also common when minor allele frequency (MAF) is low: 24% of SNPs with MAF<0.1 are highly correlated with SNPs more than 100kb distant. While SNPs within regions with polymorphic inversions are highly correlated with somewhat larger numbers of SNPs, and these correlated SNPs are on average farther away, the probability that a SNP in such regions is highly correlated with at least one other SNP is very similar to SNPs outside inversions. Previous karyotyping of the DGRP lines has been inconsistent, and we used LD and genotype to investigate these discrepancies. When previous studies agreed on inversion karyotype, our analysis was almost perfectly concordant with those assignments. In discordant cases, and for inversion heterozygotes, our results suggest errors in two previous analyses, or discordance between genotype and karyotype. Heterozygosities of chromosome arms are in many cases surprisingly highly correlated, suggesting strong epsistatic selection during the inbreeding and maintenance of the DGRP lines.

Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization

Evolution of selenophosphate synthetases: emergence and relocation of function through independent duplications and recurrent subfunctionalization
Marco Mariotti , Didac Santesmasses , Salvador Capella-Gutierrez , Andrea Mateo , Carme Arnan , Rory Johnson , Salvatore D’Aniello , Sun Hee Yim , Vadim N Gladyshev , Florenci Serras , Montserrat Corominas , Toni Gabaldon , Roderic Guigo
doi: http://dx.doi.org/10.1101/014928

SPS catalyzes the synthesis of selenophosphate, the selenium donor for the synthesis of the amino acid selenocysteine (Sec), incorporated in selenoproteins in response to the UGA codon. SPS is unique among proteins of the selenoprotein biosynthesis machinery in that it is, in many species, a selenoprotein itself, although, as in all selenoproteins, Sec is often replaced by cysteine (Cys). In metazoan genomes we found, however, SPS genes with lineage specific substitutions other than Sec or Cys. Our results show that these non-Sec, non-Cys SPS genes originated through a number of independent gene duplications of diverse molecular origin from an ancestral selenoprotein SPS gene. Although of independent origin, complementation assays in fly mutants show that these genes share a common function, which most likely emerged in the ancestral metazoan gene. This function appears to be unrelated to selenophosphate synthesis, since all genomes encoding selenoproteins contain Sec or Cys SPS genes (SPS2), but those containing only non-Sec, non-Cys SPS genes (SPS1) do not encode selenoproteins. Thus, in SPS genes, through parallel duplications and subsequent convergent subfunctionalization, two functions initially carried by a single gene are recurrently segregated at two different loci. RNA structures enhancing the readthrough of the Sec-UGA codon in SPS genes, which may be traced back to prokaryotes, played a key role in this process. The SPS evolutionary history in metazoans constitute a remarkable example of the emergence and evolution of gene function. We have been able to trace this history with unusual detail thanks to the singular feature of SPS genes, wherein the amino acid at a single site determines protein function, and, ultimately, the evolutionary fate of an entire class of genes.

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Global determinants of mRNA degradation rates in Saccharomyces cerevisiae

Benjamin Neymotin, Victoria Ettorre, David Gresham
doi: http://dx.doi.org/10.1101/014845

Degradation of mRNA contributes to variation in transcript abundance. Studies of individual mRNAs show that cis and trans factors control mRNA degradation rates. However, transcriptome-wide studies have failed to identify global relationships between transcript properties and mRNA degradation. We investigated the contribution of cis and trans factors to transcriptome-wide degradation rate variation in the budding yeast, Saccharomyces cerevisiae, using multiple regression analysis. We find that multiple transcript properties are associated with mRNA degradation rates and that a model incorporating these factors explains ~50% of the genome-wide variance. Predictors of mRNA degradation rates include transcript length, abundance, ribosome density, codon adaptation index (CAI) and GC content of the third position in codons. To validate these factors we studied individual transcripts expressed from identical promoters. We find that decreasing ribosome density by mutating the translational start site of the GAP1 transcript increases its degradation rate. Using variants of GFP that differ at synonymous sites, we show that increased GC content of the third position of codons results in decreased mRNA degradation rate. Thus, in steady-state conditions, a large fraction of genome-wide variation in mRNA degradation rates is determined by inherent properties of transcripts related to protein translation rather than specific regulatory mechanisms.

A Pleiotropy-Informed Bayesian False Discovery Rate adapted to a Shared Control Design Finds New Disease Associations From GWAS Summary Statistics

A Pleiotropy-Informed Bayesian False Discovery Rate adapted to a Shared Control Design Finds New Disease Associations From GWAS Summary Statistics

James Liley, Chris Wallace
doi: http://dx.doi.org/10.1101/014886

Genome-wide association studies (GWAS) have been successful in identifying single nucleotide polymorphisms (SNPs) associated with many traits and diseases. However, at existing sample sizes, these variants explain only part of the estimated heritability. Leverage of GWAS results from related phenotypes may improve detection without the need for larger datasets. The Bayesian conditional false discovery rate (cFDR) constitutes an upper bound on the expected false discovery rate (FDR) across a set of SNPs whose p values for two diseases are both less than two disease-specific thresholds. Calculation of the cFDR requires only summary statistics and has several advantages over traditional GWAS analysis. However, existing methods require distinct control samples between studies. Here, we extend the technique to allow for some or all controls to be shared, increasing applicability. Several different SNP sets can be defined with the same cFDR value, and we show that the expected FDR across the union of these sets may exceed expected FDR in any single set. We describe a procedure to establish an upper bound for the expected FDR among the union of such sets of SNPs. We apply our technique to pairwise analysis of p values from ten autoimmune diseases with variable sharing of controls, enabling discovery of 59 SNP-disease associations which do not reach GWAS significance after genomic control in individual datasets. Most of the SNPs we highlight have previously been confirmed using replication studies or larger GWAS, a useful validation of our technique; we report eight SNP-disease associations across five diseases not previously declared. Our technique extends and strengthens the previous algorithm, and establishes robust limits on the expected FDR. This approach can improve SNP detection in GWAS, and give insight into shared aetiology between phenotypically related conditions.

Natural Selection Shapes the Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome

Natural Selection Shapes the Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome

John E Pool
doi: http://dx.doi.org/10.1101/014837

North American populations of Drosophila melanogaster are thought to derive from both European and African source populations, but despite their importance for genetic research, patterns of admixture along their genomes are essentially undocumented. Here, I infer geographic ancestry along genomes of the Drosophila Genetic Reference Panel (DGRP) and the D. melanogaster reference genome. Overall, the proportion of African ancestry was estimated to be 20% for the DGRP and 9% for the reference genome. Based on the size of admixture tracts and the approximate timing of admixture, I estimate that the DGRP population underwent roughly 13.9 generations per year. Notably, ancestry levels varied strikingly among genomic regions, with significantly less African introgression on the X chromosome, in regions of high recombination, and at genes involved in specific processes such as circadian rhythm. An important role for natural selection during the admixture process was further supported by a genome-wide signal of ancestry disequilibrium, in that many between-chromosome pairs of loci showed a deficiency of Africa-Europe allele combinations. These results support the hypothesis that admixture between partially genetically isolated Drosophila populations led to natural selection against incompatible genetic variants, and that this process is ongoing. The ancestry blocks inferred here may be relevant for the performance of reference alignment in this species, and may bolster the design and interpretation of many population genetic and association mapping studies.

Author post: Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance

This guest post is by Pleuni Pennings about her paper (with co-authors) Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance, bioRxived here. This is a cross-post from Pleuni’s blog.

Almost three years ago, in early 2012, I attended a talk by Martin Nowak. He talked about cancer and one of the things he said was that treatment with multiple drugs at the same time is a good idea because it helps prevent the evolution of drug resistance. Specifically, he explained, when treatment is with multiple drugs, the pathogen (tumor cells in the case of cancer) needs to acquire multiple resistance mutations at the same time in order to escape drug pressure.

As I listened to Martin Nowak’s talk, I was thinking of HIV, not cancer. At that time, I had already spent about two years working on drug resistance in HIV. Treatment of HIV is always with multiple drugs, for the same reason that Martin Nowak highlighted in his talk: it helps prevent the evolution of drug resistance.

However, as I read the HIV drug resistance literature and analyzed sequence data from HIV patients, I found evidence that drug resistance mutations in HIV tend to accumulate one at a time. This is contrary to the generally accepted idea that the pathogen must acquire resistance mutations simultaneously.

There seemed to be a clear mismatch between data and theory. Data show mutations are acquired one at a time, and theory says mutations must be acquired simultaneously. One of the two must be wrong, and it can’t be the data![1]

Interesting!

After Martin Nowak’s talk, I went up to him and told him how I thought data didn’t fit the theory. Martin’s response: “Oh, that is interesting!” (Imagine this being said with an Austrian accent). “Let’s meet and talk about it.”

So, we met. Logically, Alison Hill and Daniel Rosenbloom, then grad students in Martin’s group, were there too. I had already met with Alison and Daniel many times, since they were also working on drug resistance in HIV. John Wakeley (my advisor at Harvard) came to the meeting too.

Between the five of us, we brainstormed and fairly quickly realized that one solution to the conundrum was to assume that a body’s patient consisted of different compartments and that each drug may not penetrate into each compartment. Maybe we found this solution quickly because Alison and Daniel had already been thinking of the issue of drug penetration in the context of another project. A body compartment that has only one drug instead of two or three would allow a pathogen that has acquired one drug resistance mutation to replicate. If a pathogen with just one mutation has a place to replicate, this makes it possible for the pathogen to acquire resistance mutations one at a time.

We decided to start a collaboration to analyze a formal model to see whether our intuition was correct. Over the following three years, there were some personnel changes and several moves, graduations and new jobs. Stefany Moreno joined the project as a student from the European MEME Master’s program when she spent a semester in Martin’s group. When I moved to Stanford, Dmitri Petrov became involved in the project. Next, Alison and Daniel each got their PhD and started postdocs (Alison at Harvard, Daniel at Columbia), Stefany got her MSc and started a PhD in Groningen, I had a baby and became an assistant professor at SFSU. No one would have been surprised if the project would never have been finished! But we stuck with it and after many hours of work, especially by the first authors Alison and Stefany, and uncountable Google Hangout meetings, we can now confidently say that our initial intuition from that meeting in 2012 was correct. Compartments with imperfect drug penetration indeed allow pathogens to acquire drug resistance one mutation at a time. And, importantly, the evolution of multi-drug resistance can happen fast if mutations can be acquired one at a time, much faster than when simultaneous mutations are needed.

Our manuscript can be found on the BioRxiv (link). It is entitled “Imperfect drug penetration leads to spatial monotherapy and rapid evolution of multi-drug resistance.” We hope you find it useful!

[1]Of course, it could be my interpretation of the data!

Molecular evolutionary consequences of island colonisation

Molecular evolutionary consequences of island colonisation

Jennifer James, Robert Lanfear, Adam Eyre-Walker
doi: http://dx.doi.org/10.1101/014811

Island endemics are likely to experience population bottlenecks; they also have restricted ranges. Therefore we expect island species to have small effective population sizes (Ne) and reduced genetic diversity compared to their mainland counterparts. As a consequence, island species may have inefficient selection and reduced adaptive potential. We used both polymorphisms and substitutions to address these predictions, improving on the approach of recent studies that only used substitution data. This allowed us to directly test the assumption that island species have small values of Ne. We found that island species had significantly less genetic diversity than mainland species; however, this pattern could be attributed to a subset of island species that had undergone a recent population bottleneck. When these species were excluded from the analysis, island and mainland species had similar levels of genetic diversity, despite island species occupying considerably smaller areas than their mainland counterparts. We also found no overall difference between island and mainland species in terms of effectiveness of selection or mutation rate. Our evidence suggests that island colonisation has no lasting impact on molecular evolution. This surprising result highlights gaps in our knowledge of the relationship between census and effective population size.

Transition densities and sample frequency spectra of diffusion processes with selection and variable population size

Transition densities and sample frequency spectra of diffusion processes with selection and variable population size
Daniel Zivkovic, Matthias Steinrücken, Yun S. Song, Wolfgang Stephan
doi: http://dx.doi.org/10.1101/014639

Advances in empirical population genetics have made apparent the need for models that simultaneously account for selection and demography. To address this need, we here study the Wright-Fisher diffusion under selection and variable effective population size. In the case of genic selection and piecewise-constant effective population sizes, we obtain the transition density function by extending a recently developed method for computing an accurate spectral representation for a constant population size. Utilizing this extension, we show how to compute the sample frequency spectrum (SFS) in the presence of genic selection and an arbitrary number of instantaneous changes in the effective population size. We also develop an alternate, efficient algorithm for computing the SFS using a method of moments. We apply these methods to answer the following questions: If neutrality is incorrectly assumed when there is selection, what effects does it have on demographic parameter estimation? Can the impact of negative selection be observed in populations that undergo strong exponential growth?