Approximate statistical alignment by iterative sampling of substitution matrices

Approximate statistical alignment by iterative sampling of substitution matrices

Joseph L. Herman, Adrienn Szabó, Instván Miklós, Jotun Hein
(Submitted on 19 Jan 2015)

We outline a procedure for jointly sampling substitution matrices and multiple sequence alignments, according to an approximate posterior distribution, using an MCMC-based algorithm. This procedure provides an efficient and simple method by which to generate alternative alignments according to their expected accuracy, and allows appropriate parameters for substitution matrices to be selected in an automated fashion. In the cases considered here, the sampled alignments with the highest likelihood have an accuracy consistently higher than alignments generated using the standard BLOSUM62 matrix.

Musings on the theory that variation in cancer risk among tissues can be explained by the number of divisions of normal stem cells

Musings on the theory that variation in cancer risk among tissues can be explained by the number of divisions of normal stem cells

Cristian Tomasetti, Bert Vogelstein
(Submitted on 21 Jan 2015)

This manuscript has been written to address questions related to our recent publication (Science 347:78-81, 2015). We appreciate the many reactions to this paper that have been communicated to us, either privately or publicly. The following addresses several of the most important statistical and technical issues related to our analysis and conclusions. Our responses to non-technical questions are available at this http URL

Mutation detection in candidate genes for parauberculosis resistance in sheep

Mutation detection in candidate genes for parauberculosis resistance in sheep

Bianca Moioli, Luigi De Grossi, Roberto Steri, Silvia D’Andrea, Fabio Pilla
doi: http://dx.doi.org/10.1101/014035

The marker-assisted selection exploits anonymous genetic markers that have been associated with measurable differences on complex traits; because it is based on the Linkage Disequilibrium between the polymorphic markers and the polymorphisms which code for the trait, its success is limited to the population in which the association has been assessed. The identification of the gene with effect on the target and the detection of the functional mutations will allow selection in independent populations, while encouraging studies on gene expression. The results of a genome-wide scan performed with the Illumina Ovine SNP50K Beadchip, on 100 sheep, 50 of which positive at paratuberculosis serological assessment, identified two candidate genes of immunity response, the PCP4 and the CD109, located in proximity of the markers with different allele frequency in positive and negative sheep. The coding region of the two genes was directly sequenced: three missense mutations were detected: two in the PCP4 gene and one in the second exon of the CD109 gene. The PCP4 mutations had a very low frequency (.12 and .07) so making hazardous to hypothesize their direct effect on immune response. On the contrary, the mutation detected in the CD109 gene showed a strong linkage disequilibrium with the anonymous marker. Direct sequencing of the DNA of sheep of different populations showed that disequilibrium was maintained. Allele frequency at the hypothesized marker associated to immune response, calculated for other breeds of sheep, showed that the marker allele potentially associated to disease resistance is more frequent in the local breeds and in breeds that have not been submitted to selection programs.

The genetics of resistance to Morinda fruit toxin during the postembryonic stages in Drosophila sechellia

The genetics of resistance to Morinda fruit toxin during the postembryonic stages in Drosophila sechellia

Yan Huang, Deniz Erezyilmaz
doi: http://dx.doi.org/10.1101/014027

Many phytophagous insect species are ecologic specialists that have adapted to utilize a single host plant. Drosophila sechellia is a specialist that utilizes the ripe fruit of Morinda citrifolia, which is toxic to its sibling species, D. simulans. Here we apply multiplexed shotgun genotyping and QTL analysis to examine the genetic basis of resistance to M. citrifolia fruit toxin in interspecific hybrids. We find that at least four dominant and four recessive loci interact additively to confer resistance to the M. citrifolia fruit toxin. These QTL include a dominant locus of large effect on the third chromosome (QTL-IIIsima) that was not detected in previous analyses. The small-effect loci that we identify overlap with regions that were identified in selection experiments with D. simulans on octanoic acid and in QTL analyses of adult resistance to octanoic acid. Our high-resolution analysis sheds new light upon the complexity of M. citrifolia resistance, and suggests that partial resistance to lower levels of M. citrifolia toxin could be passed through introgression from D. sechellia to D. simulans in nature. The identification of a locus of major effect, QTL-IIIsima, is an important step towards identifying the molecular basis of host plant specialization by D. sechellia.

The SMC’ is a highly accurate approximation to the ancestral recombination graph

The SMC’ is a highly accurate approximation to the ancestral recombination graph

Peter R. Wilton, Shai Carmi, Asger Hobolth
(Submitted on 12 Jan 2015)

Two sequentially Markov coalescent models (SMC and SMC’) are available as tractable approximations to the ancestral recombination graph (ARG). We present a model of coalescence at two fixed points along a pair of sequences evolving under the SMC’. Using our model, we derive a number of new quantities related to the pairwise SMC’, thereby analytically quantifying for the first time the similarity between the SMC’ and ARG. We use our model to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC’ is the same as it is marginally under the ARG, demonstrating that the SMC’ is the canonical first-order sequentially Markov approximation to the pairwise ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC’ they are approximately asymptotically unbiased.

The Time-Scale of Recombination Rate Evolution in Great Apes

The Time-Scale of Recombination Rate Evolution in Great Apes

Laurie S Stevison, August E Woerner, Jeffrey M Kidd, Joanna L Kelley, Krishna R Veeramah, Kimberly F McManus, Carlos D Bustamante, Michael F Hammer, Jeffrey D Wall
doi: http://dx.doi.org/10.1101/013755

We present three linkage-disequilibrium (LD)-based recombination maps generated using whole-genome sequencing data of 10 Nigerian chimpanzees, 13 bonobos, and 15 western gorillas, collected as part of the Great Ape Genome Project (Prado-Martinez et al. 2013). Using species-specific PRDM9 sequences to predict potential binding sites, we identified an important role for PRDM9 in predicting recombination rate variation broadly across great apes. Our results are contrary to previous research that PRDM9 is not associated with recombination in western chimpanzees (Auton et al. 2012). Additionally, we show that fewer hotspots are shared among chimpanzee subspecies than within human populations, further narrowing the time-scale of complete hotspot turnover. We quantified the variation in the biased distribution of recombination rates towards recombination hotspots across great apes. We found that correlations between broad-scale recombination rates decline more rapidly than nucleotide divergence between species. We also compared the skew of recombination rates at centromeres and telomeres between species and show a skew from chromosome means extending as far as 10‐15 Mb from chromosome ends. Further, we examined broad-scale recombination rate changes near a translocation in gorillas and found minimal differences as compared to other great ape species perhaps because the coordinates relative to the chromosome ends were unaffected. Finally, based on multiple linear regression analysis, we found that various correlates of recombination rate persist throughout primates including repeats, diversity, divergence and local effective population size (Ne). Our study is the first to analyze within- and between-species genome-wide recombination rate variation in several close relatives.

The P-element strikes again: the recent invasion of natural Drosophila simulans populations

The P-element strikes again: the recent invasion of natural Drosophila simulans populations

Robert Kofler, Tom Hill, Viola Nolte, Andrea Betancourt, Christian Schlötterer
doi: http://dx.doi.org/10.1101/013722

The P-element is one of the best understood eukaryotic transposable elements. It invaded Drosophila melanogaster populations within a few decades, but was thought to be absent from close relatives, including D. simulans. Five decades after the spread in D. melanogaster, we provide evidence that the P-element has also invaded D. simulans. P-elements in D. simulans appear to have been acquired recently from D. melanogaster probably via a single horizontal transfer event. Expression data indicate that the P-element is processed in the germline of D. simulans, and genomic data show an enrichment of P-element insertions in putative origins of replication, similar to that seen in D. melanogaster. This ongoing spread of the P-element in natural populations provides an unique opportunity to understand the dynamics of transposable element spreads and the associated piRNA defense mechanisms.

Distributions of topological tree metrics between a species tree and a gene tree

Distributions of topological tree metrics between a species tree and a gene tree

Jing Xi, Jin Xie, Ruriko Yoshida
(Submitted on 10 Jan 2015)

In order to conduct a statistical analysis on a given set of phylogenetic gene trees, we often use a distance measure between two trees. In a statistical distance-based method to analyze discordance between gene trees, it is a key to decide “biological meaningful” and “statistically well-distributed” distance between trees. Thus, in this paper, we study the distributions of the three tree distance metrics: the edge difference, the path difference, and the precise K interval cospeciation distance, between two trees: first, we focus on distributions of the three tree distances between two random unrooted trees with n leaves (n≥4); and then we focus on the distributions the three tree distances between a fixed rooted species tree with n leaves and a random gene tree with n leaves generated under the coalescent process with given the species tree. We show some theoretical results as well as simulation study on these distributions.

Reprogramming LCLs to iPSCs Results in Recovery of Donor-Specific Gene Expression Signature

Reprogramming LCLs to iPSCs Results in Recovery of Donor-Specific Gene Expression Signature

Samantha M Thomas, Courtney Kagan, Bryan J Pavlovic, Jonathan Burnett, Kristen Patterson, Jonathan K Pritchard, Yoav Gilad
doi: http://dx.doi.org/10.1101/013631

Renewable in vitro cell cultures, such as lymphoblastoid cell lines (LCLs), have facilitated studies that contributed to our understanding of genetic influence on human traits. However, the degree to which cell lines faithfully maintain differences in donor-specific phenotypes is still debated. We have previously reported that standard cell line maintenance practice results in a loss of donor-specific gene expression signatures in LCLs. An alternative to the LCL model is the induced pluripotent stem cell (iPSC) system, which carries the potential to model tissue-specific physiology through the use of differentiation protocols. Still, existing LCL banks represent an important source of starting material for iPSC generation, and it is possible that the disruptions in gene regulation associated with long-term LCL maintenance could persist through the reprogramming process. To address this concern, we studied the effect of reprogramming mature LCLs to iPSCs on the ensuing gene expression patterns within and between six unrelated donor individuals. We show that the reprogramming process results in a recovery of donor-specific gene regulatory signatures. Since environmental contributions are unlikely to be a source of individual variation in our system of highly passaged cultured cell lines, our observations suggest that the effect of genotype on gene regulation is more pronounced in the iPSCs than in the LCL precursors. Our findings indicate that iPSCs can be a powerful model system for studies of phenotypic variation across individuals in general, and the genetic association with variation in gene regulation in particular. We further conclude that LCLs are an appropriate starting material for iPSC generation.

Software for the analysis and visualization of deep mutational scanning data

Software for the analysis and visualization of deep mutational scanning data

Jesse D Bloom
doi: http://dx.doi.org/10.1101/013623

Background Deep mutational scanning is a technique to estimate the impacts of mutations on a gene by using deep sequencing to count mutations in a library of variants before and after imposing a functional selection. The impacts of mutations must be inferred from changes in their counts after selection. Results I describe a software package, dms_tools, to infer the impacts of mutations from deep mutational scanning data using a likelihood-based treatment of the mutation counts. I show that dms_tools yields more accurate inferences on simulated data than the widely used but statistically biased approach of calculating ratios of counts pre- and post-selection. Using dms_tools, one can infer the preference of each site for each amino acid given a single selection pressure, or assess the extent to which these preferences change under different selection pressures. The preferences and their changes can be intuitively visualized with sequence-logo-style plots created using an extension to weblogo. Conclusions dms_tools implements a statistically principled approach for the analysis and subsequent visualization of deep mutational scanning data.