Partitioning, duality, and linkage disequilibria in the Moran model with recombination

Partitioning, duality, and linkage disequilibria in the Moran model with recombination
Mareike Esser, Sebastian Probst, Ellen Baake
Comments: 29 pages, 6 figures
Subjects: Probability (math.PR); Populations and Evolution (q-bio.PE)

The Moran model with recombination is considered, which describes the evolution of the genetic composition of a population under recombination and resampling. There are $n$ sites (or loci), a finite number of letters (or alleles) at every site, and we do not make any scaling assumptions. In particular, we do not assume a diffusion limit. We consider the following marginal ancestral recombination process. Let $S = \{1,…c,n\}$ and $\mathcal A=\{A_1, …c, A_m\}$ be a partition of $S$. We concentrate on the joint probability of the letters at the sites in $A_1$ in individual $1$, $…c$, at the sites in $A_m$ in individual $m$, where the individuals are sampled from the current population without replacement. Following the ancestry of these sites backwards in time yields a process on the set of partitions of $S$, which, in the diffusion limit, turns into a marginalised version of the $n$-locus ancestral recombination graph. With the help of an inclusion-exclusion principle, we show that the type distribution corresponding to a given partition may be represented in a systematic way, with the help of so-called recombinators and sampling functions. The same is true of correlation functions (known as linkage disequilibria in genetics) of all orders.
We prove that the partitioning process (backward in time) is dual to the Moran population process (forward in time), where the sampling function plays the role of the duality function. This sheds new light on the work of Bobrowski, Wojdyla, and Kimmel (2010). The result also leads to a closed system of ordinary differential equations for the expectations of the sampling functions, which can be translated into expected type distributions and expected linkage disequilibria.

Systematic discovery and classification of human cell line essential genes

Systematic discovery and classification of human cell line essential genes
Traver Hart , Megha Chandrashekhar , Michael Aregger , Zachary Steinhart , Kevin R Brown , Stephane Angers , Jason Moffat
doi: http://dx.doi.org/10.1101/015412

The study of gene essentiality in human cells is crucial for elucidating gene function and holds great potential for finding therapeutic targets for diseases such as cancer. Technological advances in genome editing using clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9 systems have set the stage for identifying human cell line core and context-dependent essential genes. However, first generation negative selection screens using CRISPR technology demonstrate extreme variability across different cell lines. To advance the development of the catalogue of human core and context-dependent essential genes, we have developed an optimized, ultracomplex, genome-scale gRNA library of 176,500 guide RNAs targeting 17,661 genes and have applied it to negative and positive selection screens in a human cell line. Using an improved Bayesian analytical approach, we find CRISPR-based screens yield double to triple the number of essential genes than were previously observed using systematic RNA interference, including many genes at moderate expression levels that are largely refractory to RNAi methods. We further characterized four essential genes of unknown significance and found that they all likely exist in protein complexes with other essential genes. For example, RBM48 and ARMC7 are both essential nuclear proteins, strongly interact and are commonly amplified across major cancers. Our findings suggest the CRISPR-Cas9 system fundamentally alters the landscape for systematic reverse genetics in human cells for elucidating gene function, identifying disease genes, and uncovering therapeutic targets.

Maximum Likelihood Estimation and Phylogenetic Tree based Backward Elimination for reconstructing Viral Haplotypes in a Population

Maximum Likelihood Estimation and Phylogenetic Tree based Backward Elimination for reconstructing Viral Haplotypes in a Population

Raunaq Malhotra, Steven Wu, Allen Rodrigo, Mary Poss, Raj Acharya
(Submitted on 14 Feb 2015)

A viral population can contain a large and diverse collection of viral haplotypes which play important roles in maintaining the viral population. We present an algorithm for reconstructing viral haplotypes in a population from paired-end Next Generation Sequencing (NGS) data. We propose a novel polynomial time dynamic programming based approximation algorithm for generating top paths through each node in De Bruijn graph constructed from the paired-end NGS data. We also propose two novel formulations for obtaining an optimal set of viral haplotypes for the population using the paths generated by the approximation algorithm. The first formulation obtains a maximum likelihood estimate of the viral population given the observed paired-end reads. The second formulation obtains a minimal set of viral haplotypes retaining the phylogenetic information in the population. We evaluate our algorithm on simulated datasets varying on mutation rates and genome length of the viral haplotypes. The results of our method are compared to other methods for viral haplotype estimation. While all the methods overestimate the number of viral haplotypes in a population, the two proposed optimality formulations correctly estimate the exact sequence of all the haplotypes in most datasets, and recover the overall diversity of the population in all datasets. The haplotypes recovered from popular methods are biased toward the reference sequence used for mapping of reads, while the proposed formulations are reference-free and retain the overall diversity in the population.

Selection constrains phenotypic evolution in a functionally important plant trait

Selection constrains phenotypic evolution in a functionally important plant trait
Christopher D Muir
doi: http://dx.doi.org/10.1101/015172

A long-standing idea is that the macroevolutionary adaptive landscape — a `map’ of phenotype to fitness — constrains evolution because certain phenotypes are fit, while others are universally unfit. Such constraints should be evident in traits that, across many species, cluster around particular modal values, with few intermediates between modes. Here, I compile a new global database of 599 species from 94 plant families showing that stomatal ratio, an important functional trait affecting photosynthesis, is multimodal, hinting at distinct peaks in the adaptive landscape. The dataset confirms that most plants have all their stomata on the lower leaf surface (hypostomy), but shows for the first time that species with roughly half their stomata on each leaf surface (amphistomy) form a distinct mode in the trait distribution. Based on a new evolutionary process model, this multimodal pattern is unlikely without constraint. Further, multimodality has evolved repeatedly across disparate families, evincing long-term constraint on the adaptive landscape. A simple cost-benefit model of stomatal ratio demonstrates that selection alone is sufficient to generate an adaptive landscape with multiple peaks. Finally, phylogenetic comparative methods indicate that life history evolution drives shifts between peaks. This implies that the adaptive benefit conferred by amphistomy — increased photosynthesis — is most important in plants with fast life histories, challenging existing ideas that amphistomy is an adaptation to thick leaves and open habitats. I conclude that peaks in the adaptive landscape have been constrained by selection over much of land plant evolution, leading to predictable, repeatable patterns of evolution.

ViennaNGS: A toolbox for building efficient next-generation sequencing analysis pipelines

ViennaNGS: A toolbox for building efficient next-generation sequencing analysis pipelines
Michael T. Wolfinger , Jörg Fallmann , Florian Eggenhofer , Fabian Amman
doi: http://dx.doi.org/10.1101/013011

Recent achievements in next-generation sequencing (NGS) technologies lead to a high demand for reuseable software components to easily compile customized analysis workflows for big genomics data. We present ViennaNGS, an integrated collection of Perl modules focused on building efficient pipelines for NGS data processing. It comes with functionality for extracting and converting features from common NGS file formats, computation and evaluation of read mapping statistics, as well as normalization of RNA abundance. Moreover, ViennaNGS provides software components for identification and characterization of splice junctions from RNA-seq data, parsing and condensing sequence motif data, automated construction of Assembly and Track Hubs for the UCSC genome browser, as well as wrapper routines for a set of commonly used NGS command line tools.

Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping

Dissection of a complex disease susceptibility region using a Bayesian stochastic search approach to fine mapping

Chris Wallace , Antony J Cutler , Nikolas Pontikos , Marcin L Pekalski , Oliver S Burren , Jason D Cooper , Arcadio Rubio Garcia , Ricardo C Ferreira , Hui Guo , Neil M Walker , Deborah J Smyth , Stephen S Rich , Suna Onengut-Gumuscu , Stephen S Sawcer , Maria Ban , Sylvia Richardson , John Todd , Linda Wicker
doi: http://dx.doi.org/10.1101/015164

Identification of candidate causal variants in regions associated with risk of common diseases is complicated by linkage disequilibrium (LD) and multiple association signals. Nonetheless, accurate maps of these variants are needed, both to fully exploit detailed cell specific chromatin annotation data and highlight disease causal mechanisms and cells, and for design of the functional studies that will ultimately be required to confirm causal mechanisms. We adapted a Bayesian evolutionary stochastic search algorithm to the fine mapping problem, and demonstrated its improved performance over conventional stepwise and regularised regression through simulation studies. We then applied it to fine map the established multiple sclerosis (MS) and type 1 diabetes (T1D) associations in the IL-2RA (CD25) gene region. For T1D, both stepwise and stochastic search approaches identified four T1D association signals, with the major effect tagged by the single nucleotide polymorphism, rs12722496. In contrast, for MS, the stochastic search found two distinct competing models: a single candidate causal variant, tagged by rs2104286 and reported previously using conditional analysis; and a more complex model with two association signals, one of which was tagged by the major T1D associated rs12722496 and the other by rs56382813. There is low to moderate LD between rs2104286 and both rs12722496 and rs56382813 (r2 ≈ 0.3) and our two SNP model could not be recovered through a forward stepwise search after conditioning on rs2104286. Both signals in the two variant model for MS affect CD25 expression on distinct subpopulations of CD4+ T cells, which are key cells in the autoimmune process. The results support a shared causal variant for T1D and MS. Our study illustrates the benefit of using a purposely designed model search strategy for fine mapping and the advantage of combining disease and protein expression data.

Correcting Illumina sequencing errors for human data

Correcting Illumina sequencing errors for human data

Heng Li
(Submitted on 12 Feb 2015)

Summary: We present a new tool to correct sequencing errors in Illumina data produced from high-coverage whole-genome shotgun resequencing. It uses a non-greedy algorithm and shows comparable performance and higher accuracy in an evaluation on real human data. This evaluation has the most complete collection of high-performance error correctors so far.
Availability and implementation: this https URL

Inferring processes underlying B-cell repertoire diversity

Inferring processes underlying B-cell repertoire diversity

Yuval Elhanati, Zachary Sethna, Quentin Marcou, Curtis G. Callan Jr., Thierry Mora, Aleksandra M. Walczak
(Submitted on 10 Feb 2015)

We quantify the VDJ recombination and somatic hypermutation processes in human B-cells using probabilistic inference methods on high-throughput DNA sequence repertoires of human B-cell receptor heavy chains. Our analysis captures the statistical properties of the naive repertoire, first after its initial generation via VDJ recombination and then after selection for functionality. We also infer statistical properties of the somatic hypermutation machinery (exclusive of subsequent effects of selection). Our main results are the following: the B-cell repertoire is substantially more diverse than T-cell repertoires, due to longer junctional insertions; sequences that pass initial selection are distinguished by having a higher probability of being generated in a VDJ recombination event; somatic hypermutations have a non-uniform distribution along the V gene that is well explained by an independent site model for the sequence context around the hypermutation site.

Discovery of large genomic inversions using pooled clone sequencing

Discovery of large genomic inversions using pooled clone sequencing

Marzieh Eslami Rasekh , Giorgia Chiatante , Mattia Miroballo , Joyce Tang , Mario Ventura , Chris T Amemiya , Evan E. Eichler , Francesca Antonacci , Can Alkan
doi: http://dx.doi.org/10.1101/015156

There are many different forms of genomic structural variation that can be broadly classified as copy number variation (CNV) and balanced rearrangements. Although many algorithms are now available in the literature that aim to characterize CNVs, discovery of balanced rearrangements (inversions and translocations) remains an open problem. This is mainly because the breakpoints of such events typically lie within segmental duplications and common repeats, which reduce the mappability of short reads. The 1000 Genomes Project spearheaded the development of several methods to identify inversions, however, they are limited to relatively short inversions, and there are currently no available algorithms to discover large inversions using high throughput sequencing technologies (HTS). Here we propose to use a sequencing method (Kitzman et al., 2011) originally developed to improve haplotype resolution to characterize large genomic inversions. This method, called pooled clone sequencing, merges the advantages of clone based sequencing approach with the speed and cost efficiency of HTS technologies. Using data generated with pooled clone sequencing method, we developed a novel algorithm, dipSeq, to discover large inversions (>500 Kbp). We show the power of dipSeq first on simulated data, and then apply it to the genome of a HapMap individual (NA12878). We were able to accurately discover all previously known and experimentally validated large inversions in the same genome. We also identified a novel inversion, and confirmed using fluorescent in situ hybridization. Availability: Implementation of the dipSeq algorithm is available at https://github.com/BilkentCompGen/dipseq

Locating a Tree in a Phylogenetic Network in Quadratic Time

Locating a Tree in a Phylogenetic Network in Quadratic Time

Philippe Gambette, Andreas D. M. Gunawan, Anthony Labarre, Stéphane Vialette, Louxin Zhang
(Submitted on 11 Feb 2015)

A fundamental problem in the study of phylogenetic networks is to determine whether or not a given phylogenetic network contains a given phylogenetic tree. We develop a quadratic-time algorithm for this problem for binary nearly-stable phylogenetic networks. We also show that the number of reticulations in a reticulation visible or nearly stable phylogenetic network is bounded from above by a function linear in the number of taxa.