Raunaq Malhotra, Steven Wu, Allen Rodrigo, Mary Poss, Raj Acharya

(Submitted on 14 Feb 2015)

A viral population can contain a large and diverse collection of viral haplotypes which play important roles in maintaining the viral population. We present an algorithm for reconstructing viral haplotypes in a population from paired-end Next Generation Sequencing (NGS) data. We propose a novel polynomial time dynamic programming based approximation algorithm for generating top paths through each node in De Bruijn graph constructed from the paired-end NGS data. We also propose two novel formulations for obtaining an optimal set of viral haplotypes for the population using the paths generated by the approximation algorithm. The first formulation obtains a maximum likelihood estimate of the viral population given the observed paired-end reads. The second formulation obtains a minimal set of viral haplotypes retaining the phylogenetic information in the population. We evaluate our algorithm on simulated datasets varying on mutation rates and genome length of the viral haplotypes. The results of our method are compared to other methods for viral haplotype estimation. While all the methods overestimate the number of viral haplotypes in a population, the two proposed optimality formulations correctly estimate the exact sequence of all the haplotypes in most datasets, and recover the overall diversity of the population in all datasets. The haplotypes recovered from popular methods are biased toward the reference sequence used for mapping of reads, while the proposed formulations are reference-free and retain the overall diversity in the population.