The hemagglutinin mutation E391K of pandemic 2009 influenza revisited

The hemagglutinin mutation E391K of pandemic 2009 influenza revisited
Jan P. Radomski, Piotr Płoński, Włodzimierz Zagórski-Ostoja
(Submitted on 8 Nov 2013)

Phylogenetic analyses based on small to moderately sized sets of sequential data lead to overestimating mutation rates in influenza hemagglutinin (HA) by at least an order of magnitude. Two major underlying reasons are: the incomplete lineage sorting, and a possible absence in the analyzed sequences set some of key missing ancestors. Additionally, during neighbor joining tree reconstruction each mutation is considered equally important, regardless of its nature. Here we have implemented a heuristic method optimizing site dependent factors weighting differently 1st, 2nd, and 3rd codon position mutations, allowing to extricate incorrectly attributed sub-clades. The least squares regression analysis of distribution of frequencies for all mutations observed on a partially disentangled tree for a large set of unique 3243 HA sequences, along all nucleotide positions, was performed for all mutations as well as for non-equivalent amino acid mutations: in both cases demonstrating almost flat gradients, with a very slight downward slope towards the 3′-end positions. The mean mutation rates per sequence per year were 3.83*10^-4 for the all mutations, and 9.64*10^-5 for the non-equivalent ones.

Comparative Assembly Hubs: Web Accessible Browsers for Comparative Genomics

Comparative Assembly Hubs: Web Accessible Browsers for Comparative Genomics
Ngan Nguyen, Glenn Hickey, Brian J. Raney, Joel Armstrong, Hiram Clawson, Ann Zweig, Jim Kent, David Haussler, Benedict Paten
(Submitted on 5 Nov 2013)

We introduce a pipeline to easily generate collections of web accessible UCSC genome browsers interrelated by an alignment. Using the alignment, all annotations and the alignment itself can be efficiently viewed with reference to any genome in the collection, symmetrically. A new, intelligently scaled alignment display makes it simple to view all changes between the genomes at all levels of resolution, from substitutions to complex structural rearrangements, including duplications.

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis
Eric Y. Durand, Nicholas Eriksson, Cory Y. McLean
(Submitted on 5 Nov 2013)

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, but IBD detection accuracy in non-simulated data is largely unknown. Using 25,432 genotyped European individuals, and exploiting known familial relationships in 2,952 father-mother-child trios contained therein, we identify a false positive rate over 67% for short (2-4 centiMorgan) segments. We introduce a novel, computationally-efficient, haplotype-based metric that enables accurate IBD detection on population-scale datasets.

The inference of gene trees with species trees

The inference of gene trees with species trees
Gergely J. Szöllosi, Eric Tannier, Vincent Daubin, Bastien Boussau
(Submitted on 4 Nov 2013)

Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.

SMASH: A Benchmarking Toolkit for Variant Calling

SMASH: A Benchmarking Toolkit for Variant Calling
Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma’ayan Bresler, Yun S. Song, Michael I. Jordan, David Patterson
(Submitted on 31 Oct 2013)

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers.
Results: We propose a benchmarking methodology for evaluating variant calling algorithms called the SMASH toolkit. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMASH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms.
Availability: We provide free and open access online to the SMASH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

Most viewed on Haldane’s Sieve: October 2013

The most viewed preprints on Haldane’s Sieve this month were: