Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines

Comparing Variant Call Files for Performance Benchmarking of Next-Generation Sequencing Variant Calling Pipelines
John G. Cleary, Ross Braithwaite, Kurt Gaastra, Brian S Hilbush, Stuart Inglis, Sean A Irvine, Alan Jackson, Richard Littin, Mehul Rathod, David Ware, Justin M. Zook, Len Trigg, Francisco M. M. De La Vega
To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a “gold standard” need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.

Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates

Genome-Wide Scan for Adaptive Divergence and Association with Population-Specific Covariates
mathieu gautier

In population genomics studies, accounting for the neutral covariance structure across population allele frequencies is critical to improve the robustness of genome-wide scan approaches. Elaborating on the BayEnv model, this study investigates several modeling extensions i) to improve the estimation accuracy of the population covariance matrix and all the related measures; ii) to identify significantly overly differentiated SNPs based on a calibration procedure of the XtX statistics; and iii) to consider alternative covariate models for analyses of association with population-specific covariables. In particular, the auxiliary variable model allows to deal with multiple testing issues and, providing the relative marker positions are available, to capture some Linkage Disequilibrium information. A comprehensive simulation study is further carried out to investigate and compare the performance of the different models. For illustration purpose, genotyping data on 18 French cattle breeds are also analyzed leading to the identification of thirteen strong signatures of selection. Among these, four (surrounding the KITLG, KIT, EDN3 and ALB genes) contained SNPs strongly associated with the piebald coloration pattern while a fifth (surrounding PLAG1) could be associated to morphological differences across the populations. Finally, analysis of Pool–Seq data from 12 populations of {\it Littorina saxatilis} living in two different ecotypes illustrates how the proposed framework might help addressing relevant ecological question in non–model species. Overall, the proposed methods define a robust Bayesian framework to characterize adaptive genetic differentiation across populations. The BayPass program implementing the different models is available at

Length Distribution of Ancestral Tracks under a General Admixture Model and Its Applications in Population History Inference

Length Distribution of Ancestral Tracks under a General Admixture Model and Its Applications in Population History Inference
Xumin Ni, Xiong Yang, Wei Guo, Kai Yuan, Ying Zhou, Zhiming Ma, Shuhua Xu

As a chromosome is sliced into pieces by recombination after entering an admixed population, ancestral tracks of chromosomes are shortened with the pasting of generations. The length distribution of ancestral tracks reflects information of recombination and thus can be used to infer the histories of admixed populations. Previous studies have shown that inference based on ancestral tracks is powerful in recovering the histories of admixed populations. However, population histories are always complex, and previous studies only deduced the length distribution of ancestral tracks under very simple admixture models. The deduction of length distribution of ancestral tracks under a more general model will greatly elevate the power in inferring population histories. Here we first deduced the length distribution of ancestral tracks under a general model in an admixed population, and proposed general principles in parameter estimation and model selection with the length distribution. Next, we focused on studying the length distribution of ancestral tracks and its applications under three typical admixture models, which were all special cases of our general model. Extensive simulations showed that the length distribution of ancestral tracks was well predicted by our theoretical models. We further developed a new method based on the length distribution of ancestral tracks and good performance was observed when it was applied in inferring population histories under the three typical models. Notably, our method was insensitive to demographic history, sample size and threshold to discard short tracks. Finally, we applied our method in African Americans and Mexicans from the HapMap dataset, and several South Asian populations from the Human Genome Diversity Project dataset. The results showed that the histories of African Americans and Mexicans matched the historical records well, and the population admixture history of South Asians was very complex and could be traced back to around 100 generations ago.

Circlator: automated circularization of genome assemblies using long sequencing reads

Circlator: automated circularization of genome assemblies using long sequencing readsMartin Hunt, Nishadi De Silva, Thomas D Otto, Julian Parkhill, Jacqueline A Keane, Simon R Harris
The assembly of DNA sequence data into finished genomes is undergoing a renaissance thanks to emerging technologies producing reads of tens of kilobases. Assembling complete bacterial and small eukaryotic genomes is now possible, but the final step of circularizing sequences remains unsolved. Here we present Circlator, the first tool to automate assembly circularization and produce accurate linear representations of circular sequences. Using Pacific Biosciences and Oxford Nanopore data, Circlator correctly circularized 26 of 27 circularizable sequences, comprising 11 chromosomes and 12 plasmids from bacteria, the apicoplast and mitochondrion of Plasmodium falciparum and a human mitochondrion. Circlator is available at

Origins of de novo genes in human and chimpanzee

Origins of de novo genes in human and chimpanzee
Jorge Ruiz-Orera, Jessica Hernandez-Rodriguez, Cristina Chiva, Eduard Sabidó, Ivanela Kondova, Ronald Bontrop, Tomàs Marqués-Bonet, M. Mar Albà
(Submitted on 28 Jul 2015)

The birth of new genes is an important motor of evolutionary innovation. Whereas many new genes arise by gene duplication, others originate at genomic regions that do not contain any gene or gene copy. Some of these newly expressed genes may acquire coding or non-coding functions and be preserved by natural selection. However, it is yet unclear which is the prevalence and underlying mechanisms of de novo gene emergence. In order to obtain a comprehensive view of this process we have performed in-depth sequencing of the transcriptomes of four mammalian species, human, chimpanzee, macaque and mouse, and subsequently compared the assembled transcripts and the corresponding syntenic genomic regions. This has resulted in the identification of over five thousand new transcriptional multiexonic events in human and/or chimpanzee that are not observed in the rest of species. By comparative genomics we show that the expression of these transcripts is associated with the gain of regulatory motifs upstream of the transcription start site (TSS) and of U1 snRNP sites downstream of the TSS. We also find that the coding potential of the new genes is higher than expected by chance, consistent with the presence of protein-coding genes in the dataset. Using available human tissue proteomics and ribosome profiling data we identify several de novo genes with translation evidence. These genes show significant purifying selection signatures, indicating that they are probably functional. Taken together, the data supports a model in which frequently-occurring new transcriptional events in the genome provide the raw material for the evolution of new proteins.

Dis-integrating the fly: A mutational perspective on phenotypic integration and covariation

Dis-integrating the fly: A mutational perspective on phenotypic integration and covariation

Annat Haber, Ian Dworkin

The structure of environmentally induced phenotypic covariation can influence the effective strength and magnitude of natural selection. Yet our understanding of the factors that contribute to and influence the evolutionary lability of such covariation is poor. Most studies have examined either environmental variation, without accounting for covariation, or examined phenotypic and genetic covariation, without distinguishing the environmental component. In this study we examined the effect of mutational perturbations on different properties of environmental covariation, as well as mean shape. We use strains of Drosophila melanogaster bearing well-characterized mutations known to influence wing shape, as well as naturally-derived strains, all reared under carefully-controlled conditions and with the same genetic background. We find that mean shape changes more freely than the covariance structure, and that different properties of the covariance matrix change independently from each other. The perturbations affect matrix orientation more than they affect matrix size or eccentricity. Yet, mutational effects on matrix orientation do not cluster according to the developmental pathway that they target. These results suggest that it might be useful to consider a more general concept of ‘decanalization’, involving all aspects of variation and covariation.

Long-term natural selection affects patterns of neutral divergence on the X chromosome more than the autosomes.

Long-term natural selection affects patterns of neutral divergence on the X chromosome more than the autosomes.

Melissa Ann Wilson Sayres, Pooja Narang

Natural selection reduces neutral population genetic diversity near coding regions of the genome because recombination has not had time to unlink selected alleles from nearby neutral regions. For ten sub-species of great apes, including human, we show that long-term selection affects estimates of divergence on the X differently from the autosomes. Divergence increases with increasing distance from genes on both the X chromosome and autosomes, but increases faster on the X chromosome than autosomes, resulting in increasing ratios of X/A divergence in putatively neutral regions. Similarly, divergence is reduced more on the X chromosome in neutral regions near conserved regulatory elements than on the autosomes. Consequently estimates of male mutation bias, which rely on comparing neutral divergence between the X and autosomes, are twice as high in neutral regions near genes versus far from genes. Our results suggest filters for putatively neutral genomic regions differ between the X and autosomes.