Human Genome Variation and the concept of Genotype Networks

Human Genome Variation and the concept of Genotype Networks
Giovanni Marco Dall’Olio (1), Jaume Bertranpetit (1), Andreas Wagner (2, 3, 4), Hafid Laayouni (1) ((1) Institut de Biologia Evolutiva, CSIC-Universitat Pompeu Fabra, Barcelona, Spain. (2) Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Switzerland. (3) The Swiss Institute of Bioinformatics, Lausanne, Switzerland. (4) The Santa Fe Institute, Santa Fe, USA.)
(Submitted on 3 Sep 2013)

In 1970, John Maynard-Smith introduced the concept of “Protein Space”, a representation of all the possible protein sequences, as a framework to describe how evolutionary processes take place. Since then, the concepts of protein and of networks of sequences have been applied to a variety of systems, from protein modeling to RNA evolution, and to metabolic systems. Here, we adapted these concepts to the analysis of human DNA sequence data. We focused on the variation that can be represented from Single Nucleotide Variants (SNV) data, and we used the 1000 Genomes dataset to determine how human populations have explored this genotype space.
Our results include a genome-wide survey of how the genotype networks of human populations vary along the genome, and a framework to calculate the properties of these networks from sequencing data. Moreover, we found that, in coding regions, these networks tend to be both more “extended” in the space, and also more connected, than in non-coding regions. The application of the concept of genotype networks can provide a new opportunity to understand the evolutionary processes that shaped our genome. If we learn how human populations have explored the genotype space, we can achieve a better understanding of how selective pressures such as pathogens and diseases have shaped the evolution of a region of the genome, and how different regions have evolved. Combined with the availability of larger datasets of sequencing data, genotype networks represent a new approach to the study of human genetic diversity.

Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome

Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome
Wentian Li, Jan Freudenberg, Pedro Miramontes
(Submitted on 28 Aug 2013)

The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a greater length increases the chance for reads being uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 to 1000 basepairs. We use the proportion of non-singleton k-mers to evaluate the mappability of reads for a corresponding read length. We observe that the proportion of non-singletons decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different k ranges. A faster decay at smaller values for k indicates more limited gains for read lengths > 200 basepairs. The frequency distributions of k-mers exhibit long tails in a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The location of the most frequent 1000-mers comprises 172 kilobase-ranged regions, including four large stretches on chromosomes 1 and X, containing genes with biomedical implications. Even the read length 1000 would be insufficient to reliably sequence these specific regions.

Predicting protein contact map using evolutionary and physical constraints by integer programming

Predicting protein contact map using evolutionary and physical constraints by integer programming
Zhiyong Wang, Jinbo Xu
(Submitted on 8 Aug 2013)

Motivation. Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains very challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole contact map. A couple of recent methods predict contact map based upon residue co-evolution, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods require a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically unfavorable.
Results. This paper presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming (ILP). The evolutionary restraints include sequence profile, residue co-evolution and context-specific statistical potential. The physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration. PhyCMAP can predict contacts within minutes after PSIBLAST search for sequence homologs is done, much faster than the two recent methods PSICOV and EvFold.

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects

Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects
Binghang Liu, Yujian Shi, Jianying Yuan, Xuesong Hu, Hao Zhang, Nan Li, Zhenyu Li, Yanxiang Chen, Desheng Mu, Wei Fan
(Submitted on 9 Aug 2013)

Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++ and freely accessible at this ftp URL

Proceedings of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)

Proceedings of the 13th Workshop on Algorithms in Bioinformatics (WABI2013)
Aaron Darling, Jens Stoye
(Submitted on 6 Aug 2013)

These are the proceedings of the 13th Workshop on Algorithms in Bioinformatics, WABI2013, which was held September 2-4 2013 in Sophia Antipolis, France. All manuscripts were peer reviewed by the WABI2013 program committee and external reviewers.

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching


SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Ilya Y. Zhbannikov, Samuel S. Hunter, Matthew L. Settles, James A. Foster
(Submitted on 31 Jul 2013)

With the advent of Next-Generation (NG) sequencing, it has become possible to sequence an entire genome quickly and inexpensively. However, in some experiments one only needs to extract and assembly a portion of the sequence reads, for example when performing transcriptome studies, sequencing mitochondrial genomes, or characterizing exomes. With the raw DNA-library of a complete genome it would appear to be a trivial problem to identify reads of interest. But it is not always easy to incorporate well-known tools such as BLAST, BLAT, Bowtie, and SOAP directly into a bioinformatics pipelines before the assembly stage, either due to in- compatibility with the assembler’s file inputs, or because it is desirable to incorporate information that must be extracted separately. For example, in order to incorporate flowgrams from a Roche 454 sequencer into the Newbler assembler it is necessary to first extract them from the original SFF files. We present SlopMap, a bioinformatics software utility which allows rapid identification similar to provided target sequences from either Roche 454 or Illumnia DNA library. With a simple and intuitive command- line interface along with file output formats compatible with assembly programs, SlopMap can be directly embedded in biological data processing pipeline without any additional programming work. In addition, SlopMap preserves flowgram information needed for Roche 454 assembler.

Characterizing Compatibility and Agreement of Unrooted Trees via Cuts in Graphs

Characterizing Compatibility and Agreement of Unrooted Trees via Cuts in Graphs
Sudheer Vakati, David Fernández-Baca
(Submitted on 30 Jul 2013)

Deciding whether there is a single tree -a supertree- that summarizes the evolutionary information in a collection of unrooted trees is a fundamental problem in phylogenetics. We consider two versions of this question: agreement and compatibility. In the first, the supertree is required to reflect precisely the relationships among the species exhibited by the input trees. In the second, the supertree can be more refined than the input trees.
Tree compatibility can be characterized in terms of the existence of a specific kind of triangulation in a structure known as the display graph. Alternatively, it can be characterized as a chordal graph sandwich problem in a structure known as the edge label intersection graph. Here, we show that the latter characterization yields a natural characterization of compatibility in terms of minimal cuts in the display graph, which is closely related to compatibility of splits. We then derive a characterization for agreement.

Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes

Sibelia: A scalable and comprehensive synteny block generation tool for closely related microbial genomes
Ilya Minkin, Anand Patel, Mikhail Kolmogorov, Nikolay Vyahhi, Son Pham
(Submitted on 30 Jul 2013)

Comparing strains within the same microbial species has proven effective in the identification of genes and genomic regions responsible for virulence, as well as in the diagnosis and treatment of infectious diseases. In this paper, we present Sibelia, a tool for finding synteny blocks in multiple closely related microbial genomes using iterative de Bruijn graphs. Unlike most other tools, Sibelia can find synteny blocks that are repeated within genomes as well as blocks shared by multiple genomes. It represents synteny blocks in a hierarchy structure with multiple layers, each of which representing a different granularity level. Sibelia has been designed to work efficiently with a large number of microbial genomes; it finds synteny blocks in 31 S. aureus genomes within 31 minutes and in 59 E.coli genomes within 107 minutes on a standard desktop. Sibelia software is distributed under the GNU GPL v2 license and is available at: this https URL Sibelia’s web-server is available at: this http URL

Exploring Genome Characteristics and Sequence Quality Without a Reference

Exploring Genome Characteristics and Sequence Quality Without a Reference
Jared T. Simpson
(Submitted on 30 Jul 2013)

The de novo assembly of large, complex genomes is a significant challenge with currently available DNA sequencing technology. While many de novo assembly software packages are available, comparatively little attention has been paid to assisting the user with the assembly. This paper addresses the practical aspects of de novo assembly by introducing new ways to perform quality assessment on a collection of DNA sequence reads. The software implementation calculates per-base error rates, paired-end fragment size histograms and coverage metrics in the absence of a reference genome. Additionally, the software will estimate characteristics of the sequenced genome, such as repeat content and heterozygosity, that are key determinants of assembly difficulty. The software described is freely available and open source under the GNU Public License.

Agalma: an automated phylogenomics workflow

Agalma: an automated phylogenomics workflow
Casey W. Dunn, Mark Howison, Felipe Zapata
(Submitted on 24 Jul 2013)

In the past decade, transcriptome data have become an important component of many phylogenetic studies. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies. We present Agalma, an automated tool that conducts phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, and enables rich HTML reports for all stages of the analysis. Agalma includes a test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at this https URL. Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended.