Human Genome Variation and the concept of Genotype Networks
Giovanni Marco Dall’Olio (1), Jaume Bertranpetit (1), Andreas Wagner (2, 3, 4), Hafid Laayouni (1) ((1) Institut de Biologia Evolutiva, CSIC-Universitat Pompeu Fabra, Barcelona, Spain. (2) Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Switzerland. (3) The Swiss Institute of Bioinformatics, Lausanne, Switzerland. (4) The Santa Fe Institute, Santa Fe, USA.)
(Submitted on 3 Sep 2013)
In 1970, John Maynard-Smith introduced the concept of “Protein Space”, a representation of all the possible protein sequences, as a framework to describe how evolutionary processes take place. Since then, the concepts of protein and of networks of sequences have been applied to a variety of systems, from protein modeling to RNA evolution, and to metabolic systems. Here, we adapted these concepts to the analysis of human DNA sequence data. We focused on the variation that can be represented from Single Nucleotide Variants (SNV) data, and we used the 1000 Genomes dataset to determine how human populations have explored this genotype space.
Our results include a genome-wide survey of how the genotype networks of human populations vary along the genome, and a framework to calculate the properties of these networks from sequencing data. Moreover, we found that, in coding regions, these networks tend to be both more “extended” in the space, and also more connected, than in non-coding regions. The application of the concept of genotype networks can provide a new opportunity to understand the evolutionary processes that shaped our genome. If we learn how human populations have explored the genotype space, we can achieve a better understanding of how selective pressures such as pathogens and diseases have shaped the evolution of a region of the genome, and how different regions have evolved. Combined with the availability of larger datasets of sequencing data, genotype networks represent a new approach to the study of human genetic diversity.
Hello, I am one of the authors of this article.
I also wrote a blog post about how this work has been planned and organized:
Please, any feedback on this article will be really appreciated. We did our best to make all the code and results accessible, and to make it easier to reproduce.
Glad to see you consider arXiving papers part of “best practices” in bioinformatics!
Regarding the paper, I’ve only been able to skim it, but a major comment is that it’s difficult for me to immediately see what the problem or question is that you’re trying to address. You seem to have a data structure to describe what a population geneticist might call “haplotype diversity”. Why is this different or more interesting than commonly used data structures like coalescent trees or ARGs?
thank you very much for your comment!
Yes, our proposal is in fact in the direction to describe haplotype diversity, but paying a little more attention to the shape of the network of haplotypes. In other literature, it has been proposed that certain properties of this network (the diameter, the degree) can be related to the “innovability” or “evolvability” of a system.
These concepts of “innovability”, etc.. were never extensively applied to human genetic diversity, because fullly reconstructing an haplotype network would require too many samples. However, we thought that now that datasets like 1000 Genomes have been published, there is the opportunity to see how these properties are distributed in the human genome.
Compared to the ARGs, I think that this method is a different concept. ARG are a method to reconstruct the history of a given region of the genome. With Genotype Networks, we just want to determine some properties of how a given region of the genome has evolved in a population.