Reconstructing Roma history from genome-wide data

Reconstructing Roma history from genome-wide data

Priya Moorjani, Nick Patterson, Po-Ru Loh, Mark Lipson, Péter Kisfali, Bela I Melegh, Michael Bonin, Ľudevít Kádaši, Olaf Rieß, Bonnie Berger, David Reich, Béla Melegh
(Submitted on 7 Dec 2012)

The Roma people, living throughout Europe, are a diverse population linked by the Romani language and culture. Previous linguistic and genetic studies have suggested that the Roma migrated into Europe from South Asia about 1000-1500 years ago. Genetic inferences about Roma history have mostly focused on the Y chromosome and mitochondrial DNA. To explore what additional information can be learned from genome-wide data, we analyzed data from six Roma groups that we genotyped at hundreds of thousands of single nucleotide polymorphisms (SNPs). We estimate that the Roma harbor about 80% West Eurasian ancestry-deriving from a combination of European and South Asian sources- and that the date of admixture of South Asian and European ancestry was about 850 years ago. We provide evidence for Eastern Europe being a major source of European ancestry, and North-west India being a major source of the South Asian ancestry in the Roma. By computing allele sharing as a measure of linkage disequilibrium, we estimate that the migration of Roma out of the Indian subcontinent was accompanied by a severe founder event, which we hypothesize was followed by a major demographic expansion once the population arrived in Europe.

Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait

Oh sister, where art thou? Indirect fitness benefit could maintain a host defense trait
Tobias Pamminger, Susanne Foitzik, Dirk Metzler, Pleuni S. Pennings
(Submitted on 4 Dec 2012)

Population structure can affect the evolution of parasite virulence and host defense, a hypothesis that has been confirmed by studies focusing on large spatial scales. In contrast, we examine the small scale population structure of a host species and investigate whether it could explain the evolution of a defense trait against slavemaking ants. Slavemaking ants steal worker brood from host colonies, which will later serve as slaves to rear parasite offspring. The host species Temnothorax longispinosus has evolved an effective post-enslavement defense mechanism; instead of taking care of the slavemaker young, these slaves kill a high proportion of the parasite offspring. Because slaves never reproduce, they were thought to be trapped in an evolutionary dead end without the possibility of evolving such defense traits. Using detailed microsatellite data on a small spatial scale we can demonstrate that slaves can gain indirect fitness benefits by reducing parasite pressure on nearby host colonies, because these are often closely related to the slaves. Our genetic analyses indicate that polydomy, i.e., the occupation of several nest sites by a single colony, is sufficient to explain the elevated relatedness values between slaves and the surrounding host colonies, which may benefit from the slaves’ rebellion behavior.

GWAPP: A Web Application for Genome-wide Association Mapping in A. thaliana

GWAPP: A Web Application for Genome-wide Association Mapping in A. thaliana
Ümit Seren (1), Bjarni J. Vilhjálmssona (1 and 2), Matthew W. Horton (1 and 3), Dazhe Meng (4), Petar Forai (1), Yu S. Huang (4), Quan Long (1), Vincent Segura (5), Magnus Nordborg (1 and 2) ((1) Gregor Mendel, Institute Austrian Academy of Sciences, (2) Molecular and Computational Biology, University of Southern California, (3) Department of Ecology and Evolution, University of Chicago, (4) Center for Neurobehavioral Genetics, Semel Institute, University of California Los Angeles, (5) INRA, France)
(Submitted on 4 Dec 2012)

Arabidopsis thaliana is an important model organism for understanding the genetics and molecular biology of plants. Its highly selfing nature, together with other important features, such as small size, short generation time, small genome size, and wide geographic distribution, make it an ideal model organism for understanding natural variation. Genome-wide association studies (GWAS) have proven a useful technique for identifying genetic loci responsible for natural variation in A. thaliana. Previously genotyped accessions (natural inbred lines) can be grown in replicate under different conditions, and phenotyped for different traits. These important features greatly simplify association mapping of traits and allow for systematic dissection of the genetics of natural variation by the entire Arabidopsis community. To facilitate this, we present GWAPP, an interactive web-based application for conducting GWAS in A. thaliana. Using an efficient Python implementation of a linear mixed model, traits measured for a subset of 1386 publicly available ecotypes can be uploaded and mapped with an efficient mixed model and other methods in just a couple of minutes. GWAPP features an extensive, interactive, and a user-friendly interface that includes interactive manhattan plots and interactive local and genome-wide LD plots. It facilitates exploratory data analysis by implementing features such as the inclusion of candidate SNPs in the model as cofactors.

Deep-sequencing of the Peach Latent Mosaic Viroid Reveals New Aspects of Population Heterogeneity

Deep-sequencing of the Peach Latent Mosaic Viroid Reveals New Aspects of Population Heterogeneity
Jean-Pierre Sehi Glouzon, François Bolduc, Rafael Najmanovich, Shengrui Wang, Jean-Pierre Perreault
(Submitted on 3 Dec 2012)

Viroids are small circular single-stranded infectious RNAs that are characterized by a relatively high mutation level. Knowledge of their sequence heterogeneity remains largely elusive, and, as yet, no strategy attempting to address this question from a population dynamics point of view is in place. In order to address these important questions, a GF305 indicator peach tree was infected with a single variant of the Avsunviroidae family member Peach latent mosaic viroid (PLMVd). Six months post-inoculation, full-length circular conformers of PLMVd were isolated, deep-sequenced and the resulting sequences analyzed using an original bioinformatics scheme specifically designed and developed in order to evaluate the richness of a given the sequence’s population. Two distinct libraries were analyzed, and yielded 1125 and 1061 different PLMVd variants respectively, making this study the most productive to date (by more than an order of magnitude) in terms of the reporting of novel viroid sequences. Sequence variants exhibiting up to ~20% of mutations relative to the inoculated viroid were retrieved, clearly illustrating the high divergence dynamic inside a unique population. Using a novel hierarchical clustering algorithm, the different variants obtained were grouped into either 7 or 8 clusters depending on the library being analyzed. Most of the sequences contained, on average, between 4.6 and 6.3 mutations relative to the variant used initially to inoculate the plant. Interestingly, it was possible to reconstitute the sequence evolution between these clusters. On top of providing a reliable pipeline for the treatment of viroid deep-sequencing, this study sheds new light on the importance of the sequence variation that may take place in a viroid population and which may result in the formation of a quasi-species.

Illumina Sequencing Artifacts Revealed by Connectivity Analysis of Metagenomic Datasets

Illumina Sequencing Artifacts Revealed by Connectivity Analysis of Metagenomic Datasets
Adina Chuang Howe, Jason Pell, Rosangela Canino-Koning, Rachel Mackelprang, Susannah Tringe, Janet Jansson, James M. Tiedje, C. Titus Brown
(Submitted on 1 Dec 2012)

Sequencing errors and biases in metagenomic datasets affect coverage-based assemblies and are often ignored during analysis. Here, we analyze read connectivity in metagenomes and identify the presence of problematic and likely a-biological connectivity within metagenome assembly graphs. Specifically, we identify highly connected sequences which join a large proportion of reads within each real metagenome. These sequences show position-specific bias in shotgun reads, suggestive of sequencing artifacts, and are only minimally incorporated into contigs by assembly. The removal of these sequences prior to assembly results in similar assembly content for most metagenomes and enables the use of graph partitioning to decrease assembly memory and time requirements.

ZRT1 harbors an excess of nonsynonymous polymorphism and shows evidence of balancing selection in Saccharomyces cerevisiae

ZRT1 harbors an excess of nonsynonymous polymorphism and shows evidence of balancing selection in Saccharomyces cerevisiae
Elizabeth K. Engle, Justin C. Fay
(Submitted on 1 Dec 2012)

Estimates of the fraction of nucleotide substitutions driven by positive selection vary widely across different species. Accounting for different estimates of positive selection has been difficult, in part because selection on polymorphism within a species is known to obscure a signal of positive selection between species. While methods have been developed to control for the confounding effects of negative selection against deleterious polymorphism, the impact of balancing selection on estimates of positive selection has not been assessed. In Saccharomyces cerevisiae, there is no signal of positive selection within protein coding sequences as the ratio of nonsynonymous to synonymous polymorphism is higher than that of divergence. To investigate the impact of balancing selection on estimates of positive selection we examined five genes with high rates of nonsynonymous polymorphism in S. cerevisiae relative to divergence from S. paradoxus. One of the genes, a high affinity zinc transporter ZRT1, shows an elevated rate of synonymous polymorphism indicative of balancing selection. The high rate of synonymous polymorphism coincides with nonsynonymous divergence between three haplotype groups, which we find to be functionally indistinguishable. We conclude that balancing selection is not likely to be a common cause of genes harboring a large excess of nonsynonymous polymorphism in yeast.

Most viewed on Haldane’s Sieve: November 2012

The most viewed preprints on Haldane’s Sieve in November 2012 were:

The evolution of complex gene regulation by low specificity binding sites

The evolution of complex gene regulation by low specificity binding sites
Alexander J. Stewart, Joshua B. Plotkin
(Submitted on 30 Nov 2012)

Transcription factor binding sites vary in their specificity, both within and between species. Binding specificity has a strong impact on the evolution of gene expression, because it determines how easily regulatory interactions are gained and lost. Nevertheless, we have a relatively poor understanding of what evolutionary forces determine the specificity of binding sites. Here we address this question by studying regulatory modules composed of multiple binding sites. Using a population-genetic model, we show that more complex regulatory modules, composed of a greater number of binding sites, must employ binding sites that are individually less specific, compared to less complex regulatory modules. This effect is extremely general, and it hold regardless of the regulatory logic of a module. We attribute this phenomenon to the inability of stabilising selection to maintain highly specific sites in large regulatory modules. Our analysis helps to explain broad empirical trends in the yeast regulatory network: those genes with a greater number of transcriptional regulators feature by less specific binding sites, and there is less variance in their specificity, compared to genes with fewer regulators. Likewise, our results also help to explain the well-known trend towards lower specificity in the transcription factor binding sites of higher eukaryotes, which perform complex regulatory tasks, compared to prokaryotes.

Our paper: Bacterial diversity associated with Drosophila in the laboratory and in the natural environment

For next guest post Fabian Staubach and Dmitri Petrov write about their paper (along with coauthors) Bacterial diversity associated with Drosophila in the laboratory and in the natural environment arXived here.

Host associated bacterial communities are ubiquitous, have a variety of effects on the host phenotype and play a role in host adaptation to new environments. Some clear examples of such adaptations are known but generally these are ancient associations between host and symbiont, such as the association between aphids and the obligate symbiotic bacterium Buchnera that provides the aphid with essential amino acids or the association between bee wolfs and Streptomyces that protects bee wolf larvae from fungal infections. We are investigating the potential of bacterial communities to underlie short-term adaptation using adaptation of D. melanogaster and D. simulans to different fruit as a study system.

As the first step we profiled the diversity and composition of bacterial communities associated with Drosophila across multiple species, habitats, and substrates. We amplified and sequenced a region of the bacterial ribosomal DNA from whole body fly samples using 454 technology. We focused on comparing the bacterial communities of the sibling species D. melanogaster and D. simulans in the lab and in an ecologically and evolutionary relevant setting: their natural environment. In most cases we were able to study flies from these two species collected by aspiration from the same fruit. We also included nine different species spanning the Drosophila phylogeny to test whether phylogenetic distance and distance between bacterial communities are correlated.

We show that natural bacterial communities associated with Drosophila contain more different bacterial taxa than previously thought. Comparison to a mammalian fecal data set reveals that although mammal-associated bacterial communities are more diverse on average, the diversity of some mammalian fecal samples lies within the range or is even lower than that of the Drosophila samples we analyzed. This finding is interesting because it has been a matter of debate whether organisms with an adaptive immune system can in general accommodate higher bacterial diversity. By comparing the bacterial communities of D. melanogaster and D. simulans collected directly from different natural food substrates we demonstrate that bacterial communities differ primarily between substrates and very weakly among fly species.

We find acetic acid bacteria of the genera Acetobacter and Gluconobacter to be associated with all wild-caught flies constituting two thirds of all sequences. Acetic acid bacteria oxidize sugars and ethanol to acetic acid and are known to be directly involved in the development of a specific process of decay called ‘sour rot’ on grapes that causes wine spoilage. There is previous evidence that Drosophila is vital for the dispersal of acetic acid bacteria among rotting fruit: grapes covered with nets in the field do acquire yeasts, but no acetic acid bacteria and acetic acid bacteria thrive on grapes only when flies are present. At the same time, Acetobacter has been shown to promote Drosophila larval growth and shorten development time under certain nutritional conditions. Therefore, we argue that the relationship between Acetobacteraceae and Drosophila is likely mutualistic.

Individual natural fly samples are dominated by bacteria known to be pathogenic in Drosophila, such as Enterococcus and Providencia. These bacteria are known to reach very high cell counts during systemic infections of Drosophila and we believe that the inclusion of systemically infected flies in these samples is the most likely explanation for the observed pattern. The observation that it is in principle possible to identify potential candidate pathogens in natural populations using standard, high throughput microbial community screening techniques opens up opportunities for large scale epidemiological studies in nature and can help to identify candidate pathogenic bacterial species for further investigation in the laboratory.

In the laboratory, fly associated bacterial communities are similar irrespective of phylogenetic distance between fly species, suggesting that host genetic factors either play a minor role in shaping the bacterial communities associated with Drosophila or, as suggested by the difference of bacterial communities between D. melanogster and D. simulans in the wild, require natural conditions to manifest themselves. High variability of Drosophila bacterial communities within and between laboratories is a potential source of experimental noise when studying phenotypic variation. The impact of microbes on Drosophila phenotypes ranges from influencing growth to cold tolerance and it is hard to imagine traits that are not subject in principle to alteration by microbes.

We hope that our data will serve as a solid foundation for future studies especially for the growing community of scientists that are interested in the microbial communities that are associated with Drosophila.

Fabian Staubach and Dmitri Petrov

Identifying a species tree subject to random lateral gene transfer

Identifying a species tree subject to random lateral gene transfer

Mike Steel, Simone Linz, Daniel H. Huson, Michael J. Sanderson
(Submitted on 30 Nov 2012)

A major problem for inferring species trees from gene trees is that evolutionary processes can sometimes favour gene tree topologies that conflict with an underlying species tree. In the case of incomplete lineage sorting, this phenomenon has recently been well-studied, and some elegant solutions for species tree reconstruction have been proposed. One particularly simple and statistically consistent estimator of the species tree under incomplete lineage sorting is to combine three-taxon analyses, which are phylogenetically robust to incomplete lineage sorting. In this paper, we consider whether such an approach will also work under lateral gene transfer (LGT). By providing an exact analysis of some cases of this model, we show that there is a zone of inconsistency for triplet-based species tree reconstruction under LGT. However, a triplet-based approach will consistently reconstruct a species tree under models of LGT, provided that the expected number of LGT transfers is not too high. Our analysis involves a novel connection between the LGT problem and random walks on cyclic graphs. We have implemented a procedure for reconstructing trees subject to LGT or lineage sorting in settings where taxon coverage may be patchy and illustrate its use on two sample data sets.