The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana

The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana
Erik Wijnker, Geo Velikkakam James, Jia Ding, Frank Becker, Jonas R. Klasen, Vimal Rawat, Beth A. Rowan, Daniel F. de Jong, C. Bastiaan de Snoo, Luis Zapata, Bruno Huettel, Hans de Jong, Stephan Ossowski, Detlef Weigel, Maarten Koornneef, Joost J.B. Keurentjes, Korbinian Schneeberger
(Submitted on 13 Nov 2013)

Knowledge of the exact distribution of meiotic crossovers (COs) and gene conversions (GCs) is essential for understanding many aspects of population genetics and evolution, from haplotype structure and long-distance genetic linkage to the generation of new allelic variants of genes. To this end, we resequenced the four products of 13 meiotic tetrads along with 10 doubled haploids derived from Arabidopsis thaliana hybrids. GC detection through short reads has previously been confounded by genomic rearrangements. Rigid filtering for misaligned reads allowed GC identification at high accuracy and revealed an ~80-kb transposition, which undergoes copy-number changes mediated by meiotic recombination. Non-crossover associated GCs were extremely rare most likely due to their short average length of ~25-50 bp, which is significantly shorter than the length of CO associated GCs. Overall, recombination preferentially targeted non-methylated nucleosome-free regions at gene promoters, which showed significant enrichment of two sequence motifs.

Functional Annotation Signatures of Disease Susceptibility Loci Improve SNP Association Analysis

Functional Annotation Signatures of Disease Susceptibility Loci Improve SNP Association Analysis

Edwin S Iversen, Gary Lipton, Merlise A. Clyde, Alvaro N. A. Monteiro
doi: 10.1101/000158

We describe the development and application of a Bayesian statistical model for the prior probability of phenotype-genotype association that incorporates data from past association studies and publicly available functional annotation data regarding the susceptibility variants under study. The model takes the form of a binary regression of association status on a set of annotation variables whose coefficients were estimated through an analysis of associated SNPs housed in the GWAS Catalog (GC). The set of functional predictors we examined includes measures that have been demonstrated to correlate with the association status of SNPs in the GC and some whose utility in this regard is speculative: summaries of the UCSC Human Genome Browser ENCODE super-track data, dbSNP function class, sequence conservation summaries, proximity to genomic variants included in the Database of Genomic Variants (DGV) and known regulatory elements included in the Open Regulatory Annotation database (ORegAnno), PolyPhen-2 probabilities and RegulomeDB categories. Because we expected that only a fraction of the annotation variables would contribute to predicting association, we employed a penalized likelihood method to reduce the impact of non-informative predictors and evaluated the model’s ability to predict GC SNPs not used to construct the model. We show that the functional data alone are predictive of a SNP’s presence in the GC. Further, using data from a genome-wide study of ovarian cancer, we demonstrate that their use as prior data when testing for association is practical at the genome-wide scale and improves power to detect associations.

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis

Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis
Eric Y. Durand, Nicholas Eriksson, Cory Y. McLean
(Submitted on 5 Nov 2013)

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, but IBD detection accuracy in non-simulated data is largely unknown. Using 25,432 genotyped European individuals, and exploiting known familial relationships in 2,952 father-mother-child trios contained therein, we identify a false positive rate over 67% for short (2-4 centiMorgan) segments. We introduce a novel, computationally-efficient, haplotype-based metric that enables accurate IBD detection on population-scale datasets.

SMASH: A Benchmarking Toolkit for Variant Calling

SMASH: A Benchmarking Toolkit for Variant Calling
Ameet Talwalkar, Jesse Liptrap, Julie Newcomb, Christopher Hartl, Jonathan Terhorst, Kristal Curtis, Ma’ayan Bresler, Yun S. Song, Michael I. Jordan, David Patterson
(Submitted on 31 Oct 2013)

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers.
Results: We propose a benchmarking methodology for evaluating variant calling algorithms called the SMASH toolkit. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMASH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms.
Availability: We provide free and open access online to the SMASH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

Most viewed on Haldane’s Sieve: October 2013

The most viewed preprints on Haldane’s Sieve this month were:

Fighting network space: it is time for an SQL-type language to filter phylogenetic networks

Fighting network space: it is time for an SQL-type language to filter phylogenetic networks
Steven Kelk, Simone Linz, David A. Morrison
(Submitted on 25 Oct 2013)

The search space of rooted phylogenetic trees is vast and a major research focus of recent decades has been the development of algorithms to effectively navigate this space. However this space is tiny when compared with the space of rooted phylogenetic networks, and navigating this enlarged space remains a poorly understood problem. This, and the difficulty of biologically interpreting such networks, obstructs adoption of networks as tools for modelling reticulation. Here, we argue that the superimposition of biologically motivated constraints, via an SQL-style language, can both stimulate use of network software by biologists and potentially significantly prune the search space.

Natural selection on human Y chromosomes

Natural selection on human Y chromosomes
Chuan-Chao Wang, Li Jin, Hui Li
(Submitted on 22 Oct 2013)

The paternally inherited Y chromosome has been widely used in population genetic studies to understand relationships among human populations. Our interpretation of Y chromosomal evidence about population history and genetics has rested on the assumption that all the Y chromosomal markers in the male-specific region (MSY) are selectively neutral. However, the very low diversity of Y chromosome has drawn a long debate about whether natural selection has affected this chromosome or not. In recent several years, the progress in Y chromosome sequencing has helped to address this dispute. Purifying selection has been detected in the X-degenerate genes of human Y chromosomes and positive selection might also have an influence in the evolution of testis-related genes in the ampliconic regions. Those new findings remind us to take the effect of natural selection into account when we use Y chromosome in population genetic studies.

Discriminative Measures for Comparison of Phylogenetic Trees

Discriminative Measures for Comparison of Phylogenetic Trees
Omur Arslan, Dan P. Guralnik, Daniel E. Koditschek
(Submitted on 19 Oct 2013)

Efficient and informative comparison of trees is a common essential interest of both computational biology and pattern classification. In this paper, we introduce a novel dissimilarity measure on non-degenerate hierarchies (rooted binary trees), called the NNI navigation distance, that counts the steps along the trajectory of a discrete dynamical system defined over the Nearest Neighbor Interchange(NNI) graph of binary hierarchies. The NNI navigation distance has a unique unifying nature of combining both edge comparison methods and edit operations for comparison of trees and is an efficient approximation to the (NP-hard) NNI distance. It is given by a closed form expression which simply generalizes to nondegenerate hierarchies as well. A relaxation on the closed form of the NNI navigation distance results a simpler dissimilarity measure on all trees, named the crossing dissimilarity, counts pairwise cluster incompatibilities of trees. Both of our dissimilarity measures on nondegenerate hierarchies are positive definite (vanishes only between identical trees) and symmetric but are not a true metric because they do not satisfy the triangle inequality. Although they are not true metrics, they are both linearly bounded below by the widely used Robinson-Foulds metric and above by a new tree metric, called the cluster-cardinality distance — the pullback metric of a matrix norm along an embedding of hierarchies into the space of matrices. All of these proposed tree measures can be efficiently computed in time O(n^2) in the number of leaves, n.

Present Y chromosomes support the Persian ancestry of Sayyid Ajjal Shams al-Din Omar and Eminent Navigator Zheng He

Present Y chromosomes support the Persian ancestry of Sayyid Ajjal Shams al-Din Omar and Eminent Navigator Zheng He
Chuan-Chao Wang, Ling-Xiang Wang, Manfei Zhang, Dali Yao, Li Jin, Hui Li
(Submitted on 21 Oct 2013)

Sayyid Ajjal is the ancestor of many Muslims in areas all across China. And one of his descendants is the famous Navigator of Ming Dynasty, Zheng He, who led the largest armada in the world of 15th century. The origin of Sayyid Ajjal’s family remains unclear although many studies have been done on this topic of Muslim history. In this paper, we studied the Y chromosomes of his present descendants, and found they all have haplogroup L1a-M76, proving a southern Persian origin.

Thoughts on: Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

I (@joe_pickrell) was recently asked to review a preprint by Decker et al., Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle for a journal. Below are the comments I sent the journal.

In this paper, the authors apply a suite of population genetics analyses to a set of cattle breeds. The basic data consists of around 1,500 individuals from 143 breeds typed at around 40,000 SNPs. The authors use this data to build population trees/graphs using TreeMix and visualize population structure with PCA/ADMIXTURE. They then interpret the results of these programs in light of their knowledge of the history of cattle domestication. I had no knowledge of cattle history prior to reading this manuscript, so I enjoyed reading it. I have first a few comments on the manuscript as a whole, then on individual points.

Overall comments:

1. A lot of interpretation depends on the robustness of the inferred population graph from TreeMix. It would be extremely helpful to see that the estimated graph is consistent across different random starting points. The authors could run TreeMix, say, five different times, and compare the results across runs. I expect that many of the inferred migration edges will be consistent, but a subset will not. It’s probably most interesting to focus interpretation on the edges that are consistent.

2. Throughout the manuscript, inference from genetics is mixed in with evidence from other sources. At points it sometimes becomes unclear which points are made strictly from genetics and which are not. For example, the authors write, “Anatolian breeds are admixed between European, African, and Asian cattle, and do not represent the populations originally domesticated in the region”. It seems possible that the first part of that statement (about admixture) could be their conclusion from the genetic data, but it’s difficult to make the second statement (about the original populations in the region) from genetics, so presumably this is based on other sources. In general, I would suggest splitting the results internal to this paper apart from the other statements and making a clear firewall between their results and the historical interpretation of the results (right now the authors have a “Results and Discussion” section, but it might be easiest to do this by splitting the “Results” from the “Discussion”. But this is up to the authors.).

3. Related to the above point, could the authors add subsection headings to the results/discussion section? Right now the topic of the paper jumps around considerably from paragraph to paragraph, and at points I had difficulty following. One possibility would be to organize subheading by the claims made in the abstract, e.g. “Cline of indicine introgression into Africa”, “wild African auroch ancestry”, etc…

Specific comments:

There are quite a few results claimed in this paper, so I’m going to split my comments apart by the results reported in the abstract. As mentioned above, it would be nice if the authors clearly stated exactly which pieces of evidence they view as supporting each of these, perhaps in subheadings in the Results section. In italics is the relevant sentence in the abstract, followed by my thoughts:

Using 19 breeds, we map the cline of indicine introgression into Africa.

This claim is based on interpretation of the ADMIXTURE plot in Figure 5. I wonder if a map might make this point more clearly than Figure 5, however; the three-letter population labels in Figure 5 are not very easy to read, especially since most readers will have no knowledge of the geographic locations of these breeds.

We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine.

This claim appears to be largely based on the interpretation of the treemix plot in Figure 4. This figure shows an admixture edge from the ancestors of the European breeds into the African breeds. As noted above, it seems important that this migration edge be robust across different treemix runs. Also, labeling this ancestry as “wild African auroch ancestry” seem like an interpretation of the data rather than something that has been explicitly tested, since the authors don’t have wild African aurochs in their data.

Additionally, the authors claim that this result shows “there was not a third domestication process, rather there was a single origin of domesticated taurine…”. I may be missing something, but it seems that genetic data cannot distinguish whether a population was “domesticated” or “wild”. That is, it seems plausible that the source population tentatively identified in Figure 4 may have been independently domesticated. There may be other sources of evidence that refute this interpretation, but this is another example of where it would be useful to have a firewall between the genetic results and the interpretation in light of other evidence. The speculation about the role of disease resistance in introgression is similarly not based on evidence from this paper and should probably be set apart.

We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia.

The cline of taurine/indicine hybridization is based on interpretation of ADMIXTURE plots and some follow-up f4 statistics. I found this difficult to follow, especially since a significant f4 statistic can have multiple interpretations. Perhaps the authors could draw out the proposed phylogeny for these breeds and explain the reasons they chose particular f4 statistics to highlight.

We also identify the influence of species other than Bos taurus in the formation of Asian breeds.

The conclusion that other species other than Bos taurus have introgressed into Asian breeds seems to be based on interpretation of branch lengths in the trees in Figures 2-3 and some f3 statistics. The interpretation of branch lengths is extremely weak evidence for introgression, probably not even worth mentioning. The f3 statistics are potentially quite informative though. For the breeds in question (Brebes and Madura), which pairs of populations give the most negative f3 statstics? This is difficult information to extract from Supplementary Table 2, where the populations appear to be sorted alphabetically. A table showing the (for example) five most negative f3 statistics could be quite useful here. In general, if the SNP ascertainment scheme is not extremely complicated (can the authors describe the ascertainment scheme for this array?), a negative f3 statistic is very strong evidence that a target population is admixed, which a significant f4 statistic only means that at least one of the four populations in the statistic is admixed. This might be a useful property for the authors.

We detect the pronounced influence of Shorthorn cattle in the formation of European breeds.

This conclusion appears to be based on interpretation of ADMIXTURE plots in Figures S6-S9. Interpreting these types of plots is notoriously difficult. I wonder if the f3 statistics might be useful here: do the authors get negative f3 statistics in the populations they write “share ancestry with Shorthorn cattle” when using the Durham shorthorns as one reference?

Iberian and Italian cattle possess introgression from African taurine.

This conclusion is based on ADMIXTURE plots and treemix; it would be interesting to see the results from f3 statistics as well.

American Criollo cattle are shown to be of Iberian, and not African, decent.

I found this difficult to follow–the authors write that these breeds “derive 7.5% of their ancestry from African taurine introgression”, so presumably they are in fact partially of African descent?

Indicine introgression into American cattle occurred in the Americas, and not Europe

This conclusion seems difficult to make from genetic data. The authors identify “indicine” ancestry in American cattle, so I don’t see how they can determine whether this happened before or after a migration without temporal information. It would be helpful if the authors walk the reader through each logical step they’re making so that the reader can decide whether they believe the evidence for each step.