Ruriko Yoshida, Kenji Fukumizu
(Submitted on 26 Jun 2015)
Theory and empirical evidence clearly indicate that phylogenies (trees) of different genes (loci) should not display precisely matched topologies. The main reason for such phylogenetic incongruence is reticulated evolutionary history of most species due to meiotic sexual recombination in eukaryotes, orhorizontal transfers of genetic materials in prokaryotes. Nevertheless, most genes should display topologically related phylogenies, and should group into one or more (for genetic hybrids) clusters in the “tree space.” In this paper we propose to apply the normalized-cut (Ncut) clustering algorithm to the set of gene trees with the geodesic distance between trees over the Billera-Holmes-Vogtmann (BHV) tree space. We first show by simulated data sets that the Ncut algorithm accurately clusters the set of gene trees given a species tree under the coalescent process, and show that the Ncut algorithm works better on the gene trees reconstructed via the neighbor-joining method than these reconstructed via the maximum likelihood estimator under the evolutionary models. Moreover, we apply the methods to a genome-wide data set (1290 genes encoding 690,838 amino acid residues) on coelacanths, lungfishes, and tetrapods. The result suggests that there are two clusters in the data set. Finally we reconstruct the consensus trees from these two clusters; the consensus tree constructed from one cluster has the tree topology that coelacanths are most closely related to the tetrapods, and the consensus tree from the other includes an irresolvable trichotomy over the coelacanth, lungfish, and tetrapod lineages, suggesting divergence within a very short time interval.