Huw A. Ogilvie, Joseph Heled, Dong Xie, Alexei J. Drummond
(Submitted on 22 Jun 2015)
Under the multispecies coalescent model of molecular evolution gene trees evolve within a species tree, and follow predicted distributions of topologies and coalescent times. In comparison, supermatrix concatenation methods assume that gene trees share a common history and equate gene coalescence with species divergence. The multispecies coalescent is supported by previous studies which found that its predicted distributions fit empirical data, and that concatenation is not a consistent estimator of the species tree. *BEAST, a fully Bayesian implementation of the multispecies coalescent, is popular but computationally intensive, so the advent of large phylogenomic data sets is both a computational challenge and an opportunity for better systematics. Using simulation studies, we characterise the scaling behaviour of *BEAST, and enable quantitative prediction of the impact increasing the number of loci has on both computational performance and statistical accuracy. Follow up simulations over a wide range of parameters show that the statistical performance of *BEAST relative to concatenation improves both as branch length is reduced and as the number of loci is increased. Finally, using simulations based on estimated parameters from two phylogenomic data sets, we compare the performance of a range of species tree and concatenation methods to show that using *BEAST with a small subset of loci can be preferable to using concatenation with thousands of loci. Our results provide insight into the practicalities of Bayesian species tree estimation, the number of genes required to obtain a given level of accuracy and the situations in which supermatrix or summary methods will be outperformed by the fully Bayesian multispecies coalescent.