Phylogenetic mixtures and linear invariants for equal input models

Marta Casanellas, Mike Steel

The reconstruction of phylogenetic trees from molecular sequence data relies on modelling site substitutions by a Markov process, or a mixture of such processes. In general, allowing mixed processes can result in different tree topologies becoming indistinguishable from the data, even for infinitely long sequences. However, when the underlying Markov process supports linear phylogenetic invariants, then provided these are sufficiently informative, the identifiability of the tree topology can be restored. In this paper, we investigate a class of processes that support linear invariants once the stationary distribution is fixed, the `equal input model’. This model generalizes the `Felsenstein 1981′ model (and thereby the Jukes–Cantor model) from four states to an arbitrary number of states (finite or infinite), and it can also be described by a `random cluster’ process. We describe the structure and dimension of the vector space of phylogenetic mixtures (and the complementary space of linear invariants) for any fixed phylogenetic tree (and for all trees — the so called `model invariants’), on any number n of leaves. We also provide a precise description of the space of mixtures and linear invariants for the special case of n=4 leaves. By combining techniques from discrete random processes and (multi-) linear algebra, our results build on a classic result that was first established by James Lake in 1987.