The evolution of moment generating functions for the Wright Fisher model of population genetics

The evolution of moment generating functions for the Wright Fisher model of population genetics
Tat Dat Tran, Julian Hofrichter, Juergen Jost
(Submitted on 21 Jan 2014)

We derive and apply a partial differential equation for the moment generating function of the Wright-Fisher model of population genetics.

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data

A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data
David Coil, Guillaume Jospin, Aaron E. Darling
(Submitted on 21 Jan 2014)

Motivation: Open-source bacterial genome assembly remains inaccessible to many biologists due to its complexity. Few software solutions exist that are capable of automating all steps in the process of de novo genome assembly from Illumina data.
Results: A5-miseq can produce high quality and microbial genome assemblies on a laptop computer without any parameter tuning. A5-miseq does this by automating the process of adapter trimming, quality filtering, error correction, contig and scaffold generation, and detection of misassemblies. Unlike the original A5 pipeline, A5-miseq can use long reads from the Illumina MiSeq, use read pairing information during contig generation, and includes several improvements to read trimming. Together these changes result in substantially improved assemblies that recover a more complete set of reference genes than previous methods.
Availability: A5-miseq is licensed under the GPL open source license. Source code and precompiled binaries for Mac OS X 10.6+ and Linux 2.6.15+ are available from this http URL

Coalescence 2.0: a multiple branching of recent theoretical developments and their applications

Coalescence 2.0: a multiple branching of recent theoretical developments and their applications
Aurelien Tellier, Christophe Lemaire
(Submitted on 21 Jan 2014)

Population genetics theory has laid the foundations for genomics analyses including the recent burst in genome scans for selection and statistical inference of past demographic events in many prokaryote, animal and plant species. Identifying SNPs under natural selection and underpinning species adaptation relies on disentangling the respective contribution of random processes (mutation, drift, migration) from that of selection on nucleotide variability. Most theory and statistical tests have been developed using the Kingman coalescent theory based on the Wright-Fisher population model. However, these theoretical models rely on biological and life-history assumptions which may be violated in many prokaryote, fungal, animal or plant species. Recent theoretical developments of the so called multiple merger coalescent models are reviewed here ({\Lambda}-coalescent, beta-coalescent, Bolthausen-Snitzman, {\Xi}-coalescent). We explicit how these new models take into account various pervasive ecological and biological characteristics, life history traits or life cycles which were not accounted in previous theories such as 1) the skew in offspring production typical of marine species, 2) fast adapting microparasites (virus, bacteria and fungi) exhibiting large variation in population sizes during epidemics, 3) the peculiar life cycles of fungi and bacteria alternating sexual and asexual cycles, and 4) the high rates of extinction-recolonization in spatially structured populations. We finally discuss the relevance of multiple merger models for the detection of SNPs under selection in these species, for population genomics of very large sample size and advocate to potentially examine the conclusion of previous population genetics studies.

The life cycle of Drosophila orphan genes

The life cycle of Drosophila orphan genes

Nicola Palmieri, Carolin Kosiol, Christian Schlötterer
(Submitted on 20 Jan 2014)

Orphans are genes restricted to a single phylogenetic lineage and emerge at high rates. While this predicts an accumulation of genes, the gene number has remained remarkably constant through evolution. This paradox has not yet been resolved. Because orphan genes have been mainly analyzed over long evolutionary time scales, orphan loss has remained unexplored. Here we study the patterns of orphan turnover among close relatives in the Drosophila obscura group. We show that orphans are not only emerging at a high rate, but that they are also rapidly lost. Interestingly, recently emerged orphans are more likely to be lost than older ones. Furthermore, highly expressed orphans with a strong male-bias are more likely to be retained. Since both lost and retained orphans show similar evolutionary signatures of functional conservation, we propose that orphan loss is not driven by high rates of sequence evolution, but reflects lineage specific functional requirements.

Demography and the age of rare variants

Demography and the age of rare variants
Iain Mathieson, Gil McVean
(Submitted on 16 Jan 2014)

Recently, large whole-genome sequencing projects have provided access to much of the rare variation in human populations. This variation is highly informative about population structure and recent demography. In this paper, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how this information can detect and quantify historical relationships between populations. We investigate the distribution of the age of f2 variants in a worldwide sample sequenced by the 1,000 Genomes Project, revealing enormous variation across populations. The median age of f2 variants shared within continents is 50 to 160 generations for Europe and Asia, and 170 to 320 generations for Africa. Variants shared between continents are much older with median ages ranging from 320 to 670 generations between Europe and Asia, and 1,000 to 2,400 generations between African and Non-African populations. The distribution of the ages of variants shared across populations is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the signature of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.

Population genomics of Saccharomyces cerevisiae human isolates: passengers, colonizers, invaders.
Carlotta De Filippo, Monica Di Paola, Irene Stefanini, Lisa Rizzetto, Luisa Berná, Matteo Ramazzotti, Leonardo Dapporto, Damariz Rivero, Ivo G Gut, Marta Gut, Mónica Bayés, Jean-Luc Legras, Roberto Viola, Cristina Massi-Benedetti, Antonella De Luca, Luigina Romani, Paolo Lionetti, Duccio Cavalieri

The quest for the ecological niches of Saccharomyces cerevisiae ranged from wineries to oaks and more recently to the gut of Crabro Wasps. Here we propose the role of the human gut in shaping S. cerevisiae evolution, presenting the genetic structure of a previously unknown population of yeasts, associated with Crohn?s disease, providing evidence for clonal expansion within human?s gut. To understand the role of immune function in the human-yeast interaction we classified strains according to their immunomodulatory properties, discovering a set of genetically homogeneous isolates, capable of inducing anti-inflammatory signals via regulatory T cells proliferation, and on the contrary, a positive association between strain mosaicism and ability to elicit inflammatory, IL-17 driven, immune responses. The approach integrating genomics with immune phenotyping showed selection on genes involved in sporulation and cell wall remodeling as central for the evolution of S. cerevisiae Crohn?s strains from passengers to commensals to potential pathogens.

Author post: Genome Sequencing Highlights the Dynamic Early History of Dogs

This guest post is by Ilan Gronau on Freedman et al. Genome Sequencing Highlights the Dynamic Early History of Dogs. This preprint was one of the most popular on Haldane’s Sieve last year, and was published yesterday in PLoS Genetics.

Earlier this week, our paper entitled “Genome Sequencing Highlights the Dynamic Early History of Dogs” was published in PLoS Genetics. This paper explores central questions having to do with the origin of domestic canines by sequencing and analyzing the complete genomes of six individuals from carefully selected dog and wolf lineages. We posted an earlier version of this manuscript on ArXiv, and received quite a lot of comments and questions regarding the methodology we employed for demography inference (many through Haldane’s Sieve). This feedback exposed an increasing interest in demography inference methods that utilize small numbers of complete individual genomes, but it also highlighted the need to examine the strengths and weaknesses of the different methods we employed, and how best to combine them to obtain a unified and robust picture of demographic history. We discuss some of these issues in the revised version published this week, but I thought some of the points were worth spelling out a bit more explicitly, which is the purpose of this post. I’d like to thank Adam Siepel, John Novembre, and Adam Freedman for sharing their insights through the process of writing this post.

The Methods

Our study takes advantage of three recently developed demography inference methods:
the Pairwise Sequential Markovian Coalescent (PSMC; Li and Durbin, 2011), the D statistic, or as it is more commonly referred to, the ABBA/BABA test (Durand et al., 2011), and the Generalized Phylogenetic Coalescent Sampler (G-PhoCS; Gronau, et al., 2011). All three methods base their inferences on the genealogical relationships among a relatively small number of individuals, taking advantage of the wealth of genealogical information encoded in individual genomes due to genetic recombination. PSMC, for instance, makes use of the information on changes in coalescent times between the two chromosome copies within a single individual to infer ancestral population sizes. The ABBA/BABA test makes use of asymmetries in genealogies spanning four chromosomes to detect post-divergence gene flow. G-PhoCS jointly considers all individuals using a multi-population coalescent-based demographic model, which includes population divergence times, changes in ancestral population size, and gene flow. A major advantage of G-PhoCS is that it produces a single detailed image of demographic history, inferred using a unified probabilistic model for all individuals. However, in the interest of computational tractability, the method relies on several simplifying modeling assumptions not required by the other methods. In addition, by being constrained to subsets of individuals, the PSMC and ABBA/BABA approaches are free to specialize in capturing particular aspects of demographic history (ancestral Ne and admixture, resp.), which G-PhoCS treats more coarsely. Thus, we found all three methods to be complementary and their combination to be particularly informative about the demographic history of these wild and domestic canids.

Inference of Ancestral Population Sizes

Both PSMC and G-PhoCS provide information about ancestral population sizes (Ne). PSMC is specifically designed for this task, and provides a high-resolution trace of changes in ancestral Ne by separately analyzing each diploid genome. Using PSMC, we could detect sharp declines in Ne for wolves and dogs without making any assumptions regarding the canid phylogeny. However, we found that the traces of ancestral Ne inferred by PSMC should be interpreted with care; simulations we conducted showed that the gradual reduction in Ne inferred by PSMC was also consistent with a more recent severe population bottleneck. Eventually, the coarser model inferred by G-PhoCS, in which shortly after the divergence of dogs and wolves the two ancestral populations suffered severe bottlenecks, ended up fitting the data better than the model implied by the PSMC traces. Thus in our case we found the phylogenetic context to be quite useful for dating the major changes in canid population size. Ideally, it should be possible to directly infer PSMC-style traces along the branches of the population phylogeny (alongside inference of divergence times), but this would involve a fairly major methodological undertaking.

Detection of Admixture and Post Divergence Gene Flow

One of the central findings in our study was that gene flow, particularly between dogs and wolves, played a prominent role in the history of canids. Indeed, we found that several previous claims about dog origins in the Middle East and East Asia were likely influenced by ancient gene flow from wolves to dogs in these regions. The ABBA/BABA test was an obvious choice for a method to detect admixture. It is a fairly simple method, sensitive to even low amounts of gene flow, and robust to assumptions about the demographic history of the populations being tested. By applying this test separately to all sample quartets that include the jackal outgroup, we were able to obtain a good set of candidate ancestral admixture events between dogs and wolves. However, interpreting some of these signals and combining them into a single unified hypothesis was not straightforward, especially since we found signatures for multiple ancestral admixture events among overlapping pairs of populations (e.g., Basenji-Israeli wolf and Boxer-Israeli wolf). G-PhoCS is better suited to deal with this more complex scenario of gene flow, because it jointly analyzes all samples and can consider multiple migration bands in a single analysis. We exploited this feature to find strong evidence of gene flow between wolves and jackals and to show that the signal found for Boxer-Israeli wolf admixture in the ABBA/BABA test was a result of ancestral gene flow from Basenji to Israeli wolf. Still, coming up with a scenario of ancestral gene flow that best fit the data required developing a fairly complex framework for model comparison that involved a combination of multiple separate G-PhoCS runs and comparison with simulated data (see below).

Model Comparison

Addressing subtle questions having to do with the origin of dogs and post-divergence gene flow with wolves required the ability to compare alternative hypotheses for dog domestication in terms of their fit to the data. We did this by considering a collection of plausible topologies for the population phylogeny augmented with different sets of migration bands, and using G-PhoCS to infer demographic parameters for each case. This provided us with a complete demographic model we could associate with each alternative hypothesis and then use to simulate data representing that hypothesis. The hypotheses were then assessed by comparing the simulated data with the real data. While this approach does not constitute a formal model-testing method, it did allow us to explore the space of plausible models in a systematic way and show that the data supports a model with single origin for dogs and that the origin was ancient and similarly distant from all sampled wolf populations.

Future Development

This study allowed us to closely examine recently developed methods for demography inference and ways of combining them to obtain a unified and robust inference of demographic history. While the different methods used in our study were all shown to be quite powerful, particularly when combined, there is obvious room for improvement. In my view, the most promising developments in this field will come from methods (such as G-PhoCS) that capture all major aspects of the demographic history—divergence times, ancestral population sizes and post-divergence gene flow—in a single analysis. The great advantage of such methods is that they provide a framework for rigorous hypothesis testing and model comparison. In principle, the fully Bayesian nature of G-PhoCS enables this quite naturally through the use of Bayes factors for comparison of different sets of model assumptions. Bayes factors are essentially the relative probabilities of different models given the data. Throughout this study, we experimented with various simple ways of estimating Bayes factors based on the data likelihoods of the MCMC samples generated by G-PhoCS, but we were not able to robustly capture the differences in likelihoods of genealogies sampled for the different hypotheses. Solving this important problem will require additional work, but it is definitely within reach. Another important set of extensions involves using richer models that rely on weaker sets of assumptions. This includes modeling recombination, gradual changes in ancestral population sizes, and more realistic models for gene flow. Progress is being made in these directions as well (see, for example, our recent work on ancestral recombination graph inference), and there is much room for optimism that the next generation of demography inference methods, coupled with emerging genomic data sets, will allow researchers an unprecedented capability to investigate the demographic history of additional species.

A C++ template library for efficient forward-time population genetic simulation of large populations

A C++ template library for efficient forward-time population genetic simulation of large populations
Kevin R. Thornton
(Submitted on 15 Jan 2014)

fwdpp is a C++ library of routines intended to facilitate the development of forward-time simulations under arbitrary mutation and fitness models. The library design provides a combination of speed, low memory overhead, and modeling flexibility not currently available from other forward simulation tools. The library is particularly useful when the simulation of large populations is required, as programs implemented using the library are much more efficient that other available forward simulation programs.

The existence and abundance of ghost ancestors in biparental populations

The existence and abundance of ghost ancestors in biparental populations

Simon Gravel, Mike Steel
(Submitted on 15 Jan 2014)

In a randomly-mating biparental population of size N there are, with high probability, individuals who are genealogical ancestors of every extant individual within approximately log2(N) generations into the past. We use this result of Chang to prove a curious corollary under standard models of recombination: there exist, with high probability, individuals within a constant multiple of log2(N) generations into the past who are simultaneously (i) genealogical ancestors of {\em each} of the individuals at the present, and (ii) genetic ancestors to {\em none} of the individuals at the present. Such ancestral individuals – ancestors of everyone today that left no genetic trace — represent `ghost’ ancestors in a strong sense. In this short note, we use simple analytical argument and simulations to estimate how many such individuals exist in Wright-Fisher populations.

Author post: Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle

This guest post is by Jared Decker on his preprint (with colleagues) “Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle“, arXived here. The post is a response to the review posted by Joe Pickrell here.

I have posted an updated version of my preprint on arXiv. Because Joe Pickrell posted his review of my preprint “Worldwide Patterns of Ancestry, Divergence, and Admixture in Domesticated Cattle” on Haldane’s Sieve, I thought readers might enjoying seeing my response. I have really enjoyed having the process open to the public.

Reviewers comments are in blue.
My comments are in black.
Quotes from the manuscript are in Arial font.

Reviewer #1 [Joe Pickrell]

Overall comments:

1. A lot of interpretation depends on the robustness of the inferred population graph from TreeMix. It would be extremely helpful to see that the estimated graph is consistent across different random starting points. The authors could run TreeMix, say, five different times, and compare the results across runs. I expect that many of the inferred migration edges will be consistent, but a subset will not. Itís probably most interesting to focus interpretation on the edges that are consistent.

We followed Reviewer #1ís recommendation and have included 6 phylogenetic networks (the original network and 5 replicates) as supplementary Figure S4. The admixed histories of several of the sample populations are quite complex, and as seen in Figure S4, the same relationships can be represented multiple ways. For example if population A is admixed between populations B, C, and D, it can be placed sister to population B with migration edges from C and D, or it can be placed sister to C with migration edges from B and D. We have tried to note in the manuscript when migration edges are not consistent. But, one of the main points of the paper, introgression for an ancestral population into the African taurine clade is consistent across all replicates.

To the third paragraph of the Admixture in Europe subsection we added, “The placement of Italian breeds is not consistent across independent TreeMix runs (Figure S4), likely due to their complicated history of admixture.”

In the second to last paragraph of the manuscript we state. “In TreeMix replicates, Texas Longhorn and Romosinuano are either sister to admixed Anatolian breeds or they receive a migration edge that originates near Brahman (Figure S4).

2. Throughout the manuscript, inference from genetics is mixed in with evidence from other sources. At points it sometimes becomes unclear which points are made strictly from genetics and which are not.

We have edited the manuscript by adding citations to clarify which inference is from genetics and which is from previous studies or breed histories.

For example, the authors write, “Anatolian breeds are admixed between European, African, and Asian cattle, and do not represent the populations originally domesticated in the region”. It seems possible that the first part of that statement (about admixture) could be their conclusion from the genetic data, but itís difficult to make the second statement (about the original populations in the region) from genetics, so presumably this is based on other sources.

We edited this sentence to say, “Anatolian breeds (AB, EAR, TG, ASY, and SAR) are admixed between blue European-like, grey African-like, and green indicine-like cattle (Figures 5 and 6), and we infer they do not represent the taurine populations originally domesticated in this region due to a history of admixture.

In general, I would suggest splitting the results internal to this paper apart from the other statements and making a clear firewall between their results and the historical interpretation of the results (right now the authors have a “Results and Discussion” section, but it might be easiest to do this by splitting the “Results” from the “Discussion”. But this is up to the authors.).

The corresponding authors of this manuscript (Decker and Taylor) prefer to have the results and discussion sections combined, so we appreciate Review #1 leaving that decision up to us. But, we recognize that he brings up a valid point and have strived to make the distinction between results and discussion clearer throughout the manuscript.

3. Related to the above point, could the authors add subsection headings to the results/discussion section? Right now the topic of the paper jumps around considerably from paragraph to paragraph, and at points I had difficulty following. One possibility would be to organize subheading by the claims made in the abstract, e.g. “Cline of indicine introgression into Africa”, “wild African auroch ancestry”, etc.

Subsection headings have been added.

Specific comments:

There are quite a few results claimed in this paper, so Iím going to split my comments apart by the results reported in the abstract. As mentioned above, it would be nice if the authors clearly stated exactly which pieces of evidence they view as supporting each of these, perhaps in subheadings in the Results section. In italics is the relevant sentence in the abstract, followed by my thoughts:

Using 19 breeds, we map the cline of indicine introgression into Africa.

This claim is based on interpretation of the ADMIXTURE plot in Figure 5. I wonder if a map might make this point more clearly than Figure 5, however; the three-letter population labels in Figure 5 are not very easy to read, especially since most readers will have no knowledge of the geographic locations of these breeds.

Map added as Figure 5, with previous ADMIXTURE figure as Figure 6 so that readers can still see individual breed ancestries.

“We infer that African taurine possess a large portion of wild African auroch ancestry, causing their divergence from Eurasian taurine.”

This claim appears to be largely based on the interpretation of the treemix plot in Figure 4. This figure shows an admixture edge from the ancestors of the European breeds into the African breeds. As noted above, it seems important that this migration edge be robust across different treemix runs. Also, labeling this ancestry as “wild African auroch ancestry” seem like an interpretation of the data rather than something that has been explicitly tested, since the authors don’t have wild African aurochs in their data.

This migration edge is robust across 6 different TreeMix runs. The edge is from a node that is ancestral to European, Asian, and African taurine, and this node is approximately halfway between the common ancestor of domesticated indicine and the common ancestor of domesticated taurine. African auroch are extinct. Most, if not all, bovine ancient DNA samples come from much colder climates than northern Africa. So we are unable to sample African aurochs.

But, we feel it is a strength of the TreeMix analysis to identify introgression from ancestral populations that have not been sampled. We feel the interpretation that the introgression is from African auroch is the most parsimonious explanation of our PCA, ADMIXTURE, and TreeMix results.

Additionally, the authors claim that this result shows “there was not a third domestication process, rather there was a single origin of domesticated taurine”. I may be missing something, but it seems that genetic data cannot distinguish whether a population was “domesticated” or “wild”. That is, it seems plausible that the source population tentatively identified in Figure 4 may have been independently domesticated. There may be other sources of evidence that refute this interpretation, but this is another example of where it would be useful to have a firewall between the genetic results and the interpretation in light of other evidence. The speculation about the role of disease resistance in introgression is similarly not based on evidence from this paper and should probably be set apart.

The claim that there was a single origin of domesticated taurine is based upon the topology of the phylogenetic network, as European, Asian, and African taurine all share a common ancestor, and the Asian clade is sister to the rest of the ingroup. This rules out the possibility of a separate domestication in Africa as a separate domestication would cause African domesticates to be sister to the rest of taurus. Larson and Burger (2013) do not consider admixture a separate domestication, and we choose to follow their definition. Two domestications with the resulting population in Africa a mixture of the two is not very parsimonious. The most parsimonious explanation is admixture from a wild relative.

We agree that we have not tested the influence of trypanosomiasis resistance on driving admixture, but we feel it is an interesting hypothesis that explains the force that drove admixture. We have rephrased the sentence as:

“We hypothesize that the introgression in Africa may have been driven by trypanosomiasis resistance in African auroch which may be the source of resistance in African taurine populations [48].”

“We detect exportation patterns in Asia and identify a cline of Eurasian taurine/indicine hybridization in Asia.”

The cline of taurine/indicine hybridization is based on interpretation of ADMIXTURE plots and some follow-up f4 statistics. I found this difficult to follow, especially since a significant f4 statistic can have multiple interpretations. Perhaps the authors could draw out the proposed phylogeny for these breeds and explain the reasons they chose particular f4 statistics to highlight.

We have added a map figure so that the ADMIXTURE estimates will be easier to interpret in a geographic frame. We also added, From previous research [3] and Figures 2 and 3, these relationships should be tree-like if there were no admixture. For 53 of the possible 280 tests, the Z-score was more extreme than ±2.575829. The most extreme test statistics were f4(Wagyu, Mongolian; Simmental, Shorthorn) = -0.003 (Z-score = -5.21, other rearrangements of these groups had Z-scores of 7.32 and 16.55) and f4(Hanwoo, Wagyu; Piedmontese, Shorthorn) = 0.002 (Z-score = 4.90, other rearrangements of these groups had Z-scores of 21.79 and 27.77)

While the f4 statistics do have multiple interpretations, we do feel confident that the ADMIXTURE analysis highlights which interpretation is the most likely.

“We also identify the influence of species other than Bos taurus in the formation of Asian breeds.”

The conclusion that other species other than Bos taurus have introgressed into Asian breeds seems to be based on interpretation of branch lengths in the trees in Figures 2-3 and some f3 statistics. The interpretation of branch lengths is extremely weak evidence for introgression, probably not even worth mentioning. The f3 statistics are potentially quite informative though. For the breeds in question (Brebes and Madura), which pairs of populations give the most negative f3 statistics? This is difficult information to extract from Supplementary Table 2, where the populations appear to be sorted alphabetically. A table showing the (for example) five most negative f3 statistics could be quite useful here.

Supplementary Table 2 has been updated to report the 5 most negative statistics. The Z-scores for Brebes are smaller than -18 and the Z-scores for Madura are smaller than -13. We also note that these results are supported by the ADMIXTURE analysis.

In general, if the SNP ascertainment scheme is not extremely complicated (can the authors describe the ascertainment scheme for this array?), a negative f3 statistic is very strong evidence that a target population is admixed, which a significant f4 statistic only means that at least one of the four populations in the statistic is admixed. This might be a useful property for the authors.

The SNPs were ascertained multiple ways, they were either a SNP in the reference Hereford animal, discovered from Sanger resequencing of 9 breeds, or reduced representation sequencing of Angus, Holstein, or a pool of breeds. Most of the SNPs were ascertained in Hereford, Angus, or Holstein.

“We detect the pronounced influence of Shorthorn cattle in the formation of European breeds.”

This conclusion appears to be based on interpretation of ADMIXTURE plots in Figures S6-S9. Interpreting these types of plots is notoriously difficult. I wonder if the f3 statistics might be useful here: do the authors get negative f3 statistics in the populations they write ìshare ancestry with Shorthorn cattleî when using the Durham shorthorns as one reference?

Durham Shorthorn is the ancestral breed of Beef Shorthorn, Milking Shorthorn, and Lincoln Red (reference 30 from the manuscript), and as these are direct relationships (tree-like) we wouldnít expect significant f-statistics. We added Table S3 to report the negative f3 statistics for Maine Anjou, Santa Gertrudis, and Beefmaster. We suspect Belgian Blue have undergone too much change in allele frequencies due to intense selection and small effective population sizes since admixture to produce significant f3 statistics. We have edited the sentence to say:

“As shown in Figures S6 through S9, Table S3, and from their breed histories [31], many breeds share ancestry with Shorthorn cattle, including Milking Shorthorn, Beef Shorthorn, Lincoln Red, Maine-Anjou, Belgian Blue, Santa Gertrudis, and Beefmaster.”

Charolais and Holstein did not produce significant f3 statistics. Although they did produce significant f4 statistics, we choose to not report these.

“Iberian and Italian cattle possess introgression from African taurine.”

This conclusion is based on ADMIXTURE plots and treemix; it would be interesting to see the results from f3 statistics as well.

We added this as the last paragraph of the Admixture in Europe subsection.

“We also used f-statistics to explore the evidence for African taurine introgression into Spain and Italy. We did not see any significant f3 statistics, but this test may be underpowered because of the low-level of introgression. With Italian and Spanish breeds as a sister group and African breeds, including OulmËs Zaer, as the other sister group, we see 321 significant tests out of 1911 possible tests. Of these 321 significant tests, 218 contained Oulmes Zaer. We also calculated f4 statistics with the Spanish breeds as sister and the African taurine breeds as sister (excluding Oulmes Zaer). With this setup, out of the possible 675 tests we only see 1 significant test, f4(Berrenda en Negro, Pirenaica;Lagune, N’Dama (ND2)) = 0.0007, Z-score = 3.064. With Italian cattle as sister and African taurine as sister (excluding Oulmes Zaer), we see 17 significant test out of 90 possible. Patterson et al. [27] define the f4-ratio as f4(A, O; X, C)/f4(A, O; B, C), where A and B are a sister group, C is sister to (A,B), X is a mixture of B and C, and O is the outgroup. This ratio estimates the ancestry from B, denoted as α. We calculated this ratio using Shorthorn as A, Montbeliard as B, Lagune as C, Morucha as X, and Hariana as O. We choose Shorthorn, Montbeliard, Lagune, and Hariana as they appeared the least admixed in the ADMIXTURE analyses. We choose Morucha because it is solid red with African ancestry in Figure S10. This statistic estimates that Morucha is 91.23% European (α†= 0.0180993/0.0198386) and 8.77% African, which is similar to the proportion estimated by TreeMix. The multiple f4 statistics with Italian breeds as sister and African breeds as the opposing sister support African admixture into Italy. The f4-ratio test with Morucha also supports our conclusion of African admixture in Spain.”

We understand that the f4 statistics are not as easily interpreted, but the f4-ratio seems to have a straight-forward interpretation.

“American Criollo cattle are shown to be of Iberian, and not African, decent.”

I found this difficult to follow-the authors write that these breeds “derive 7.5% of their ancestry from African taurine introgression”, so presumably they are in fact partially of African descent?

We reworded as:

“American Criollo cattle are shown to be imported from Iberia, and not directly from Africa, and African ancestry is inherited via Iberian ancestors.”

“Indicine introgression into American cattle occurred in the Americas, and not Europe”

This conclusion seems difficult to make from genetic data. The authors identify “indicine” ancestry in American cattle, so I don’t see how they can determine whether this happened before or after a migration without temporal information. It would be helpful if the authors walk the reader through each logical step they’re making so that the reader can decide whether they believe the evidence for each step.

We added this sentence:

“To reiterate, Iberian cattle do not have indicine ancestry, American Criollo breeds originated from exportations from Iberia, Brahman cattle were developed in the United States in the 1880ís [31], and American Criollo breeds carry indicine ancestry, and the introgression likely occurred from Brahman cattle.”

Other responses [NB: these are responses to comments from another reviewer, but his/her comments are not printed]:

We have attempted to make the manuscript easier to read for a wider audience, but welcome additional feedback.
NOTE TO BLOG READERS: Please send me your feed back as well! @pop_gen_JED

We have rearranged the nodes of Figure 4 and we believe it is now easier to read.

The position of the migration edges denote where in time or ancestry the migration occurred. The more basal a migration edge is placed, the
migration occurred earlier in time or from a more divergent population.

As mentioned above, the placement of the migration edges is meaningful, so we prefer to keep the information displayed in this manner. We have added a brief explanation of TreeMix to the manuscript under the TreeMix analysis paragraph of the Methods section.

The geographic origin of all the populations is given in Table S1. We have edited these two sentences to say,

We find that the Indonesian Brebes (BRE) and Madura (MAD) breeds have significant Bos javanicus (BALI) ancestry demonstrated by the short branch lengths in Figures 2-4, shared ancestry with Bali in ADMIXTURE analyses (light green in Figures S7-S9), and significant f3 statistics (Table S2). The Indonesian Pesisir and Aceh and the Chinese Hainan and Luxi breeds also have Bali ancestry (migration edge c in Figure 4, and light green in Figures S8 and S9).

We agree that the reference to Murray adds confusion and have deleted these references from the manuscript.

We add “previously suggested” to this statement to identify that these two waves have previously been inferred in the literature from archeology and genetics. We also feel that the use of “possibly” suggests that this is an interpretation and not a concrete result. In regards to the evidence to support our interpretation, we see two analyses, ADMIXTURE and TreeMix, suggesting two clades of indicine introgression.

Durham Shorthorns are the ancestors of Beef Shorthorns, Milking Shorthorns, Lincoln Reds, Belgium Blues, and Maine Anjous. We add a parenthetical with a citation to reference 31 to clarify this.

Table 1 was moved to the supplement.

One of the main assumptions and conclusions of McTavish et al is that there are no pure taurines in Africa; all cattle in Africa have indicine ancestry. Our results suggest that this is not true and pure taurines do exist in Africa. We have added, “Thus, we conclude that contrary to the assumptions and conclusions of [57] cattle with pure taurine ancestry do exist in Africa.

Added “The f3 and f4 statistics look for correlations in allele frequencies that are not compatible with a bifurcating tree; these statistics provide support for admixture in the history of the tested populations [26,27].” as the first sentence of the f3 and f4 statistics subsection in the Methods section.

If cattle were separately domesticated in Africa they would be the most divergent taurine clade. But, TreeMix finds, separate from user intervention, that the best model for the relationship between indicine, Asian taurine, African taurine, and European taurine is indicine as the outgroup, European and African taurine† as sister groups, and Asian taurine as the most divergent taurine group. I.e. (indicine,(Asian taurine, (African taurine, European taurine))). But, this model also includes admixture from an unsampled ancestral population that is approximately equally divergent from taurine and indicine. Our sampling is quite extensive and has sampled populations across Europe and Africa. But, we are unable to sample African auroch as they are extinct. Rather than separate domestication and admixture being indistinguishable, the gene frequencies suggest that there was introgression into African domesticated taurine from an ancestral population. We strongly feel the most parsimonious explanation is introgression from African auroch.

From Stock and Gifford-Gonzalez 2013, “The central fact around which disparate speculations about the origins of African cattle turn is one upon which all can also agree: northern Africa was home to wild aurochsen, Bos primigenius, from the Middle Pleistocene onwards (Linseele 2004).” We have added citations to Stock and Gifford-Gonzalez 2013 and Linseele 2004 to our manuscript.

Other changes:

Changed “elucidate” to “reveal” in Author Summary.

Second paragraph of TreeMix subsection of Methods section: Changed migration rate to migration proportion

Results section, Worldwide patterns subsection, 2nd paragraph, 6th sentence: Changed “(Figure 4)” to “(Figures 4 and 5, discussed in detail in the following subsections)”.

Results section,
Divergence within the taurine lineage subsection, 1st paragraph. Added “
We also see some runs of TreeMix placing a migration edge from Chianina cattle to Asian taurines (Figure S4).