Robust estimation of microbial diversity in theory and in practice

Bart Haegeman, Jérôme Hamelin, John Moriarty, Peter Neal, Jonathan Dushoff, Joshua S. Weitz

(Submitted on 15 Feb 2013)

Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao’s estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics (“Hill diversities”), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao’s estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.

### Like this:

Like Loading...

*Related*

See also tree of life post.

I really enjoyed this paper– it was very clearly written, relevant, and points out some important facts regarding estimation of unseen species. And thank you for posting it on arXiv.

A little question, if you don’t mind. It appears to me that your conclusions do depend significantly on an assumption: changes in the rare species tail do not change the total abundance of rare species. This assumption then forces constraints on the asymptotics of the rare species tail. For instance, it’s not possible to have an arbitrarily large collection of rare species with the same (low) abundance.

I had to keep on reminding myself when I was reading that the p_i’s are not actually out there in the microbial community– the counts are! So, if we add a big fat tail to the abundance distribution, that will decrease the p_i’s for the abundant organisms even though of course their counts are not changing. This, then, will change the shape of the rarefaction curve, etc, etc.

So, my question is, is there any information out there on the tail of the abundance distribution for environments of interest? Do the rare species tails follow the asymptotics you implicitly describe?

Also, would it be possible for you all to include the supplementary material as part of the arXiv PDF? It seems that many are reading it on that server, and it would be nice to have the complete document there.

Thanks!

Erick,

Thanks for your post. Your question about the tail of the abundance distribution goes to the core of the diversity estimation problem. The problem is as follows: the only information we have about the tail is the information contained in the sample data, but the sample data does not tell us much about the rarest species in the community. This is illustrated in Figure 1 of the paper: different communities, three of which are shown in the figure, are consistent with the sample data. All these communities have the same set of abundant species, meaning that the sample data contain accurate information about the abundant species. But the communities have very different rare-species tails; the sample data is not informative enough to tell which tail is more plausible then others.

You wonder whether the tail we are assuming in the reconstructed communities (for example, the tail of the three communities in Figure 1) is realistic. First, as said already, the tails are consistent with the (sample) data we have about the community. The problem is that we cannot really get more information than what the sample data tell us. One could try to look for “typical” features of rare-species tails of microbial communities. If you look at the tails of the largest microbial community samples available, some of which are shown in Figure S3, then you notice that the tails of the sample (!) abundance distribution often resemble a power law (that is, the rank-abundance curve for large rank tends to a straight line on a log-log plot). It could be reasonable to assume that the abundances of unsampled species also lie on this power-law distribution. And in fact, that is exactly the assumption we made in reconstructing the communities in Figure 1. But as you see in the figure, even when assuming a power-law distribution for the tail, there are still very different communities (that is, with very different species richness) consistent with the sample data.

In short, even when assuming more information about the rare-species tail than we strictly have, species richness estimation remains highly inaccurate. An important point to add here is that this statement does not depend on the power-law assumption. For example, if we would have assumed that the rare-species tail is not a straight line but concave-down on a log-log rank-abundance curve (as some of the data sets in Figure S3 suggest), then again we could have reconstructed communities with very different species richness, and all consistent with the sample data. In fact, it is quite easy to construct such communities. If you have one community that is consistent with the sample data, and you change the rare species tail such that the total abundance of rare species does not change, then the resulting community will again be consistent with the sample data. Using this property it is easy to convince oneself that there is really a huge variety of reconstructed communities, all consistent with the sample data. So, contrary to what you write, the statement that changes in the rare species tail do not change the total abundance of rare species is not an assumption on which our conclusions rely. Rather, it is a part of a mathematical property (proven in Text S1) that we use to construct a range of communities consistent with the sample data.

I hope this is helpful for you. Also, I have submitted a new version on the arXiv, containing both main text and supplementary information. It should be available in the next few days.

Thank you for your response. Let me again repeat that I really liked

your paper and that I agree with its main conclusions. However, it

appears that we still don’t quite see eye to eye in this regard.

My central points are twofold:

1. We don’t know what the tail of the abundance distribution looks like

(we agree on this).

2. Tails of the abundance distributions exist that could violate an

assumption of yours (it appears that we don’t agree on this).

> First, as said already, the tails are consistent with the (sample)

> data we have about the community. The problem is that we cannot really

> get more information than what the sample data tell us.

Absolutely.

> One could try to look for “typical” features of rare-species tails of

> microbial communities. If you look at the tails of the largest

> microbial community samples available, some of which are shown in

> Figure S3, then you notice that the tails of the sample (!) abundance

> distribution often resemble a power law (that is, the rank-abundance

> curve for large rank tends to a straight line on a log-log plot). It

> could be reasonable to assume that the abundances of unsampled species

> also lie on this power-law distribution. And in fact, that is exactly

> the assumption we made in reconstructing the communities in Figure 1.

> But as you see in the figure, even when assuming a power-law

> distribution for the tail, there are still very different communities

> (that is, with very different species richness) consistent with the

> sample data.

Er, hm, I’m not sure how this all fits into your argument that the tail

doesn’t matter. First you choose a family of distributions from sample

data, then you say, “even when assuming a power-law distribution for the

tail, there are still very different communities (that is, with very

different species richness) consistent with the sample data.”

The power law is just one of many different families of distributions,

and although I can see that by choosing different parameters for the

power law you can get different “true” distributions that look the same

upon sampling, but it seems a bit premature to conclude that the

collection of rare species cannot change the sample distribution.

> In short, even when assuming more information about the rare-species

> tail than we strictly have, species richness estimation remains highly

> inaccurate. An important point to add here is that this statement does

> not depend on the power-law assumption. For example, if we would have

> assumed that the rare-species tail is not a straight line but

> concave-down on a log-log rank-abundance curve (as some of the data

> sets in Figure S3 suggest), then again we could have reconstructed

> communities with very different species richness, and all consistent

> with the sample data.

It seems to me that we can choose anything that goes down to zero more

quickly than a power law. However, what if it does not?

> In fact, it is quite easy to construct such communities. If you have

> one community that is consistent with the sample data, and you change

> the rare species tail such that the total abundance of rare species

> does not change, then the resulting community will again be consistent

> with the sample data. Using this property it is easy to convince

> oneself that there is really a huge variety of reconstructed

> communities, all consistent with the sample data.

Now I think we are really getting to the core of the misunderstanding.

In my first post, I was pointing out the centrality of your assumption

that: *changes in the rare species tail do not change the total

abundance of rare species.* It seems to me that you are just reinforcing

the centrality of that assumption.

I absolutely agree that there are is a huge variety of distributions

that are consistent with the sample data.

> So, contrary to what you write, the statement that changes in the rare

> species tail do not change the total abundance of rare species is not

> an assumption on which our conclusions rely. Rather, it is a part of a

> mathematical property (proven in Text S1) that we use to construct a

> range of communities consistent with the sample data.

In my reading of Text S1 it appears to me that the p_i’s for the

non-rare species are considered fixed. This again appears to me to be

using the assumption that the total abundance of rare species does not

change.

Let’s say the total count of organisms from the non-rare species is x,

and the total count of organisms from the rare species is y. If there is

a given non-rare species with count z, then the corresponding p_i is

z/(x+y). If we double the count of all of the rare species, the

corresponding p_i will become z/(x+2y). These are not the same.

> I hope this is helpful for you. Also, I have submitted a new version

> on the arXiv, containing both main text and supplementary information.

> It should be available in the next few days.

Excellent, thank you.

I’m enjoying the fact that this discussion is happening on an open blog

supported by the community and that folks are reading your paper on the

arXiv. I hope this is a glimmer of the future.

Thanks for your comments Erick, and for your support of Haldane’s sieve.

Erick,

Here is another attempt at clarification.

First, I think we are all in agreement that changing the abundance of each of the rare species while having more/less rare species (so that the total abundance of the rare species does not change) would not affect the rarefaction curve. From this it is clear that many community species abundance distributions (with different total number of species) are compatible with sample data.

Second, changing the abundance of each of the rare species in the way you suggest could in fact change the rarefaction curve. The important thing to note that our version of rarity has to do with the sample. As we say in the SI,

“We define a species to be rare if its relative abundance is much smaller than 1/M where M is the number of individuals in the sample. This means that a rare species is unlikely to be present in the sample (of size M). For concreteness we say that species i is rare if p_i is less than or equal to 1/(50*M). Note that our definition of rarity depends on the sample size M. The choice of a threshold for rarity is arbitrary, though our results are robust to changes in the constant (which in this case has been set to 50) so long as it is much greater than 1.”

We are relying on our sample to inform us about the community. Hence, the key point is that our sample size implicitly bounds the total relative density of rare species that, by definition, are unlikely to be observed in our sample.

Best,

Joshua Weitz (with input from Jonathan Dushoff and Bart Haegeman)

Josh, I’m glad we are making progress here.

First paragraph: agreed!

For the rest, I think the following is the source of my confusion. Your

criterion for rarity bounds an individual p_i in terms of M, which I

would call “the relative density of a species.” However, this sentence

discusses the “total relative density”:

> We are relying on our sample to inform us about the community. Hence,

> the key point is that our sample size implicitly bounds the total

> relative density of rare species that, by definition, are unlikely to

> be observed in our sample.

In contrast, the “total relative density” seems to me as being the sum

of p_i for all rare i. This is not bounded by your formula involving the

sample size if the number of rare species is allowed to vary.

As an aside, when the sample size is fixed one still has lots of “wiggle

room” in the bound; thus one could also, say, double the frequency of

each rare species without changing classifications into rare and not

rare if your rare species are indeed quite rare. However, this could

have a substantial effect on the p_i’s if the rare species make up a

significant fraction of the total individual count.

I agree with you that your version of rarity has to do with the sample

size, but I don’t see how that resolves things here. Thanks for your

help.

Erick:

Thanks for the feedback. It’s great to feel people are interested, and these sorts of questions will help us explain better for the future.

I think the missing piece we haven’t explained well is _conceptually_ how we bound the _total_ density of rare species.

The details are in the math, but the basic idea is that the total density of rare species is constrained by the number of singletons we see in our sample. Because of definition of rare means that we are very unlikely to see any rare species twice, we don’t expect the _total_ density of to exceed (roughly) the proportion of singletons in our sample.

The other component (which I think you get) is that the metrics which are estimatable are relatively insensitive to the distribution of rare species, as long as these don’t take up too much of the total density of the population. In contrast, richness and other non-estimatable metrics can be very sensitive.

@jd_mathbio

TYPO:

Because of definition of rare means that we are very unlikely to see any rare species twice, we don’t expect the _total_ density of to exceed (roughly) the proportion of singletons in our sample.

SHOULD READ:

Because [[of->our]] definition of rare means that we are very unlikely to see any rare species twice, we don’t expect the _total_ density of [[rare species]] to exceed (roughly) the proportion of singletons in our sample.

My apologies.

@jd_mathbio

Jonathan—

Thanks for your reply. I think we are getting close, but we’re not quite

there. I think I just need to connect two concepts.

The only place I can find “singletons” is in (S15), where you are

describing the total relative abundance of unobserved species in the

sample.

(S4) implies that the total relative abundance of the set of rare

species in the community is (as one would expect) the sum of the

relative abundances of all of the rare species, where rare species are

defined in (S3) in terms of the sample size rather than the singletons.

I can see that one might roughly equate the number of rare species with

the number of unseen species assuming your sample is relatively small or

some such. However, I’m still not seeing how something like that would

change the definitions in the previous paragraph, which as I described

above seem to contradict the assumption that changes in the rare species

tail do not change the total abundance of rare species.

We discuss the singleton point right at (S4) (“it is unlikely that a rare species will be observed twice”); sorry if I misled by using the word “singleton”.

We don’t assume that the number of rare species is unchanged by distributional assumptions. Effectively, we _estimate_ the total abundance of rare species, and then show that for some metrics that is sufficient to allow us to estimate diversity.

When you make a distributional assumption for rare species you are free to _assume_ that changes in distribution do or don’t change their total abundance. That’s not what we do: we sidestep this question, by estimating the total independently of distributional assumptions.

Hope that helps, if not, please try to spell out your contradiction more clearly.

@jd_mathbio

Pingback: Most viewed on Haldane’s Sieve: February 2013 | Haldane's Sieve

Thank you for your patience. I’ve looked at this several times now.

> We don’t assume that the number of rare species is unchanged by

> distributional assumptions.

Er, do you mean to say that the _total abundance_ of rare species is

unchanged by distributional assumptions? From the rest of what you write

it seems that this would be the case.

> Effectively, we _estimate_ the total abundance of rare species, and

> then show that for some metrics that is sufficient to allow us to

> estimate diversity.

OK, for the time being I’m thinking about the rarefaction curve. Do you

agree that if we double the total abundance of each of the rare species

and hold the abundance of the rest of the species constant that this

will change the p_i and thus the rarefaction curve, even in the case

where each of the rare species remains rare even after doubling?

> When you make a distributional assumption for rare species you are

> free to _assume_ that changes in distribution do or don’t change their

> total abundance. That’s not what we do: we sidestep this question, by

> estimating the total independently of distributional assumptions.

Yes, however equation (S4) concerns the _actual_ abundance of rare

species. If we change that actual abundance, that will change the p_i

and hence the rarefaction curve.

Thanks!

> Do you agree that if we double the total abundance of each of the rare species and hold the abundance of the rest of the species constant that this will change the p_i and thus the rarefaction curve, even in the case where each of the rare species remains rare even after doubling?

Roughly. I guess you would have to renormalize, because you now have a bigger sample.

> Equation (S4) concerns the _actual_ abundance of rare

species. If we change that actual abundance, that will change the p_i

and hence the rarefaction curve.

Yes. How does that affect our argument? Our sample effectively allows us to robustly estimate the total abundance of rare species. We worry about the different possibilities for the _number_ of species in that group, because we can’t estimate the total number of species in the group. We don’t worry (much) about the different theoretical possibilities for the total abundance of that group, because we can estimate this quantity.

A loose analogy, which I hope will be helpful; apologies if it is misrepresenting what you are trying to say:

If 10% of our sample is singletons, then the total abundance of rare species in the population can’t exceed ~10%. The fact that you can imagine a particular distribution underlying this 10%, and then imagine changing that distribution in such a way that the 10% would become 20%, is not a challenge to our measurement.

>> Do you agree that if we double the total abundance of each of the

>> rare species and hold the abundance of the rest of the species

>> constant that this will change the p_i and thus the rarefaction

>> curve, even in the case where each of the rare species remains rare

>> even after doubling?

> Roughly. I guess you would have to renormalize, because you now have a

> bigger sample.

I’m not sure what you mean by “roughly agree.” Unless you tell me

otherwise, I’ll assume that you mean “yes.”

>> Equation (S4) concerns the _actual_ abundance of rare species. If we

>> change that actual abundance, that will change the p_i and hence the

>> rarefaction curve.

> Yes. How does that affect our argument? Our sample effectively allows

> us to robustly estimate the total abundance of rare species. We worry

> about the different possibilities for the _number_ of species in that

> group, because we can’t estimate the total number of species in the

> group. We don’t worry (much) about the different theoretical

> possibilities for the total abundance of that group, because we can

> estimate this quantity.

I’m confused by “We worry about the different possibilities for the

_number_ of species in that group, because we can’t estimate the total

number of species in the group.”

I’m not making an extraordinary statement, nor am I refuting the basic

premise of your paper. I’m just saying that if the total abundance of

rare species changes, then the p_i’s will change. Figure 4, e.g., is set

up in a way such that the total abundance of rare species does not

change much. So, we don’t see changes in the rarefaction curve.

I can’t find a statement in your paper that clarifies the difference

between _changes in the rare species tail_ and _changes in the rare

species tail that do not change the total abundance of rare species_.

For example, you say “A computation in Supplementary Text S1 shows that

the rarefaction curve up to sample size M is insensitive to the

abundance distribution of species with relative abundance well below

1/M.”

> A loose analogy, which I hope will be helpful; apologies if it is

> misrepresenting what you are trying to say: If 10% of our sample is

> singletons, then the total abundance of rare species in the population

> can’t exceed ~10%.

Is there a typo here? Let’s assume, as I think you are, that all of

these singletons are considered rare in terms of your sampling

definition. Then I would think that the total abundance of rare species

is at _least_ 10% of the population.

> The fact that you can imagine a particular distribution underlying

> this 10%, and then imagine changing that distribution in such a way

> that the 10% would become 20%, is not a challenge to our measurement.

What measurement do you refer to? If you mean “argument”, then I’m not

quite convinced yet.