This is a toy example that Robert used in Royal Statistical Society and other courses introducing Bayesian statistics concepts. It uses a very small imaginary data set with one discrete variable and one discrete parameter (unknown) to illustrate concepts of likelihood and posterior distribution.
Of course, you need the book to really follow threse ideas seriously, so we will just provide the accompanying code here. The idea is that there are known to be 10 snow leopards in a nature reserve, you need to estimate how many might be affected by parasitic worms, and you have three faecal samples, of which one has tested positive. Of course, we assume random sampling from the 10 leopards with equal probability (and a leopard might have contributed more than one poop). We also assume that the test is perfectly accurate with zero false positives or negatives.

The parameter you want to estimate is the number affected out of the 10 in the reserve.
Prior to 2024, the same example was presented as faulty laptops rather than snow leopard poop.
The likelihood of a value for the unknown parameter given the data \( L(\theta | x )\) is equal to the probability of the data given the parameter \( P(x | \theta) \). With the nice random sampling setup, we can use the binomial probability mass function. Suppose you found 2 out of 3 samples were positive; this is the likelihood of 6 out of 10 truly being positive in the reserve:
> dbinom(2, size=3, prob=6/10)
[1] 0.432Likewise, if you found 1 out of 3 positive, this would be the likelihood of 8/10 being truly positive:
> dbinom(1, size=3, prob=8/10)
[1] 0.096We can then make a matrix of all the combinations of data (from 0 to 3) and parameter (from 0 to 10):
allprobs <-matrix(NA,nrow=11,ncol=4)
for(N in 0:10) {
for(n in 0:3){
allprobs[N+1,n+1] <- dbinom(n, size=3, prob=N/10)
}
}
# (we are looping cell-by-cell to make it really clear)
rownames(allprobs)<-as.character(0:10)
colnames(allprobs)<-as.character(0:3)which leads to:
> print(allprobs)
0 1 2 3
0 1.000 0.000 0.000 0.000
1 0.729 0.243 0.027 0.001
2 0.512 0.384 0.096 0.008
3 0.343 0.441 0.189 0.027
4 0.216 0.432 0.288 0.064
5 0.125 0.375 0.375 0.125
6 0.064 0.288 0.432 0.216
7 0.027 0.189 0.441 0.343
8 0.008 0.096 0.384 0.512
9 0.001 0.027 0.243 0.729
10 0.000 0.000 0.000 1.000
Note that some combinations have probability zero and are therefore impossible. Note also (this is very important) that the probabilities / likelihoods add up to one along the rows but not down the columns. Yet, we want to use it by column. When we look along a row, we are fixing the parameter and we get \( P(x | \theta) \). When we look down a column, we are fixing the data (which is what happens in real life) and we get \( L(\theta | x) \).
Looking down the column for 1/3 positive data, which parameter value would give the highest probability of data given parameter? This column is the likelihood, and the value of N with the largest likelihood is the maximum likelihood estimate (MLE: a common technique in non-Bayesian stats). We don’t call it “probability” because it doesn’t add up to 1 when you look down the columns.
Here’s a quick base R graph of the likelihood values for the data 1/3:
plot(0:10, allprobs[,2], type='b',
xlab="Data", ylab="Likelihood")
3/10 is the MLE given data of 1/3. To use this as a probability, we need to “normalise” the likelihood to add up to one.
likelihood <- allprobs[,'1']
posterior <- likelihood / sum(likelihood)In doing so, we enter the Bayesian side of stats* and we can get \( P(\theta | x) \). Then, we can get not only an MLE but also the probability of the number of affected snow leopards being a certain value (just remember that the posterior probability of N/10 lies in the (N+1)th element of the vector):
> posterior[3+1]
3
0.1781818 which can be interpreted directly as a probability. There is a 17.8% probability that there are actually 3/10 snow leopards affected.
We can also add elements to get the probability of being in some region, like 5 or more affected:
> sum(posterior[6:11])
[1] 0.3939394which is a 39.4% probability. In the book (Chapter 2) we discuss the communication benefits for decision makers (who may be numerate but not statistically literate).
You can also summarise the posterior by its mean, median or other quantiles:
sum(posterior*(0:10)) # mean
which(posterior==max(posterior)) # mode
cumsum(posterior)
# median is the point where the cumsum exceeds 0.5
# you can also extract other quantiles this way
although with only 11 possible values of the parameter, there’s not much detail to extract.
Hopefully this helps to crystallise ideas about the model for the data (a data-generating process, perhaps), the likelihood (a principled, justifiable loss function) and the posterior (a useful output). We have not included any priors here, which is equivalent to having a flat constant prior. See Chapter 2 of the book for discussion of thrilling topics that extend this, like:
- why some people feel that flat priors should always be used (as a way of avoiding reponsibility for their actions)
- why others contend that flat priors are never advisable, or even permissible
- what a prior can represent
- how to get a prior (also see Chapter 6 on elicitation!)
- what some common anti-Bayesian arguments are
- what happens with continuous-valued parameters
- what happens with multiple parameters
- what software and algorithms can practically do this work with real-world problems
* There is no Bayesian side of stats really. As a matter of fact, it’s all Bayesian. 😉


Leave a Reply