
The output from any Bayesian sampling algorithm is a sample from the posterior distribution of the unknowns (parameters, random effects, etc). The statistics we usually report are summaries of this sample. For example, if we want the posterior median, we just report the median of this sample. The simplest form of 95% credible interval extends from the 0.025 quantile to the 0.975 quantile.
Often, critical decisions are made based on this or other quantiles of the posterior sample. If the sample is too small, these quantiles will have uncertainty arising from the fact that your posterior sample is just a sample. The further you go into the “tails” of the distribution, the worse this will get, because there are few draws in the sample that far away from the median.
The uncertainty arising from the sample is called Monte Carlo error. There are some formulas to compute a standard error for it, which ought to take into account the fact that some algorithms are prone to producing autocorrelated samples. We will leave you to investigate these if you want to know more about the mathematics; any book on Bayesian statistics that does not avoid theory should cover it.
We want to emphasise the fact that, although formulas exist for simple statistics such as the posterior mean, there is no formula for the quantiles. Instead, we can use the bootstrap, which is a general way of doing inference on a statistic. For statisticians, inference means understanding the uncertainty in our estimates. Machine learning people use the word for something else entirely.
There are bootstrap packages and commands / functions in all good statistical software, but to help understanding, we wrote this function in R that goes step-by-step through the process (see below for Python and Stata!). We start with the posterior sample of maybe 10,000 values for an unknown, and resample with replacement. Each resample gives us the next best thing to a new sample. And if we calculate the statistic (quantile) for each of these resamples, it is the next best thing to the true sampling distribution of the quantile (for which there is no exact formula).
There are several good books introducing the bootstrap, and although many people cite the ur-texts by Bradley Efron, we recommend pedagogical writing by Tim Hesterberg, or the more mathematical and wide-ranging book by Davison and Hinkley, Bootstrap Methods and Their Application. Bootstrap also forms part of a wider movement to introduce statistical inference by simulation, and a fun generic undergraduate textbook in this tradition would be Statistics: Unlocking the Power of Data, by the Lock family.
bootstrap_MC_error <- function(x,
iter = 10000,
quant,
show_hist = TRUE) {
# x: numeric vector of posterior samples for one unknown
# iter: number of bootstrap resamples
# quant: quantile to estimate (between 0 and 1)
# show_hist: whether to display histogram of bootstrap distribution
n <- length(x)
bq <- rep(NA, iter)
obs_q <- quantile(x, probs = quant)
cat(paste0("Running ", iter, " bootstrap re-samples: \n"))
pb <- txtProgressBar(min = 1, max = iter, style = 3)
for (i in 1:iter) {
setTxtProgressBar(pb, i)
bx <- sample(x, n, replace = TRUE)
bq[i] <- quantile(bx, probs = quant)
}
close(pb)
if (show_hist) {
hist(bq,
breaks = 50,
main = "Bootstrap distribution",
xlab = paste0(quant, " quantile"))
}
sd_quantile <- sd(bq)
cat(paste0("SD of ", quant, " quantile = ", sd_quantile, "\n"))
return(list(
sd_quantile = sd_quantile,
bootstrap_quantiles = bq,
observed_quantile = obs_q
))
}The loop “for (i in 1:iter)” is where the bootstrap resampling takes place.
Here’s an example of using it on the ACEI vs ARB example dataset with a common effect, contrast-based, asymptotic meta-analysis:
library(cmdstanr)
library(bayesplot)
########### ACEI vs ARB dataset ############
ava <- read.csv("ARBvACEI.csv")
ava$ACEI_odds <- ava$ACEI_AE / (ava$ACEI_n - ava$ACEI_AE)
ava$ARB_odds <- ava$ARB_AE / (ava$ARB_n - ava$ARB_AE)
ava$logor <- log(ava$ACEI_odds / ava$ARB_odds)
ava$se <- sqrt((1 / ava$ACEI_AE) +
(1 / (ava$ACEI_n - ava$ACEI_AE)) +
(1 / ava$ARB_AE) +
(1 / (ava$ARB_n - ava$ARB_AE)))
# drop one study that did not report adverse events
ava <- ava[!is.na(ava$logor),]
######## CE, contrast, binary, asymptotic ########
stancode <- "
data {
int m;
array[m] real logor;
array[m] real<lower=0> se;
}
parameters {
real theta;
}
model {
theta ~ normal(0, 3);
for(j in 1:m) {
logor[j] ~ normal(theta, se[j]);
}
}
"
stanmod <- cmdstan_model(write_stan_file(stancode),
compile = TRUE)
stanfit <- stanmod$sample(data=list(m = NROW(ava),
logor = ava$logor,
se = ava$se),
iter_warmup=1000,
iter_sampling=5000,
chains = 2,
parallel_chains = 2)
stanfit$summary()
standraws <- stanfit$draws(variables="theta",
format="draws_matrix")
bootmc025 <- bootstrap_MC_error(x = as.numeric(standraws),
quant = 0.025)
This uses cmdstanr but BUGS, JAGS, Stata, brms, or PyMC will provide you with the posterior sample itself. The pseudo-random number generators in the resampling and in Stan are not seeded here, so you will find slightly different results. This gives an observed (from the original posterior sample) 0.025 quantile of 0.08305, and that has a Monte Carlo standard error of 0.00189.
How to interpret this? Add and subtract 1.96 standard errors and you will get a 95% bootstrap confidence interval around the quantile. (Yes, the confidence / credible language is confusing. This is an unfortunate legacy of decades of frequentist dominance.) Although we have 10,000 draws in our posterior sample, when we get out to the 0.025 quantile, its 95% confidence interval extends from 0.07934 to 0.08675.
The histogram gives a more nuanced look at the bootstrap estimate of the sampling distribution. Is it skew or extremely lumpy? That might argue against the use of a simple standard error to describe the uncertainty.

Is that wide enough to undermine the results? That depends on your context and how the results are going to be used for decision making. If it is too wide, you should increase the number of posterior draws and re-run the Bayesian meta-analysis.
Here are some ways you could improve on this basic example code:
- check autocorrelation and issue a warning if it is above some threshold
- vectorise / parallelise the bootstrap loop for a little more speed (we don’t think it is prohibitive, but see below for base Python which is very slow)
- add a pseudo-random number generator seed as an argument to the function
- add other popular credible interval calculations such as the highest density interval (see HDInterval::hdi() in R or arviz.hdi() in Python)
Python version
Here is a simple Python translation that, like the R one, avoids extra packages, and uses loops for human-readability. Its purpose is to help you understand the principle.
import random
import statistics
import math
import sys
import numpy as np
def bootstrap_MC_error(x, iter=10000, quant=0.5, show_hist=True):
n = len(x)
bq = [None] * iter
obs_q = np.quantile(x, quant)
print(f"Running {iter} bootstrap re-samples:")
for i in range(iter):
if (i + 1) % max(1, iter // 50) == 0:
sys.stdout.write(f"\r{(i+1):>{len(str(iter))}}/{iter} done")
sys.stdout.flush()
bx = [random.choice(x) for _ in range(n)]
bq[i] = np.quantile(bx, quant)
print("\n")
sd_quantile = statistics.pstdev(bq)
if show_hist:
import matplotlib.pyplot as plt
plt.hist(bq, bins=50, edgecolor="black")
plt.title("Bootstrap distribution")
plt.xlabel(f"{quant} quantile")
plt.ylabel("Frequency")
plt.show()
print(f"SD({quant} quantile) = {sd_quantile}")
return {
"sd_quantile": sd_quantile,
"bootstrap_quantiles": bq,
"observed_quantile": obs_q
}
This is fairly easy to understand, but much slower than the R version. Here’s an alternative that avoids looping and also sorting all draws in all resamples; chunk_size should be smaller if you have memory problems. It runs about 20 times faster.
import numpy as np
import matplotlib.pyplot as plt
def bootstrap_MC_error(x, iter=10000, quant=0.5, chunk_size=1000, show_hist=True):
x = np.asarray(x)
n = len(x)
obs_q = np.quantile(x, quant)
k = int(np.floor(quant * n))
bq_list = []
# loop over "chunks" to limit memory use
for start in range(0, iter, chunk_size):
end = min(start + chunk_size, iter)
this_chunk = end - start
# Generate bootstrap sample indices for this chunk
idx = np.random.randint(0, n, size=(this_chunk, n))
samples = x[idx]
# Find kth order statistic for each bootstrap sample
bq_chunk = np.partition(samples, k, axis=1)[:, k]
bq_list.append(bq_chunk)
bq_all = np.concatenate(bq_list)
sd_quantile = bq_all.std(ddof=0)
if show_hist:
plt.hist(bq_all, bins=50, edgecolor="black")
plt.title("Bootstrap distribution")
plt.xlabel(f"{quant} quantile")
plt.ylabel("Frequency")
plt.show()
print(f"SD({quant} quantile) = {sd_quantile}")
return {
"sd_quantile": sd_quantile,
"bootstrap_quantiles": bq_all,
"observed_quantile": obs_q
}
Here is the same example of use, skipping the fitting of the ACEI vs ARB meta-analysis:
bootmc025 = bootstrap_MC_error(standraws, quant=0.025)Stata version
Stata has some good approximate confidence intervals for centiles built into the centile command, and you could reasonably rely on that instead of bootstrapping.
centile standraws, centile(2.5) cciThis will report a 95% confidence interval. You can convert it into a Monte Carlo standard error if you prefer, by subtracting its bottom from its top and dividing by 3.92. Bootstrapping this in Stata requires the bootstrap prefix to the command:
bootstrap q025 = r(c_1), ///
reps(10000) ///
saving("MC_error.dta", replace): ///
centile standraws, centile(2.5)This runs more slowly than the base looping R or chunked Python versions, possibly because we can’t stop it calculating confidence intervals at each iteration, but it is tolerable, and faster than the basic Python code. The output shows both the 95% confidence interval and the bootstrap standard error.


Leave a Reply