
In this post, we will share some code expanding on section 1.4 and chapters 3 and 4 of our book. Specifically, this relates to the computer icon in the margins of pages 68, 92 and 93.
Our goals are:
- generate simulated data as though from an RCT with two parallel arms, no regression to the mean, no dropout, etc; in other words, a simple example so you can focus on the logic; we summarise the data to get study-level statistics
- generate several such trials, either with or without heterogeneity
- run frequentist/likelihood meta-analyses on them
Repeated simulation of data allows us to see how statistical methods really behave. We can keep everything in line with the assumptions, and then break the assumptions one at a time to see what happens. Most people learn a lot this way, partly because it is hands-on active learning. Statisticians have been doing this for a long time, even before we got computers we had slide rules and tables of random numbers! If you want to learn more about simulation in general, we recommend this nice paper by Morris, White and Crowther.
We will use R for this as a lingua franca. Translation to Python, Stata, etc is fairly trivial once you understand the logic. You can also just follow along and see what happens without worrying about the coding, and that will teach you something about inference and MA.
As always, we keep the code as basic as possible, with only base functions that everyone will recognise. The print() functions ensure that outputs are indeed printed, however you are running R.
1: generate one trial’s data
library(meta)
# ----------------- binary outcome data, one trial ---------------
set.seed(9876)
# control arm:
risk_ctl <- 0.3 # risk of an event in the control arm
odds_ctl <- risk_ctl/(1-risk_ctl)
logodds_ctl <- log(odds_ctl)
# odds ratio:
true_or <- 0.9
true_lor <- log(true_or)
# intervention arm:
odds_int <- odds_ctl * true_or
risk_int <- odds_int / (1+ odds_int)
# print out:
print(paste0("Population control risk: ", risk_ctl))
print(paste0("Population control odds: ", odds_ctl))
print(paste0("Population control log odds: ", logodds_ctl))
print(paste0("Population intervention risk: ", risk_int))
print(paste0("Population log odds ratio: ", true_lor))
print(paste0("Population odds ratio: ", exp(true_lor)))
# simulate data for one parallel-arms trial:
n_int <- n_ctl <- 80 # n in each arm (you can make them different)
sim_one_trial_binary <- data.frame(treat=factor(c(rep(0,n_ctl),rep(1,n_int))),
event=c(rbinom(n_ctl,1,risk_ctl),
rbinom(n_int,1,risk_int)))
counts <- with(sim_one_trial_binary,
table(treat,event))
study_or <- (counts[1,1]*counts[2,2]) / (counts[1,2]*counts[2,1])
print("Sample odds ratio:")
print(study_or, digits=4)
study_logor <- log(study_or)
study_se <- sqrt((1/counts[1,1]) +
(1/counts[2,2]) +
(1/counts[1,2]) +
(1/counts[2,1]))
study_ci <- c(exp(study_logor - 1.96*study_se),
exp(study_logor + 1.96*study_se))
print("Sample 95% CI:")
print(study_ci, digits=3)
Here, we are hand-coding all the formulas, as seen in Chapter 1. We use the Katz formula for the asymptotic standard error of the log odds ratio.
Our next step must be to wrap this in a function, so that it can be repeatedly called. We drop all the printing:
# ---------------- wrap this in a function -----------------
onetrial <- function(n_int,n_ctl,risk_ctl,risk_int) {
sim_one_trial_binary <- data.frame(treat=factor(c(rep(0,n_ctl),rep(1,n_int))),
event=c(rbinom(n_ctl,1,risk_ctl),
rbinom(n_int,1,risk_int)))
counts <- with(sim_one_trial_binary,
table(treat,event))
study_or <- (counts[1,1]*counts[2,2]) / (counts[1,2]*counts[2,1])
study_logor <- log(study_or)
study_se <- sqrt((1/counts[1,1]) +
(1/counts[2,2]) +
(1/counts[1,2]) +
(1/counts[2,1]))
return(list(d_ctl=counts[1,2],
d_int=counts[2,2],
lor=study_logor,
se=study_se))
}
# Note! This returns log OR and SE(log OR)
onetrial() will do all the steps we saw above, but give you back just the study stats for the contrast (counts of events, log odds ratio and its SE). It’s a bit like reading a terse paper written by subject experts.
You don’t have much control over the inputs, only the number of participants per arm and the risk in the two arms.
2 & 3: simulate a whole evidence base
Now we can call the onetrial() function repeatedly. We will begin with a common effect assumption, so every trial is drawn from a population with the same risks in the control and intervention arms.
# ------------- Now call the function repeatedly under common effect --------------
set.seed(5678)
m <- 25 # number of studies to be simulated
sim_binary_ce <- data.frame(n_ctl=rep(n_ctl,m),
n_int=rep(n_int,m),
d_ctl=rep(NA_integer_,m),
d_int=rep(NA_integer_,m),
lor=rep(NA_integer_,m),
se=rep(NA_integer_,m))
for(j in 1:m) {
simtrial <- onetrial(n_ctl, n_int, risk_ctl, risk_int)
sim_binary_ce$d_ctl[j] <- simtrial$d_ctl
sim_binary_ce$d_int[j] <- simtrial$d_int
sim_binary_ce$lor[j] <- simtrial$lor
sim_binary_ce$se[j] <- simtrial$se
}
ce_freq <- metabin(d_int, n_int, d_ctl, n_ctl,
data=sim_binary_ce,
sm="OR",
common=TRUE, random=FALSE,
method="Inverse")
print(summary(ce_freq))
forest(ce_freq)
sim_binary_ce is the data frame that gets populated with the results. We run a CE meta-analysis on it using meta::metabin().
If we want heterogeneity, we can apply that within a loop and change the risk in the intervention arm each time.
# ------------------- random effects variation ----------------
set.seed(5432)
m <- 25 # number of studies to be simulated
tau <- 0.1 # this is on the log OR scale
# these are study-specific population parameters (m-vectors)
re_or <- exp(rnorm(m,true_lor,tau))
re_odds_int <- odds_ctl * re_or
re_risk_int <- re_odds_int / (1+ re_odds_int)
sim_binary_re <- data.frame(n_ctl=rep(n_ctl,m),
n_int=rep(n_int,m),
d_ctl=rep(NA_integer_,m),
d_int=rep(NA_integer_,m),
lor=rep(NA_integer_,m),
se=rep(NA_integer_,m))
# we assume risk_ctl is common to all trials:
for(j in 1:m) {
simtrial <- onetrial(n_ctl, n_int, risk_ctl, re_risk_int[j])
sim_binary_re$d_ctl[j] <- simtrial$d_ctl
sim_binary_re$d_int[j] <- simtrial$d_int
sim_binary_re$lor[j] <- simtrial$lor
sim_binary_re$se[j] <- simtrial$se
}
# this is the correct RE model; you could try CE for comparison...
re_freq <- metabin(d_int, n_int, d_ctl, n_ctl,
data=sim_binary_re,
sm="OR",
common=FALSE, random=TRUE,
method="Inverse")
print(summary(re_freq))
forest(re_freq)
We assume here that the heterogeneity is normal (see Gian Luca’s recent post on this); you can change that in the line:
re_or <- exp(rnorm(m,true_lor,tau))More ideas
You can elaborate this in a number of ways, for example:
- skewed heterogeneity
- bimodal heterogeneity (subgroup) or outlier study
- publication bias: remove some studies with “boring” results
- introducing a typo into one trial’s stats
- parallelising the study simulation loop with
mclapply()for example - continuous or count outcomes
- reporting a mixture of stats and requiring conversion
- investigate performance with edge cases when d=0 or d=n
- investigate small m
There are R packages intended just for this simulation investigation, like simstudy or rsimsum. There exists some commercial software for clinical trials like FACTS (by Berry Consultants) or KerusCloud (by MMS). If simulation interests you professionally, you should definitely look into these as they have many options built in. We mention these as examples, but this does not constitute an endorsement.
Paired data and correlated data in general (e.g. multiple time points) will be the subject of the next simulation post.


Leave a Reply