Three white dice on a green background, showing various numbers of dots.

In this post, we will share some code expanding on section 1.4 and chapters 3 and 4 of our book. Specifically, this relates to the computer icon in the margins of pages 68, 92 and 93.

Our goals are:

  • generate simulated data as though from an RCT with two parallel arms, no regression to the mean, no dropout, etc; in other words, a simple example so you can focus on the logic; we summarise the data to get study-level statistics
  • generate several such trials, either with or without heterogeneity
  • run frequentist/likelihood meta-analyses on them

Repeated simulation of data allows us to see how statistical methods really behave. We can keep everything in line with the assumptions, and then break the assumptions one at a time to see what happens. Most people learn a lot this way, partly because it is hands-on active learning. Statisticians have been doing this for a long time, even before we got computers we had slide rules and tables of random numbers! If you want to learn more about simulation in general, we recommend this nice paper by Morris, White and Crowther.

We will use R for this as a lingua franca. Translation to Python, Stata, etc is fairly trivial once you understand the logic. You can also just follow along and see what happens without worrying about the coding, and that will teach you something about inference and MA.

As always, we keep the code as basic as possible, with only base functions that everyone will recognise. The print() functions ensure that outputs are indeed printed, however you are running R.

1: generate one trial’s data

library(meta)


# ----------------- binary outcome data, one trial ---------------
set.seed(9876)

# control arm:
risk_ctl <- 0.3  # risk of an event in the control arm
odds_ctl <- risk_ctl/(1-risk_ctl)
logodds_ctl <- log(odds_ctl)

# odds ratio:
true_or <- 0.9  
true_lor <- log(true_or)

# intervention arm:
odds_int <- odds_ctl * true_or
risk_int <- odds_int / (1+ odds_int)

# print out:
print(paste0("Population control risk: ", risk_ctl))
print(paste0("Population control odds: ", odds_ctl))
print(paste0("Population control log odds: ", logodds_ctl))
print(paste0("Population intervention risk: ", risk_int))
print(paste0("Population log odds ratio: ", true_lor))
print(paste0("Population odds ratio: ", exp(true_lor)))

# simulate data for one parallel-arms trial:
n_int <- n_ctl <- 80  # n in each arm (you can make them different)
sim_one_trial_binary <- data.frame(treat=factor(c(rep(0,n_ctl),rep(1,n_int))),
                                   event=c(rbinom(n_ctl,1,risk_ctl),
                                           rbinom(n_int,1,risk_int)))
counts <- with(sim_one_trial_binary,
               table(treat,event))
study_or <- (counts[1,1]*counts[2,2]) / (counts[1,2]*counts[2,1])
print("Sample odds ratio:")
print(study_or, digits=4)
study_logor <- log(study_or)
study_se <- sqrt((1/counts[1,1]) + 
                 (1/counts[2,2]) + 
                 (1/counts[1,2]) + 
                 (1/counts[2,1]))
study_ci <- c(exp(study_logor - 1.96*study_se),
              exp(study_logor + 1.96*study_se))
print("Sample 95% CI:")
print(study_ci, digits=3)

Here, we are hand-coding all the formulas, as seen in Chapter 1. We use the Katz formula for the asymptotic standard error of the log odds ratio.

Our next step must be to wrap this in a function, so that it can be repeatedly called. We drop all the printing:


# ---------------- wrap this in a function -----------------

onetrial <- function(n_int,n_ctl,risk_ctl,risk_int) {
	sim_one_trial_binary <- data.frame(treat=factor(c(rep(0,n_ctl),rep(1,n_int))),
                                   event=c(rbinom(n_ctl,1,risk_ctl),
                                           rbinom(n_int,1,risk_int)))
	counts <- with(sim_one_trial_binary,
    	           table(treat,event))
	study_or <- (counts[1,1]*counts[2,2]) / (counts[1,2]*counts[2,1])
	study_logor <- log(study_or)
	study_se <- sqrt((1/counts[1,1]) + 
	                 (1/counts[2,2]) + 
	                 (1/counts[1,2]) + 
	                 (1/counts[2,1]))
	return(list(d_ctl=counts[1,2],
	            d_int=counts[2,2],
	            lor=study_logor,
	            se=study_se))
}
# Note! This returns log OR and SE(log OR)

onetrial() will do all the steps we saw above, but give you back just the study stats for the contrast (counts of events, log odds ratio and its SE). It’s a bit like reading a terse paper written by subject experts.

You don’t have much control over the inputs, only the number of participants per arm and the risk in the two arms.

2 & 3: simulate a whole evidence base

Now we can call the onetrial() function repeatedly. We will begin with a common effect assumption, so every trial is drawn from a population with the same risks in the control and intervention arms.


# ------------- Now call the function repeatedly under common effect --------------

set.seed(5678)
m <- 25  # number of studies to be simulated

sim_binary_ce <- data.frame(n_ctl=rep(n_ctl,m),
                            n_int=rep(n_int,m),
                            d_ctl=rep(NA_integer_,m),
                            d_int=rep(NA_integer_,m),
                            lor=rep(NA_integer_,m),
                            se=rep(NA_integer_,m))
for(j in 1:m) {
	simtrial <- onetrial(n_ctl, n_int, risk_ctl, risk_int)
	sim_binary_ce$d_ctl[j] <- simtrial$d_ctl
	sim_binary_ce$d_int[j] <- simtrial$d_int
	sim_binary_ce$lor[j] <- simtrial$lor
	sim_binary_ce$se[j] <- simtrial$se
}


ce_freq <- metabin(d_int, n_int, d_ctl, n_ctl, 
	               data=sim_binary_ce, 
	               sm="OR",
	               common=TRUE, random=FALSE,
	               method="Inverse")
print(summary(ce_freq))
forest(ce_freq)

sim_binary_ce is the data frame that gets populated with the results. We run a CE meta-analysis on it using meta::metabin().

If we want heterogeneity, we can apply that within a loop and change the risk in the intervention arm each time.


# ------------------- random effects variation ----------------

set.seed(5432)
m <- 25  # number of studies to be simulated
tau <- 0.1  # this is on the log OR scale

# these are study-specific population parameters (m-vectors)
re_or <- exp(rnorm(m,true_lor,tau))
re_odds_int <- odds_ctl * re_or
re_risk_int <- re_odds_int / (1+ re_odds_int)

sim_binary_re <- data.frame(n_ctl=rep(n_ctl,m),
                            n_int=rep(n_int,m),
                            d_ctl=rep(NA_integer_,m),
                            d_int=rep(NA_integer_,m),
                            lor=rep(NA_integer_,m),
                            se=rep(NA_integer_,m))

# we assume risk_ctl is common to all trials:
for(j in 1:m) {
	simtrial <- onetrial(n_ctl, n_int, risk_ctl, re_risk_int[j])
	sim_binary_re$d_ctl[j] <- simtrial$d_ctl
	sim_binary_re$d_int[j] <- simtrial$d_int
	sim_binary_re$lor[j] <- simtrial$lor
	sim_binary_re$se[j] <- simtrial$se
}

# this is the correct RE model; you could try CE for comparison...
re_freq <- metabin(d_int, n_int, d_ctl, n_ctl, 
	               data=sim_binary_re, 
	               sm="OR",
	               common=FALSE, random=TRUE,
	               method="Inverse")
print(summary(re_freq))
forest(re_freq)

We assume here that the heterogeneity is normal (see Gian Luca’s recent post on this); you can change that in the line:

re_or <- exp(rnorm(m,true_lor,tau))

More ideas

You can elaborate this in a number of ways, for example:

  • skewed heterogeneity
  • bimodal heterogeneity (subgroup) or outlier study
  • publication bias: remove some studies with “boring” results
  • introducing a typo into one trial’s stats
  • parallelising the study simulation loop with mclapply() for example
  • continuous or count outcomes
  • reporting a mixture of stats and requiring conversion
  • investigate performance with edge cases when d=0 or d=n
  • investigate small m

There are R packages intended just for this simulation investigation, like simstudy or rsimsum. There exists some commercial software for clinical trials like FACTS (by Berry Consultants) or KerusCloud (by MMS). If simulation interests you professionally, you should definitely look into these as they have many options built in. We mention these as examples, but this does not constitute an endorsement.

Paired data and correlated data in general (e.g. multiple time points) will be the subject of the next simulation post.


Discover more from The Bayesian Meta-Analysis Network

Subscribe to get the latest posts sent to your email.

2 responses to “Simulate randomised controlled trials to learn about meta-analysis”

  1. Frequentist summaries of repeated simulation from the same data-generating process include bias, coverage and relative efficiency. These are discussed in the Morris, White & Crowther paper (and elsewhere). I have heard it said (by Reviewer Two) that simulation of Bayesian procedures cannot be summarised by these metrics. They contended that Bayes is about mere personal opinion (ironically) so can only be appraised using personal opinion. Don’t fall for this nonsense. Any procedure has long-run properties, and they are an insight into how it works and when it works well. In this post, we run non-Bayesian MAs but you can also switch in code from some of our CE/FE/RE templates and do it Bayesian, to see how it gets on. A nice aspect of this is that you can test how (un)responsive they are to potentially contentious priors.

  2. […] In this post, we will consider the simulation of one study with correlated data. This process can then fit into the multi-trial simulation of a whole evidence base, as discussed in our previous post. […]

Leave a Reply

Discover more from The Bayesian Meta-Analysis Network

Subscribe now to keep reading and get access to the full archive.

Continue reading