Understanding Sampling Distributions Using R for Statistical Inference

Three dice on a green surface showing different numbers of dots, with one die featuring red dots.

This post relates to the computer icons in the margins of pages 7 and 8 of our book. In Chapter 1, we want to refresh any ideas of statistical inference, to prime you for moving on to thinking about models in terms of likelihoods, and from there we can get Bayesian without any confusing jumps of logic.

We even thought about making this “Chapter Zero” at one point, but we felt in the end that every reader should look at it, even if only a skim read. (CRC Press were brilliantly supportive throughout, but still, we felt that they might not welcome a Chapter Zero.)

Simulating random data and playing with different inputs to see what happens are two steps that are well-demonstrated to be beneficial in research on statistics education. If you teach stats and want to find out more about this, some of the key names to look up are George Cobb, Joan Garfield, and Robin Lock.

This post provides R code; matching posts for Python and Stata will follow. As usual, we use base R and loops for clarity.

Part 1: run one simulated random sample from a normal population

# before simulating random data, we should set 
# the "seed" for the pseudo-random number 
# generator functions

set.seed(2468)

# normally distributed sample using the rnorm()
# function with statistics for pulse rate:

mydata <- rnorm(n=100, mean=79.1, sd=14.5)

# histogram of the data:

hist(mydata, breaks=20)

# observed statistics differ somewhat from 
# the true values:

print(paste0("Mean = ", mean(mydata)))
print(paste0("SD = ", sd(mydata)))

You should get a nice, normal-ish distribution like this:

Histogram displaying the frequency distribution of a simulated dataset named 'mydata', showing variations in values ranging from 40 to 120.

You might not get exactly the same histogram as us. This is because the pseudo random number generator code can vary somewhat between versions of R, your operating system, or even because of your hardware.

Part 2: run it repeatedly

If we run this in a loop, we can get lots of samples, capture the observed mean and SD from each random sample, and see what their sampling distributions look like.

# now, let's put this into a loop, so we 
# can accumulate many samples and see
# how their observed statistics behave:
set.seed(9876)

# you can change the inputs here:
input_n <- 100
input_mean <- 79.1
input_sd <- 14.5

# how many iterations do we run?
iter <- 100000
sample_mean <- sample_sd <- rep(NA, iter)

# run the loop:
for(i in 1:iter) {
	mydata <- rnorm(n=input_n, 
                        mean=input_mean, 
                        sd=input_sd)
	sample_mean[i] <- mean(mydata)
	sample_sd[i] <- sd(mydata)
}

par(mfcol=c(2,1))
hist(sample_mean, breaks=30)
hist(sample_sd, breaks=30)
par(mfcol=c(1,1))

Self-test question: can you use R to get the standard deviation of the vector ‘sample_mean’? What name is this standard deviation of a sampling distribution known by? (See p.7 if you’re not sure.)

Part 3: wrap it in a function

Now we can put the same code inside a function. The arguments are input_n, input_mean, input_sd and iter, and their defaults are the ones used above. This also prints a progress bar on the screen in case you pump up the iterations. (In Jupyter servers, or when using source() to run this as a script file in command-line R, unfortunately txtProgressBar may not display anything until it is finished.)

# now, we put the whole loop into a function, 
# so you can quickly rerun it with 
# different inputs:

mysample <- function(input_n=100, 
                     input_mean=79.1, 
                     input_sd=14.5, 
                     iter=100000) {
	pb <- txtProgressBar(min=1, 
                             max=iter, 
                             style=3)
	set.seed(9876)
	sample_mean <- sample_sd <- rep(NA, iter)

	for(i in 1:iter) {
		setTxtProgressBar(pb, i)
		mydata <- rnorm(n=input_n, 
                                mean=input_mean, 
                                sd=input_sd)
		sample_mean[i] <- mean(mydata)
		sample_sd[i] <- sd(mydata)
	}
	
	par(mfcol=c(2,1))
	hist(sample_mean, breaks=30)
	hist(sample_sd, breaks=30)
	par(mfcol=c(1,1))
	
	close(pb)

	return(list(sample_mean=sample_mean, 
	            sample_sd=sample_sd))

}

The vectors ‘sample_mean’ and ‘sample_sd’ of the observed statistics are returned in a list. This allows you to play with the settings and see what happens. The only really interesting one is input_n. When you change it, you will find that the SDs of the sampling distributions change.

> samps <- mysample(input_n=100)
  |==========================================
==========================================
============| 100%

> sd(samps[[1]])
[1] 1.453322

> samps <- mysample(input_n=10)
  |======================================
=======================================
===================| 100%

> sd(samps[[1]])
[1] 4.582957

Next steps

The distribution of observation-level data will not always be normal. In other #repo posts for Chapter 1 (see the list at the end of “Getting Started“), you encounter different random number generator functions for common distributions, and you can play with substituting these for the normal.

For example, you should investigate how much skewness in the observations really leads to asymmetry in the sampling distribution of the mean. Then, how does this vary with the sample size, ‘input_n’? This will give you an intuition for data and the right sampling distribution for modelling through likelihood.

This works with means, but the same logic can be applied to almost any other summary statistic. Importantly, there are some statistics for which there is only an approximate formula for the samping distribution, notably the median.

If the distribution of ‘sample_mean’ is not reassuringly normal, you might want to fit some candidate probability density functions to it. The usual approach to this is maximum likelihood estimation, which we will illustrate in another post.

Bottom line

Sampling distributions give us likelihood contributions for studies in meta-analysis. Remember, when you are not sure what the appropriate sampling distribution is, simulation can really help to inform your choice. When you are in doubt, consult a qualified statistician.

If you enjoy hands-on investigation of what is really going on in basic statistical inference, we recommend the anthropologist Richard McElreath’s book and lectures on Statistical Rethinking (see “Recommended Reading“).

Simulate data to learn about inference, Using R

Part 1: run one simulated random sample from a normal population

Part 2: run it repeatedly

Part 3: wrap it in a function

Next steps

Bottom line

Like this:

Discover more from The Bayesian Meta-Analysis Network

Simulate data to learn about inference, Using R

Part 1: run one simulated random sample from a normal population

Part 2: run it repeatedly

Part 3: wrap it in a function

Next steps

Bottom line

Share this:

Like this:

Discover more from The Bayesian Meta-Analysis Network

Discover more from The Bayesian Meta-Analysis Network