From Wikipedia:
Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset).
Lets run a simulation where we estimate the mean, median, standard deviation, and variance of a population from which we have a sample. We will provide the standard error, a 95% confidence interval, and the bias for each of the estimates (a positive bias would indicate an underestimation of the statistic of interest, a negative bias would indicate an overestimation). We will do this using bootstrapping.
First we simulate taking a sample of size 50 from a population with mean 0 and standard deviation 1:
data <- rnorm(100, mean = 0, sd = 1);
data[1:10];
## [1] -0.215647074 -1.453146297 0.238233482 0.007944063 0.324200079
## [6] -1.710888226 0.015950867 0.725813004 0.843558267 1.849744895
Now we sample with replacement from the above sample - a total of 10000 times - and calculate our statistics of interest. We store the calculated statistics in a matrix:
results <- matrix(nrow = 10000, ncol = 4);
for (i in 1:10000) {
sample <- sample(data, replace = TRUE);
results[i, 1] <- mean(sample);
results[i, 2] <- median(sample);
results[i, 3] <- sd(sample);
results[i, 4] <- var(sample);
}
Now lets look at the mean:
mean <- mean(data);
mean_se <- sd(results[, 1]);
mean_ci <- quantile(results[, 1], c(0.025, 0.975));
mean_bi <- mean(results[, 1]) - mean;
the mean of 0.0006872 has a standard error of 0.104, with a 95% confidence interval (-0.2037, 0.2037), and a bias of 0.0003409. We plot the distribution of means:
hist(
results[, 1],
breaks = 25,
main = 'Mean',
xlab = 'Samples'
);
now lets look at the median:
median <- median(data);
median_se <- sd(results[, 2]);
median_ci <- quantile(results[, 2], c(0.025, 0.975));
median_bi <- mean(results[, 2]) - median;
the median of 0.02505 has a standard error of 0.09301, with a 95% confidence interval (-0.2487, 0.1873), and a bias of -0.002616. We plot the distribution of medians:
hist(
results[, 2],
breaks = 25,
main = 'Median',
xlab = 'Samples'
);
now lets look at the standard deviation:
sd <- sd(data);
sd_se <- sd(results[, 3]);
sd_ci <- quantile(results[, 3], c(0.025, 0.975));
sd_bi <- mean(results[, 3]) - sd;
the standard deviation of 1.041 has a standard error of 0.0702, with a 95% confidence interval (0.8966, 1.1716), and a bias of -0.007269. We plot the distribution of standard deviations:
hist(
results[, 3],
breaks = 25,
main = 'Standard Deviation',
xlab = 'Samples'
);
now lets look at the variance:
var <- var(data);
var_se <- sd(results[, 4]);
var_ci <- quantile(results[, 4], c(0.025, 0.975));
var_bi <- mean(results[, 4]) - var;
the variance of 1.085 has a standard error of 0.1453, with a 95% confidence interval (0.8039, 1.3726), and a bias of -0.01016. We plot the distribution of variances:
hist(
results[, 4],
breaks = 25,
main = 'Variance',
xlab = 'Samples'
);
We have seen that, using the bootstrap technique, we can get a pretty good approximation of statistics of interest, with associated standard errors and confidence intervals, and also an indication of the bias of the estimate.