5.1 Topics

  • Standard error vs. Standard deviation
  • Boostrapping
  • Confidence Intervals

5.2 Standard error vs. standard deviation

import numpy as np
import seaborn as sns

np.random.seed(42)
sns.set_theme(style="whitegrid")

Let’s define a “population” of size 1,000,000. Each member of the population will be a number from the standard normal distribution. We will be using the numpy function randn

population = np.random.randn(1000000)
population
array([ 0.49671415, -0.1382643 ,  0.64768854, ..., -0.11297975,
        1.46914237,  0.47643025])

What’s the population’s mean and standard deviation?

mean_value = population.mean()
std = population.std()
print(f"The mean value is {mean_value} and the standard deviation is {std}")
The mean value is -0.0015997564542563718 and the standard deviation is 1.000188036804731

What does the distribution of the population look like?

sns.displot(population)

If this were the real world, sampling 1,000,000 people would be difficult. Let’s say that our budget allowed us to sample only 100 people at a time. How can we simulate that?

sample_size = 100
sample_100 = np.random.choice(population, sample_size)

What’s the sample’s mean and standard deviation?

mean_100 = sample_100.mean()
std_100 = sample_100.std()
print(f"The mean value is {mean_100}\nThe standard deviation is {std_100}")
The mean value is -0.1753180501180135
The standard deviation is 1.1706030198688049

Notice that our sample’s mean value is pretty different from that of the population. What if we take a bigger sample size?

sample_size = 1000
sample_1000 = np.random.choice(population, sample_size)
mean_1000 = sample_1000.mean()
std_1000 = sample_1000.std()
print(f"The mean value is {mean_1000}\nThe standard deviation is {std_1000}")
The mean value is -0.008352574043024205
The standard deviation is 1.003117759041072

Or smaller…

sample_size = 5
sample_5 = np.random.choice(population, sample_size)
mean_5 = sample_5.mean()
std_5 = sample_5.std()
print(f"The mean value is {mean_5}\nThe standard deviation is {std_5}")
The mean value is 0.03172109602313291
The standard deviation is 0.758566765992809

The standard error of the mean (SEM) measures the precision of the estimate of the sample mean. We can use the formula:

\text{SEM} = \frac{\sigma}{\sqrt{n}}

  • \sigma = the standard deviation of the sample
  • n = the sample size

Now we can compare our standard errors using the different sample sizes.

se_5 = std_5/np.sqrt(5)
se_100 = std_100/np.sqrt(100)
se_1000 = std_1000/np.sqrt(1000)

print(f"The SEM when the sample size was 5 was {se_5}")
print(f"The SEM when the sample size was 100 was {se_100}")
print(f"The SEM when the sample size was 1000 was {se_1000}")
The SEM when the sample size was 5 was 0.3392413708464193
The SEM when the sample size was 100 was 0.11706030198688049
The SEM when the sample size was 1000 was 0.03172136879933749

6 Bootstrapping

Bootstrapping is a statistical method that involves repeatedly resampling a dataset with replacement to estimate the distribution of a statistic. This technique allows for the assessment of the accuracy and variability of sample estimates, such as the mean or standard deviation, without making strict assumptions about the underlying population distribution. By generating many resampled datasets (called bootstrap samples) and calculating the statistic of interest for each sample, bootstrapping provides an empirical approximation of the sampling distribution. This method is particularly useful when dealing with small samples or when traditional parametric assumptions cannot be applied.

Let’s create a numpy array to represent heights of females at the Faculta

np.random.seed(42)
mean = 162.2
std = 5.5
sample_size = 1000
heights = np.random.normal(loc=mean, 
                           scale=std, 
                           size=sample_size)

# sample mean
mean_value = heights.mean()

# population standard deviation
std = heights.std()

print(f"The population's mean value is {mean_value}\nThe standard deviation is {std}")
The population's mean value is 162.30632630702277
The standard deviation is 5.382994142610448

Now we can construct a simulated sampling distribution

boot_straps = 500
sample_size
sample_means = np.zeros(boot_straps)
for ii in range(boot_straps):
    sample = np.random.choice(heights, 
                              size=sample_size,
                              replace=True)
    sample_means[ii]= sample.mean()
# now we can find the standard deviation of the means
se_mean_height_bootstrap = sample_means.std()
print(f"The standard error of the mean calculated using bootstrapping is: {se_mean_height_bootstrap:.3f}")
The standard error of the mean calculated using bootstrapping is: 0.175

Reminder, for the mean, it is also possible to estimate the Standard Error (SE) of the mean by dividing the standard deviation (STD) of the values by the square root of the sample size. The formula for the Standard Error of the Mean (SEM) is:

\text{SEM} = \frac{\sigma}{\sqrt{n}}

where \sigma is the standard deviation of the sample, and n is the sample size.

For other statistics, such as the median, there is no such analytical formula. The utility of the bootstrap method lies in its ability to estimate the SE of these other statistics. By repeatedly resampling the data with replacement and calculating the statistic of interest for each resample, bootstrapping provides an empirical distribution of the statistic. From this distribution, we can estimate the SE and other measures of variability.

se_mean_height_analytical = heights.std()/np.sqrt(sample_size)
print(f"The standard error of the mean calculated using the analytical formula is: {se_mean_height_analytical:.3f}")
The standard error of the mean calculated using the analytical formula is: 0.170

6.1 Confidence Intervals

We can also compute the confidence interval for our estimate of the mean. Let’s say we want the 90 percent confidence interval.

conf_level = 0.9
low_bound = (1-conf_level)/2
up_bound = (1+conf_level)/2
CI = (np.quantile(sample_means, low_bound),
      np.quantile(sample_means, up_bound)) 
print(f"The lower bound of the confidence interval is: {CI[0]}")
print(f"The upper bound of the confidence interval is: {CI[1]}")
The lower bound of the confidence interval is: 162.0432374294097
The upper bound of the confidence interval is: 162.6211799101076

6.2 Can you bootsrap confidence interval for the median?

Answer
# make a bootstrapped population of the median
boot_straps = 500
sample_size
sample_medians = np.zeros(boot_straps)
for ii in range(boot_straps):
    sample = np.random.choice(heights, 
                              size=sample_size,
                              replace=True)
    sample_medians[ii]= np.median(sample)

# compute CI for the bootstrapped population of the median
conf_level = 0.9
low_bound = (1-conf_level)/2
up_bound = (1+conf_level)/2
CI = (np.quantile(sample_medians, low_bound),
      np.quantile(sample_medians, up_bound)) 

# # uncomment below:
# print(f"The lower bound of the confidence interval is: {CI[0]}")
# print(f"The upper bound of the confidence interval is: {CI[1]}")