Confidence Intervals

January 29, 2025

What is wrong with this?

On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.

Alberto Cairo. The truthful art: Data, charts, and maps for communication. New Riders, 2016.

Alberto Cairo. “Uncertainty and Graphicacy: How Should Statisticians Journalists and Designers Reveal Uncertainty in Graphics for Public Consumption?”, Power from Statistics: Data Information and Knowledge, 2017.

Characterizing Uncertainty

  • We know from previous section that even unbiased procedures do not get the “right” answer every time

  • We also know that our estimates might vary from sample to sample due to random chance

  • Therefore: we want to report on our estimate and our level of uncertainty

Characterizing Uncertainty

  • In previous section, we knew the population parameter

  • In real life, we do not!

  • We want to generate an estimate and characterize our uncertainty with a range of possible estimates

Solution: Create a Confidence Interval

  • A plausible range of values for the population parameter is a confidence interval.
  • 95 percent confidence interval is standard

    • We are 95% confident that the parameter value falls within the range given by the confidence interval

“Catching” our Parameter

Challenge

  • We cannot run the process over and over like when we simulated the true sampling distribution

  • That is, we cannot generate the true sampling distribution

Solutions

  • Take advantage of Central Limit Theorem to estimate sampling distribution

  • Use simulation, bootstrapping

Bootstrapping

  • Pulling oneself up from their bootstraps …

  • Use the data we have to estimate or approximate the sampling distribution

    • We call this the bootstrap distribution

Bootstrapping Process

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same sample size as the original sample

    • In essence, random sampling from our sample, which is itself a representative sample from the population

    • With replacement: Pull one observation out, put it back in before selecting the next observation

    • Why with replacement?

Bootstrap process

  1. Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same size as the original sample
  1. Calculate the bootstrap statistic - a statistic such as mean, median, proportion, slope, etc. computed on the bootstrap samples
  1. Repeat steps (1) and (2) many times to create a bootstrap distribution - a distribution of bootstrap statistics
  1. Calculate the bounds of the XX% confidence interval as the middle XX% of the bootstrap distribution (usually 95 percent confidence interval)

Illustration

  • Let’s return to the example where we have set the proportion of the population that supports Putin to 0.40 (the true value of our population parameter)

  • We want a point estimate of Putin support AND a confidence interval

Create Data and Install Needed Packages

We will use a package called tidymodels

library(tidyverse)
library(tidymodels)

putin_supporter <- c(rep(1, 400), rep(0, 600))
pop <- tibble(putin_supporter)
mean(pop$putin_supporter)
[1] 0.4

Take our Sample

  • Let’s take a random sample of 100, and calculate our estimate
set.seed(12112021)
sample <- slice_sample(pop, n= 100)
estimate <- mean(sample$putin_supporter)
estimate
[1] 0.35

Next step

  • Now lets construct a confidence interval around the estimate

  • A range of plausible estimates

Bootstrap Sample 1

  • Take a bootstrap sample and calculate the mean

  • This means we take a sample (with replacement) from our sample, of the same sample size

  • Notice I am NOT sampling from the Population: I AM sampling from the sample

sample_boot1 <- slice_sample(sample, n= 100, replace=TRUE) 
mean(sample_boot1$putin_supporter)
[1] 0.37

Bootstrap Sample 2

Take another bootstrap sample

sample_boot2 <- slice_sample(sample, n= 100, replace=TRUE)
mean(sample_boot2$putin_supporter)
[1] 0.4

And another…

sample_boot3 <- slice_sample(sample, n= 100, replace=TRUE)
mean(sample_boot3$putin_supporter)
[1] 0.34

Let’s repeat 15,000 times using tidymodels

library(tidymodels)
## create an object called boot_df
boot_df <- sample %>% 
  # specify the variable of interest
  specify(response = putin_supporter) %>%  
  # generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap") %>% 
  # calculate the mean of each bootstrap sample
  calculate(stat = "mean")

The bootstrap sample

  • How many observations are there in boot_df? What does each observation represent?
glimpse(boot_df)
Rows: 15,000
Columns: 2
$ replicate <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ stat      <dbl> 0.31, 0.37, 0.37, 0.24, 0.41, 0.30, 0.39, 0.35, 0.39, 0.43, …

Visualize the bootstrap distribution

Calculate the confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.

boot_df %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1  0.26  0.44

Visualize the confidence interval

Interpret the confidence interval

The 95% confidence interval was calculated as (0.26, 0.44). Which of the following is the correct interpretation of this interval?

(a) 95% of the time support for Putin in this sample is between 0.26 and 0.44.

(b) 95% of all Russians have support for Putin between 0.26 and 0.44.

(c) We are 95% confident that the proportion of Russians who support Putin is between 0.26 and 0.44.

(d) We are 95% confident that the proportion of Russians who support Putin in this sample is between 0.26 and 0.44.

Interpretation

Standard language to interpret: We are 95% confident that the proportion of Russians who support Putin is between 0.26 and 0.44.

More precise: If we were to repeat our process over and over, the population parameter would sit within the confidence interval 95% of the time.

Increasing Sample Size


Let’s take 1 sample of 200 so we can see what happens to the confidence interval if we increase the sample size


# First, take a sample of 200 from our hypothetical data
sample200 <- slice_sample(pop, n= 200)
sample200 <- as_tibble(sample200)
mean(sample200$putin_supporter)
[1] 0.39

Bootstrap with Tidy Models

## create an object called boot_df
boot_df200 <- sample200 %>% 
  # specify the variable of interest
  specify(response = putin_supporter) %>%  
  # generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap") %>% 
  # calculate the mean of each bootstrap sample
  calculate(stat = "mean")

Calculate the confidence interval

boot_df200 %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1 0.325  0.46

Compare to interval with n=100

boot_df %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975),
            CIwidth = upper - lower)
# A tibble: 1 × 3
  lower upper CIwidth
  <dbl> <dbl>   <dbl>
1  0.26  0.44    0.18
boot_df200 %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975),
            CIwidth = upper - lower)
# A tibble: 1 × 3
  lower upper CIwidth
  <dbl> <dbl>   <dbl>
1 0.325  0.46   0.135

Quick Break

Central Limit Theorem


Who remembers?

Central Limit Theorem


As sample size gets bigger . . .

  • The spread of the sampling distribution gets narrower

  • The shape of the sampling distributions becomes more normally distributed

Implication…


  • We can use our data to estimate/approximate the sampling distribution

  • Because we know things about the properties of the normal distribution

“Normal” Distribution

Centered on the mean

95% of the distribution sits Plus/Minus 1.96 standard deviations from the mean

  • With sufficient sample size, we can assume the sampling distribution is distributed as a normal distribution (central limit theorem)

  • We can use the mean in our data (our estimate) to estimate the center of the distribution

  • We can use the standard error in our data to estimate the standard error

95% Confidence Interval


  • Lower bound: \(Mean - 1.96*SE\)

  • Upper bound: \(Mean + 1.96*SE\)

Calculate by hand

sample %>% 
  summarize(
    mean = mean(putin_supporter),
    sd = sd(putin_supporter),
    se = sd/sqrt(n()-1),
    lower = mean - 1.96*se,
    upper = mean + 1.96*se
  )
# A tibble: 1 × 5
   mean    sd     se lower upper
  <dbl> <dbl>  <dbl> <dbl> <dbl>
1  0.35 0.479 0.0482 0.256 0.444

Use estimatr package

library(estimatr)
sample %>% 
  lm_robust(putin_supporter ~ 1, data = .) 
            Estimate Std. Error  t value     Pr(>|t|)  CI Lower  CI Upper DF
(Intercept)     0.35 0.04793725 7.301212 7.286987e-11 0.2548821 0.4451179 99

Compare to bootstrap estimate

lower_bound
2.5% 
0.26 
upper_bound
97.5% 
 0.44 

Meaning of the 95 percent CI

sims <- 10000
inside <- rep(NA, sims)

for(i in 1:sims){
  sim_sample <- slice_sample(pop, n= 100)
  sim_estimates <- sim_sample %>% 
    lm_robust(putin_supporter ~ 1, data = .) %>% 
    tidy()
  inside[i] <- ifelse(.4 >= sim_estimates$conf.low & .4 <= sim_estimates$conf.high, 1, 0)
}

Meaning of the 95 percent CI

table(inside)
inside
   0    1 
 415 9585 

Meaning of the 95 percent CI

mean(inside)
[1] 0.9585

  • Typical language (which is good!): We are 95 percent confident that the true parameter is inside the upper and lower bounds of the confidence interval

  • More precise: if we ran this study over and over, 95 percent of the time the true parameter is going to fall within the bounds of the confidence interval

IA Example for you

What Proportion of Russians believe their country interfered in the 2016 presidential elections in the US?

  • Pew Research survey

  • 506 subjects

  • data available in the openintro package

Load up the data

library(openintro)
# glimpse(russian_influence_on_us_election_2016)
russiaData <- russian_influence_on_us_election_2016 %>% 
  mutate(try_influence = ifelse(influence_2016 == "Did try", 1, 0))
# Mean  
summarize(russiaData, mean = mean(try_influence))
# A tibble: 1 × 1
   mean
  <dbl>
1 0.150

Do Part 1 of Week 6 Coursework

Bootstrap with Tidy Models

set.seed(12112021)
boot_df <- russiaData %>%
  # specify the variable of interest
  specify(response = try_influence) %>% 
  # generate 15000 bootstrap samples
  generate(reps = 15000, type = "bootstrap") %>% 
  # calculate the mean of each bootstrap sample
  calculate(stat = "mean")

Calculate the confidence interval

A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.

boot_df %>%
  summarize(lower = quantile(stat, 0.025),
            upper = quantile(stat, 0.975))
# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1 0.119 0.182

Visualize the confidence interval

Write a sentence interpreting the confidence interval