library(tidyverse)
library(tidymodels)
putin_supporter <- c(rep(1, 400), rep(0, 600))
pop <- tibble(putin_supporter)
mean(pop$putin_supporter)
[1] 0.4
January 29, 2025
On December 19, 2014, the front page of Spanish national newspaper El País read “Catalan public opinion swings toward ‘no’ for independence, says survey”.
Alberto Cairo. The truthful art: Data, charts, and maps for communication. New Riders, 2016.
Alberto Cairo. “Uncertainty and Graphicacy: How Should Statisticians Journalists and Designers Reveal Uncertainty in Graphics for Public Consumption?”, Power from Statistics: Data Information and Knowledge, 2017.
We know from previous section that even unbiased procedures do not get the “right” answer every time
We also know that our estimates might vary from sample to sample due to random chance
Therefore: we want to report on our estimate and our level of uncertainty
In previous section, we knew the population parameter
In real life, we do not!
We want to generate an estimate and characterize our uncertainty with a range of possible estimates
95 percent confidence interval is standard
We cannot run the process over and over like when we simulated the true sampling distribution
That is, we cannot generate the true sampling distribution
Take advantage of Central Limit Theorem to estimate sampling distribution
Use simulation, bootstrapping
Pulling oneself up from their bootstraps …
Use the data we have to estimate or approximate the sampling distribution
Take a bootstrap sample - a random sample taken with replacement from the original sample, of the same sample size as the original sample
In essence, random sampling from our sample, which is itself a representative sample from the population
With replacement: Pull one observation out, put it back in before selecting the next observation
Why with replacement?
Let’s return to the example where we have set the proportion of the population that supports Putin to 0.40 (the true value of our population parameter)
We want a point estimate of Putin support AND a confidence interval
We will use a package called tidymodels
Now lets construct a confidence interval around the estimate
A range of plausible estimates
Take a bootstrap sample and calculate the mean
This means we take a sample (with replacement) from our sample, of the same sample size
Notice I am NOT sampling from the Population: I AM sampling from the sample
Take another bootstrap sample
tidymodels
library(tidymodels)
## create an object called boot_df
boot_df <- sample %>%
# specify the variable of interest
specify(response = putin_supporter) %>%
# generate 15000 bootstrap samples
generate(reps = 15000, type = "bootstrap") %>%
# calculate the mean of each bootstrap sample
calculate(stat = "mean")
boot_df
? What does each observation represent?A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.
The 95% confidence interval was calculated as (0.26, 0.44). Which of the following is the correct interpretation of this interval?
(a) 95% of the time support for Putin in this sample is between 0.26 and 0.44.
(b) 95% of all Russians have support for Putin between 0.26 and 0.44.
(c) We are 95% confident that the proportion of Russians who support Putin is between 0.26 and 0.44.
(d) We are 95% confident that the proportion of Russians who support Putin in this sample is between 0.26 and 0.44.
Standard language to interpret: We are 95% confident that the proportion of Russians who support Putin is between 0.26 and 0.44.
More precise: If we were to repeat our process over and over, the population parameter would sit within the confidence interval 95% of the time.
Let’s take 1 sample of 200 so we can see what happens to the confidence interval if we increase the sample size
Who remembers?
As sample size gets bigger . . .
The spread of the sampling distribution gets narrower
The shape of the sampling distributions becomes more normally distributed
We can use our data to estimate/approximate the sampling distribution
Because we know things about the properties of the normal distribution
Centered on the mean
95% of the distribution sits Plus/Minus 1.96 standard deviations from the mean
With sufficient sample size, we can assume the sampling distribution is distributed as a normal distribution (central limit theorem)
We can use the mean in our data (our estimate) to estimate the center of the distribution
We can use the standard error in our data to estimate the standard error
Lower bound: \(Mean - 1.96*SE\)
Upper bound: \(Mean + 1.96*SE\)
estimatr
packageTypical language (which is good!): We are 95 percent confident that the true parameter is inside the upper and lower bounds of the confidence interval
More precise: if we ran this study over and over, 95 percent of the time the true parameter is going to fall within the bounds of the confidence interval
What Proportion of Russians believe their country interfered in the 2016 presidential elections in the US?
Pew Research survey
506 subjects
data available in the openintro
package
A 95% confidence interval is bounded by the middle 95% of the bootstrap distribution.