Inference, Sampling, and Study Design

May 19, 2025

Activity

What proportion of all milk chocolate M&Ms are blue?

  • M&Ms has a precise distribution of colors that it produces in its factories

  • M&Ms are sorted into bags in factories in a fairly random process

Activity

  • Get in groups of 2. Each group will have 4-5 bags of M&Ms.

  • Each bag represents a Sample from the full Population of M&Ms

  • Keep the contents of each bag separate, and do not eat (yet!)

Activity

  • Open up your first bag of M&Ms: calculate the proportion of the M&Ms that are blue. Write this down.

  • Do the same as above for the rest of your bags (you should have 4-5 estimates). Keep these estimates on hand.

  • Calculate the mean of your estimates.

  • Come to the board to add the data you collected to the full class histogram

  • What is your best guess about the percentage of all M&Ms that are blue?

When you are done, discuss the following with partner:

  • What is the histogram/distribution on the board showing?

  • Why do some bags of M&Ms have proportions of blues that are higher and lower than the number you gave above?

  • Based on the histogram on the board, what is your answer to the question of what percentage of all milk chocolate M&Ms are blue? Why do you give that answer?

What have we just done?

  • We wanted to say something about the Population of M&Ms
  • The Parameter we care about is the proportion of M&Ms that are blue
  • It would be impossible to conduct a Census and to calculate the parameter
  • We took a Sample from the population and calculated a Sample Statistic
  • Statistical inference: act of making a guess about a population using information from a sample

What have we just done?

  • We completed this task many times

  • This produced a Sampling Distribution of our Estimates

  • There is a distribution of estimates because of Sampling Variability

    • due to random chance, one estimate from one sample can differ from the population parameter

In real research….

  • We do not repeat sampling processes many times

    • We have one sample to work with
  • These concepts about sampling and sampling variability are foundational ideas for statistical inference that we are going to keep building on

  • Big idea: when we observe a sample of data, we are looking at one draw from a distribution of possible draws (or worlds)

Zooming Out

Inference

  • Prior classes focused on data visualization and summary

  • Now, we turn to inference: the kinds of conclusions we can and want to draw from our data

Target Population

In data analysis, we are usually interested in saying something about a Target Population.

  • What proportion of adult Russians support the war in Ukraine?

    • Target population: adult Russians (age 18+)
  • How many US college students check social media during their classes?

    • Target population: US college students

Sample

In many instances, we have a Sample

  • We cannot talk to every Russian

  • We cannot talk to all college students

Parameters vs Statistics

  • The Parameter: this is the value of a calculation for the entire target population

  • The Statistic: this is what we calculate on our sample

    • We calculate a statistic in order to say something about the parameter

Inference

  • Inference is the act of “making a guess” about some unknown.

  • Statistical inference: making a good guess about a population from a sample

  • Causal inference: did X cause Y? [topic for later classes]

Let’s Move to an IA Example

Hypothetical

  • Question: What proportion of adult (18+) Russian citizens support the war in Ukraine?

  • What is the Population of interest?

  • What is the Parameter of interest?

Russia Hypothetical

  • Our population has 4000 supporters, and 6000 non-supporters

  • What is the population parameter?

Russia Example

This is the Population: in real research, we do not have this information!

pop <- c(rep(1, 4000), rep(0, 6000))
pop <- as_tibble(pop)
table(pop)
value
   0    1 
6000 4000 
mean(pop$value)
[1] 0.4

Let’s take a random sample of 20

s1 <- slice_sample(pop, n= 20)
glimpse(s1)
Rows: 20
Columns: 1
$ value <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0
mean(s1$value)
[1] 0.25

Our random sample of 20

  • What does it mean to say that this is a random sample?

Our random sample of 20

  • What does it mean to say that this is a random sample?

  • All units in the population have an equal Probability of being selected

Our random sample of 20

  • When I calculate the mean in the sample, what am I calculating?

Our random sample of 20

  • When I calculate the mean in the sample, what am I calculating?

  • Our Estimate of the population parameter

Let’s take another random sample of 20

s2 <- slice_sample(pop, n= 20)
mean(s2$value)
[1] 0.45

Let’s take another….

s3 <- slice_sample(pop, n= 20)
mean(s3$value)
[1] 0.5

Let’s do this 1,000 more times

# Gather 1,000 estimates
sims <- 1000
# Store the sample means in this object called "ests"
ests <- rep(NA, sims)
# For loops allow us to run the same procedure many times
for(i in 1:sims){
  sample <- slice_sample(pop, n= 20)
    # store mean in our vector of estimates
    ests[i] <- mean(sample$value)
}
ests <- as_tibble(ests)

What is stored in this vector?

ests
# A tibble: 1,000 × 1
   value
   <dbl>
 1  0.6 
 2  0.5 
 3  0.6 
 4  0.35
 5  0.3 
 6  0.45
 7  0.55
 8  0.55
 9  0.4 
10  0.35
# ℹ 990 more rows

Where should the distribution of this set of estimates be centered?


Or, what should the mean of ests be?

Where should the distribution of this set of estimates be centered?


mean <- mean(ests$value)
mean
[1] 0.39315

Distribution of estimates

Bias

  • The mean of ests is almost exactly the same as the population mean (which we know because we set it)

  • This means that our procedure (random sampling + taking the mean) for estimating the population mean is unbiased

    • on average, we get the “right” answer

Bias

  • Will unbiased procedures get the “right” answer every time?

Sampling Variability

  • An unbiased procedure does not guarantee that we will get the “right” answer every time

  • This is due to random chance, or sampling variability

Sampling Distribution (n = 20)

Standard Error

  • Standard deviation of the sampling distribution

  • Characterizes the spread of the sampling distribution

sd(ests$value)
[1] 0.1058148

Sample size and Sampling Variability

  • Repeat the simulations conducted above:

    • Once with n = 5
    • Once with n = 20
    • Once with n = 200
  • For each: Are these biased or unbiased procedures?

  • Visualize the sampling distributions using a histogram: how are they different?

  • Calculate the standard errors: how are they different?

  • Discuss with partner: Why are the sampling distributions and standard errors different?

Let’s try N=5

# RUn 1000 simulations
sims <- 1000
# Store the sample means in this
ests5 <- rep(NA, sims)
# For loops allow us to run the same procedure many times
for(i in 1:sims){
  sample <- slice_sample(pop, n = 5)
    ests5[i] <- mean(sample$value)
  }    
ests5 <- as_tibble(ests5)

Let’s try N=200

sims <- 1000
# Store the sample means in this
ests200 <- rep(NA, sims)
# For loops allow us to run the same procedure many times
for(i in 1:sims){
  sample <- slice_sample(pop, n= 200)
    ests200[i] <- mean(sample$value)
}  
ests200 <- as_tibble(ests200)

Sampling Distribution, Sample Size = 5

Sampling Distribution, Sample Size = 20

Sampling Distribution, Sample Size = 200

Compare Standard errors

# N = 5
sd(ests5$value)
[1] 0.2276573
# N = 20
sd(ests$value)
[1] 0.1058148
#N = 200
sd(ests200$value)
[1] 0.03511064

Why does changing sample size matter?

Central Limit Theorem

As sample size gets bigger . . .

  • The spread of the sampling distribution gets narrower

  • The shape of the sampling distributions becomes more normally distributed

    • Important for mathematical calculations in statistics (not big focus for us)

Bias vs Precision


  • A procedure is unbiased if it generates the “right” answer, on average
  • Precision refers to variability: procedures with less sampling variability will be more precise

    • all else equal, a greater sample size will increase precision

Why did we do these simulations?


  • They provide a foundation for statistical inference and for characterizing uncertainty in our estimates, our next topic
  • The best research designs will try to maximize or achieve good balance on bias vs precision

Research Design Implications

Questions?

Research Design Scenarios

  • Discuss with partner

Scenario 1

A development organization is trying to build evidence demonstrating the impact of their programming on political participation. The program has hundreds of participants. Five program participants volunteered to provide information on their political participation and each reports that they participate much more after they were in the program. The development organization concludes that the program had the intended impact of increasing political participation.

  • Using language from earlier in class, critically assess this conclusion.

Scenario 1A

A development organization is trying to build evidence demonstrating the impact of their programming on political participation. The program has hundreds of participants. Five program participants were randomly selected to provide information on their political participation and each reports that they participate much more after they were in the program. The development organization concludes that the program had the intended impact of increasing political participation.

  • Using language from earlier in class, critically assess this research design and conclusion.

Scenario 2

A polling firm is trying to estimate the proportion of likely voters that will vote for President Biden. They draw a random sample of 1000 adult Americans and ask them whether they intend to vote for Biden. From this, they calculate the proportion who intend to vote for Biden.

  • What is the target population?
  • What is the statistic (or estimate)?
  • Is this estimate a good guess of the population parameter?

Scenario 3

In 2016, polling firms using random sampling tried to estimate the proportion of adult Americans who would vote for Trump. In surveys in America, the response rate is very low (~10 percent) and people without a college education are less likely to respond to surveys than are other Americans.

  • What is the target population?
  • One firm claims despite the low response rate their procedure is unbiased because of random sampling: based on the information above, do you agree?

Scenario 4

Researchers are interested in studying the political attitudes of ex-combatants in a civil war. They have a limited budget and are debating the best research design. In option A, they use random sampling, which is very expensive given the target population (they have to enumerate ex-combatants and then track them down, which is resource intensive). This would allow them to survey 20 ex-combatants given the budget. In option B, they use snowball sampling, where ex-combatants introduce the researchers to other ex-combatants that they know. This is much cheaper, and would result in a sample size of 250.

  • What are the costs and benfits of option A?
  • What are the costs and benefits of option B?
  • Imagine the researchers are able to make reasonably good guesses about how much bias would be generated by snowball sampling: when would option B be a better choice? when would option A be better?

Bias-Variance Tradeoff

  • Sometimes, a little bit of bias is OK, if it buys us a lot more precision [small bias + high precision]

  • An unbiased study with very low precision is not informative (as we will discover more in future classes)

  • But this depends on how much bias there is: high bias + high precision is not a good outcome.

IAFF 6501 Website

1 / 55
Inference, Sampling, and Study Design May 19, 2025

  1. Slides

  2. Tools

  3. Close
  • Inference, Sampling, and Study Design
  • Activity
  • What proportion of all milk chocolate M&Ms are blue?
  • Activity
  • Activity
  • What have we just done?
  • What have we just done?
  • In real research….
  • Zooming Out
  • Inference
  • Target Population
  • Sample
  • Parameters vs Statistics
  • Inference
  • Let’s Move to an IA Example
  • Hypothetical
  • Russia Hypothetical
  • Russia Example
  • Let’s take a random sample of 20
  • Our random sample of 20
  • Our random sample of 20
  • Our random sample of 20
  • Our random sample of 20
  • Let’s take another random sample of 20
  • Let’s take another….
  • Let’s do this 1,000 more times
  • What is stored in this vector?
  • Where should the distribution of this set of estimates be centered?
  • Where should the distribution of this set of estimates be centered?
  • Distribution of estimates
  • Bias
  • Bias
  • Sampling Variability
  • Sampling Distribution (n = 20)
  • Standard Error
  • Sample size and Sampling Variability
  • Let’s try N=5
  • Let’s try N=200
  • Sampling Distribution, Sample Size = 5
  • Sampling Distribution, Sample Size = 20
  • Sampling Distribution, Sample Size = 200
  • Compare Standard errors
  • Why does changing sample size matter?
  • Central Limit Theorem
  • Bias vs Precision
  • Why did we do these simulations?
  • Research Design Implications
  • Questions?
  • Research Design Scenarios
  • Scenario 1
  • Scenario 1A
  • Scenario 2
  • Scenario 3
  • Scenario 4
  • Bias-Variance Tradeoff
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help