IAFF 6501 – Where do Data come From?

Where do data come from?

In a simple sense

Someone gives it to you
You download it from the internet
You or someone else conducts a study
Will focus on this later in class: getting data and using it in R

In a deeper sense

There is a process that generates the data we work with
When analyzing and drawing conclusions from data, we need to keep this process in mind

Key aspects of the process

Sampling: how were the units that we are examining selected?
Research Design: does our design allow us to draw causal conclusions?
Measurement: do the measures we are examining really capture the concepts/constructs/outcomes that we think they are

Will come back to these issues throughout course, but important to mention at the outset and always consider

We need to learn about the data we are using
We need to be critical about the data we are using (or that others are using)

How were units selected into our data?

Two studies examine a civic education program and use a survey to understand satisfaction with and other attitudes about the program.

Study 1: Some participants volunteer to answer survey questions after the program is completed.
Study 2: Participants are randomly selected to answer survey questions after the program is completed.
Which do we prefer? Why?

How were units selected into our data?

Some violent events datasets rely on newspaper reports (and web scraping) to identify specific instances of and locations of violence in specific countries.
What are the costs and benefits of this approach?

How were units selected into our data?

We usually have a population that we are interested in learning about
We need to think about whether the sample we have (the specific rows in our dataset) is useful for teaching us about that population
More on this in future classes!

Causal Conclusions

In a post-conflict reconciliation program, program participants were surveyed about their attitudes about out-group members right before the program. Six months later, they were surveyed again. Program participants were more favorable toward out-group members six months later.
Is this evidence that the program caused an improvement in out-group attitudes? Why or why not?

Causal Conclusions

Researchers have long noticed that, on average, wealthier countries are more democratic than poorer countries.
Is this evidence that wealth causes democracy? Why or why not?

Causal Conclusions

Two studies want to know whether an education program improves employment prospects.

Study 1: Some participants are randomly assigned to the program while others are not (in the control group). The employment rates of participants and non-participants are compared at the end of the study to determine program impact.
Study 2: Participants apply to be part of the program. The employment rate of participants is compared to the employment rate of a set of randomly selected non-participants at the end of the study to determine program impact.
Which do we prefer? Why?

Measurement

In a voter turnout study, participants are randomly assigned to receive significant encouragement from a civic organization to turn out to vote (or to be in control).
To measure program impact, those in the study are asked after the election whether they voted or not.
What do we think of this measurement strategy?

Measurement

A democracy organization wants to generate a measure of how democratic every country in the world is. To do so, they send survey questions to professors at universities in the United States. They use answers to the questions to generate their measures.
What do we think of this measurement strategy?

Big picture

Always investigate where your data come from
Ask questions about this and be critical when consuming data
Will come back to some of these themes in more detail

Break

“Tidy” Data

Each column represents a single variable
Each row represents a single observation
Each cell represents a single value

“Tidy” Data

library(vdemlite)
myData <- fetchdem(indicators = "v2x_polyarchy", start_year = 2000, end_year = 2000)
head(myData)

  country_name country_text_id year v2x_polyarchy
1       Mexico             MEX 2000         0.671
2     Suriname             SUR 2000         0.783
3       Sweden             SWE 2000         0.914
4  Switzerland             CHE 2000         0.888
5        Ghana             GHA 2000         0.667
6 South Africa             ZAF 2000         0.745

“Tidy” Data

myData2 <- fetchdem(indicators = "v2x_polyarchy", start_year = 2000, end_year = 2001)
head(myData2)

  country_name country_text_id year v2x_polyarchy
1       Mexico             MEX 2000         0.671
2       Mexico             MEX 2001         0.682
3     Suriname             SUR 2000         0.783
4     Suriname             SUR 2001         0.781
5       Sweden             SWE 2000         0.914
6       Sweden             SWE 2001         0.914

Accessing and Working with Data

Check out chapter on Data Wrangling from Modern Dive

Posit Cloud

Let’s move to Posit Cloud