8 Verbs Part 4: Combining data
You’ll often find yourself attempting to put together two or more data sets. To add combine columns – getting more variables – use one of the join
functions. To add rows – stacking datasets – use bind_rows()
. This tutorial only addresses combining columns.
Reporters use joins to:
Reconstruct a typical relational database such as inspections, court data or campaign finance.
Add information from one table, such demogrphics, to another, such as a list of counties.
For clean-up operations, such as fixing company names in one table and applying the fixes to another.
Match one set of records against a completely different one to find potential stories. Some of the most famous data journalism investigations used this kind of join to find, for example, school bus drivers who have DUI’s or daycare centers run by people with serious criminal histories.
8.1 Key takeaways
- Combining two tables requires exact matches on one or more variables. Close matches don’t count.
- Whenever you can get codes to go with your data, get them – you never know when you’ll run across another dataset with the same code.
- You can use information from one table to learn more about another, especially when you have geographic information by county, Census tract or zip code.
- Many public records databases come with “lookup tables”. Be sure to request them so you can match a code, such as “G” to its translation, such as “Great!”
- When you match two datasets that weren’t intended to be combined, there will always be errors. Your job is to minimize the kind of error you fear most for a given story – false positives or false negatives.
Adding rows instead of columns
Joining only adds columns (or variables) to your data. If you need to stack tables on top of each other, use the bind_rows(data frame 1, data frame 2)
function.
8.2 Concepts of joining
8.2.1 Relational databases
The world is made relational databases, which became popular in business applications as early as the 1970s. These are the data systems that your bank, your hospital or your school use to manage their very complicated businesses. The underlying concept is that, beneath the surface, each item is stored only once in a series of interconnected tables.
For example, the Federal Election Commission holds information about donors in one data table, and information about candidates and other political action committees in another. They link together using the common identifier of a committee ID.
(They don’t need to have the same name, and they don’t need to be in the first column.)
The key is that different nouns describe each table: the donations are one table, which list each individual transaction but don’t show the name or other information about the candidate. The candidates are in a different table, and are linked to the donations they received through a common code.
Even in this example, although Martha McSally and Kyrsten Sinema are listed twice, they are for two separate political entities.
The reason to do this is that you never have to worry that any changes to the candidate information – the treasurer, the address or the office sought – carries over to the donation. It’s only listed once in the candidate table. Most large databases are constructed this way. For example:
- Your school records are held using your student ID, which means that your address and home email only needs to be changed once, not in every class or in every account you have with the school.
- Inspection records, such as those for restaurants, hospitals, housing code violations and workplace safety, typically have at least three tables: The establishment (like a restaurant or a workplace), an inspection (an event on a date), and a violation (something that they found). They’re linked together using establishment ID’s.
- A court database usually has many types of records: A master case record links to information on charges, defendants, lawyers, sentences and court hearings.
Each table, then, is described using a different noun – candidates or contributions; defendants or cases; students or courses. This conforms to the tidy data principle that different types of information are stored in different tables.
8.3 Matchmaking with joins
To link these tables together, you’ll use the verb join
. There are several ways to think of joins, but the ones you’ll most frequently use are:
inner_join
: Once the two tables are fitted together using one or more columns in common, only the rows that are in BOTH of contributing tables are kept. Anything without a row is dropped. In the traditional relational database, these are the default. But in the journalism world, we use it for another purpose: matching datasets to find a needle in a haystack. You’ll see more about that later. You will often get back fewer rows than you started with in this kind of join.left_join
: Once they’re all fitted together, keep everything from the first table listed and drop anything that doesn’t match from the second one. You typically use when you need to join tables that come from different agencies or systems, and you’re not sure you’ll have a good match. You should normally get back the same number of rows you started with when you use this join.
8.3.1 Matchmaking strategies
Here are several ways that reporters use joins in their stories:
“Enterprise” joins
Journalists have taken to calling a specific kind of join “enterprise”, referring to the enterprising reporters who do this. Here, you’ll look for needles in a haystack. Some of the most famous data journalism investigations relied on joining two databases that started from completely different sources, such as:
- Bus drivers who had DUI citations
- Donors to a governor who got contracts from the state
- Day care workers with criminal records
When you match these kinds of datasets, you will always have some error. You always have to report out any suspected matches, so they are time consuming stories.
In the mid-2000s, when some politicians insisted that dead people were voting and proposed measures to restrict registration, almost every regional news organization sent reporters on futile hunts for the dead voters. They got lists of people on the voter rolls, then lists of people who had died through the Social Security Death Index or local death certificates. I never met anyone who found a single actual dead voter, but months of reporter-hours were spent tracking down each lead.
It’s very common for two people to have the same name in a city. In fact, it’s common to have two people at the same home with the same name – they’ve just left off “Jr.” and “Sr.” in the database. In this case, you’ll find matches that you shouldn’t. These are false positives, or Type I errors in statistics.
Also, we rarely get dates of birth or Social Security Numbers in public records, so we have to join by name and sometimes location. If someone has moved, sometimes uses a nickname, or the government has recorded the spelling incorrectly, the join will fail – you’ll miss some of the possible matches. These are false negatives, or Type II errors in statistics.4
In different contexts, you’ll want to minimize different kinds of errors. For example, if you are looking for something extremely rare, and you want to examine every possible case – like a child sex offender working in a day care center – you might choose to make a “loose” match and get lots of false positives, which you can check. If you want to limit your reporting only to the most promising leads, you’ll be willing to live with missing some cases in order to be more sure of the joins you find.
You’ll see stories of this kind write around the lack of precision – they’ll often say, “we verified x cases of….” rather than pretend that they know of them all.
Find cases with interesting characteristics
This might be considered a join-then-filter operation. In this case, you might decide to assign each county some characteristics, such as Trump-to-Biden voting, high income or something else. Then you can pick out counties from another dataset that meet your criteria.
This is common when you have data by zip code or some other geography, and you want to find clusters of interesting potential stories, such as PPP loans in minority neighborhoods.
Summarise data against another dataset
This would be considered grouping-then-joining. You count the number of loans in each zip code so that you can calculate a rate – the amount per household, or something like it.
8.4 Arizona immunization data
I looked up the information from most of the public and charter schools in Arizona against Department of Education statistics to find their federal ID numbers, then downloaded some characteristics of schools from the National Center for Education Statistics characteristics. I effectively created a relational database (with a few errors) by assigning unique codes to each school in the immunization data.
You may do that a lot yourself. For example, you might have to look up Zip Codes to match against Census data, or you might have to look up DUNS numbers to match company names across databases. There are ways to do this using a computer and “fuzzy matching”, but it always involves at least a little work by hand.
Two tables are saved in the R dataset called immune_to_nces.Rda, which you can add to your environment using the load()
command.
load (url("https://github.com/cronkitedata/rstudyguide/blob/master/data/immune_to_nces.Rda?raw=true"))
- You can load R datasets from the web through the
url
function, as above.
There are 2,414 schools in the NCES database, but only 841 schools in the immunizations because we’ve only kept schools that had students in Grade 6. There were seven schools that I couldn’t find in the NCES data, and their IDs are blank.
Here are their variables:
glimpse(grade6_to_nces)
## Rows: 841
## Columns: 17
## $ rowid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ nces_id <chr> "040906000652", "040906000915", "040079703230", "0…
## $ school_name <chr> "ABRAHAM LINCOLN TRADITIONAL SCHOOL", "ACACIA ELEM…
## $ address <chr> "10444 N 39TH AVE", "3021 W EVANS DR", "7102 W VAL…
## $ city <chr> "PHOENIX", "PHOENIX", "TUCSON", "TUCSON", "PHOENIX…
## $ county <chr> "MARICOPA", "MARICOPA", "PIMA", "PIMA", "MARICOPA"…
## $ zip_code <chr> "85051", "85053", "85757", "85705", "85033", "8501…
## $ school_nurse <chr> "NO", "YES", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ school_type <chr> "PUBLIC", "PUBLIC", "CHARTER", "CHARTER", "CHARTER…
## $ enrolled <dbl> 59, 134, 53, 42, 113, 56, 75, 44, 45, 106, 53, 87,…
## $ num_immune_mmr <dbl> 58, 134, 52, 40, 113, 55, 73, 39, 44, 105, 46, 84,…
## $ num_exempt_mmr <dbl> 1, 0, 1, 2, 0, 1, 2, 0, 1, 1, 7, 3, 0, 0, 0, 2, 2,…
## $ num_compliance_mmr <dbl> 59, 134, 53, 42, 113, 56, 75, 44, 45, 106, 53, 87,…
## $ num_pbe <dbl> 5, 2, 1, 0, 3, 6, 4, 0, 23, 4, 9, 3, 1, 1, 0, 4, 0…
## $ num_medical_exempt <dbl> 2, 0, 0, 7, 0, 6, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,…
## $ num_pbe_exempt_all <dbl> 5, 0, 1, 0, 0, 0, 1, 0, 2, 1, 10, 3, 0, 0, 0, 2, 2…
## $ match_type <chr> "Yes", "Manual", "Yes", "Yes", "Yes", "Yes", "Yes"…
glimpse (nces_master)
## Rows: 2,414
## Columns: 15
## $ rowid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ nces_school_id <chr> "040553000451", "040463000999", "040010601892", "0…
## $ nces_district_id <chr> "0405530", "0404630", "0400106", "0400417", "04067…
## $ nces_district_name <chr> "Nogales Unified District", "Marana Unified Distri…
## $ nces_school_name <chr> "A J MITCHELL ELEMENTARY SCHOOL", "A. C. E.", "AAE…
## $ nces_school_type <chr> "1-Regular school", "1-Regular school", "1-Regular…
## $ nces_urban <chr> "33-Town: Remote", "41-Rural: Fringe", "11-City: L…
## $ nces_student_ct <dbl> 406, 10, 313, 444, 517, 608, 868, 643, NA, 61, 59,…
## $ nces_fte_teacher <dbl> 20.50, 6.80, NA, NA, 25.60, 32.75, 45.87, 0.00, NA…
## $ nces_ratio <dbl> 19.80, 1.47, NA, NA, 20.20, 18.56, 18.92, NA, NA, …
## $ nces_lowest <chr> "Kindergarten", "7th Grade", "9th Grade", "9th Gra…
## $ nces_highest <chr> "5th Grade", "12th Grade", "12th Grade", "12th Gra…
## $ nces_school_level <chr> "Elementary", "High", "High", "High", "Elementary"…
## $ nces_county_name <chr> "Santa Cruz County", "Pima County", "Maricopa Coun…
## $ nces_fips <chr> "04023", "04019", "04013", "04013", "04025", "0401…
8.4.1 Setting up the data
In this case, we want to get information that the federal government had on the schools attached to the immunization data. In particular, we’d like to be able to generate statistics by district, by urbanization and type of school, and we’d like to keep the code for the county so we can link it up to other datasets.
To make it simple, I’ll just create a small set of data for each table:
<-
immune %>%
grade6_to_nces select (rowid, nces_id, school_name, city, county, zip_code, school_nurse, school_type,
enrolled, num_immune_mmr)
<-
school_list %>%
nces_master select (nces_school_id, nces_district_id, nces_district_name, nces_school_type,
nces_urban, nces_ratio, nces_school_level, nces_fips)
8.4.2 Apply the join
Here are two ways to join:
%>%
immune inner_join (school_list, by=c("nces_id" = "nces_school_id")) %>%
glimpse ()
## Rows: 834
## Columns: 17
## $ rowid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ nces_id <chr> "040906000652", "040906000915", "040079703230", "0…
## $ school_name <chr> "ABRAHAM LINCOLN TRADITIONAL SCHOOL", "ACACIA ELEM…
## $ city <chr> "PHOENIX", "PHOENIX", "TUCSON", "TUCSON", "PHOENIX…
## $ county <chr> "MARICOPA", "MARICOPA", "PIMA", "PIMA", "MARICOPA"…
## $ zip_code <chr> "85051", "85053", "85757", "85705", "85033", "8501…
## $ school_nurse <chr> "NO", "YES", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ school_type <chr> "PUBLIC", "PUBLIC", "CHARTER", "CHARTER", "CHARTER…
## $ enrolled <dbl> 59, 134, 53, 42, 113, 56, 75, 44, 45, 106, 53, 87,…
## $ num_immune_mmr <dbl> 58, 134, 52, 40, 113, 55, 73, 39, 44, 105, 46, 84,…
## $ nces_district_id <chr> "0409060", "0409060", "0400797", "0400368", "04009…
## $ nces_district_name <chr> "Washington Elementary School District", "Washingt…
## $ nces_school_type <chr> "1-Regular school", "1-Regular school", "1-Regular…
## $ nces_urban <chr> "11-City: Large", "11-City: Large", "21-Suburb: La…
## $ nces_ratio <dbl> 18.56, 18.92, NA, NA, NA, NA, NA, NA, NA, 19.13, N…
## $ nces_school_level <chr> "Elementary", "Elementary", "Elementary", "Other",…
## $ nces_fips <chr> "04013", "04013", "04019", "04019", "04013", "0401…
You can see that the information from the federal Education Department was added to the immunization data, but we lost seven records – the seven that I couldn’t find in the federal department.
To preserve these records, you’ll usually protect one of the tables – the one you care about most – and keep everything, even if it doesn’t match. To do that, use a left
or right
join, depending on whether you mention the table first or second. In this case:
<-
immune_joined %>% # the table I want to protect
immune left_join ( school_list, # the table I want to apply to my original data frame
by=c("nces_id"="nces_school_id") ) # the variable that is the same in the two tables.
glimpse(immune_joined)
## Rows: 841
## Columns: 17
## $ rowid <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ nces_id <chr> "040906000652", "040906000915", "040079703230", "0…
## $ school_name <chr> "ABRAHAM LINCOLN TRADITIONAL SCHOOL", "ACACIA ELEM…
## $ city <chr> "PHOENIX", "PHOENIX", "TUCSON", "TUCSON", "PHOENIX…
## $ county <chr> "MARICOPA", "MARICOPA", "PIMA", "PIMA", "MARICOPA"…
## $ zip_code <chr> "85051", "85053", "85757", "85705", "85033", "8501…
## $ school_nurse <chr> "NO", "YES", "NO", "NO", "NO", "NO", "NO", "NO", "…
## $ school_type <chr> "PUBLIC", "PUBLIC", "CHARTER", "CHARTER", "CHARTER…
## $ enrolled <dbl> 59, 134, 53, 42, 113, 56, 75, 44, 45, 106, 53, 87,…
## $ num_immune_mmr <dbl> 58, 134, 52, 40, 113, 55, 73, 39, 44, 105, 46, 84,…
## $ nces_district_id <chr> "0409060", "0409060", "0400797", "0400368", "04009…
## $ nces_district_name <chr> "Washington Elementary School District", "Washingt…
## $ nces_school_type <chr> "1-Regular school", "1-Regular school", "1-Regular…
## $ nces_urban <chr> "11-City: Large", "11-City: Large", "21-Suburb: La…
## $ nces_ratio <dbl> 18.56, 18.92, NA, NA, NA, NA, NA, NA, NA, 19.13, N…
## $ nces_school_level <chr> "Elementary", "Elementary", "Elementary", "Other",…
## $ nces_fips <chr> "04013", "04013", "04019", "04019", "04013", "0401…
Here are the rows that were kept without a match:
rowid | nces_id | school_name | city | county | zip_code | school_nurse | school_type | enrolled | num_immune_mmr | nces_district_id | nces_district_name | nces_school_type | nces_urban | nces_ratio | nces_school_level | nces_fips |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | NA | ARCADIA NEIGHBORHOOD LEARNING CENTER | SCOTTSDALE | MARICOPA | 85251 | YES | PUBLIC | 52 | 49 | NA | NA | NA | NA | NA | NA | NA |
209 | NA | DENNEHOTSO BOARDING SCHOOL | DENNEHOTSO | APACHE | 86535 | NO | PUBLIC | 21 | 21 | NA | NA | NA | NA | NA | NA | NA |
414 | NA | J.C.U.S.D. - JOSEPH CITY PRESCHOOL | JOSEPH CITY | NAVAJO | 86032 | NO | PUBLIC | 33 | 28 | NA | NA | NA | NA | NA | NA | NA |
452 | NA | L.H.U.S.D. #1 - DEVELOPMENTAL PRESCHOOL | LAKE HAVASU CITY | MOHAVE | 86403 | YES | PUBLIC | 69 | 67 | NA | NA | NA | NA | NA | NA | NA |
522 | 590014800009 | MANY FARMS COMMUNITY SCHOOL, INC | MANY FARMS | APACHE | 86538 | NO | PUBLIC | 38 | 38 | NA | NA | NA | NA | NA | NA | NA |
785 | NA | ST THERESA LITTLE FLOWER PRESCHOOL | PHOENIX | MARICOPA | 85018 | YES | PUBLIC | 42 | 42 | NA | NA | NA | NA | NA | NA | NA |
821 | NA | T.U.S.D.#1 - C. E. ROSE PRESCHOOL PROGRAM | TUCSON | PIMA | 85714 | YES | PUBLIC | 86 | 86 | NA | NA | NA | NA | NA | NA | NA |
In real life, you’d have to decide how much you care about these missing schools – does it ruin your story, or can you just mention that you were unable to get information for a handful of schools, amounting to about 350 students?
8.4.3 Use the joined table
Now I might want to look at which school districts have low immunization rates:
%>%
immune_joined # by school
mutate (school_pct = num_immune_mmr / enrolled * 100 ) %>%
# by district
group_by (nces_district_name, county) %>%
summarise ( num_schools = n() ,
total_enrolled = sum(enrolled),
total_immune = sum (num_immune_mmr),
median_immune = median (school_pct)
%>%
) # district total pct (immunized / total students)
mutate ( pct_immune = total_immune/ total_enrolled * 100) %>%
select (nces_district_name, county, num_schools, pct_immune, total_enrolled, median_immune) %>%
filter (median_immune <= 93) %>%
head (10) %>%
kable (digits=1)
## `summarise()` has grouped output by 'nces_district_name'. You can override using the `.groups` argument.
nces_district_name | county | num_schools | pct_immune | total_enrolled | median_immune |
---|---|---|---|---|---|
Acclaim Charter School | MARICOPA | 1 | 88.6 | 44 | 88.6 |
American Leadership Academy Inc. | MARICOPA | 5 | 91.9 | 467 | 92.9 |
Arete Preparatory Academy | MARICOPA | 1 | 90.9 | 99 | 90.9 |
Arizona Connections Academy Charter School Inc. | MARICOPA | 1 | 85.6 | 201 | 85.6 |
Arizona Montessori Charter School at Anthem | MARICOPA | 1 | 86.5 | 37 | 86.5 |
Ball Charter Schools (Dobson) | MARICOPA | 1 | 90.0 | 40 | 90.0 |
BASIS School Inc. 12 | YAVAPAI | 1 | 85.7 | 91 | 85.7 |
BASIS School Inc. 6 | COCONINO | 1 | 92.6 | 95 | 92.6 |
BASIS School Inc. 9 | MARICOPA | 1 | 91.7 | 133 | 91.7 |
Benchmark School Inc. | MARICOPA | 1 | 85.5 | 55 | 85.5 |
8.5 Joining risks
8.5.1 joining tl;dr
There are lots of risks in joining tables that you created yourself, or that were created outside a big relational database system. Keep an eye on the number of rows returned every time that you join – you should know what to expect.
8.5.2 Double counting with joins
We won’t go into this in depth, but just be aware it’s easy to double-count rows when you join. Here’s a made-up example, in which a zip code is on the border and is in two counties:
Say you want to use some data on zip codes :
zip code | county | info |
---|---|---|
85232 | Maricopa | some data |
85232 | Pinal | some more data |
and match it to a list of restaurants in a zip code:
zip code | restaurant name |
---|---|
85232 | My favorite restaurant |
85232 | My second-favorite restaurant |
When you match these, you’ll get 4 rows:
zip code | county | info | restaurant name |
---|---|---|---|
85232 | Maricopa | some data | My favorite restaurant |
85232 | Pinal | some more data | My favorite restaurant |
85232 | Maricopa | some data | My second-favorite restaurant |
85232 | Pinal | some more data | My second-favority restaurant |
Now, every time you try to count restaurants, these two will be double-counted.
In computing, this is called a “many-to-many” relationship – there are many rows of zip codes and many rows of restaurants. In journalism, we call it spaghetti. It’s usually an unintended mess.
8.5.3 Losing rows with joins
The opposite can occur if you aren’t careful and there are items you want to keep that are missing in your reference table. That’s what happened in the immunization data above for the seven schools that I couldn’t find.
8.6 Resources
The “Relational data” chapter in the R for Data Science textbook has details on exactly how a complex data set might fit together.
An example using a superheroes dataset, from Stat 545 at the University of British Columbia
8.6.1 Practice
Immunization data
Create a new table from the immunizations and DOE data used in this example, then see if you can find any patterns in immunization rates by school district rather than by county. (Note that charter school companies are each their own district.) Do the same by looking at urban vs. rural schools.
Campaign finance data
There are two tables saved in the R data file, “azcampfin.Rda”. One holds information on contributions available from the offiical FEC database as of Feb. 23, 2020 and the other holds information on the candidates and committees.
The following codes are used in this dataset, which you may want to save into data frame. Here is some code you can use to create a lookup table for the transaction types. These codes can be joined with the column called transaction_tp
in the contributions (or arizona20
) data frame.
<- tribble (
transaction_types ~tcode, ~contrib_type,
"10", "To a Super PAC",
"11", "Native American tribal",
"15", "Individual contrib",
"15C", "From a candidate",
"15E", "Earmarked (eg, ActBlue)",
"20Y", "Non-federal refund",
"22Y", "Refund to indiv.",
"24I", "Earmarked check passed on",
"24T", "Earmarked contrib passed on",
"30", "To a convention account",
"31", "To a headquarters account",
"32", "To a recount effort",
"41Y", "Refund from headquarters account"
)
(These are pretty complicated definitions in the federal campaign finance world. For now, don’t worry much about what they mean. Refunds are shown in the data as negative numbers, which is what you want.)
Try to analyze some of this by putting together the datasets and finding interesting items or patterns.
I remember them by thinking of the boy who cried wolf. When the village came running and there was no wolf, it was a Type I error, or false positive ; when the village ignored the boy and there was a wolf, it was a Type II error, or false negative.↩︎