<- readRDS (
ppp_orig url (
"https://cronkitedata.s3.amazonaws.com/rdata/ppp_az_loans.RDS"
) )
18 Verbs in depth: Aggregating with groups
This chapter continues with the Paycheck Protection Program, or PPP, loans in Arizona. Full documentation of the dataset is in the Appendix. If you haven’t already, look through that documentation before you start.
To follow along, open your R project and create a new Quarto document with the front matter and libraries.
Then load the saved PPP data with this code chunk:
18.1 summarize
for statistics
summarize
computes summary statistics such as the number of rows in a data frame or the sum of dollar values. It removes the original columns completely, and only produces the summary statistics you compute within that statement. Using summarize
alone produces a data frame with one row. It’s the equivalent of putting nothing in your pivot table in Excel other than the “Values” area.
Another way to think of summarize
is that it collapses your list of items (loans, in our example) into a statistical report.
The dreaded NA
You saw in the mutate
section that missing values are always a problem. Because they’re unknown, they can’t match anything else, they can’t be considered 0, and they can warp any answers you get. But there’s usually nothing you can do about missing data, so you have to tell the program exactly what to do about them.
There are two choices:
- Let them infect everything they touch, turning everything into
NA
. In this scenario, a total of the dollar values in a column would be NA if any of the values in that column is missing:
|>
ppp_orig summarize ( total = sum(forgiveness_amount))
- Ignore them in a computation completely, effectively removing that value from your calculation.
There’s no right answer, and it depends on what you’re doing. In some cases, you know that they stand for the value 0
, and in others you don’t. We will usually ignore them by adding an argument to every summary function that could be infected by them : na.rm = TRUE
, which means, “remove NA’s before you do anything.”.
|>
ppp_orig summarize ( total = sum (forgiveness_amount, na.rm=TRUE))
Summary functions
Some of the common functions you’ll use to summarize are :
mean (column_name, na.rm=T)
– for an average : Numbers onlysum (column_name, na.rm = T)
: Numbers onlyn()
– for “how many”, or “count”. Anything - this counts rows, not valuesn_distinct ( column_name)
: The number of unique entries in the column. Use it to see how many categories there are in a column.median (column_name, na.rm=T)
: Numbers onlymin (column_name, na.rm=T)
: Dates and numbersmax (column_name , na.rm=T)
: Dates and numbers
When used on the whole data frame, it’s customary to just glimpse the output, since there’s only one row:
|>
ppp_orig summarize ( n(),
mean (amount, na.rm=T),
mean (forgiveness_amount, na.rm=T),
min (date_approved, na.rm=T),
max (date_approved, na.rm= T),
n_distinct ( business_type)
|>
) glimpse()
Rows: 1
Columns: 6
$ `n()` <int> 169259
$ `mean(amount, na.rm = T)` <dbl> 73206.66
$ `mean(forgiveness_amount, na.rm = T)` <dbl> 77050.41
$ `min(date_approved, na.rm = T)` <date> 2020-04-03
$ `max(date_approved, na.rm = T)` <date> 2021-06-29
$ `n_distinct(business_type)` <int> 24
This produced a data frame with 1 row and 5 columns. The column names are the same as the formulas that created them, which is difficult to work with. Create new column names using the name (in back-ticks if it’s got spaces or special characters) and assign them the values of the summaries using the =
sign:
|>
ppp_orig summarize ( number_of_rows = n(),
mean_amount = mean (amount, na.rm=T),
median_amount = median (amount, na.rm=T),
mean_forgiven = mean (forgiveness_amount, na.rm=T),
first_loan = min (date_approved, na.rm=T),
last_loan = max (date_approved, na.rm= T),
business_type_count = n_distinct(business_type)
|>
) glimpse()
Rows: 1
Columns: 7
$ number_of_rows <int> 169259
$ mean_amount <dbl> 73206.66
$ median_amount <dbl> 20800
$ mean_forgiven <dbl> 77050.41
$ first_loan <date> 2020-04-03
$ last_loan <date> 2021-06-29
$ business_type_count <int> 24
Note that the mean forgiven removes those with missing values for the forgiven amount, which is wrong! We need to turn them into zeroes first.
18.2 Grouping for lists
Now that you know how to summarize the whole data frame, you’ll want to start getting totals by category. This is the same thing as a pivot table in spreadsheets: the column names that create the “groups” are the equivalent of the Rows area a spreadsheet pivot table:
Grouping by one column
In the PPP data, the “draw” refers to which of the two programs was involved - the original one created in April 2020, or the one with stricter criteria passed by Congress that December.
Here’s how we’d get some key statistics by draw:
|>
ppp_orig group_by ( draw ) |>
summarize ( first_loan = min ( date_approved ),
total_amount = sum (amount),
total_forgiven = sum (forgiveness_amount, na.rm=T),
`# of loans` = n()
)
Here are a couple of things to note about grouped output:
- The only columns saved are the ones that are shown in either the
group_by
orsummarize
rows. All of the other original columns have been eliminated. You no longer have them to work with . - The names of the columns for the summary statistics are the ones defined before the “=” sign in the summarize statement.
- TRAP! Don’t ever name your summary columns the same thing as a group_by column. It will override those names, and your output will be unintelligible.
Note that the name of the columns doesn’t always follow our standard. In this case, # of loans
has a special character and spaces. In order to create or use it, you must enclose them in back-tics (`) or you’ll get an error.
Grouping by more than one column
If you wanted to know the numbers outstanding and forgiven by draw, you could add another column to the group by:
|>
ppp_orig group_by ( loan_status, draw ) |>
summarize ( first_loan = min ( date_approved ),
total_amount = sum (amount),
total_forgiven = sum (forgiveness_amount, na.rm=T),
loan_ct = n()
)
A shortcut : count()
If all you want to do is count or add by group, you can use the count()
function as a shortcut. It does the exact same thing as a combination of group_by()
and summarize( n() )
and arrange()
` to get the number of items in each category, sorted by the most frequent to least:
|>
ppp_orig count ( loan_status, draw,
sort=TRUE,
name = "loan_ct")
A new version of the tidyverse, in 2023, has changed some of the options for the summarize()
verb that make grouping unnecessary much of the time, but it’s confusing. I’m skipping it for now.
18.3 Using and converting groups
Converting from long to wide data
You’ve looked at different ways to think about data, but when we talk about “granular” data, “database” thinking or “tidy” data, we generally mean we’re working with what some people call “long” rather than “wide” data. That is, there are many rows but few columns.
But your instinct is to want to look at a rectangle of wide data, with the values of one column down the side and another across the top. Helpfully, a function called pivot_wider()
does just that – pivots your data from long to wide.1
Start with a simple query with two grouping columns (note that I’ve called the number of loans loan_ct
, so it’s easier to work with later on.
This is really hard to read. Turn it on its head with pivot_wider()
:
Normally, you’ll only want to have one summary statistic shown in a rectangle, with one column spread across the top and another column shown in rows. There are a lot of advanced options in pivot functions that let you show more than one statistic at a time, and tell R how to name them. There will be a chapter later on that addresses a lot of the problems you have in reading tables, so we’ll put that off for now.
Here’s an explanation of what the command looks like.
```{r}
pivot_wider ( id_cols = column that you want to see as is down the side,
names_from = column with the words you want to see across the top,
values_from = column with the numbers you want in the middle
) ```
Your instinct will be to turn your data into one of these wider tables right from the start, but try to overcome it. The tidyverse expects to do most of its work in “long” format, saving the “wide” format just for printing.
Totals and subtotals
You noticed that when you created the summaries, there was no option to create a “percent of total” such as the percent of loans in each draw, or the percent of money that had been forgiven.
You can use summary functions outside a summarize statement!
This means that you can compute the percent of total, the same way you used an option in pivot tables. This took me a long time to understand, so try to slow down, and just try it a few times! When you look carefully at your output, you’ll start to understand it better.
The trick is to summarize, then use mutate
to add a column with the percentages made out of totals:
|>
ppp_orig group_by ( draw) |>
summarize ( loan_count = n() ) |>
mutate ( all_loans = sum (loan_count),
pct_of_total = loan_count / all_loans * 100
)
What happens if you have more than one group?
This is where the idea of grouped data gets a little confusing. It depends on exactly how you did your summarize statement. But if you use the default mechanism, the “all_loans” is the subtotal. The default behavior is that the “groups” are kept for all but the last column listed in the group_by statement, meaning any summaries you do off of the data will refer to the subtotal.
|>
ppp_orig group_by ( draw, loan_status) |>
summarize ( loan_count = n() ) |>
mutate ( loans_in_draw = sum (loan_count),
pct_of_draw = loan_count / loans_in_draw * 100)
Here’s a pretty typical way to do this: Create a subtotal, use it for your percentages, then pivot the percentages:
|>
ppp_orig group_by ( draw, loan_status) |>
summarize ( loan_count = n() ) |>
mutate ( loans_in_draw = sum(loan_count),
pct_of_draw = loan_count / loans_in_draw * 100 ) |>
pivot_wider (
id_cols = c(draw, loans_in_draw),
names_from = loan_status,
values_from = pct_of_draw,
values_fill = 0)
`summarise()` has grouped output by 'draw'. You can override using the
`.groups` argument.
Now you can easily compare the outcome by draw, by reading across to reach 100% and reading down to compare them.
We’ll have a whole chapter / week on making good tables that are readable and understandable. For now, just remember that it’s always possible to turn a data frame on its head, and that you can compute much of what you need BEFORE you do that.
18.4 Practice
Putting together the grouping and summarizing, along with the commands you learned last chapter to filter
, arrange
and display the head()
and tail()
of a dataset should equip you to write the code for these questions:
- Which lenders provided the most loans?
- Which lenders provided the most amount of money loaned?
- Which borrowers got the least amount of money?
- Show the number of loans in each draw that went to the 24 (including
NA
) types of businesses. To see them all on one screen, add “, rows.print=25” to the heading of the code chunk like this:{r , rows.print=25}
- Try to compute the percent of loans that went to projects in each county in Arizona. This will require first filtering, then grouping.
18.5 Postscript: Understanding grouped data
You may have noticed an odd warning after you run the code with multiple grouping columns, for example:
`summarise()` has grouped output by 'draw'. You can override using the `.groups` argument."
What does that mean?
When you grouped by loan status and draw, R effectively split up your data frame into five independent and completely divorced piles - one for each combination of draw and loan status that it found. It processed them one by one to create the output data frame that was printed out.
After it’s done summarizing your data, R doesn’t know what you want to do with the piles – keep them, or put everything back together again.
By default, after you group by more than one column, it maintains the separate piles for all but the last group in your list under group_by
– in this case the loan_status
. Here, everything you do after this will work on three piles separately.The message tells you what it did with the piles, and how to change that behavior.
The documentation of grouped data provides details of how each of the tidyverse’s verbs handle grouped data.
Here’s what a “glimpse()” looks like for a data frame that has retained some groups:
|>
ppp_orig select ( loan_status, date_approved:amount) |>
group_by ( loan_status) |> glimpse()
Rows: 169,259
Columns: 11
Groups: loan_status [3]
$ loan_status <chr> "Paid in Full", "Paid in Full", "Paid in Full", "Paid…
$ date_approved <date> 2020-04-10, 2020-04-11, 2020-04-11, 2020-04-29, 2020…
$ draw <chr> "First", "First", "First", "First", "First", "First",…
$ borrower_name <chr> "SFE HOLDINGS LLC", "NAVAJO TRIBAL UTILITY AUTHORITY"…
$ borrower_address <chr> "9366 East Raintree Drive", "Po Box 170", "2999 N44th…
$ borrower_city <chr> "Scottsdale", "Fort Defiance", "Phoenix", "Tucson", "…
$ borrower_state <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ",…
$ borrower_zip <chr> "85260", "86504", "85018", "85711", "85250", "85012",…
$ franchise_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ loan_status_date <date> 2021-08-17, 2022-02-05, 2021-09-25, 2021-08-21, 2021…
$ amount <dbl> 10000000, 10000000, 10000000, 10000000, 10000000, 100…
Notice the “Groups” row at the top – that tells you it’s got three piles, defined by the loan_status column.
Getting rid of the message
You can do two things to get rid of the message. I suggest the first of these, since it makes you explicitly decide what to do each time, depending on your goal:
- Add a
.groups=...
argument that looks like this at the end of thesummarize
statement. This example tells R to do what it does by default, with no warning:
|>
ppp_orig group_by ( loan_status, draw ) |>
summarize ( `# of loans` = n() ,
.groups = "drop_last"
)
The other possibilities are : .groups="drop"
and ".groups="keep"
(Note the period before the word “groups”. I have no idea why, but sometimes options are indicated this way.)
Add a line to your setup chunk, changing the default behavior through the systemwide options:
options(dplyr.summarise.inform = FALSE)
What does “tidy” data have to do with groupings?
Grouped data effectively breaks out values of categories and treats them independently, which is the equivalent of temporarily treating them as their own data frame.
It’s somewhat difficult in the tidyverse to summarize across columns – it really wants to summarize rows. In a spreadsheet, it’s just as easy to write an =sum(B1:J1)
as it is =sum(B1:B12)
. That’s not true in R journey so far. This is probably the first time your instinct would be to wreck a perfectly good dataset.
We’ll come back to all of that, but just remember that it’s possible to do all kinds of computations within a group that you’d normally think you want to do across columns. One example is, say, percent change over time. Instead of trying to compute them one by one, you can use groups and the lag() function to do math that depends on a previous row.
state | county | month | cases |
---|---|---|---|
AL | Auburn | 2020-04-01 | 24 |
AL | Auburn | 2020-05-01 | 35 |
AL | Auburn | 2020-06-01 | 200 |
covid_data |>
group_by (state, county) |>
arrange (month) |>
mutate ( change = cases - lag(cases) ,
pct_change = change / lag(cases) * 100 )
This method will start over for each county, so it will be NA
for the first month within each county.
This is just one example of how grouped data is quite powerful when used correctly. There are many others, such as extracting the most recent event in a court history by case. Try to think about how one group would be computed, and then don’t worry how the rest will work – R will do that thinking for you.
Yes, there is something called
pivot_longer()
, which lets you turn rectangular data into the tidy form.↩︎