17  Summarizing detail into totals

In the wake of a police shooting in 2016, reporter Mitch Smith obtained a list of traffic stops from the St. Anthony Police Department in Minnesota. He was writing a story on Philandro Castile’s death and was running out of time. He wanted to answer a simple question: Were minority motorists more likely to be stopped in St. Anthony than whites?

Rob Gebeloff made a quick pivot table to answer the question. That night, Smith wrote:

In each of the three small suburbs patrolled by the St. Anthony police, less than 10 percent of the population is black. But data released by the city on Tuesday showed that a far higher percentage of the people ticketed or arrested by St. Anthony officers were African-American.

Last year, around 19 percent of those cited by St. Anthony police were black, as were roughly 41 percent of people arrested by the department, a review of the city’s data showed. Those percentages do not include the large number of defendants whose race was unknown.

17.1 Summarizing with groups

Summarizing a list of items in R is done using the verbs group_by and summarize. Think of grouping as answering the questions, “How many?” and “How much?”. They are particularly powerful when your question also has the words “the most” or the “the least” or “of each”. Some examples:

  • Which Zip Code had the most crimes?
  • What month had the least total rainfall?
  • How much did each candidate raise last quarter?
  • In playing cards, how many of each suit do I have in my hand?
  • On average, are Cronkite students taller or shorter than in other schools?

Confusing grouping with sorting or arranging

Many reporters confuse this summarization with “sorting”. One reason is that this is how we express the concept in plain language: “I want to sort Skittles by color”.

But in data analysis, sorting and and grouping are very different things. Sorting, which is done in R via the arrange() verb, involves shuffling a data frame’s rows into some order based on the values in a column. Grouping is a way to aggregate and compute summary statistics such as a count (the number of items), sum (how much they add up to), or average for category. It means “make piles and compute statistics for each one.”

When to use filter vs. summary

Something that trips up beginners is a desire to see details and totals at the same time, which is more difficult than it sounds.

A filter is used to display your selected items as a list. You’ll get to see all of the detail and every column. As a convenience, R shows you how many items are in that filtered list (usually). That’s great when you want to just look at them, or get more information about them. For instance, if you had a list of crimes by ZIP code, you might just want to see the list in your neighborhood – where, exactly, were they? When did they happen? Was it at night or the morning? What crimes happened on which blocks?

Aggregates are used when you just want to see summaries – does my ZIP code have more crime than others? Are robberies more common than car theft in my Zip code, and how does that compare to others?

In practice, you’ll go back and forth between summary and detail. They’re both important, just different.

17.2 Motivational exercise

This exercise helps you internalize the differences between filtering and aggregating (or grouping). It uses a list of fatal police shootings logged by the Washington Post, downloaded in late 2022. The column names are pretty self-explanatory, and we’ll only be looking at the ethnicity column. (Scroll to the right if you can’t see it.)

  1. Create a Quarto document in your project
  2. Add the usual setup chunk to load the tidyverse, using the library command. Copy and paste the chunk from a previous lesson if you need to.
  3. Copy and paste this into a code chunk to read the data.
waposhootings <- 
  readRDS ( 
        url ( 
           "https://cronkitedata.s3.amazonaws.com/rdata/waposhootings.RDS"
           )
        )

Examine the data:

waposhootings |> 
  sample_n (50)

Filtering method

Using a pen and paper, write out something that looks like the table below. The total number comes from the number of rows in the data frame.

Ethnicity # of victims % of total
Black, non-Hispanic
White, non-Hispanic
Hispanic
TOTAL 5,945 100%

Copy and paste this code to list all fatal shootings of Black, non-Hispanic victims. Look at the number of rows returned and enter it into your handmade table.

waposhootings |> 
  filter ( ethnicity == "Black, non-Hispanic")

Run this code to get the number of White, non-Hispanic victims and write the answer down on your handmade table

waposhootings |> 
  filter ( ethnicity == "White, non-Hispanic")

Run this filter to get the number of Hispanic victims, and write the answer down.

waposhootings |> 
  filter ( ethnicity == "Hispanic")

Compute the percentages

  1. Compute the percent of total each group represents. Do this on a calculator (or just type the formula into Google). Remember, the percent of total formula is: group / total * 100

  2. Write the answers into your handwritten chart.

  3. Now, rewrite your chart so that the highest percentage is on the first row, the second highest on the second row, and the third highest on the third row.

  4. Repeat your calculations, since you always have to do every hand calculation twice. They won’t add up to 100% because we left out people who had an unknown ethnicity or other ethnicity.

17.3 Aggregating for an easier way

That’s a lot of work. Imagine if you had to do this 50 times for 50 states, or even more for every Zip code? There has to be an easier way, and there is: Grouping and summarizing.

We’ll build the code one piece at a time so you can see what’s happening:

Copy this code, one chunk at a time, into your Quarto document and run them to see what the answers are.

  1. Get the totals
waposhootings |> 
  summarize ( num_shootings = n() ) 
  1. Add a “group” to summarize by ethnicity
waposhootings |> 
  group_by (ethnicity) |> 
  summarize (num_shootings = n() )
  1. Arrange to get the highest number at the top
waposhootings |> 
  group_by ( ethnicity) |> 
  summarize (num_shootings = n() ) |> 
  arrange ( desc (num_shootings ))
  1. Compute the percent

    Notice that the “mutate” line is exactly the same formula you typed into your calculator. The only difference is that we computed the total number of shootings instead of typing it.

    This is one of the most common patterns you’ll use in data reporting, and we can get much more sophisticated with it. Don’t expect to understand it fully right away, but you should recognize the idea: If you find yourself filtering in order to count something, consider using grouping instead.

waposhootings |> 
  group_by ( ethnicity) |> 
  summarize ( num_shootings = n() ) |> 
  mutate ( total = sum ( num_shootings ) , 
           percent = num_shootings /total * 100 ) |> 
  arrange ( desc ( num_shootings))

17.4 Differences

When you filtered, you could see the details of every row. That’s great if you want to explore it or find examples. But it’s not great for that sentence you need when you want to say which group is the largest or smallest, or compare the numbers across groups.

When you grouped, you got statistics by ethnicity, but you couldn’t see the details.

Typically, you’ll go back and forth between aggregating and filtering, using each method to get to different goals. Don’t fall in the trap of trying to force all of your questions into a filter, or all of your questions into aggregation. Instead, think about which method will give you the answers you want most efficiently.

17.5 Your questions:

On your Quarto document, try to write out, in plain English, what you just did. See if you can articulate what the difference between filtering and aggregating is, in your own words.