6 Verbs Part 2: group_by and summarise

6.1 Key takeaways

group_by and summarise are the R equivalent to Excel’s pivot tables
Together, they’re usually used to answer questions like “how many?” and “how much?”.
You will rarely use group_by without summarise.
group_by is picky - it will create a separate answer for every variant of a category: “Lettuce” could be one, and “lettuce” could be another.
Compute percentages using subtotals from a group (rolling up groups)
Rearrange your data for easier reading with pivot_wider

This continues with the Arizona immunization data from chapter 4 and assumes you’ve gone through that chapter. It uses some of the other verbs introduced there. If you’ve started a new project, you can re-download the data in R form, and the load it in an R Markdown code chunk with this line:

load("az-immunizations-grade6.Rda")

(Be sure to preface it with a subfolder name if you’ve saved it in one.)

When you run your first code chunk you should have a data frame called grade6_counts in your environment tab:

6.2 group_by / summarise² basics

Think of grouping as making piles of things while summarise counts, adds or creates other summary statistics.

Where filtering and selecting let you hone in on individual items that meet specific criteria, grouping and summarising help you answer the questions of “how many?” and “how much?” They’re the equivalent to pivot tables in Excel.

Reporters use grouping to:

Focus further reporting on categories with particular characteristics, like neighborhoods with the most home loan rejections, or families who contributed the most to political candidates.
Produce summary statistics for use in stories, such as the number of complaints by type of business or the number of c-sections by hospital or doctor.
Test data they receive from government and others for errors, such as missing codes, impossible combinations, sudden and suspicious patterns or incorrect totals when checked against an independent source.

This query counts the number of schools and adds up the number of students by type of school while giving them descriptive names:

grade6_counts %>%
  group_by (school_type) %>%
  summarise ( schools = n() ,
              student = sum (enrolled, na.rm=TRUE)
      )

Notice the na.rm=TRUE in the sum() function. You should almost always add this to any summary math function, like sum(), median(), mean(), etc. If you don’t, a single missing value (NA) will result in an answer of NA. That’s good practice and it makes sense: you should want to know if your data has missing or invalid information. But normally you have no choice but to ignore it.

6.2.1 Other summary functions

Here are some common functions you’ll use in the summarise statement

Function	description
n()	the number of rows that are in a grouping
n_distinct ()	the number of unique values of the variable you name
sum()	the total of a numeric variable
median () , mean()	The median or mean of a numeric variable
min() , max()	The highest and lowest of a numeric variable
first() , last()	The first and last of a variable - you can indicate the order. Useful to find the first or last address found in a date rather than grouping by them, to avoid minor misspellings.

6.2.2 Grouping by more than one variable

This creates every combination of school_type and school_nurse, and the number of schools that are in each of those piles.

grade6_counts %>%
  group_by ( school_type, school_nurse) %>% 
  summarise ( num_schools = n() )

## `summarise()` has grouped output by 'school_type'. You can override using the `.groups` argument.

school_type	school_nurse	num_schools
CHARTER	NO	169
CHARTER	YES	24
PRIVATE	NO	32
PRIVATE	YES	35
PUBLIC	NO	240
PUBLIC	YES	408

6.2.3 Percentages by rolling up totals

To compute a percentage of total, use the summary statistic within a mutate statement (covered in the next chapter) to get a total for your denominator:

grade6_counts %>% 
  group_by( school_type ) %>% 
  summarise (schools = n() ) %>% 
  mutate  ( total = sum( schools),
            pct_total = schools/ total * 100 ) %>%
  arrange ( desc (schools))

school_type	schools	total	pct_total
PUBLIC	648	908	71.365639
CHARTER	193	908	21.255507
PRIVATE	67	908	7.378855

A note on count() and tally()

You may see other tutorials in which the group_by / summarise is shortened to use two other verbs, tally() and count(). They’re perfectly good for a very quick review of the data, but it’s a little hard to follow how they work. For now, sticking with the group_by / summarise sequence will help you cement it in your mind while at the same time making it closer to what you’ll find in other programming languages.

When you get further along, you’ll notice that count() is really useful to isolate lines that could be duplicates.

6.3 Restructuring for a report with `pivot_wider()`

It’s hard to read multi-level groupings, so the tidyverse offers a way to spread one variable across the top with the other down the side. This is just like dragging one column into the “Columns” area in a pivot table, and another into the “Rows”. Here is how you’d make a crosstab like those in Excel using this method:

Here’s a full example of creating “column” percentages and then showing them as a crosstab. See if you can piece together what’s going on. If not, try moving pieces of the query outside the code chunk (and remove your %>% operator) and run it piece by piece.

grade6_counts %>%
  group_by (school_type, school_nurse) %>%                           # See 1
  summarise ( schools = n() ) %>%
  mutate ( types = sum(schools),                                     # See 2
           pct_types = schools/types*100) %>%                        # See 3 
  select (school_type, school_nurse, pct_types) %>%
  pivot_wider ( names_from = school_type, values_from = pct_types)   #See 4

## `summarise()` has grouped output by 'school_type'. You can override using the `.groups` argument.

school_nurse	CHARTER	PRIVATE	PUBLIC
NO	87.56477	47.76119	37.03704
YES	12.43523	52.23881	62.96296

Now you have a crosstab showing that public schools are five times as likely to have a school nurse as charter schools (63/12), and private schools are in between. Remember the rule:

The independent variable, or the thing that comes first, is in your column and the thing that comes second is in your rows. Then calculate the column percentage

Walking through this code:

Do a usual group_by/summarise by the type of school and whether there are school nurses. The order matters, in order to get the right percentages.
Calcuate the sum at the next level, which is by school type – this is a hidden feature of group_by, in that it rolls up sequentially through the variables you have grouped on – to get the total by school_nurse, then calcuate the percentage off of that.
Keep only the variables you want to see in the end.
pivot it!

6.3.1 A note on restructuring your data frames

pivot_wider is part of a whole set of functions designed to work with tidy data, and fix untidy data. You’ll sometimes see it called spread, which is from an older version of the tidyverse, and is harder to understand. We’ll come back to some of this later on.

6.4 Grouping / summarising resources & exercises

If you got through this, congratulations! You’ve mastered one of the things that people new to data analysis struggle find difficult to wrap their heads around. If you’re still unsure, don’t worry – many people are. Maybe some of these resources will help:

Mary Jo Webster’s “Intro to analysis with Tidyverse” is a walkthrough aimed at group by / summarise.
Go back to Pivot Tables in Excel - it’s the same thing.
Susan Baert’s dplyr tutorial: summarizing data
Matt Waite’s “Aggregates” chapter

6.4.1 Exercises

Using one of the datasets you worked with in pivot tables, try to replicate at least some of your work in R.
In the appendix of this book, there are some exercises to continue with the Murder Accountability Project data.

The tidyverse uses the British spelling of a lot of words. You can use the American spelling, but you’ll usually see it this way.↩︎