9 Grouping with pivot tables
This continues with the Washington Post’s police shooting database, but I updated it to include the descriptions of ethnicities.
In the wake of a police shooting in 2016, reporter Mitch Smith obtained a list of traffic stops from the St. Anthony Police Department in Minnesota. He was writing a story on Philandro Castile’s death and was running out of time. He wanted to answer a simple question: Were minority motorists more likely to be stopped in St. Anthony than whites?
Rob Gebeloff made a quick pivot table to answer the question. That night, Smith wrote:
In each of the three small suburbs patrolled by the St. Anthony police, less than 10 percent of the population is black. But data released by the city on Tuesday showed that a far higher percentage of the people ticketed or arrested by St. Anthony officers were African-American.
Last year, around 19 percent of those cited by St. Anthony police were black, as were roughly 41 percent of people arrested by the department, a review of the city’s data showed. Those percentages do not include the large number of defendants whose race was unknown.
9.1 Summarizing with groups
Summarizing a list of items in a spreadsheet is done using pivot tables. In other languages, it’s considered “aggregating” or “grouping and summarizing”. Think of pivot tables and grouping as answering the questions, “How many?” and “How much?”. They are particularly powerful when your question also has the words “the most” or the “the least” or “of each”. Some examples:
- Which Zip Code had the most crimes?
- What month had the least total rainfall?
- How much did each candidate raise last quarter?
- In playing cards, how many of each suit do I have in my hand?
- On average, are Cronkite students taller or shorter than in other schools?
Confusing grouping with sorting or arranging
Many reporters confuse this summarization with “sorting”. One reason is that this is how we express the concept in plain language: “I want to sort Skittles by color”.
But in data analysis, sorting and and grouping are very different things. Sorting involves shuffling a table’s rows into some order based on the values in a column. In other languages, this is called arranging or ordering, much clearer concepts. Grouping, which is what pivot tables do, is a path to aggregating and computing summary statistics such as a count (the number of items), sum (how much they add up to), or average for category. It means “make piles and compute statistics for each one.”
When to use filter vs. pivot tables
Something that trips up beginners is a desire to see details and totals at the same time, which is more difficult than it sounds.
A filter is used to display your selected items as a list. You’ll get to see all of the detail and every column. As a convenience, Excel shows you how many items are in that filtered list. That’s great when you want to just look at them, or get more information about them. For instance, if you had a list of crimes by ZIP code, you might just want to see the list in your neighborhood – where, exactly, were they? When did they happen? Was it at night or the morning? What crimes happened on which blocks?
A pivot table is used when you just want to see summaries – does my ZIP code have more crime than others? Are robberies more common than car theft in my Zip code, and how does that compare to others?
In practice, you’ll go back and forth between summary and detail. They’re both important, just different.
Motivational exercise
A lot of reporters try to do everything with filters. Once they learn pivot tables, they try to do everything with pivot tables. They’re two different animals – filtering is used for detail, and grouping (or pivot tables) are for summary. Here’s one way of understanding what pivot tables do before you try to make one.
If you wanted to know how many people of each race/ethnicity were shot and killed, here’s one way you can do it:
- Using your data table, use the filter the column in the new spreadsheet called “Ethnicity” to show only “Black non-Hispanic” victims.
You should see something like this at the bottom of your screen, showing 1444 of 5945 records shown.
Write that down on a piece of paper, or start a new sheet in your workbook called “ethnicity”, and create a small data area with the columns called “Ethnicity”, “# of shootings”, “% of total”.
Repeat the process with each ethnicity, entering or writing down the results of each of them.
Now, for each ethnicity, compute the percent of total – or, the number of rows you found divided by 5,945 expressed as a percent.
In the end, you would have a page that looks something like this (or your own chicken-scratching with the same numbers):
And if you’re asking yourself, “why is this so much work? Isn’t there an easier way?” , the answer is to try a pivot table. It does all that work for you.
9.2 Tutorial
A Skittles example
Make sure that the column you choose to put in the “Values” area to count is always filled out – otherwise it will skip them.
Setting up the pivot table
Start with your cursor somewhere in your data , and choose Insert, then Pivot table
If all goes well, it will look like your data disappeared. It didn’t – you’re just on a new page. Here’s what it looks like:
Counting , or “how many”?
The section on the right gives you an outline of what to do. The section on the left will get filled in as you make your pivot table. If you want to see the number of fatalities by ethnicity, drag that column into the “Rows” area, then drag something that’s always filled out into the Values area (state
is a safe one in this data).
Be sure to ALWAYS use something that has something filled in all the time for this. If you look at the filter, it must not have an option at the bottom that says (Blanks)
. If you ignore this, your answer will be wrong, since the calculation will only count filled-in cells.
Percents of total
It’s hard to compare raw numbers unless they’re really small. Instead, we’d like to know what percent of fatalities by ethnicity. Right-click on any number in the pivot table, and choose “Show data as…” , then choose “Percent of Column total”.
To remove it, right-click on it and choose “Remove”Count of state2”“.
More variables
Suppose you’d like to see the number of fatalities by year, with the years across the top and the ethnicity
down the sides. Drag the year
variable into the column area . Sorting can get hinky on pivot tables, but in this case it will work to put the largest number on top. This won’t work with percentages – it still sorts by the underlying number.
Even more detail
Say you wanted to see each city’s total shootings by year. Which one had the most last year, and which one had the most overall?
This is actually really hard in a pivot table, because there are cities with the same names in different states. It means you’d need to have a pivot table with TWO columns down the side, and one across the top. Here’s my attempt at getting there:
This is after some fiddling with the formats, and I still can’t sort properly – the city “Phoenix”, including those in Maryland and Arizona, had the second-highest number of shootings. We can’t sort by the combination of city and state. (Don’t try to get something that looks like this on your own – I can’t even reproduce it exactly because I’d fiddled around trying to get something that I wanted to see in this example. )
Your choices in Excel are limited: Copy and paste the values of the pivot table into a new sheet and sort there, or create a new variable by concatenating the name of the city and state into one column.
Go back to your raw data
But say you wanted to look into the 15 Phoenix, AZ shootings from 2020. Your instinct might be to filter your pivot table and try adding more items to the columns and rows:
But this is what your filters were made for! Return to your original data, and set the filters so you can see ALL of the details about the small list of items.
The big picture
Pivot tables and aggregation with grouping are good for narrowing down the items you want to examine more closely, and for compiling summary statistics about your data.
But filters are much better for examining the full range of information about a small number of examples.
In practice, you’ll go back and forth between summary and detail. You need them both.
9.3 TL;DR
Here’s a video with the same material:
9.4 FAQ
Everything disappeared!
If you select something outside of that pivot table on the left, the menu on the right disappears. Select something in the pivot table area and it will likely come back.
I have too many columns
If you want two sets of statistics – say, number of fatalities and percent of fatalities – across the top, it can get very wide and confusing very quickly. One alternative is to change it into more of a vertical rectangle by dragging the “Values” element from the columns to the rows on the right. (This only shows up when you have two calculations being made.)
I want to sort by percents, not numbers
You can’t.
Things aren’t adding up
You have to be super careful about which column you use to Count things – it has to always be filled out (there can’t be any blanks). Go through the filters and find one that doesnt have (Blanks) at the bottom to be sure.
Its a crazy number!
You might have dragged a numeric column into the “Values” area. Check to see if it says “Count” or “Sum”. Change it to “Count” if it has something else on it, unless you wanted to add up that column.
This is so frustrating - I can’t get what I want
Right? It’s time to go to a programming language!