7 Verbs Part 3: mutate
7.1 Key takeaways
mutate
just means “compute”, or convert one or more variables into something new.- mutate is the equivalent of creating a new column in Excel using a formula
- Some of the most important jobs of a
mutate
statement are:- Convert unstandardized data into standardized, such as yes-or-no variables or all lower case
- Use arithmetic on numeric variables
- Use conditional statements to put different values on variables, instead of filtering one at a time.
- Change the data type from, say, text to date and time.
- Use functions to combine or pull apart and edit your data.
7.1.1 Read in some data
Start with the Arizona immunization data that some other chapters have worked with, loading it and the tidyverse to follow along.
Here are the packages and data used for this chapter. You might need to install some of them.
library(tidyverse)
library(knitr) # used for slightly better looking tables
library(DT) # even more options with tables
library(lubridate) # used for working with dates
library(janitor) # used to clean up dirty data and change column names.
load("data/az-immunizations-grade6.Rda") # you may need to change the path to wherever you saved the data
7.2 Simple arithmetic
Use the usual arithmetic operators to calculate numbers and create a new column:
+ Addition
- Subtraction
* Multiplication
/ Division
This example creates a new data frame called immune_pct
from a few columns of the original:
<-
immune_pct %>%
grade6_counts select ( school_name, city, school_type, enrolled, num_immune_mmr) %>%
mutate (pct_immune = ( num_immune_mmr/enrolled) * 100)
Here’s what it looks like:
school_name | city | school_type | enrolled | num_immune_mmr | pct_immune |
---|---|---|---|---|---|
ABRAHAM LINCOLN TRADITIONAL SCHOOL | PHOENIX | PUBLIC | 59 | 58 | 98.3 |
ACACIA ELEMENTARY SCHOOL | PHOENIX | PUBLIC | 134 | 134 | 100.0 |
ACADEMY DEL SOL - HOPE | TUCSON | CHARTER | 53 | 52 | 98.1 |
ACADEMY OF MATH AND SCIENCE | TUCSON | CHARTER | 42 | 40 | 95.2 |
ACADEMY OF MATH AND SCIENCE CAMELBACK | PHOENIX | CHARTER | 113 | 113 | 100.0 |
7.3 Creating categories with if_else()
Sometimes, you want to create categories out of more detailed information. In this case, a threshold of 90 percent immunization is considered the limit under which there is a threat of a measles outbreak, also known as “herd immunity”. To categorize each schools as above or under that level, we need to create a new variable that is based on the value of the percentage we just calcluated. That’s done using an if_else
statement, which is in the form:
if_else ( condition,
new value if it's true,
new value if it's not true)
For example:
<-
immune_pct %>%
immune_pct mutate ( threshold_met = if_else ( pct_immune >= 90,
"Meets threshold",
"Below threshold")
)
Now we can do some grouping by the new categories:
%>%
immune_pct group_by (school_type, threshold_met) %>%
summarise ( school_ct = n(),
enrolled = sum(enrolled))
## `summarise()` has grouped output by 'school_type'. You can override using the `.groups` argument.
You can practice rolling up by looking at the percent by type.
%>%
immune_pct group_by (school_type, threshold_met) %>%
summarise ( schools = n(),
students = sum(enrolled)) %>%
mutate ( pct_school_type = schools / sum(schools, na.rm=TRUE) * 100,
pct_enrolled = students / sum(students, na.rm=TRUE) * 100
%>%
) filter ( threshold_met == "Below threshold")
## `summarise()` has grouped output by 'school_type'. You can override using the `.groups` argument.
school_type | threshold_met | schools | students | pct_school_type | pct_enrolled |
---|---|---|---|---|---|
CHARTER | Below threshold | 40 | 2344 | 20.725389 | 17.757576 |
PRIVATE | Below threshold | 10 | 324 | 14.925373 | 12.053571 |
PUBLIC | Below threshold | 18 | 1223 | 2.777778 | 1.794544 |
You can read this table to say, “More than one-sixth of 6th-grade charter school students could be at risk of contracting measles because their schools have immunization rates below levels experts say are needed to prevent outbreaks.” (The wording is tough here to avoid saying “schools” over and over! )
This strategy can also be used to set unreal values to the special NA
as in the MAP murder data that uses the value 999 instead of “unknown” for an age. This is common in survey data.
7.4 Useful functions
Here are some functions you may find yourself using pretty regularly. I’m not going to go into the details, but think about looking some up if you think you need them. In each case, these are just the verbs
. You’d put your noun – an existing variable from your data frame – within the parentheses.
tolower( )
,toupper ( )
,toproper ( )
: Convert to all lower, all upper, or proper case.paste ( )
,paste0 ( )
: Combine (eg, concatenate) two character variables into one, with a delimiter like a space, or smushed together (paste0)na_if( )
,replace_na ()
: The first creates NA values if something is true; the other replaces NA values with something else, such as a zero.recode ()
: Lets you set up a system to replace codes with words.- Conversion functions: as.character(), as.integer(), as.numeric() and as.double()
7.5 Bonus: Fun with date and time
Sometimes your data came to you as text, but you need to treat it as numbers. In journalism, we’re often trying to work with dates that came to us in some non-standard form. This is called changing the data type
. 3
This example comes from the opioid emergency call data shown in earlier chapters.
<- "https://cronkitedata.github.io/cronkite-docs/assets/data/csv/opioidemscalls.csv"
my_link #use it in a read_csv command
<- read_csv(my_link)
opioid_calls_orig <- opioid_calls_orig %>%
opioid_calls clean_names() %>%
select (objectid, incident_date_char = incident_date, narcan= narcan_naloxone_given, is_asu_student, is_homeless=is_homeles)
glimpse( opioid_calls)
## Rows: 650
## Columns: 5
## $ objectid <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ incident_date_char <chr> "1/3/2017 11:42", "1/9/2017 1:08", "1/9/2017 1:55"…
## $ narcan <chr> "No", "Yes", "No", "No", "No", "Yes", "No", "No", …
## $ is_asu_student <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ is_homeless <chr> "No", "No", "Yes", "Yes", "No", "No", "No", "No", …
The date and time in this dataset is a character field, but we want to convert it to a date field and a separate time field. At first, we might just want to get the date value out of the character string. It’s in the form month/day/year hour:minute, so we can use the lubridate::mdy_hm function
[When you see a function preceded by two colons as this one does, the author is showing you which package it comes from. In this case, the lubridate
package, which you must install before you can use it, has a function mdy_hm
.]. This function tries to guess how the character string is formatted:
%>%
opioid_calls mutate ( incident_date = mdy_hm(incident_date_char)) %>%
select (incident_date, incident_date_char:is_homeless) %>%
glimpse()
## Rows: 650
## Columns: 5
## $ incident_date <dttm> 2017-01-03 11:42:00, 2017-01-09 01:08:00, 2017-01…
## $ incident_date_char <chr> "1/3/2017 11:42", "1/9/2017 1:08", "1/9/2017 1:55"…
## $ narcan <chr> "No", "Yes", "No", "No", "No", "Yes", "No", "No", …
## $ is_asu_student <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ is_homeless <chr> "No", "No", "Yes", "Yes", "No", "No", "No", "No", …
Notice that the new variable is of the type
Let’s say I want to distinguish weekends from weekdays in this dataset. First, we have to define weekend. Here’s one definition, but you might want to use another: A weekend begins at 3pm on Friday and ends at 6pm on Sunday. (The hour
variable is in military time.). Instead of using if_else
, this example uses case_when
, which lets you choose among a variety of options, not just two.
%>%
opioid_calls mutate (incident_date = mdy_hm(incident_date_char),
day_of_week = wday(incident_date) ,
weekday = weekdays(incident_date),
hour = hour(incident_date),
weekend =
case_when ( day_of_week == 7 ~ "Yes",
== 6 & hour >= 15 ~ "Yes",
day_of_week == 1 & hour <= 18 ~ "Yes",
day_of_week TRUE ~ "No"
)%>%
) #use count() instead of group_by / summarise. Same thing in this context.
count (day_of_week, weekend, weekday) %>%
pivot_wider (values_from = n, names_from = weekend, values_fill= list(n=0))
day_of_week | weekday | No | Yes |
---|---|---|---|
1 | Sunday | 19 | 80 |
2 | Monday | 78 | 0 |
3 | Tuesday | 87 | 0 |
4 | Wednesday | 87 | 0 |
5 | Thursday | 102 | 0 |
6 | Friday | 53 | 41 |
7 | Saturday | 0 | 103 |
7.6 More resources
As always, other professors have done similar exercises:
- Matt Waite’s chapter on “mutate”, which includes a recipe for combining character variables into phrases
TK: Exercises
This example uses the library
lubridate
, which may not be installed in your computer. If you’re following along, you may need to install it in the console.↩︎