7 Verbs Part 3: mutate

7.1 Key takeaways

  • mutate just means “compute”, or convert one or more variables into something new.
  • mutate is the equivalent of creating a new column in Excel using a formula
  • Some of the most important jobs of a mutate statement are:
    • Convert unstandardized data into standardized, such as yes-or-no variables or all lower case
    • Use arithmetic on numeric variables
    • Use conditional statements to put different values on variables, instead of filtering one at a time.
    • Change the data type from, say, text to date and time.
    • Use functions to combine or pull apart and edit your data.

7.1.1 Read in some data

Start with the Arizona immunization data that some other chapters have worked with, loading it and the tidyverse to follow along.

Here are the packages and data used for this chapter. You might need to install some of them.

library(tidyverse)
library(knitr)       # used for slightly better looking tables
library(DT)          # even more options with tables
library(lubridate)   # used for working with dates
library(janitor)     # used to clean up dirty data and change column names.

load("data/az-immunizations-grade6.Rda")  # you may need to change the path to wherever you saved the data

7.2 Simple arithmetic

Use the usual arithmetic operators to calculate numbers and create a new column:

  + Addition
  - Subtraction
  * Multiplication
  / Division
  

This example creates a new data frame called immune_pct from a few columns of the original:

immune_pct <- 
  grade6_counts %>%
  select ( school_name, city, school_type, enrolled, num_immune_mmr) %>%
  mutate (pct_immune = ( num_immune_mmr/enrolled) * 100) 

Here’s what it looks like:

school_name city school_type enrolled num_immune_mmr pct_immune
ABRAHAM LINCOLN TRADITIONAL SCHOOL PHOENIX PUBLIC 59 58 98.3
ACACIA ELEMENTARY SCHOOL PHOENIX PUBLIC 134 134 100.0
ACADEMY DEL SOL - HOPE TUCSON CHARTER 53 52 98.1
ACADEMY OF MATH AND SCIENCE TUCSON CHARTER 42 40 95.2
ACADEMY OF MATH AND SCIENCE CAMELBACK PHOENIX CHARTER 113 113 100.0

7.3 Creating categories with if_else()

Sometimes, you want to create categories out of more detailed information. In this case, a threshold of 90 percent immunization is considered the limit under which there is a threat of a measles outbreak, also known as “herd immunity”. To categorize each schools as above or under that level, we need to create a new variable that is based on the value of the percentage we just calcluated. That’s done using an if_else statement, which is in the form:

  if_else (  condition, 
             new value if it's true, 
             new value if it's not true)

For example:

immune_pct <- 
    immune_pct %>%
    mutate (  threshold_met =  if_else   ( pct_immune >= 90, 
                                        "Meets threshold", 
                                         "Below threshold")
            )

Now we can do some grouping by the new categories:

immune_pct %>%
    group_by (school_type, threshold_met) %>%
    summarise ( school_ct = n(), 
                enrolled = sum(enrolled)) 
## `summarise()` has grouped output by 'school_type'. You can override using the `.groups` argument.

You can practice rolling up by looking at the percent by type.

immune_pct %>%
    group_by (school_type, threshold_met) %>%
    summarise ( schools = n(), 
                students = sum(enrolled)) %>%
    mutate (   pct_school_type = schools / sum(schools, na.rm=TRUE) * 100,
               pct_enrolled = students / sum(students, na.rm=TRUE) * 100 
             ) %>%
    filter ( threshold_met == "Below threshold")  
## `summarise()` has grouped output by 'school_type'. You can override using the `.groups` argument.
school_type threshold_met schools students pct_school_type pct_enrolled
CHARTER Below threshold 40 2344 20.725389 17.757576
PRIVATE Below threshold 10 324 14.925373 12.053571
PUBLIC Below threshold 18 1223 2.777778 1.794544

You can read this table to say, “More than one-sixth of 6th-grade charter school students could be at risk of contracting measles because their schools have immunization rates below levels experts say are needed to prevent outbreaks.” (The wording is tough here to avoid saying “schools” over and over! )

This strategy can also be used to set unreal values to the special NA as in the MAP murder data that uses the value 999 instead of “unknown” for an age. This is common in survey data.

7.4 Useful functions

Here are some functions you may find yourself using pretty regularly. I’m not going to go into the details, but think about looking some up if you think you need them. In each case, these are just the verbs. You’d put your noun – an existing variable from your data frame – within the parentheses.

  • tolower( ) , toupper ( ), toproper ( ): Convert to all lower, all upper, or proper case.
  • paste ( ), paste0 ( ) : Combine (eg, concatenate) two character variables into one, with a delimiter like a space, or smushed together (paste0)
  • na_if( ), replace_na () : The first creates NA values if something is true; the other replaces NA values with something else, such as a zero.
  • recode () : Lets you set up a system to replace codes with words.
  • Conversion functions: as.character(), as.integer(), as.numeric() and as.double()

7.5 Bonus: Fun with date and time

Sometimes your data came to you as text, but you need to treat it as numbers. In journalism, we’re often trying to work with dates that came to us in some non-standard form. This is called changing the data type. 3

This example comes from the opioid emergency call data shown in earlier chapters.

my_link <- "https://cronkitedata.github.io/cronkite-docs/assets/data/csv/opioidemscalls.csv"
#use it in a read_csv command
opioid_calls_orig <- read_csv(my_link)
opioid_calls <- opioid_calls_orig %>% 
                clean_names() %>%
                select (objectid, incident_date_char = incident_date, narcan= narcan_naloxone_given, is_asu_student, is_homeless=is_homeles)

glimpse( opioid_calls)
## Rows: 650
## Columns: 5
## $ objectid           <dbl> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ incident_date_char <chr> "1/3/2017 11:42", "1/9/2017 1:08", "1/9/2017 1:55"…
## $ narcan             <chr> "No", "Yes", "No", "No", "No", "Yes", "No", "No", …
## $ is_asu_student     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ is_homeless        <chr> "No", "No", "Yes", "Yes", "No", "No", "No", "No", …

The date and time in this dataset is a character field, but we want to convert it to a date field and a separate time field. At first, we might just want to get the date value out of the character string. It’s in the form month/day/year hour:minute, so we can use the lubridate::mdy_hm function [When you see a function preceded by two colons as this one does, the author is showing you which package it comes from. In this case, the lubridate package, which you must install before you can use it, has a function mdy_hm.]. This function tries to guess how the character string is formatted:

opioid_calls %>%
  mutate ( incident_date = mdy_hm(incident_date_char)) %>%
  select (incident_date, incident_date_char:is_homeless) %>%
  glimpse()
## Rows: 650
## Columns: 5
## $ incident_date      <dttm> 2017-01-03 11:42:00, 2017-01-09 01:08:00, 2017-01…
## $ incident_date_char <chr> "1/3/2017 11:42", "1/9/2017 1:08", "1/9/2017 1:55"…
## $ narcan             <chr> "No", "Yes", "No", "No", "No", "Yes", "No", "No", …
## $ is_asu_student     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
## $ is_homeless        <chr> "No", "No", "Yes", "Yes", "No", "No", "No", "No", …

Notice that the new variable is of the type , which refers to a date-time combined variable.

Let’s say I want to distinguish weekends from weekdays in this dataset. First, we have to define weekend. Here’s one definition, but you might want to use another: A weekend begins at 3pm on Friday and ends at 6pm on Sunday. (The hour variable is in military time.). Instead of using if_else, this example uses case_when, which lets you choose among a variety of options, not just two.

opioid_calls %>%
  mutate (incident_date = mdy_hm(incident_date_char), 
          day_of_week = wday(incident_date) ,
          weekday = weekdays(incident_date),
          hour = hour(incident_date), 
          weekend = 
              case_when ( day_of_week == 7                 ~ "Yes", 
                          day_of_week == 6 & hour >= 15     ~ "Yes", 
                          day_of_week == 1 & hour <= 18   ~ "Yes", 
                          TRUE                             ~ "No"
                          )
          ) %>%
  #use count() instead of group_by / summarise. Same thing in this context.
  count (day_of_week, weekend, weekday) %>%
  pivot_wider (values_from = n, names_from = weekend, values_fill= list(n=0))
day_of_week weekday No Yes
1 Sunday 19 80
2 Monday 78 0
3 Tuesday 87 0
4 Wednesday 87 0
5 Thursday 102 0
6 Friday 53 41
7 Saturday 0 103

7.6 More resources

As always, other professors have done similar exercises:

TK: Exercises


  1. This example uses the library lubridate, which may not be installed in your computer. If you’re following along, you may need to install it in the console.↩︎