16  Verbs in depth: New from old data with Mutate

This continues the work using Arizona Paycheck Protection Program loans. Full documentation of the dataset is in the Appendix.

To follow along, open your PPP project and create a new Quarto document with the usual front matter and setup chunk.

Load the data into your environment using this code chunk:

ppp_orig <- 
  readRDS( 
   url ( 
     "https://cronkitedata.s3.amazonaws.com/rdata/ppp_az_loans.RDS"
       )
   )

16.1 mutate to create new columns

Use the verb mutate whenever you want to create or change existing columns in your data.

Examples of this include:

  • Computing difference between numbers in two columns
  • Replacing NA values with “Unknown” or zero.
  • Collapsing or creating categories for more meaningful analysis

You will often use a combination of filtering and mutating to create a new data frame using the <- assignment to use in future code chunks. That’s because they can get complex, and you don’t want to repeat code that you might have to change over and over.

Warning

This chapter assumes you have gone through the filtering chapter. Many of the expressions and functions are the same, so if they don’t look familiar to you, consider reviewing that chapter (again) first.

mutate uses the syntax:

    mutate ( new_column_name = function (arguments))

That’s pretty abstract, so we’ll do this using examples.

16.2 Math on columns

Here’s an example of computing the portion of the original amount that was forgiven.

Use the single “=” sign to provide a name for the new column and create more than one new column using a comma between them:

ppp_orig |>
1    select ( borrower_name, borrower_city, amount, forgiveness_amount) |>
    mutate ( forgiven_pct = 
2               ( forgiveness_amount  /  amount )  * 100
            )   |>                   
3  sample_n (50)
1
Pick out a few columns to work with
2
Compute a new column using math
3
Print out a random sample of 50 rows
Note

A value over 100 means that the government also waived fees and interest; a value under 100 means that less than all of the loan was forgiven.

(I did some fancy formatting so you could read the results more clearly. We’ll come back to that in a later lesson.)

Converting NA to 0

If you look through the pages, you’ll realize that there are a lot of rows with NA instead of a value. That’s because whenever you do anything with a missing value, the result is missing. Missing values infect everything they touch. You often have to deal with them before you can do anything else.

We would like to convert the forgiven amount from a missing value to zero, under the idea that if they have not filled it out, nothing has (yet) been forgiven. Of course, we’d have to check that with the SBA before publication.

There is a specific function used for that: replace_na(), with two arguments: The column you want to check for missing values, and the value you want to use instead. This isn’t limited to numbers – you can do the same thing with words, often replacing NA with “Unknown”.

ppp_forgiven_fixed <- 
  ppp_orig |>
  mutate (amount_forgiven = replace_na(forgiveness_amount, 0))

(Note that nothing came out in this code chunk because the result was saved into a new data frame)

Detail and total with summary statistics

You can use mutate to put summary statistics next to your values so you can see whether or not they are similar to the average or median.1

ppp_forgiven_fixed |> 
  mutate ( avg_forgiven = mean(amount_forgiven), 
           median_forgiven = median(amount_forgiven)) |> 
  select (borrower_name, amount_forgiven, avg_forgiven, median_forgiven ) |> 
  head()

16.3 Creating categories for easier analysis

Very often, you’ll want to categorize entries in a database in order to make it simpler to count and sum the values in a meaningful way. For example, the business_type column has 24 different values, including NA . Here’s a list with the number of loans in each category:

Yes-No categories using if_else()

One way to work with these is to create new columns with yes-no indicators for certain types of businesses like non-profits or individuals vs. companies.

The function to do this is if_else() , which tests a condition exactly the same way filter did, but then assigns a value based on whether it’s met or not. You list the condition first, then a comma , then what should happen if the condition is true, and then what should happen if it’s false.

The condition is done the same way you did it in a filter, but instead of picking out the rows that match, mutate acts on them separately from the rows that don’t match.

There’s no “maybe”, except for NA’s in the original value. Here’s the general form of what it looks like:


new_column_name = if_else ( test the old column for something as in a a filter,
                         give it a value if it's true,
                         give it another value if it's not true)

So here is a way to do this with the business_type using the same %in% operator you used in the filter lesson, saving it to new data frame in your Environment, then displaying the first 10 types of businesses using count()

Example 1: Two categories using exact conditions

ppp_category_indiv <- 
  ppp_forgiven_fixed |>
  mutate ( is_individual = 
              if_else ( business_type %in% 
                          c("Independent Contractors", 
                            "Sole Proprietorship", 
                            "Self-Employed Individuals", 
                            "Single Member LLC"), 
                 "Individual", 
                 "Organization")
  )  

Now print off a sampling of a some of the rows to take a look:

ppp_category_indiv |> 
  select ( is_individual, business_type) |> 
  sample_n (50) 

Example 2: Two categories using wildcards

ppp_category_nonprofit <-
  ppp_forgiven_fixed |> 
  mutate ( is_nonprofit = 
             if_else ( str_like(business_type, "%Non_Profit%") , 
                       "Is nonprofit", 
                       "Not nonprofit"))  

You can check it using a new verb, count(), to check the number of items by category:

ppp_category_nonprofit |> 
  filter ( is_nonprofit == "Is nonprofit") |> 
  count ( is_nonprofit, business_type)

(The profit categorization is unclear for some of the original types, such as professional associations , tribal concerns and cooperatives.)

More than two categories using case_when()

Sometimes you will want more than one outcome, such as setting a value for “High”, “Medium” and “Low”. Instead of if_then, use the function case_when, which lets you string along conditions and their outcomes. The tilde (~) is used to show what happens if it’s true. At the end, you can add an argument .default="Other" to say “everything else”.

original data |> 
  case_when ( first condition ~ what if it's true,
              second condition ~ what if  it's true, 
              third condition  ~ what if it's true, 
              .default =  what to do with everything that's left
              )

16.4 Putting it all together

Here is how you could set a column to with five types of borrowers instead of three. Once you put it together, you have a much more manageable list of five instead of 25 categories:

ppp_business_categories <- 
  ppp_forgiven_fixed |>
  mutate (  new_business_type = 
              
          case_when (  str_like(business_type, "%non_profit%") 
1                         ~ "Non-profit",
                       
                      business_type %in% 
                         c("Independent Contractors", 
                            "Sole Proprietorship", 
                            "Self-Employed Individuals", 
                            "Single Member LLC")  
2                        ~ "Individual",
                      
                       business_type == "Tribal Concerns" 
3                        ~ "Tribal concerns",
                      
                       str_detect (business_type, 
                                   "LLC|Company|Corporation|Partnership") 
4                         ~ "Companies",
                      
5                       .default = "Other"
            )
)
1
The same code you used for non-profits, but with a ~ instead of a comma.
2
The same code you used for individuals
3
A new type, for Tribal concerns,
4
and another new type for traditional companies, and
5
finally, what you want it to say when none of the conditions are true.
Tip

This code chunk uses a function you haven’t seen yet, called str_detect(), which is a powerful way to streamline the conditions. Instead of using str_like(), with its limited wildcards, this is based on regular expressions, which we’ll look at later. But the way it’s used here is to separate words that might be in the description with “|”, for “OR”.

Take a look at how the categories are distributed now:

ppp_business_categories |>
  count ( new_business_type)

Save it for use in another program

Saving this for future use means you don’t have to worry anymore about some of the missing values, and you can filter and group by the simpler new business type instead of the original. This saves a data file called ppp_edited.RDS in your project folder.

saveRDS(ppp_business_categories, file="ppp_edited.RDS")

  1. If you skipped converting the NA values above, the answers will always come out as NA. ↩︎