18  Verbs in depth: New from old data with Mutate

In this chapter

  • Creating new columns with mutate()
  • Combine summary and detail into one data frame
  • Replacing NA values
  • Using conditional commands if_else() and case_when() to create categories

This continues the work using Arizona Paycheck Protection Program loans. Full documentation of the dataset is in the Appendix.

To follow along, open your PPP project and create a new markdown the usual way, including the usual setup chunk.

Add this command options(scipen=999) to the bottom of your setup chunk to avoid looking at scientific notation for large and small numbers.

Load the data into your environment using this code chunk:

ppp_orig <- 
  readRDS( 
   url ( 
     "https://cronkitedata.s3.amazonaws.com/rdata/ppp_az_loans.RDS"
       )
   )

18.1 mutate to create new columns

Use the verb mutate whenever you want to create or change existing columns in your data.

Examples of this include:

  • Computing difference or percent difference
  • Replacing NA values with “Unknown” or zero.
  • Collapsing or creating categories for more meaningful analysis

You will often use a combination of filtering and mutating to create a new data frame using the <- assignment to use in future code chunks. That’s because they can get complex, and you don’t want to repeat code that you might have to change over and over.

Warning

This chapter assumes you have gone through the filtering chapter. Many of the expressions and functions are the same, so if they don’t look familiar to you, consider reviewing that chapter (again) first.

mutate uses the syntax:

    mutate ( new_column_name = function (arguments))

That’s pretty abstract, so we’ll do this using examples.

18.1.1 Math on columns: compute difference and percent difference

Here’s an example of computing the difference and percent difference between the amount received and the amount forgiven. Before you run this, try to think of why a reporter might be interested in this list.

Use the single “=” sign to provide a name for the new column and create more than one new column using a comma between them:

ppp_orig |>
  mutate ( amt_diff =  forgiveness_amount-amount, 
           amt_pct_diff = amt_diff / amount  * 100 ) |>                   # compute new columns
  select ( borrower_name, borrower_city, amount, forgiveness_amount) |>   # pick out a few columns
  head(5).                                                                # show the first five rows

If you look hard enough, you’ll realize that there are a lot of rows with NA instead of a value. That’s because whenever you do anything with a missing value, the result is missing. Missing values infect everything they touch. You often have to deal with them before you can do anything else.

18.1.2 Converting NA to 0

We would like to convert the forgiven amount from a missing value to zero, under the idea that if they have not filled it out, nothing has (yet) been forgiven. Of course, we’d have to check that with the SBA before publication.

There is a specific function used for that: replace_na(), with two arguments: The column you want to check for missing values, and the value you want to use instead. This isn’t limited to numbers – you can do the same thing with words, often replacing NA with “Unknown”.

ppp_forgiven_fixed <- 
  ppp_orig |>
  mutate (amount_forgiven = replace_na(forgiveness_amount, 0))

(Note that nothing came out in this code chunk because the result was saved into a new data frame variable.)

18.1.3 Detail and total with summary statistics

You can use mutate to put summary statistics next to your values so you can see whether or not they are similar to the average or median. (If you skipped converting the NA values above, the answers will always come out as NA, because they infect everything they touch. )

ˆ

ppp_forgiven_fixed |> 
  mutate ( avg_forgiven = mean(amount_forgiven), 
           median_forgiven = median(amount_forgiven)) |> 
  select (borrower_name, amount_forgiven, avg_forgiven, median_forgiven ) |> 
  head()

18.2 Categorizing text using conditional statements

Very often, you’ll want to categorize entries in a database in order to make it simpler to count and sum the values in a meaningful way. For example, the business_type column has 24 different values, including NA . Here’s a list with the number of loans in each category:

#| label: print-forgiven
#| eval: true
#| echo: false 

ppp_forgiven_fixed  |>
   count ( business_type, sort=TRUE, name="# of rows") 

One way to work with these is to create new columns with yes-no indicators for certain types of businesses like non-profits or individuals vs. companies.

18.2.1 Two categories using if_else

The function to do this is if_else() , which tests a condition exactly the same way filter did, but then assigns a value based on whether it’s met or not. You list the condition first, then a comma , then what should happen if the condition is true, and then what should happen if it’s false.

The condition is done the same way you did it in a filter, but instead of picking out the rows that match, mutate acts on them separately from the rows that don’t match.

There’s no “maybe”, except for NA’s in the original value. Here’s the general form of what it looks like:


new_column_name = if_else ( test the old column for something as in a a filter,
                         give it a value if it's true,
                         give it another value if it's not true)

So here is a way to do this with the business_type using the same %in% operator you used in the filter lesson, saving it to new data frame in your Environment, then displaying the first 10 types of businesses using count()

Note

This code chunk uses a verb called count() to count up the number of rows in each category. We’ll get into this in much more detail in the next chapter, but it’s helpful to use it now to show you the results of creating new categories from old data.

#| label: m-category-indiv 
#| eval: true 
#| echo: true

ppp_category_indiv <- 
  ppp_forgiven_fixed |>
  mutate ( is_individual = 
              if_else ( business_type %in% 
                          c("Independent Contractors", 
                            "Sole Proprietorship", 
                            "Self-Employed Individuals", 
                            "Single Member LLC"), 
                        "Individual", 
                        "Organization")
  )  

ppp_category_indiv |> 
  count (is_individual, business_type) |> 
  head(10)

18.2.2 Two categories using fuzzy matching

You can also use the same str_like() function you used in filtering. Here, it sets whether or not the borrower was a non-profit. Remember, a % means “anything or nothing, and a _ means”any single character but it must be present”. In our case, the word is shown as “Non Profit” and “Non-Profit”. Once. you’ve set it to your new words, you can just filter on the new value:

ppp_category_nonprofit <-
  ppp_category_indiv |>
  mutate ( is_nonprofit = 
             if_else ( str_like(business_type, "%Non_Profit%") , 
                       "Is nonprofit", 
                       "Not nonprofit"))  



ppp_category_nonprofit |>
  count ( business_type, is_nonprofit) |>
  filter ( is_nonprofit == "Is nonprofit")

(The profit categorization is unclear for some of these types, such as professional associations , tribal concerns and cooperatives.)

18.2.3 More than two categories using case_when()

Sometimes you will want more than one outcome, such as setting a value for “High”, “Medium” and “Low”. Instead of if_then, use the function case_when, which lets you string along conditions and their outcomes. The tilde (~) is used to show what happens if it’s true, and a TRUE at the end lets you say “everthing else”1

original data |> 
  case_when ( first condition ~ what if it's true,
              second condition ~ what if  it's true, 
              third condition  ~ what if it's true, 
              TRUE ~ what to do with everything that's left
              )

18.3 Putting it all together

Here is how you could set a column to with five types of borrowers instead of three. Once you put it together, you have a much more manageable list of five instead of 25 categories:

ppp_business_categories <- 
  ppp_category_nonprofit |>
  mutate (  new_business_type = 
                case_when (  str_like(business_type, "%non_profit%") ~ "Non-profit",    #1st category

                             business_type %in% 
                               c("Independent Contractors", 
                                "Sole Proprietorship", 
                                "Self-Employed Individuals", 
                                  "Single Member LLC")              ~ "Individual",    # 2nd category
                             
                             business_type == "Tribal Concerns"     ~ "Tribal concerns",    #3rd category
                             
                             str_detect (business_type, "LLC|Company|Corporation|Partnership") ~ "Companies",  #4th category
                             
                             TRUE ~ "Other")         #catchall 5th category
            )


# Now take a look at what you got

ppp_business_categories |>
  count ( new_business_type)
Tip

This code chunk uses a function you haven’t seen yet, called str_detect(), which is a powerful way to streamline the conditions. Instead of using str_like(), with its limited wildcards, this is based on regular expressions, which we’ll look at later. But the way it’s used here is to separate words that might be in the description with “|”, for “OR”.

Save it for use in another program

Saving this for future use means you don’t have to worry anymore about some of the missing values, and you can filter and group by the simpler new business type instead of the original. This saves a data file called ppp_edited.RDS in your project folder.

saveRDS(ppp_business_categories, file="ppp_edited.RDS")

  1. This will change in future versions, which have a much more intuitive “Other” alternative.↩︎