21  Recipes

This final chapter on data wrangling in R simply puts together a lot of the code snippets that are easy to forget, and adds some more advanced code examples that you can adapt to your work. There is minimal explanation. Instead, look at at the relevant chapter for more explanation. Over time, I’ll add section links to the examples.

You can copy and paste these into RStudio’s snippets and use them directly in your program – then you just have to change the references to the variables in your own work.

These examples use the data from the other chapters, from the PPP program.

R annoyances and errors

There are several annoyances that aren’t consistent from one function in R to the next. When you’re having trouble, look for

  • Quoting issues - what kind of quotes, whether they’re needed, and if they’re matched open and closed.

  • Unmatched or missing parentheses

    Putting words on the wrong side of equal signs. To create a new column name, put it on the left. To identify a column to be used as an argument, put it on the right. To filter a column, use two equals signs, with the column name on the left.

  • Are you working with a list of items? If so, you need to wrap them in the c() function, for “combine”

  • Case-sensitivity in column names; back-ticks for more than one word in a column name rather than quotes.

  • Missing or hanging pipes (|>)

  • You often have to run your code chunk twice after fixing an error. (There is still a mistake in R’s innards that has to get flushed out.)

These are the problems that will fix about 80 percent of the code that won’t run once you get more used to it.

21.1 Re-using answers from a previous step

Assign your code chunk to a new variable in the environment to use it again later. This is usefule when:

  • You have a complex filter that you want to apply to future steps.
  • You are joining data frames with columns having the same names: select just the columns you need and re-name any of the ones that are the same, except for the join one.
  • You have fixed some column types or adjusted their values.

The assignment operator is <-, which means “pour the answer into this variable name”

These can all go into one code chunk if you’d like:

select_ppp <- 
  ppp_orig |>
  filter ( project_state == "AZ" & !is.na (project_county) ) |>
  mutate ( forgiveness_amount = if_else (is.na(forgiveness_amount),
                                         0, 
                                         forgiveness_amount)) |>
  select ( borrower_name, borrower_city, 
           naics_code, 
           amount, forgiveness_amount,
           date_approved, forgiveness_date)



sectors <- 
  naics_codes |>
  select ( naics_code, sector_code, sector_desc)


select_ppp |>
  left_join ( sectors, by="naics_code")  |>
  select ( borrower_name, forgiveness_amount, sector_desc) |>
  head(5)

21.2 Importing / reading data

  • If you are referring to a file on your computer or on the web, then it must be in quotes.
  • Use the proper library and function to read the data.
  • Check the output for proper treatment of text, dates and numbers, especially with Excel and text imports

Reading R files

Two types of data files: RDS files contain one data frame, which you have to assign to a new object. Rda files have multiple objects that are already named. Use readRDS ( url ( ....) ) to read data stored on the internet instead of in your project.

mydata <- readRDS ( "mydata_file.RDS")
load ( "mydata_file.Rda")

Reading text (csv) files

Use read_csv(), read_tsv(), read_delimited() and various fixed file formats. Match the type of your file with the proper function. These require that the tidyverse or readr libraries are already loaded.

Reading Excel files

library (readxl)          # put this with the rest of your libraries.
mydata <- read_excel("mydata_file.xlsx", 
                      sheet = "sheet_name")  

There is no url() option for reading Excel files. They must be stored in your project. One approach is to download the file in your code:

download.file ( "https://address-of-file.xlsx", "myfilename.xlsx")
mydata <- read_excel ("myfilename.xlsx")

Import options

There are options in most of the importing functions that tell you how to treat each column and let you rename at the same time. One approach is :

read_csv ( "mydata_file.csv", 
           coltypes = col(.default = "c")
           )
           

To keep everything as text, which you can convert later.

21.3 Filtering

  1. One condition that’s a number :
  filter ( amount > 1000 )
  1. A number between two values
  filter ( between ( amount, 0, 1000))
  1. An exact phrase or word
  filter ( project_county == "MARICOPA")
  1. One of several possible entries in one column (exactly)
    filter ( project_county %in% c("MARICOPA", "PIMA", "PINAL" ))
  1. Everything except missing values
    filter ( ! is.na (project_county) )
  1. Between two dates . Be sure it’s really a date in the data by glimpsing your data frame. If not, turn it into a date first.
    filter ( approval_date >= "2021-01-24" & 
           approval_date <= "2021-01-31")
  1. Phrases, words or letters at the beginning of a column
    filter ( str_like (borrower_type , "Non-Profit%"))
  1. Phrases, letters or words at the end of a column
    filter ( str_like  (borrower_type, "Corporation%"))

All of these examples can be used in a mutate statement to create flags or new values if the conditions are met.

21.4 Aggregate (count, sum rows by category)

  1. Counting (How many?)
    group_by (project_county) |>
    summarize (loans = n() )  |>
    arrange ( desc ( loans ))

Make sure you don’t name the new column containing the count the same thing as a group_by() column.

  1. Shortcut for counting:
     count( project_county, sort=TRUE, name="loans")
  1. Counting unique entries

Sometimes you want to know how many items of a type, not how many rows, are included in a category.

        group_by ( project_county) |>
        summarize ( number_of_loans = n(), 
                    number_of_industries = n_distinct ( naics_code )
                    )
  1. Summing (how much?)
    group_by (project_county) |>
    summarize ( total_amount = sum (amount, na.rm=T))
  1. Shortcut for summing:
    count ( project_county, weight=amount, name="total_amount")

21.5 Recoding categories

You’ll often want fewer categories, or numbers in categories, that you want to use instead of the original values. This is done in a mutate statement. Don’t forget to save the output to a new data frame (<-), or you won’t have access to it later on.

  1. Create yes-no categories . This is really “Yes”, “No” or NA, where there is an NA to begin with.
    mutate ( corp_yn = if_else  
                      (str_detect (borrower_type, 
                                   "Corporation"), 
                       "Yes", 
                       "No")
       )
  1. Recode into more than two categories using case_when()

This example introduces str_detect(), which uses regular expressions. We’ll go over that later in the book.

  mutate (  new_business_type = 
                case_when (  str_like(business_type, "%non_profit%") ~ "Non-profit",    #1st category

                             business_type %in% 
                               c("Independent Contractors", 
                                "Sole Proprietorship", 
                                "Self-Employed Individuals", 
                                  "Single Member LLC")              ~ "Individual",    # 2nd category
                             
                             business_type == "Tribal Concerns"     ~ "Tribal concerns",    #3rd category
                             
                             str_detect (business_type, "LLC|Company|Corporation|Partnership") ~ "Companies",  #4th category
                             
                             TRUE ~ "Other")         #catchall 5th category
            )
  1. Recode numbers into categories

It’s often useful to give them numeric codes in front so they sort properly:

        mutate ( new_type = 
                 case_when  (
                    amount <= 1000 ~ "00-Very low", 
                    amount <= 10000  ~ "01-Low", 
                    amount <= 100000 ~ "03-Medium", 
                    amount > 100000 ~ "04-High")
              )

This works because the first one that it finds will be used, so a value of exactly 1,000 would be “Very low”, but a value of 1,001 would be “Low”.

21.6 Working with grouped data for subtotals, changes, percents

  1. Percent of total by group
  group_by ( project_county, sector_desc ) |>
  summarize ( loans = n() , .groups="drop_last") |>
  mutate ( pct_in_county = loans / sum(loans))
  1. Display results as in spreadsheet form

To see the items across the top, use pivot_wider.

  group_by ( project_county, new_type) |>
  summarize ( loan_count = n() ) |>
  pivot_wider ( names_from = new_type, 
                id_cols = c( project_county), 
                values_from = loan_count)

You can add an argument after values_from if you know that any missing values are zero, by using values_fill=0

You usually only choose one column to show down the side, one column to spread across the top, and one column to display the value.

  1. Complex example: Get the change by year within each county.

    annual_ppp <- mutate ( approve_year = year (approval_date ) ) |> group_by ( project_county, approve_year) |> summarize ( loans = n() , amount = sum( amount, na.rm=T), .groups=“drop_last”)

Next, compare them within groups. New functions introduced: complete(), which fills in missing information in a sequence. For example, if there were missing years by county, it would create a row to fill it in. lag() refers to the previous item in a group. First, make a new, summarized data frame with just the columns you nee

    complete ( project_county, approve_year, 
              fill= list (amount=0, loans =0) )|>
    mutate ( change_loans = loans - lag(loans), 
             pct_change_loans = change_loans / lag(loans) * 100) 
  1. Pick out the last item in a group, with all of its columns. New verb introduced : slice_tail() . This is particularly useful for chronological events, such as the last thing that happened in a court case, or the most recent complaint against a police officer. This example isn’t a great one, but it gets you the name and other details of the most recent loan for each lender

    arrange ( lender, date_approved) |>
    group_by ( lender) |>
    slice_tail(n=1)

NOTE: Don’t use slice_max(), because it will give you back multiple rows in the event of ties.

21.7 Joining tables

New options in the dplyr library that contains the join verb let us use the same operators in joining that you use in filtering: ==, <=, etc. Usually you will use the “==” operator. Requires dplyr 1.1, which was released in February 2023.

ppp_orig |> 
  select ( borrower_name, amount, naics_code) |> 
  inner_join ( naics_codes, 
               join_by (naics_code == naics_code) 
  )

21.8 Quarto document tips

YAML (front matter)

The YAML is very picky about exact casing, spacing and other details. If you try to render and you get a YAML error, it is likely at the top.

Typical YAML front matter that goes within the three dashes beginning on the very first line:


title: "Name of document"
author: "Your name"
output: 
  html: 
    theme: cosmo
    code-tools: true
    embed-resources: true
    toc: true
execute: 
   warning: true
   error: true

Code chunk options

These must be the very first rows in your code chunk, and must start in the first position, with spacing as shown. Use #| label: setup in the first code chunk to run it automatically every time you start up R.

Nothing runs; everything is gray

You probably erased the last three back-ticks at the end of the code chunk. You may need to restart R to get it re-set.