<-
select_ppp |>
ppp_orig filter ( project_state == "AZ" & !is.na (project_county) ) |>
mutate ( forgiveness_amount = if_else (is.na(forgiveness_amount),
0,
|>
forgiveness_amount)) select ( borrower_name, borrower_city,
naics_code,
amount, forgiveness_amount,
date_approved, forgiveness_date)
<-
sectors |>
naics_codes select ( naics_code, sector_code, sector_desc)
|>
select_ppp left_join ( sectors, by="naics_code") |>
select ( borrower_name, forgiveness_amount, sector_desc) |>
head(5)
21 Recipes
This final chapter on data wrangling in R simply puts together a lot of the code snippets that are easy to forget, and adds some more advanced code examples that you can adapt to your work. There is minimal explanation. Instead, look at at the relevant chapter for more explanation. Over time, I’ll add section links to the examples.
You can copy and paste these into RStudio’s snippets and use them directly in your program – then you just have to change the references to the variables in your own work.
These examples use the data from the other chapters, from the PPP program.
R annoyances and errors
There are several annoyances that aren’t consistent from one function in R to the next. When you’re having trouble, look for
Quoting issues - what kind of quotes, whether they’re needed, and if they’re matched open and closed.
Unmatched or missing parentheses
Putting words on the wrong side of equal signs. To create a new column name, put it on the left. To identify a column to be used as an argument, put it on the right. To filter a column, use two equals signs, with the column name on the left.
Are you working with a list of items? If so, you need to wrap them in the
c()
function, for “combine”Case-sensitivity in column names; back-ticks for more than one word in a column name rather than quotes.
Missing or hanging pipes (
|>
)You often have to run your code chunk twice after fixing an error. (There is still a mistake in R’s innards that has to get flushed out.)
These are the problems that will fix about 80 percent of the code that won’t run once you get more used to it.
21.1 Re-using answers from a previous step
Assign your code chunk to a new variable in the environment to use it again later. This is usefule when:
- You have a complex filter that you want to apply to future steps.
- You are joining data frames with columns having the same names: select just the columns you need and re-name any of the ones that are the same, except for the join one.
- You have fixed some column types or adjusted their values.
The assignment operator is <-
, which means “pour the answer into this variable name”
These can all go into one code chunk if you’d like:
21.2 Importing / reading data
- If you are referring to a file on your computer or on the web, then it must be in quotes.
- Use the proper library and function to read the data.
- Check the output for proper treatment of text, dates and numbers, especially with Excel and text imports
Reading R files
Two types of data files: RDS files contain one data frame, which you have to assign to a new object. Rda files have multiple objects that are already named. Use readRDS ( url ( ....) )
to read data stored on the internet instead of in your project.
mydata <- readRDS ( "mydata_file.RDS")
load ( "mydata_file.Rda")
Reading text (csv) files
Use read_csv()
, read_tsv()
, read_delimited()
and various fixed file formats. Match the type of your file with the proper function. These require that the tidyverse or readr
libraries are already loaded.
Reading Excel files
library (readxl) # put this with the rest of your libraries.
mydata <- read_excel("mydata_file.xlsx",
sheet = "sheet_name")
There is no url()
option for reading Excel files. They must be stored in your project. One approach is to download the file in your code:
download.file ( "https://address-of-file.xlsx", "myfilename.xlsx")
mydata <- read_excel ("myfilename.xlsx")
Import options
There are options in most of the importing functions that tell you how to treat each column and let you rename at the same time. One approach is :
read_csv ( "mydata_file.csv",
coltypes = col(.default = "c")
)
To keep everything as text, which you can convert later.
21.3 Filtering
- One condition that’s a number :
filter ( amount > 1000 )
- A number between two values
filter ( between ( amount, 0, 1000))
- An exact phrase or word
filter ( project_county == "MARICOPA")
- One of several possible entries in one column (exactly)
filter ( project_county %in% c("MARICOPA", "PIMA", "PINAL" ))
- Everything except missing values
filter ( ! is.na (project_county) )
- Between two dates . Be sure it’s really a date in the data by glimpsing your data frame. If not, turn it into a date first.
filter ( approval_date >= "2021-01-24" &
approval_date <= "2021-01-31")
- Phrases, words or letters at the beginning of a column
filter ( str_like (borrower_type , "Non-Profit%"))
- Phrases, letters or words at the end of a column
filter ( str_like (borrower_type, "Corporation%"))
All of these examples can be used in a mutate
statement to create flags or new values if the conditions are met.
21.4 Aggregate (count, sum rows by category)
- Counting (How many?)
group_by (project_county) |>
summarize (loans = n() ) |>
arrange ( desc ( loans ))
Make sure you don’t name the new column containing the count the same thing as a group_by() column.
- Shortcut for counting:
count( project_county, sort=TRUE, name="loans")
- Counting unique entries
Sometimes you want to know how many items of a type, not how many rows, are included in a category.
group_by ( project_county) |>
summarize ( number_of_loans = n(),
number_of_industries = n_distinct ( naics_code )
)
- Summing (how much?)
group_by (project_county) |>
summarize ( total_amount = sum (amount, na.rm=T))
- Shortcut for summing:
count ( project_county, weight=amount, name="total_amount")
21.5 Recoding categories
You’ll often want fewer categories, or numbers in categories, that you want to use instead of the original values. This is done in a mutate statement. Don’t forget to save the output to a new data frame (<-), or you won’t have access to it later on.
- Create yes-no categories . This is really “Yes”, “No” or NA, where there is an NA to begin with.
mutate ( corp_yn = if_else
str_detect (borrower_type,
("Corporation"),
"Yes",
"No")
)
- Recode into more than two categories using
case_when()
This example introduces str_detect()
, which uses regular expressions. We’ll go over that later in the book.
mutate ( new_business_type =
case_when ( str_like(business_type, "%non_profit%") ~ "Non-profit", #1st category
business_type %in%
c("Independent Contractors",
"Sole Proprietorship",
"Self-Employed Individuals",
"Single Member LLC") ~ "Individual", # 2nd category
business_type == "Tribal Concerns" ~ "Tribal concerns", #3rd category
str_detect (business_type, "LLC|Company|Corporation|Partnership") ~ "Companies", #4th category
TRUE ~ "Other") #catchall 5th category
)
- Recode numbers into categories
It’s often useful to give them numeric codes in front so they sort properly:
mutate ( new_type =
case_when (
<= 1000 ~ "00-Very low",
amount <= 10000 ~ "01-Low",
amount <= 100000 ~ "03-Medium",
amount > 100000 ~ "04-High")
amount )
This works because the first one that it finds will be used, so a value of exactly 1,000 would be “Very low”, but a value of 1,001 would be “Low”.
21.6 Working with grouped data for subtotals, changes, percents
- Percent of total by group
group_by ( project_county, sector_desc ) |>
summarize ( loans = n() , .groups="drop_last") |>
mutate ( pct_in_county = loans / sum(loans))
- Display results as in spreadsheet form
To see the items across the top, use pivot_wider
.
group_by ( project_county, new_type) |>
summarize ( loan_count = n() ) |>
pivot_wider ( names_from = new_type,
id_cols = c( project_county),
values_from = loan_count)
You can add an argument after values_from
if you know that any missing values are zero, by using values_fill=0
You usually only choose one column to show down the side, one column to spread across the top, and one column to display the value.
Complex example: Get the change by year within each county.
annual_ppp <- mutate ( approve_year = year (approval_date ) ) |> group_by ( project_county, approve_year) |> summarize ( loans = n() , amount = sum( amount, na.rm=T), .groups=“drop_last”)
Next, compare them within groups. New functions introduced: complete()
, which fills in missing information in a sequence. For example, if there were missing years by county, it would create a row to fill it in. lag()
refers to the previous item in a group. First, make a new, summarized data frame with just the columns you nee
complete ( project_county, approve_year,
fill= list (amount=0, loans =0) )|>
mutate ( change_loans = loans - lag(loans),
pct_change_loans = change_loans / lag(loans) * 100)
Pick out the last item in a group, with all of its columns. New verb introduced :
slice_tail()
. This is particularly useful for chronological events, such as the last thing that happened in a court case, or the most recent complaint against a police officer. This example isn’t a great one, but it gets you the name and other details of the most recent loan for each lenderarrange ( lender, date_approved) |> group_by ( lender) |> slice_tail(n=1)
NOTE: Don’t use slice_max()
, because it will give you back multiple rows in the event of ties.
21.7 Joining tables
New options in the dplyr
library that contains the join
verb let us use the same operators in joining that you use in filtering: ==
, <=
, etc. Usually you will use the “==” operator. Requires dplyr 1.1, which was released in February 2023.
ppp_orig |>
select ( borrower_name, amount, naics_code) |>
inner_join ( naics_codes,
join_by (naics_code == naics_code)
)
21.8 Quarto document tips
YAML (front matter)
The YAML is very picky about exact casing, spacing and other details. If you try to render and you get a YAML error
, it is likely at the top.
Typical YAML front matter that goes within the three dashes beginning on the very first line:
title: "Name of document"
author: "Your name"
output:
html:
theme: cosmo
code-tools: true
embed-resources: true
toc: true
execute:
warning: true
error: true
Code chunk options
These must be the very first rows in your code chunk, and must start in the first position, with spacing as shown. Use #| label: setup
in the first code chunk to run it automatically every time you start up R.
Nothing runs; everything is gray
You probably erased the last three back-ticks at the end of the code chunk. You may need to restart R to get it re-set.