15  Verbs in depth: Select, arrange, filter

This the next several chapters are based on Paycheck Protection Program, or PPP, loans in Arizona. Full documentation of the dataset is in the Appendix. Make sure to review it carefully before going further.

Update your packages

The str_like() function in this chapter was added to the tidyverse in December 2022. If you installed packages before that, go to the “Packages” tab in the bottom right panel, find the stringr package, and press the “Update” button before continuing. Alternatively, in the console, run the command update.packages(ask=FALSE)

As a reminder, here are the columns in our original data:

Rows: 169,259
Columns: 23
$ loan_id            <dbl> 2008847101, 3057347105, 3470877104, 3530847305, 377…
$ date_approved      <date> 2020-04-10, 2020-04-11, 2020-04-11, 2020-04-29, 20…
$ draw               <chr> "First", "First", "First", "First", "First", "First…
$ borrower_name      <chr> "SFE HOLDINGS LLC", "NAVAJO TRIBAL UTILITY AUTHORIT…
$ borrower_address   <chr> "9366 East Raintree Drive", "Po Box 170", "2999 N44…
$ borrower_city      <chr> "Scottsdale", "Fort Defiance", "Phoenix", "Tucson",…
$ borrower_state     <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ…
$ borrower_zip       <chr> "85260", "86504", "85018", "85711", "85250", "85012…
$ franchise_name     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ loan_status        <chr> "Paid in Full", "Paid in Full", "Paid in Full", "Pa…
$ loan_status_date   <date> 2021-08-17, 2022-02-05, 2021-09-25, 2021-08-21, 20…
$ amount             <dbl> 10000000, 10000000, 10000000, 10000000, 10000000, 1…
$ forgiveness_amount <dbl> 10108219, 9882528, 10133389, 10131667, 10126027, 10…
$ forgiveness_date   <date> 2021-06-11, 2022-01-24, 2021-08-17, 2021-08-26, 20…
$ lender             <chr> "The Huntington National Bank", "First Interstate B…
$ rural_urban        <chr> "U", "R", "U", "U", "U", "U", "U", "U", "U", "U", "…
$ low_income_area    <chr> "N", "Y", "N", "Y", "Y", "Y", "N", "N", "Y", "N", "…
$ project_county     <chr> "MARICOPA", "APACHE", "MARICOPA", "PIMA", "MARICOPA…
$ project_state      <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ…
$ project_cong_dist  <chr> "AZ-06", "AZ-01", "AZ-09", "AZ-02", "AZ-06", "AZ-09…
$ employees          <dbl> 500, 500, 500, 500, 500, 500, 500, 378, 500, 500, 5…
$ business_type      <chr> "Limited  Liability Company(LLC)", "Non-Profit Orga…
$ naics_code         <chr> "722310", "221122", "621610", "621111", "517311", "…

15.1 select columns

The select verb allows you to pick out columns by name or by their order in the data frame. Selecting columns is case-sensitive: a column called amount is completely different than a column named Amount. That’s one reason to use our style of always converting column names to lower case.

select is the easiest verb to understand it so it’s shown here first. But in practice, I usually include it as my last verb in a code chunk because some columns that I don’t care about in the end are needed for filtering or creating new columns.

There a lot of ways to pick out column names that are based on their name, their position, their type or other characteristics. Here are a few:

  • select (date_approved, borrower_name) picks out those columns and rearranges them from left to right in the order you mention them.
  • select (1:10) picks out the first 10 columns. Think of the colon as the word “through”
  • select ( 2, 4:6, business_type) combines position and name so you can use whichever is easier.

You can rename columns at the same time you select them by typing the new name, one equals sign and the old name, for example:

select ( given_date = date_approved, 
        borrower = borrower_name)
        

This example selects just a handful of columns to look at in depth. The original data hasn’t changed – we’ve just picked out the columns we want to use, and poured them into a new data frame called ppp_small.

1ppp_small <-
   ppp_orig |> 
  select (borrower_name, borrower_address, borrower_city,
          project_county, date_approved, amount, 
          forgiveness_amount)


2glimpse(ppp_small)
1
Creates a NEW data frame instead of printing the result onto the screen. After it is run, a separate command…
2
glimpse() lets you look at the saved data frame.
Rows: 169,259
Columns: 7
$ borrower_name      <chr> "SFE HOLDINGS LLC", "NAVAJO TRIBAL UTILITY AUTHORIT…
$ borrower_address   <chr> "9366 East Raintree Drive", "Po Box 170", "2999 N44…
$ borrower_city      <chr> "Scottsdale", "Fort Defiance", "Phoenix", "Tucson",…
$ project_county     <chr> "MARICOPA", "APACHE", "MARICOPA", "PIMA", "MARICOPA…
$ date_approved      <date> 2020-04-10, 2020-04-11, 2020-04-11, 2020-04-29, 20…
$ amount             <dbl> 10000000, 10000000, 10000000, 10000000, 10000000, 1…
$ forgiveness_amount <dbl> 10108219, 9882528, 10133389, 10131667, 10126027, 10…

15.2 arrange rows

The arrange verb sorts your data in different ways depending on the type of column. They can be alphabetical (character columns), in numeric order ( number or double columns), or chronologically (date or date/time columns) . Reverse the order by using desc() :

ppp_small |>
  arrange ( date_approved, desc(amount) )

15.3 filter rows

filter uses Boolean logic to allow quite sophisticated and powerful conditions to pick out just the rows you want to see or to further examine.

When you use a filter, you are telling R: “Give me back only the rows where the conditional formula I give you results in true

The conditions are extremely picky about characteristics like upper and lower case, spacing and punctuation. For example, “PIZZA HUT OF ARIZONA, INC” is completely different than “PIZZA HUT” or “Pizza Hut of Arizona Inc” . A simple extra character will prevent a match.1

Here are the operators , or symbols, you’ll use to make the conditions you want to test. Your column name goes on the left, and the value you want to test is on the right:

Operator Data types Meaning
== All Equals, exactly. Note the double-equals sign, which distinguishes it from an assignment operator, such as setting new column name.
> , >= date, numeric Greater than / greater than or equal to.
<, <= date, numeric Less than / less than or equal to.
!= all Not equal to
%in% all Matches any of the items in a list that you type out, such as fruit %in% c("apple", "orange").

In each of these cases, the column name goes on the left of the operator, and the condition goes on the right.

You can combine them using the Boolean operators and and or, which are used in other contexts like advanced Google searches :

  • & means “and”. which means that both conditions must be true. It narrows your search. Think of it as a fish net, with smaller openings so only littler fish can get through. You will sometimes see a comma instead of & – they both work.

  • | (the verticle bar, near the Return key) means “or” , which means that either condition can be true. It widens your search so both big and little fish can get through.

  • %in% is the equivalent of an “or” condition and is used when you want to test a lot of items within one column but reduce the amount of repetitive typing.

This is sometimes confusing because it’s the exact opposite of the way we describe it in English. If we want “apples and oranges”, we have to search for an apple OR an orange in each row: fruit == 'apple' | fruit == 'orange'

You often need to use parentheses to tell R what order you want to evaluate the conditions when you combine “AND” and “OR” conditions.

Unmatched parenthese cause weird errors

Matching parentheses can be tricky. To make it a little easier, you can set “rainbow parentheses” in the RStudio options, which will show you the matching opening or closing brackets and parentheses in different colors when your cursor is placed next to one of them. It’s under the Tools -> Global Options -> Code -> at the bottom of the Display tab.

A few examples

In the PPP data, all borrower names are in upper case, and all cities and addresses are in proper case. In addition, all of the punctuation has been removed from those two columns, such as periods, commas, quotes and apostrophes.

When you use a condition :

  • Numbers are without quotes, commas, dollar signs or anything other than digits and decimal points. (1000000.24)
  • Text is always in quotes.
  • Dates are in the form “2022-01-22”.

A list of borrowers in Flagstaff and Sedona

Start with the borrowers just in Flagstaff.

ppp_small |> 
  filter ( borrower_city == "Flagstaff") 

Now look for either “Flagstaff” OR “Sedona”:

ppp_small |> 
  filter ( borrower_city == "Flagstaff" | borrower_city == "Sedona") 

This could get old if you’re looking for, say 10 or 20 cities. Instead, you can put them in a list and look for them all at once, like in this query where we look for Flagstaff, Sedona and Bisbee

ppp_small |> 
  filter ( borrower_city %in% 
             c("Flagstaff", "Sedona", "Bisbee")
           )

Narrowed to under $100,000

To narrow it just to loans under $100,000, you can use an “&” condition, or add another filter. In practice, I will often try to separate my “AND” conditions into a new filter command — it’s sometimes easier to read than a lot of (|&,) symbols.

ppp_small |>
  filter ( borrower_city %in% c("Flagstaff", "Sedona")) |>
  filter ( amount < 100000) |>
  arrange (date_approved)

This is how you’d do it without the extra filter. Not the parentheses aroud the OR conditions. If you don’t have them, you ’ll get a weird (and wrong) answer – R won’t know which to do first.

ppp_small |>
  filter ( 
           (borrower_city == "Flagstaff" | borrower_city == "Sedona") &
            amount < 10000
          ) |>
  arrange (date_approved)

(You might note how I did the indentations above – it helps in reading the code to make clear what comes before what.)

Projects in Maricopa County that are NOT in Phoenix or Scottsdale

Here’s how to negate a filter - everything NOT in two cities. The ! at the beginning of the condition means that it must evaluate to FALSE instead of TRUE to be kept in the data.

ppp_small |>
  filter ( project_county == "MARICOPA" &
             (! borrower_city %in% c("Scottsdale", "Phoenix")) 
  ) |>
  arrange (amount)
Warning

Be careful when using missing data or combining conditions when you negate. The answer is often very different than you intended. It often behaves opposite of what you intended.

15.4 Using str_like() for inexact matches

The operators we’ve examined so far are pretty limited. You have to know exactly what’s in a character column, which can get pretty annoying when you are trying to nail down a story based on data entered by humans into computers.

But you can also use the results of formulas, or expressions as a condition, as long as they result in a true or false answer. One of the most powerful is the function str_like(), which was added to the tidyverse in late 2022. It allows you to use wildcard characters to find words and phrases that begin with or contain some characters, and it’s not case-sensitive!

Tip

There is a whole set of these “string” functions available in the tidyverse through its stringr package. We’ll get to some of the more common ones later in the book, especially those related to a powerful pattern-matching method called “regular expressions” or “regex”.

The difference in using functions is that you’ll structure your command a little differently. Instead of

borrower_city == "Scottsdale", you’ll use a formula that looks like

str_like (borrower_city, "scottsdale"), where the function name comes first, then, inside parentheses, are the column you want to check and the value you want to look for.

str_like() uses wildcards to stand in for characters you’re not sure about. They are:

  • _ for one missing character. Use it when you aren’t sure of spelling. The condition summari_e would be true if you weren’t sure if the British “summarise” or the American “summarize” were used in a column.

  • % for any number of unknown characters, or a missing character.
    For example, N% SCOTTSDALE R% would match “North Scottsdale Road”, “N. Scottsdale Rd” or “North Scottsdale Rd.” but not “Scottsdale” or “N Scottsdale Ave”.

Smart use of wildcards can often save you a lot of typing!

Examples

You might want to use str_like whenever you filter for words simply because it doesn’t care about upper and lower case.

This example would find borrowers who had addresses that were “CENTRAL AVE”, “Central Ave” or “central ave”. It excludes anyone with a street address, a suite number, or anyone on North Central Ave. because there is no wildcard:

ppp_small |> 
  filter ( str_like ( borrower_address, "central ave"))

But this one finds all of the borrowers on Central Avenue:

ppp_small |> 
  filter ( str_like ( borrower_address, "% central ave%") )

If you examine some of the answers, you’ll see why having wildcards is so important — there’s really no way to guess how, exactly, a borrower might have entered their address. There’s no standardization on this or on many other columns in the data.

15.5 A note on NA, or missing data

Missing values never match anything!

A filter will never return any rows that are missing, or the special NA value, in the column you test – it is a void that is never equal or not equal to any other value.

In most datasets, there is at least a little missing data. In this case, the forgiveness amount never shows the value zero. Instead, it shows NA, or the R symbol for missing data.

This isn’t the same as the letters “NA”. It’s a special type of value that means “nothing”. It’s not zero, it’s not higher or lower than any other value, it doesn’t ever equal anything and it doesn’t ever NOT equal anything! It stands for a value that is truly unknown. In fact, an NA doesn’t even equal another NA, because we don’t know anything about either one.

This means you can’t filter for NA values directly and often have to take them into account.

To filter for NA values, you’ll use a special function — is.na() — designed to pick out these bad actors:

ppp_small |> 
  filter ( is.na ( forgiveness_amount ))

We’ll come back to this later, but for now just remember that missing data is a scourge on your analysis, and it will almost always be a problem you have to solve.


  1. That’s why I’ve removed all of the punctuation from your the borrower names and addresses in your data file, and turned all of the names into upper case.↩︎