17  Verbs in depth: Select, arrange, filter

In this chapter

  • select to pick out columns efficiently and rename them
  • arrange to order the rows in your data frame
  • filter using Boolean logic
  • str_like() function for variations on a word or phrase
  • Introducing the problem of missing values, or NA

This the next several chapters are based on Paycheck Protection Program, or PPP, loans in Arizona. Full documentation of the dataset is in the Appendix. Make sure to review it carefully before going further.

Update your packages

The str_like() function in this chapter was added to the tidyverse in December 2022. If you installed packages before that, go to the “Packages” tab in the bottom right panel, find the stringr package, and press the “Update” button before continuing. Alternatively, in the console, run the command update.packages(ask=FALSE)

To follow along, you should

  1. Open or re-open your project
  2. Create a new Quarto document with the standard front matter (YAML) and libraries
  3. Load the PPP data into a new object called ppp_orig. It’s saved on the internet, which means using a nested function to tell R to go out in the world to find it rather than just looking in your project folder. That function is url():
ppp_orig <- readRDS ( 
               url ( 
                 "https://cronkitedata.s3.amazonaws.com/rdata/ppp_az_loans.RDS"
                 )
               )

17.1 select columns

The select verb allows you to pick out columns by name or by their order in the data frame. Selecting columns is case-sensitive: a column called amount is completely different than a column named Amount. That’s one reason to use our style of always converting column names to lower case.

select is the easiest verb to understand it so it’s shown here first. But in practice, I usually include it as my last verb in a code chunk because some columns that I don’t care about in the end are needed for filtering or creating new columns.

There a lot of ways to pick out column names that are based on their name, their position, their type or other characteristics. Here are a few:

  • select (date_approved, borrower_name) picks out those columns and rearranges them from left to right in the order you mention them.
  • select (1:10) picks out the first 10 columns. Think of the colon as the word “through”
  • select ( 2, 4:6, business_type) combines position and name so you can use whichever is easier.

You can rename columns at the same time you select them by typing the new name, one equals sign and the old name, for example:

select ( given_date = date_approved, 
        borrower = borrower_name)

Try it now and create a new data frame you’ll use later in the filtering section with just a handful of columns that are easier to manage.

ppp_small <- 
   ppp_orig |> 
  select (borrower_name, borrower_address, borrower_city,
          project_county, date_approved, amount, 
          forgiveness_amount)


glimpse(ppp_small)

17.2 arrange rows

The arrange verb sorts your data in different ways depending on the type of column. They can be alphabetical (character columns), in numeric order ( number or double columns), or chronologically (date or date/time columns) . Reverse the order by using desc() :

ppp_small |>
  arrange ( date_approved, desc(amount) )

17.3 filter rows

filter uses Boolean logic to allow quite sophisticated and powerful conditions to pick out just the rows you want to see or to further examine.

When you use a filter, you are telling R: “Give me back only the rows where the conditional formula I give you results in TRUE.”

Try it yourself: In your Console, type your name in quotes, then two = signs, then “Barack Obama” in quotes. Note that the answer is FALSE – that condition fails.

The conditions are case-sensitive and extremely picky about characteristics like upper and lower case, spacing and punctuation. For example, “PIZZA HUT OF ARIZONA, INC” is completely different than “PIZZA HUT” or “Pizza Hut of Arizona Inc” . A simple extra character will prevent a match.[^That’s why I’ve removed all of the punctuation from your the borrower names and addresses in your data file, and turned all of the names into upper case.]

Here are the operators , or symbols, you’ll use to make the conditions you want to test. Your column name goes on the left, and the value you want to test is on the right:

Operator Data types Meaning
== All Equals, exactly. Note the double-equals sign, which distinguishes it from an assignment operator, such as setting new column name.
> , >= date, numeric Greater than / greater than or equal to.
<, <= date, numeric Less than / less than or equal to.
!= all Not equal to
%in% all Is equal to any of the items in a list that you type out, such as fruit %in% c("apple", "orange").

In each of these cases, the column name goes on the left of the operator, and the condition goes on the right.

You can combine them using the Boolean operators and and or, which are used in other contexts like advanced Google searches :

  • & means “and”. which means that both conditions must be true. It narrows your search. (You can also separate the conditions with commas instead of ampersands) Think of it as a fish net, with smaller openings so only littler fish can geth through.

  • | (the verticle bar, near the Return key) means “or” , which means that either condition can be true. It widens your search. It is a fish net with bigger openings, so both big and little fish can get through.

  • %in% is the equivalent of an “or” condition and is used when you want to test a lot of items within one column but reduce the amount of repetitive typing.

This is sometimes confusing because it’s the exact opposite of the way we describe it in English. If we want “apples and oranges”, we have to search for an apple OR an orange in each row: fruit == 'apple' | fruit == 'orange'

You often need to use parentheses to tell R what order you want to evaluate the conditions when you combine “AND” and “OR” conditions.

Tip

Matching parentheses can be tricky. To make it a little easier, you can set “rainbow parentheses” in the RStudio options, which will show you the matching opening or closing brackets and parentheses in different colors when your cursor is placed next to one of them. It’s under the Tools -> Global Options -> Code -> at the bottom of the Display tab.

A few examples

In the PPP data, all borrower names are in upper case, and all cities and addresses are in proper case. In addition, all of the punctuation has been removed from those two columns, such as periods, commas, quotes and apostrophes.

When you use a condition :

  • Numbers are without quotes, commas, dollar signs or anything other than digits and decimal points. (1000000.24)
  • Text is always in quotes.
  • Dates are in the form “2022-01-22”.

A list of borrowers in Flagstaff and Sedona

Start with the borrowers just in Flagstaff.

ppp_small |> 
  filter ( borrower_city == "Flagstaff") 

Now look for either “Flagstaff” OR “Sedona”:

ppp_small |> 
  filter ( borrower_city == "Flagstaff" | borrower_city == "Sedona") 

This could get old if you’re looking for, say 10 or 20 cities. Instead, you can put them in a list and look for them all at once, like in this query where we look for Flagstaff, Sedona and Bisbee

ppp_small |> 
  filter ( borrower_city %in% 
             c("Flagstaff", "Sedona", "Bisbee")
           )

Narrowed to under $100,000

To narrow it just to loans under $100,000, you can use an “&” condition, or add another filter. In practice, I will often try to separate my “AND” conditions into a new filter command — it’s sometimes easier to read than a lot of (|&,) symbols.

ppp_small |>
  filter ( borrower_city %in% c("Flagstaff", "Sedona")) |>
  filter ( amount < 100000) |>
  arrange (date_approved)

Another way to do the same thing

The query above is the equivalent of the last example to produce the same answer. This time it uses a combination of the | “or” operator with the & “and” operator.

ppp_small |>
  filter ( 
           (borrower_city == "Flagstaff" | borrower_city == "Sedona") &
            amount < 10000
  ) |>
  arrange (date_approved)

(You might note how I did the indentations above – it helps in reading the code to make clear what comes before what.)

Projects in Maricopa County that are NOT in Phoenix or Scottsdale

Here’s how to negate a filter - everything NOT in two cities. The ! at the beginning of the condition means that it must evaluate to FALSE instead of TRUE to be kept in the data.

ppp_small |>
  filter ( project_county == "MARICOPA" &
             (! borrower_city %in% c("Scottsdale", "Phoenix")) 
  ) |>
  arrange (amount)
Warning

Be careful when using missing data or combining conditions when you negate. The answer is often very different than you intended. It often behaves opposite of what you intended.

17.4 Using str_like() for inexact matches

The operators we’ve examined so far are pretty limited. You have to know exactly what’s in a character column, which can get pretty annoying when you are trying to nail down a story based on data entered by humans into computers.

But you can also use the results of formulas, or expressions as a condition, as long as they result in a TRUE or FALSE answer. One of the most powerful is the function str_like(), which was added to the tidyverse in late 2022. It allows you to use wildcard characters to find words and phrases that begin with or contain some characters, and it’s not case-sensitive!

Tip

There is a whole set of these “string” functions available in the tidyverse through its stringr package. We’ll get to some of the more common ones later in the book, especially those related to a powerful pattern-matching method called “regular expressions” or “regex”.

The difference in using functions is that you’ll structure your command a little differently. Instead of

borrower_city == "Scottsdale", you’ll use a formula that looks like

str_like(borrower_city, "scottsdale"), where the function name comes first, then, inside parentheses, are the column you want to check and the value you want to look for.

str_like() uses wildcards to stand in for characters you’re not sure about. They are:

  • _ for one missing character. Use it when you aren’t sure of spelling.
    The condition summari_e would be TRUE if the word were “summarise” or “summarize”.

  • % for any number of unknown characters, or a missing character.
    For example, N% SCOTTSDALE R% would match “North Scottsdale Road”, “N. Scottsdale Rd” or “North Scottsdale Rd.” but not “Scottsdale” or “N Scottsdale Ave”.

Smart use of wildcards can often save you a lot of typing!

Examples

You might want to use str_like whenever you filter for words just because it doesn’t care about upper and lower case.

This example would find borrowers who had addresses that were “CENTRAL AVE”, “Central Ave” or “central ave”. It excludes anyone with a street address, a suite number, or anyone on North Central Ave. because there is no wildcard:

ppp_small |> 
  filter ( str_like ( borrower_address, "central ave"))

But this one finds all of the borrowers on Central Avenue:

ppp_small |> 
  filter ( str_like ( borrower_address, "% central ave%") )

If you examine some of the answers, you’ll see why having wildcards is so important — there’s really no way to guess how, exactly, a borrower might have entered their address. There’s no standardization on this or on many other columns in the data.

17.5 A note on NA, or missing data

Warning

A filter will never return any rows that are missing, or NA in the column you test. This is true both when you are trying to include them, and when you’re trying to include everything except something else.

In most data sets, there is at least a little missing data. In this case, the forgiveness amount never shows the value zero. Instead, it shows NA, or the R symbol for missing data.

This isn’t the same as the letters “NA”. It’s a special type of value that means “nothing”. It’s not zero, it’s not higher or lower than any other value, it doesn’t ever equal anything and it doesn’t ever NOT equal anything! It stands for a value that is truly unknown. In fact, an NA never even matches another NA, because we don’t know anything about either one.

This means you can’t filter for NA values directly and often have to take them into account.

To filter for NA values, you’ll use a special function — is.na() — designed to pick out these bad actors:

ppp_small |> 
  filter ( is.na ( forgiveness_amount ))

We’ll come back to this later, but for now just remember that missing data is a scourge on your analysis, and it will almost always be a problem you have to solve.

17.6 Your turn

Create a new Quarto document in your project. Add a sub-heading that provides documentation and sourcing of the data. You can load the data at the same time using this code chunk:

```{r}

ppp_orig <- readRDS( url ( "https://cronkitedata.s3.amazonaws.com/rdata/ppp_az_loans.RDS"))
```

Next, add a code chunk with the glimpse(ppp_orig) command to get a list of the columns in order.

Filtering your data

Create code chunks that will filter for the following values. You can select just the columns you want to see in whatever sort order makes sense for you to troubleshoot. You can rewrite the list below into plain English sentences to introduce the code chunks.

Find the following sets of loans, and try to think about why these might make for interesting or possible story ideas.

  1. At least $1 million

  2. Under $1,000

  3. Under $1,000 for projects in Maricopa County

  4. Loans that were paid in full but not forgiven. In other words, the borrower decided to pay it back rather than ask for taxpayers to cover it. (These loans are shown as missing, or NA in the forgiveness_amount column)

  5. Loans in Phoenix, Scottsdale and Tempe

  6. Loans in Arizona that were for projects out of state.

  7. McDonald’s franchises. (I removed the apostrophe.)

  8. Any loan to a Native American tribe or nation or reservation. Hint: It’s hard to do this because “NATION” could be part of the word “NATIONAL” and “TRIBE” could be part of the word “TRIBECA” and “RESERVATION” could be part of the word “PRESERVATION”. See how far you can get using your wildcards.

Your questions

Create a new section in your document, and list three additional filters that might yield something interesting. Before each attempt you make, write a sentence that describes, in plain English, what you are trying to find. Then after running your attempt, write a sentence summing up what you see in the result. Was it what you expected? Anything interesting?