This markdown gives you instructions for importing and understanding the PPP data for your mini-memo. There are other background documents and example stories in the Canvas module that you can also review.
To load the data into your Global Environment add this line to a code chunk:
data_url <- "https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp_zipcodes.Rda?raw=true"
load (url ( data_url) )
This will create two data frames in your working environment: az_ppp_zip, which is the list of loans, and az_by_zipcode, which has information on the zip codes.
The original data was published by the SBA here https://sba.app.box.com/s/5myd1nxutoq8wxecx2562baruz774si6 on March 1, 2021. I downloaded and unzipped the PPP portion, and read it into R using a different program.
The SBA has never released a record layout or data dictionary for this data that I can find. I worked on a project over the Christmas break using an earlier version of the data, so I think I understand much of what it contains.
I selected about half of the variables that are included in the original data, and simplified a few as shown below. I also converted all character fields to upper case to ease filtering, and converted the Zip Codes to 5-digits .
This dataset includes all loans that were made to companies in Arizona. A handful (6) of these are going toward projects in other states.
Here are the columns I kept, and what I understand them to mean. Any changes other than those above are indicated in the record layout.
The loan application form . There’s one important thing I don’t know the answer to: There are no rows with a loan statust that suggests that it was forgiven, though we know from reports that more than $100 billion have already been forgiven. There is a “loan status” field in the data, but that doesn’t have anything related to forgiveness, and the “PAID IN FULL” indicator doesn’t add up to nearly enough money. I’m trying to find out what the answer is there, but I wanted to warn you not to use that column until we get some clarity on it.
| column name | type | description |
|---|---|---|
| id | n | A unique identifier supplied by the SBA |
| date_approved | d | Date the loan was initially approved |
| status_date | d | When the status was reported (NA if unreported) |
| borrower | c | Name of the borrower, all upper case |
| address | c | Address of the borrower |
| city | c | |
| state | c | |
| zip | c | 5-digit Zip code of the borrower, as reported on the loan |
| census_zipcode | c | 5-digit “zip code tabulatioon area” used by the Census bureau* |
| loan_status | c | Blank if redacted because of EXEMPTION 4 - VERY unclear what this means! |
| term | n | Months to pay back |
| initial_amt | n | Initial amount approved by SBA, as reported by the bank. |
| payroll_amt | n | Amount used for payroll |
| nonpayroll_amt | n | Total amount used for other costs such as utilities, rent, or interest on loans |
| franchise | c | Name of a franchisee |
| business_type | c | How the business is organized, such as “SOLE PROPRIETOR” or “CORPORATION” |
| lender | c | Name of originating lender (not servicing lender) |
| lender_state | c | |
| rural_urban | c | |
| business_age | c | How long the business had been open before the loan - |
| project_city | c | City where the money will be used |
| project_county | c | County where the money will be used |
| project_state | c | State where the county will be used |
| jobs_reported | n | Number of jobs “saved” by the loan |
| naics_code | c | Standard industry code used in the federal government |
| naics_sector | c | Sector of that standard industry code, where it was a valid one |
| naics_descript | c | 2017 description of the industry, blank if a previous year’s version used |
| race | c | Race of business owner, “UNANSWERED” is 85% of loans. Hispanic is not a race. |
| ethnicity | c | “HISPANIC OR LATINO”, “NOT HISPANIC OR LATINO”, “UNKNOWN/NOTSTATED” (85% of loans) |
| gender | c | Same with gender - Usually not filled out |
| veteran | c | Same with veteran - Usually not filled out |
| non_profit | c | “Y” if the borrower is a non-profit organization |
census_zipcode is the one that you will use to match against the demographic information.To get demographic information, we had to use “zip code tabulation areas”, which is what the Census uses. The reason is that Zip Codes are actually postal codes, and refelct post office routes rather than areas. They can also contain post office boxes and special zip codes for large institutions like ASU. I created a new column in this dataset to reflect that Census zip code, which can be used to match against the rest of our demographic information.
Creating it was a two-step process. First, I matched all “ZCTA” data from the Census itself against the original list and kept whatever matched. If they didn’t, I applied the census ZCTA found in the “UDS Mapper” crosswalk, which is a health care coordination consortium created to help administer Obamacare.
One element that isn’t clear is whether the loan was forgiven or not (yet). The borrower has up to two years to request forgiveness, and they can’t even ask until they’ve used all the money.
The zip code data you are getting is a compilation of sources that I created for some work I did for Reveal over the holiday break. (It was preliminary work to get started reporting, and won’t be used in the final story – they were in the process of geocoding every loan so that they had Census Tract information rather than zip codes, which are much better for demographic purposes.)
There is no good measure of the number of businesses that exist in any small geographic area like zip code or Census Tract. The reason is that most government surveys only include private, for-profit entities with at least 5 employees. About 80 percent of all businesses have NO employees, and a very large proportion of the rest have fewer than 5. The problem is, this loan program was targeted specifically at these small entities!
This means it’s very difficult to estimate the percent of all businesses in a zip code that got loans! The best proxy I could find is from the U.S. Postal Service, which produces a file every quarter that has the number of businesses and residential addresses that can get mail. It excludes PO boxes, and would count as “residential” a business run out of a home, so it’s not exact. But it’s the best we can find.
I obtained the Census data using Social Explorer , which you can access through the ASU library. It makes it relatively easy to obtain cuts of Census products tailored to your needs. I used the 2014-2018 American Community Survey to obtain basic demographic information by Zip Code Tabulation area.
Zip codes don’t have basic geographic elements connected to them! I used the Census TIGER mapping files to attach the county, latitude, longitude, etc. to the zctas using their center points.
| column name | type | Description |
|---|---|---|
| zcta | c | The census zip code (matches to census_zip in the ppp data) |
| zipcode_city | c | The name of the post office, from UDS mapper |
| st_county_fip | c | A standard code indicating the county that the zip code is in |
| county_name | c | The name of that county |
| metro_name | c | Metro area name if there is one for that zip code |
| usps_businesses | n | # of business addresses estimated by the US Postal Service |
| tot_pop | n | Total population (Some are zero) |
| households | n | # of households |
| pct_white_nh | n | Proportion of White, non-Hispanic population (percent in decimals) |
| pct_black_nh | n | Proporation of Black, non-Hispanic |
| pct_aian_nh | n | Proportion of American Indian or Alaska Native non-Hispanic |
| pct_asian_nh | n | Proportion of Asian non-Hispanic |
| pct_hispanic | n | Proportion of Hispanic or Latino population |
| zcta_ethnic | n | Ethnicity categorization using Urban Institute’s methodology |
| white_nh_group | n | Categorization of the prop.of White non-Hispanic population, using Urban Institute |
| median_inc_2018 | m | Median income in 2018 dollars |
| income_lt25 | n | Number of households with less than $25,000 annual income |
| income_150kup | n | Number of households with $150,000 annual income or more |
| aggregate_income | n | Total household income (can be divided by households for average income) |
| pct_college_grad | n | Proportion of adults with college degree |
| pct_kids_droppedout | n | Proportion of youth 18 and under who dropped out of school |
| unempl_rate | n | Average unemployment rate 2014-2018 |
| latitude | n | The latitude of the center point of the zip code |
| longitude | n | The longitude of the center point of the zip code |
You can use the demographic information in two ways:
To attach information about the area to each individual loan, allowing you to look up businesses that were located in zip codes with specific characteristics.
To use with summarised information from the loans to find out, for example, which lenders provided loans in minority areas.
Here are two examples:
Here, we’ll pull out information on the ethnicity of the zip code, and attach it to the individual loan information. Then we can look up businesses that were in certain areas.
First, create a data frame with the information that you want to keep from the demographic file:
az_keep_zip <-
az_by_zipcode %>%
select ( zcta, zipcode_city, county_name, usps_businesses: households, zcta_ethnic, white_nh_group, pct_hispanic, pct_white_nh, pct_aian_nh, pct_black_nh,
median_inc_2018)Now join it to the loan data and pick out loans that match a few criteria and select some columns you might want to examine:
az_ppp_zip %>%
left_join ( az_keep_zip, by=c("census_zip"="zcta")) %>%
filter ( county_name == "Yuma County",
str_detect (zcta_ethnic, "Hispanic"),
white_nh_group == "0 to 10%") %>%
select ( borrower, city, census_zip, initial_amt, ethnicity,
median_inc_2018, zcta_ethnic, pct_hispanic, address
) %>%
arrange ( census_zip, borrower)The opposite strategy can be used to summarize how loans were made to minority and non-minority zip codes. You could use this to find places where very few businesses got loans , or where most of the businesses got loans, in minority neighborhoods. This is a lot more complicated and you have to be careful of data anomolies, but it is also very powerful: It lets you map out the kinds of neighborhoods that got, and didn’t get, loans.
You can look at how much you can do with this approach by looking at one of the metro areas I profiled for the Reveal project.
You could also look at whether the Biden administration is, so far, doing more to serve minority or underserved areas.
Here is a simple example, which looks at the distribution of loans by county, depending on the ethnicity of the zip code.
First, we have to get the number of loans and some other data about each zip code from the PPP loan data. We should also be sure we know how many businesses got loans, not just the number of loans, because we’re going to want to get a rate of participation later on.
ppp_ziptotals <-
az_ppp_zip %>%
group_by ( census_zip ) %>%
summarise ( loans = n(),
borrowers = n_distinct ( borrower, zip) ,
total_amount = sum ( payroll_amt + nonpayroll_amt)
)Now we’re going to match them against the zip code information we kept earlier, making sure to keep all of the rows (not just thost that had loans in them). This will drop out any zipcodes that aren’t really in Arizona or are blank in our PPP data. We also have to make the number of loans zero in zip codes that didn’t exist in our PPP loan data.
(This is a “left_join”, which means “keep everything from the first data frame mentioned, and only the things that match from the second one listed.”)
ppp_zip_sums <-
az_keep_zip %>%
left_join ( ppp_ziptotals, by=c("zcta"="census_zip")) %>%
# we are left with some NA's because not every zip code had any loans
# There's a shortcut way to do this, but I don't want to make it more confusing than necessary right now.
mutate ( loans = replace_na (loans, 0),
borrowers = replace_na (borrowers, 0),
total_amount = replace_na (total_amount, 0)) In practice, we’d have to decide what to do in cases in which there were more loans than businesses! In some cases, the Postal Service had no business addresses but there were loans . In others, there were many more loans than there were businesses. You have some choices:
Set the % of businesses with loans to 100% if the number of loans was more than the number of unique businesses.
Add together the number of businesses and the number of households, to get a stable basis for a comparison. This basically says, “out of all of the entities we know of in this zip code, what portion got loans?” If it’s a business district, that would be high. But it would never be more than 100%!
Move up a level of aggregation – look at totals by county and ethnicity, or group the zipcodes together some other way, to summarize the data and paper over the data problems.
Don’t worry about it, and pick out some interesting zip codes, such as some with relatively low income or some predominant ethnicity, and then compare their total loan rates with everyone else.
Use the anomalies to your advantage – look at those places on a map or a satellite image, and see if they might be good places to center a story!
Here’s an example of aggregating to the county level:
az_ppp_county <-
ppp_zip_sums %>%
group_by (county_name, zcta_ethnic ) %>%
summarise (across ( c( usps_businesses, tot_pop, households, loans, borrowers, total_amount) , sum, na.rm=T))## `summarise()` has grouped output by 'county_name'. You can override using the `.groups` argument.
Now, for each group, calculate the percent of the businesses that were borrowers. Cap it at 100%. I’ll do it in two steps so it’s a little more obvious:
az_ppp_county %>%
mutate ( pct_borrowed = borrowers / usps_businesses * 100 ) %>%
# fix it.
mutate ( pct_borrowed =
if_else ( pct_borrowed <= 100, pct_borrowed, 100 )) %>%
# round it off.
mutate (pct_borrowed = round (pct_borrowed, 2))You can see we get really weird answers for the small counties with few businesses registered with the Postal Service. In this case, we might want to add together the number of households and businesses, and then try it. It’s not awesome, but it at least gives us a reasonable rate of some kind:
az_ppp_county %>%
mutate ( pct_borrowed = borrowers / (usps_businesses + households) * 100 ) %>%
# fix it.
mutate ( pct_borrowed =
if_else ( pct_borrowed <= 100, pct_borrowed, 100 )) %>%
# round it off.
mutate (pct_borrowed = round (pct_borrowed, 2))Let’s do the same thing, but compute some averages by county so we know if it’s different from the total.
az_ppp_county %>%
mutate ( pct_borrowed = borrowers / (usps_businesses + households) * 100 ) %>%
# fix it.
mutate ( pct_borrowed =
if_else ( pct_borrowed <= 100, pct_borrowed, 100 )) %>%
# round it off.
mutate (pct_borrowed = round (pct_borrowed, 2),
county_pct_borrowed = round (sum( borrowers) / sum( usps_businesses + households) * 100, 2)) Finally, let’s just pick out some of the numbers and flip them around so that you can compare “spreadsheet” style:
az_ppp_county %>%
#copied from above
mutate ( pct_borrowed = borrowers / (usps_businesses + households) * 100 ) %>%
mutate ( pct_borrowed =
if_else ( pct_borrowed <= 100, pct_borrowed, 100 )) %>%
mutate (pct_borrowed = round (pct_borrowed, 2),
county_pct_borrowed = round (sum( borrowers) / sum( usps_businesses + households) * 100, 2)) %>%
#this part is new:
select ( zcta_ethnic, county_name, county_pct_borrowed, pct_borrowed ) %>%
filter (str_detect (zcta_ethnic, "01|04|06") ) %>%
pivot_wider ( names_from = zcta_ethnic, values_from=pct_borrowed)Because we’ve compared the number of borrowers to a combination of businesses and households, it’s a little hard to describe this answer. But here’s a stab at it:
In Maricopa County, the predominantly White areas had about 68 percent higher rates of borrowing than predominantly Latino areas, based on the number of businesses and households in each zip code.