C Murder Accountability Project exercises

The Murder Accountability Project was profiled in the New Yorker in November, 2017. Andrew B Tran brilliantly decided to use its data as a vehicle for learning R in his R for Journalists’ online class. This set of tutorials adapts his exercises for people using this textbook. You should consider taking his entire free full course if you want to get more detail.

Download the data into a new or existing R project, then load it using the load(file="murder_data.Rda") command in a code chunk.

C.1 The data

This R dataset was created by subsetting only mountain states and Oregon and Washington from the national dataset, and converting the codes used in the original SPSS data into codes and their labels in separate variables. The data runs from 1976 through 2018. Here are the fields included:

## Rows: 56,075
## Columns: 37
## $ rowid             <int> 27475, 27476, 27477, 27478, 27479, 27480, 27481, 27…
## $ mapid             <chr> "197612001AZ00100", "197708001AZ00100", "197801001A…
## $ year              <dbl> 1976, 1977, 1978, 1982, 1985, 1988, 1989, 1990, 199…
## $ month             <dbl> 12, 8, 1, 7, 6, 11, 10, 9, 12, 2, 12, 3, 9, 4, 2, 8…
## $ fip               <chr> "04001", "04001", "04001", "04001", "04001", "04001…
## $ state_fip         <chr> "04", "04", "04", "04", "04", "04", "04", "04", "04…
## $ cnty_name         <chr> "Apache, AZ", "Apache, AZ", "Apache, AZ", "Apache, …
## $ msa_code          <chr> "99904", "99904", "99904", "99904", "99904", "99904…
## $ msa_name          <chr> "Rural Arizona", "Rural Arizona", "Rural Arizona", …
## $ state_abbr        <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ…
## $ ori               <chr> "AZ00100", "AZ00100", "AZ00100", "AZ00100", "AZ0010…
## $ agency            <chr> "Apache County", "Apache County", "Apache County", …
## $ agency_type       <chr> "Sheriff", "Sheriff", "Sheriff", "Sheriff", "Sherif…
## $ solved            <chr> "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes…
## $ homicide_type     <chr> "Murder and non-negligent manslaughter", "Murder an…
## $ incident_num      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ vic_age           <dbl> 42, 28, 0, 23, 18, 23, 43, 33, 33, 0, 38, 0, 37, 25…
## $ vic_sex           <chr> "Male", "Male", "Female", "Female", "Male", "Male",…
## $ vic_race_code     <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "…
## $ vic_race          <chr> "White", "White", "White", "White", "White", "White…
## $ vic_ethnic_code   <chr> "U", "U", "U", "N", "U", "U", "N", "N", "N", "N", "…
## $ vic_ethnic        <chr> "Unknown or not reported", "Unknown or not reported…
## $ off_age           <dbl> 22, 44, 50, 999, 999, 41, 39, 48, 999, 29, 18, 14, …
## $ off_sex           <chr> "Male", "Male", "Male", "Unknown", "Unknown", "Male…
## $ off_race_code     <chr> "I", "W", "W", "U", "U", "W", "W", "W", "U", "W", "…
## $ off_race          <chr> "American Indian or Alaskan Native", "White", "Whit…
## $ off_ethnic_code   <chr> "U", "U", "U", "U", "U", "U", "N", "N", "U", "N", "…
## $ off_ethnic        <chr> "Unknown or not reported", "Unknown or not reported…
## $ situation_code    <chr> "C", "A", "C", "B", "B", "A", "A", "A", "B", "A", "…
## $ situation         <chr> "Single victim/multiple offenders", "Single victim/…
## $ weapon_code       <chr> "20", "13", "90", "11", "12", "12", "12", "12", "12…
## $ weapon            <chr> "Knife or cutting instrument", "Rifle", "Other or t…
## $ relationship_code <chr> "AQ", "AQ", "DA", "UN", "UN", "AQ", "AQ", "AQ", "ST…
## $ relationship      <chr> "Acquaintance", "Acquaintance", "Daughter", "Relati…
## $ circumstance_code <chr> "42", "44", "59", "99", "99", "45", "5", "60", "3",…
## $ circumstance      <chr> "Brawl due to influence of alcohol", "Argument over…
## $ use_date          <date> 1980-03-01, 1980-03-01, 1980-03-01, 1982-10-06, 19…

Any variable that begins with off refers to the offender; any variable that begins with vic refers to the victim. Most variables are pretty self-explanatory, but here are a few details:

  • fip and state_fip are standard codes used across many databases to identify geographic areas – in this case, counties and states.
  • msa_code and msa_name refer to Metropolitan Statistical Areas, which combine nearby counties regardless of state into areas that are often considered on large metropolitan area, such as Washington DC and the Virginia and Maryland suburbs.
  • ori, agency and agency_type refer to the law enforcement agency – police or sheriff’s office – that investigated the murder. An ORI is a standard code for each agency from the FBI.
  • relationship_code and relationship can be confusing. They refer to the relationship of the victim to the offender, not the other way around. For example, “Wife” means that the victim was the killer’s wife, not that the killer was the victim’s wife. This is unclear in the FBI documentation.

More details are available in this detailed record layout and data dictionary.

C.2 Exercises by chapter

C.2.1 Select and filter exercises

Here are some suggested exercises to practice what you learned in Chapter 5, Select and Filter:

Older wives as victims in Arizona

  1. Create a new data frame called arizona_murders based on just the murders that were reported in that state.

Try doing these one step at a time by adding to a query :

  1. Pick out just following variables to work with:

    • year
    • name of the county and the police department,
    • whether it was solved,
    • demographics (eg, age, race, sex, ethnicity) of the victim and the offender,
    • information on the weapon and the relationship
  2. List all of the murders in which the killer was the husband of the victim. This can be done using either the relationship or the relationship_code.

  3. Add a condition that the wives were at least 60 years old.

  4. Sort the answer by oldest to youngest

Advanced exercises using other conditions

If you’re feeling adventurous, try figuring out how you might find:

  • Any domestic-violence related incidents. Hint: This would be an %in% condition once you look at your options using either “distinct” or a group-by query.

  • Try using str_detect when you want to use wild cards instead of exact matches. These take regular expressions as arguments. So to find any gun in this dataset, you’d use

    str_detect(weapon_code, "^1")

(For more details on regular expressions, try the Regex101 tutorial on our class website. Using regular expressions is often a way to make queries shorter and less fussy, but they are not as clear to a reader – they often take some puzzling through.)

C.2.2 Group by and summarise

The group by exercises are just like a pivot table. In fact, to turn it on its head, you use the command “pivot_wider”

“The most”

  • Which county in this small dataset has the most murders? Which one has the most police killings? (Look in the circumstance column for this.)

  • Create a table showing the number of murders by year and state (with states across the top, and years down the side). This is a group_by / summarise / arrange / pivot_wider exercise

  • Try calculating the percent of murders by relationship. For this to work, you can only keep one group_by column (relationship)

Putting it all together

What percent of each state’s domestic violence victims are of Hispanic origin?

C.2.3 Mutate

In this case, there aren’t very many numbers for us to work with, but there are still things to be done to make more meaningful groupings. These are usually done using an if_else statement or something like it.

  1. Create a table that compares the number of gun murders with other weapons for each year.
  2. Try to compute the percent of murders by weapon annually.
  3. Are gun murders more likely to be solved than others?

C.2.4 Join

This gives you an opportunity to see how to add data to a table from the Web. This Excel file includes lots of information from the Census as of 2017, which you can link to summary data from the murder data using the county fips codes.

C.3 Wrap-up exercise

Try getting as far as you can to answer this question: Characterize the county with the highest murder rate, using the Census data linked above. This will require doing everything in steps:

  1. Grouping to the county level, keeping the county FIPS code
  2. Importing the Excel file, making sure you keep the FIP code as the same type of data as the one in the murder dataset.
  3. Joining the grouped file to the Census data.
  4. Computing the murder rate
  5. Deciding which variables in the Census data to use or manipulate to find a characterization that you’re interested in.

C.4 Answers to exercises

C.4.1 Select and filter

Arizona wives

  1. Create a new dataset with just arizona:
arizona_murders <- 
  murder_data %>%
  filter ( state_abbr == "AZ")

You should have 15,443 rows in this dataset.

  1. The final set of queries might look like this. (You might have noticed you that you had one victim age 999 when you sorted. That means “unknown” in this dataset, so you’ll want to filter that out as well.)
arizona_murders  %>%
  select ( year, cnty_name, agency, solved, 
           starts_with("vic"), starts_with("off"), 
           contains ("relationship")) %>%
  filter (relationship == "Wife"  & 
          vic_age >= 60 & 
          vic_age < 999 )  %>%
  arrange ( desc(vic_age)) 

You could also use

  relationship == "Wife"  & 
      between (vic_age, 60, 998)  
      

You might also notice that there is a Male “wife” as a victim, reflecting how poorly many police agencies fill out these forms.

C.4.2 Group by

“The most”

  1. County with the most killings
murder_data %>%
  group_by (state_abbr, cnty_name) %>%
  summarise ( cases = n() ) %>%
  arrange (desc (cases))
## `summarise()` has grouped output by 'state_abbr'. You can override using the `.groups` argument.

Why is this answer not surprising? (Hint: Clark County, Nevada, has about half the population of Maricopa.) We’ll get to ways to normalize this in later chapters.

To just get the police shootings, filter the above query for circumstance_code == "81" before the group_by

  1. Murders by state and year

“Mutate” here takes the result of the summary, then creates a new category with the annual totals. It’s a little confusing how it works, but don’t worry about it too much. Just know it’s possible.

murder_data %>% 
  group_by (year, state_abbr) %>%
  summarise (cases = n() ) %>%
  #bonus: Calculate the total number of cases by year:
  mutate ( total_cases = sum(cases) )  %>%
  pivot_wider (names_from = state_abbr, values_from = cases) %>%
  arrange ( desc (year)) 
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Does this mean that these states have suddenly become more dangerous? What about population growth?

Alternative method

The janitor package (which you may need to install) has a way to create cross-tabulations like this more simply. The function is tabyl (to distinguish from other table operations in R, which you probably want to avoid). Here’s an example:

murder_data %>%
  tabyl ( year , state_abbr) %>%
  arrange ( desc (year) ) %>%
  head () 
year AZ CO ID MT NM NV OR UT WA WY
2018 425 248 37 40 168 239 115 69 255 17
2017 484 252 48 43 166 311 131 82 242 14
2016 394 208 47 37 111 228 111 76 213 23
2015 344 189 32 38 105 203 85 61 224 18
2014 313 165 30 30 128 185 84 72 206 14
2013 383 186 30 16 119 168 85 51 194 17
  1. Percent of murders by relationship:

Once those totals are calculated, you can use them to compute a new variable, such as percent.

murder_data %>%
  group_by (relationship) %>%
  summarise (num_of_cases = n() ) %>%
  mutate (total_cases = sum(num_of_cases) ,
          # This rounds to 1 digit.
          pct_cases = round (num_of_cases / total_cases * 100 , 1)
          ) %>%
  # you could un-select the total cases since they'll always be the same , but for illustratio purposes I'm keeping it.
  arrange ( desc (num_of_cases)) 

Alternative method with janitor::tabyl

murder_data %>%
  tabyl (relationship) %>%
  adorn_pct_formatting (digits=1)   %>%
  #this last part turns it into a normal data frame
  arrange ( desc(n)) %>%
  as_tibble()

Putting it all together

First, isolate the domestic violence cases. Let’s see what our choices are:

murder_data %>%
  group_by (relationship_code, relationship) %>%
  summarise (n())
## `summarise()` has grouped output by 'relationship_code'. You can override using the `.groups` argument.
relationship_code relationship n()
AQ Acquaintance 11935
BF Boyfriend 669
BR Brother 534
CH Common-law husband 84
CW Common-law wife 178
DA Daughter 1005
EE Employee 53
ER Employer 74
FA Father 480
FR Friend 2368
GF Girlfriend 1893
HO Homosexual relationship 135
HU Husband 904
IL In-law 365
MO Mother 559
NE Neighbor 727
OF Other family 873
OK Other - known to victim 2349
SD Stepdaughter 102
SF Stepfather 159
SI Sister 168
SM Stepmother 23
SO Son 1260
SS Stepson 163
ST Stranger 9988
UN Relationship not determined 15774
WI Wife 2890
XH Ex-husband 76
XW Ex-wife 287

Here’s one way to get the answer (eliminating cases in which we don’t know the victim’s ethnicity)

murder_data %>%
  filter ( relationship_code %in% c("BF", "BR", "CH", "CW", "DA", "FA", 
                                    "GF", "HO", "HU", "IL", "MO", "OF", "SD", "SF", 
                                    "SI", "SM", "SO", "WI", "XH", "XW")   & 
          vic_ethnic_code %in% c("H", "N")
        ) %>%
   group_by ( state_abbr, vic_ethnic ) %>%
   summarise (cases = n() ) %>%
   #calculate total by state and percent
   mutate ( total_cases = sum(cases), 
            pct_cases = cases / total_cases * 100 ) %>%
   #get rid of case counts
   select ( -cases ) %>%
   #sort by state
   arrange ( state_abbr ) %>%
   # put ethnicity in columns
   pivot_wider ( values_from = pct_cases, names_from = vic_ethnic)  
## `summarise()` has grouped output by 'state_abbr'. You can override using the `.groups` argument.

  1. If you wanted to go further with filtering, you might look at the regular expressions available for more sophisticated filtering using the str_detect function. In this case, str_detect (weapon_code, "^1") searches for anything in the field that begins with a “1”.↩︎