C Murder Accountability Project exercises

The Murder Accountability Project was profiled in the New Yorker in November, 2017. Andrew B Tran brilliantly decided to use its data as a vehicle for learning R in his R for Journalists’ online class. This set of tutorials adapts his exercises for people using this textbook. You should consider taking his entire free full course if you want to get more detail.

Download the data into a new or existing R project, then load it using the load(file="murder_data.Rda") command in a code chunk.

C.1 The data

This R dataset was created by subsetting only mountain states and Oregon and Washington from the national dataset, and converting the codes used in the original SPSS data into codes and their labels in separate variables. The data runs from 1976 through 2018. Here are the fields included:

## Rows: 56,075
## Columns: 37
## $ rowid             <int> 27475, 27476, 27477, 27478, 27479, 27480, 27481, 27…
## $ mapid             <chr> "197612001AZ00100", "197708001AZ00100", "197801001A…
## $ year              <dbl> 1976, 1977, 1978, 1982, 1985, 1988, 1989, 1990, 199…
## $ month             <dbl> 12, 8, 1, 7, 6, 11, 10, 9, 12, 2, 12, 3, 9, 4, 2, 8…
## $ fip               <chr> "04001", "04001", "04001", "04001", "04001", "04001…
## $ state_fip         <chr> "04", "04", "04", "04", "04", "04", "04", "04", "04…
## $ cnty_name         <chr> "Apache, AZ", "Apache, AZ", "Apache, AZ", "Apache, …
## $ msa_code          <chr> "99904", "99904", "99904", "99904", "99904", "99904…
## $ msa_name          <chr> "Rural Arizona", "Rural Arizona", "Rural Arizona", …
## $ state_abbr        <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ…
## $ ori               <chr> "AZ00100", "AZ00100", "AZ00100", "AZ00100", "AZ0010…
## $ agency            <chr> "Apache County", "Apache County", "Apache County", …
## $ agency_type       <chr> "Sheriff", "Sheriff", "Sheriff", "Sheriff", "Sherif…
## $ solved            <chr> "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes…
## $ homicide_type     <chr> "Murder and non-negligent manslaughter", "Murder an…
## $ incident_num      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ vic_age           <dbl> 42, 28, 0, 23, 18, 23, 43, 33, 33, 0, 38, 0, 37, 25…
## $ vic_sex           <chr> "Male", "Male", "Female", "Female", "Male", "Male",…
## $ vic_race_code     <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "…
## $ vic_race          <chr> "White", "White", "White", "White", "White", "White…
## $ vic_ethnic_code   <chr> "U", "U", "U", "N", "U", "U", "N", "N", "N", "N", "…
## $ vic_ethnic        <chr> "Unknown or not reported", "Unknown or not reported…
## $ off_age           <dbl> 22, 44, 50, 999, 999, 41, 39, 48, 999, 29, 18, 14, …
## $ off_sex           <chr> "Male", "Male", "Male", "Unknown", "Unknown", "Male…
## $ off_race_code     <chr> "I", "W", "W", "U", "U", "W", "W", "W", "U", "W", "…
## $ off_race          <chr> "American Indian or Alaskan Native", "White", "Whit…
## $ off_ethnic_code   <chr> "U", "U", "U", "U", "U", "U", "N", "N", "U", "N", "…
## $ off_ethnic        <chr> "Unknown or not reported", "Unknown or not reported…
## $ situation_code    <chr> "C", "A", "C", "B", "B", "A", "A", "A", "B", "A", "…
## $ situation         <chr> "Single victim/multiple offenders", "Single victim/…
## $ weapon_code       <chr> "20", "13", "90", "11", "12", "12", "12", "12", "12…
## $ weapon            <chr> "Knife or cutting instrument", "Rifle", "Other or t…
## $ relationship_code <chr> "AQ", "AQ", "DA", "UN", "UN", "AQ", "AQ", "AQ", "ST…
## $ relationship      <chr> "Acquaintance", "Acquaintance", "Daughter", "Relati…
## $ circumstance_code <chr> "42", "44", "59", "99", "99", "45", "5", "60", "3",…
## $ circumstance      <chr> "Brawl due to influence of alcohol", "Argument over…
## $ use_date          <date> 1980-03-01, 1980-03-01, 1980-03-01, 1982-10-06, 19…

Any variable that begins with off refers to the offender; any variable that begins with vic refers to the victim. Most variables are pretty self-explanatory, but here are a few details:

fip and state_fip are standard codes used across many databases to identify geographic areas – in this case, counties and states.
msa_code and msa_name refer to Metropolitan Statistical Areas, which combine nearby counties regardless of state into areas that are often considered on large metropolitan area, such as Washington DC and the Virginia and Maryland suburbs.
ori, agency and agency_type refer to the law enforcement agency – police or sheriff’s office – that investigated the murder. An ORI is a standard code for each agency from the FBI.
relationship_code and relationship can be confusing. They refer to the relationship of the victim to the offender, not the other way around. For example, “Wife” means that the victim was the killer’s wife, not that the killer was the victim’s wife. This is unclear in the FBI documentation.

More details are available in this detailed record layout and data dictionary.

C.2 Exercises by chapter

C.2.1 Select and filter exercises

Here are some suggested exercises to practice what you learned in Chapter 5, Select and Filter:

Older wives as victims in Arizona

Create a new data frame called arizona_murders based on just the murders that were reported in that state.

Try doing these one step at a time by adding to a query :

Pick out just following variables to work with:
- year
- name of the county and the police department,
- whether it was solved,
- demographics (eg, age, race, sex, ethnicity) of the victim and the offender,
- information on the weapon and the relationship
List all of the murders in which the killer was the husband of the victim. This can be done using either the relationship or the relationship_code.
Add a condition that the wives were at least 60 years old.
Sort the answer by oldest to youngest

Gun-related killings

Finding out what codes are in the data could be done with a group_by/summarise query, but there is another verb that can show you every unique value in a dataset. Use this code to show you every type of weapon used in the dataset:

    murder_data %>%
       distinct (weapon_code, weapon) %>%
       arrange (weapon_code)

The verb distinct is used instead of select to just show a list of values that are never repeated. When you run that code, you can see why it’s sometimes useful to keep codes as well as words in a dataset – codes “11” through “15” refer to some kind of gun.

Question: Find all gun-related murders of young black or Hispanic men since 2015. You can define “young” however you want, but in my example I’ll use victims between 15 and 29.

In this example, you’ll have to combine OR conditions, with others. Remember you can use BETWEEN for ranges of values or %in% for a list of values.

Advanced exercises using other conditions

If you’re feeling adventurous, try figuring out how you might find:

Any domestic-violence related incidents. Hint: This would be an %in% condition once you look at your options using either “distinct” or a group-by query.
Try using str_detect when you want to use wild cards instead of exact matches. These take regular expressions as arguments. So to find any gun in this dataset, you’d use
```
str_detect(weapon_code, "^1")
```

(For more details on regular expressions, try the Regex101 tutorial on our class website. Using regular expressions is often a way to make queries shorter and less fussy, but they are not as clear to a reader – they often take some puzzling through.)

C.2.2 Group by and summarise

The group by exercises are just like a pivot table. In fact, to turn it on its head, you use the command “pivot_wider”

“The most”

Which county in this small dataset has the most murders? Which one has the most police killings? (Look in the circumstance column for this.)
Create a table showing the number of murders by year and state (with states across the top, and years down the side). This is a group_by / summarise / arrange / pivot_wider exercise
Try calculating the percent of murders by relationship. For this to work, you can only keep one group_by column (relationship)

Putting it all together

What percent of each state’s domestic violence victims are of Hispanic origin?

C.2.3 Mutate

In this case, there aren’t very many numbers for us to work with, but there are still things to be done to make more meaningful groupings. These are usually done using an if_else statement or something like it.

Create a table that compares the number of gun murders with other weapons for each year.
Try to compute the percent of murders by weapon annually.
Are gun murders more likely to be solved than others?

C.2.4 Join

This gives you an opportunity to see how to add data to a table from the Web. This Excel file includes lots of information from the Census as of 2017, which you can link to summary data from the murder data using the county fips codes.

C.3 Wrap-up exercise

Try getting as far as you can to answer this question: Characterize the county with the highest murder rate, using the Census data linked above. This will require doing everything in steps:

Grouping to the county level, keeping the county FIPS code
Importing the Excel file, making sure you keep the FIP code as the same type of data as the one in the murder dataset.
Joining the grouped file to the Census data.
Computing the murder rate
Deciding which variables in the Census data to use or manipulate to find a characterization that you’re interested in.

C.4 Answers to exercises

C.4.1 Select and filter

Arizona wives

Create a new dataset with just arizona:

arizona_murders <- 
  murder_data %>%
  filter ( state_abbr == "AZ")

You should have 15,443 rows in this dataset.

The final set of queries might look like this. (You might have noticed you that you had one victim age 999 when you sorted. That means “unknown” in this dataset, so you’ll want to filter that out as well.)

arizona_murders  %>%
  select ( year, cnty_name, agency, solved, 
           starts_with("vic"), starts_with("off"), 
           contains ("relationship")) %>%
  filter (relationship == "Wife"  & 
          vic_age >= 60 & 
          vic_age < 999 )  %>%
  arrange ( desc(vic_age))

You could also use

  relationship == "Wife"  & 
      between (vic_age, 60, 998)

You might also notice that there is a Male “wife” as a victim, reflecting how poorly many police agencies fill out these forms.

Gun-related killings

This is actually much more difficult than it sounds. Try to build it one piece at a time. Here is how I might build the conditions: Guns:

    ... weapon_code %in% c("11", "12", "13", "14", "15")

(This is a text variable even though it looks like numbers - that means you need the quotes. Don’t forget to use the “c” for “combine into a list” before the list of values)⁶

Since 2015:

    weapon_code %in% c("11", "12", "13", "14", "15") &
    year >= 2015

Young men:

    weapon_code %in% c("11", "12", "13", "14", "15") &
    year >= 2015 & 
    between (vic_age, 15, 29)

Black and Hispanic victims:

    weapon_code %in% c("11", "12", "13", "14", "15") &
    year >= 2015 & 
    between (vic_age, 15, 29)  &
    (vic_race_code  == "B" | vic_ethnic_code = "H")

That last one is the trickiest – If you want to find both African-American AND Hispanic victims, you need to look for a race code of “B” OR an ethnicity code of “H”. Those have to be in a parenthese in order not to be confused with the other conditions. (I’m going to select just some of the columns and order it by the year and month of the murder, showing only the most recent). (NOTE: Eliminate the last line of this code if you haven’t installed and activated the “DT” package, which makes searchable, sortable tables.)

If this complex filter seems too complex, you can always chain one filter after another to check your data as you go. For example, you could do this:

     murder_data %>%
        filter ( weapon_code %in% c("11", "12", "13", "14", "15") ) %>%
        filter ( vic_race_code == "B" | vic_ethnic_code == "H") ....

C.4.2 Group by

“The most”

County with the most killings

murder_data %>%
  group_by (state_abbr, cnty_name) %>%
  summarise ( cases = n() ) %>%
  arrange (desc (cases))

## `summarise()` has grouped output by 'state_abbr'. You can override using the `.groups` argument.

Why is this answer not surprising? (Hint: Clark County, Nevada, has about half the population of Maricopa.) We’ll get to ways to normalize this in later chapters.

To just get the police shootings, filter the above query for circumstance_code == "81" before the group_by

Murders by state and year

“Mutate” here takes the result of the summary, then creates a new category with the annual totals. It’s a little confusing how it works, but don’t worry about it too much. Just know it’s possible.

murder_data %>% 
  group_by (year, state_abbr) %>%
  summarise (cases = n() ) %>%
  #bonus: Calculate the total number of cases by year:
  mutate ( total_cases = sum(cases) )  %>%
  pivot_wider (names_from = state_abbr, values_from = cases) %>%
  arrange ( desc (year))

## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

Does this mean that these states have suddenly become more dangerous? What about population growth?

Alternative method

The janitor package (which you may need to install) has a way to create cross-tabulations like this more simply. The function is tabyl (to distinguish from other table operations in R, which you probably want to avoid). Here’s an example:

murder_data %>%
  tabyl ( year , state_abbr) %>%
  arrange ( desc (year) ) %>%
  head ()

year	AZ	CO	ID	MT	NM	NV	OR	UT	WA	WY
2018	425	248	37	40	168	239	115	69	255	17
2017	484	252	48	43	166	311	131	82	242	14
2016	394	208	47	37	111	228	111	76	213	23
2015	344	189	32	38	105	203	85	61	224	18
2014	313	165	30	30	128	185	84	72	206	14
2013	383	186	30	16	119	168	85	51	194	17

Percent of murders by relationship:

Once those totals are calculated, you can use them to compute a new variable, such as percent.

murder_data %>%
  group_by (relationship) %>%
  summarise (num_of_cases = n() ) %>%
  mutate (total_cases = sum(num_of_cases) ,
          # This rounds to 1 digit.
          pct_cases = round (num_of_cases / total_cases * 100 , 1)
          ) %>%
  # you could un-select the total cases since they'll always be the same , but for illustratio purposes I'm keeping it.
  arrange ( desc (num_of_cases))

Alternative method with janitor::tabyl

murder_data %>%
  tabyl (relationship) %>%
  adorn_pct_formatting (digits=1)   %>%
  #this last part turns it into a normal data frame
  arrange ( desc(n)) %>%
  as_tibble()

Putting it all together

First, isolate the domestic violence cases. Let’s see what our choices are:

murder_data %>%
  group_by (relationship_code, relationship) %>%
  summarise (n())

## `summarise()` has grouped output by 'relationship_code'. You can override using the `.groups` argument.

relationship_code	relationship	n()
AQ	Acquaintance	11935
BF	Boyfriend	669
BR	Brother	534
CH	Common-law husband	84
CW	Common-law wife	178
DA	Daughter	1005
EE	Employee	53
ER	Employer	74
FA	Father	480
FR	Friend	2368
GF	Girlfriend	1893
HO	Homosexual relationship	135
HU	Husband	904
IL	In-law	365
MO	Mother	559
NE	Neighbor	727
OF	Other family	873
OK	Other - known to victim	2349
SD	Stepdaughter	102
SF	Stepfather	159
SI	Sister	168
SM	Stepmother	23
SO	Son	1260
SS	Stepson	163
ST	Stranger	9988
UN	Relationship not determined	15774
WI	Wife	2890
XH	Ex-husband	76
XW	Ex-wife	287

Here’s one way to get the answer (eliminating cases in which we don’t know the victim’s ethnicity)

murder_data %>%
  filter ( relationship_code %in% c("BF", "BR", "CH", "CW", "DA", "FA", 
                                    "GF", "HO", "HU", "IL", "MO", "OF", "SD", "SF", 
                                    "SI", "SM", "SO", "WI", "XH", "XW")   & 
          vic_ethnic_code %in% c("H", "N")
        ) %>%
   group_by ( state_abbr, vic_ethnic ) %>%
   summarise (cases = n() ) %>%
   #calculate total by state and percent
   mutate ( total_cases = sum(cases), 
            pct_cases = cases / total_cases * 100 ) %>%
   #get rid of case counts
   select ( -cases ) %>%
   #sort by state
   arrange ( state_abbr ) %>%
   # put ethnicity in columns
   pivot_wider ( values_from = pct_cases, names_from = vic_ethnic)

## `summarise()` has grouped output by 'state_abbr'. You can override using the `.groups` argument.

If you wanted to go further with filtering, you might look at the regular expressions available for more sophisticated filtering using the str_detect function. In this case, str_detect (weapon_code, "^1") searches for anything in the field that begins with a “1”.↩︎