C Murder Accountability Project exercises
The Murder Accountability Project was profiled in the New Yorker in November, 2017. Andrew B Tran brilliantly decided to use its data as a vehicle for learning R in his R for Journalists’ online class. This set of tutorials adapts his exercises for people using this textbook. You should consider taking his entire free full course if you want to get more detail.
Download the data into a new or existing R project, then load it using the load(file="murder_data.Rda")
command in a code chunk.
C.1 The data
This R dataset was created by subsetting only mountain states and Oregon and Washington from the national dataset, and converting the codes used in the original SPSS data into codes and their labels in separate variables. The data runs from 1976 through 2018. Here are the fields included:
## Rows: 56,075
## Columns: 37
## $ rowid <int> 27475, 27476, 27477, 27478, 27479, 27480, 27481, 27…
## $ mapid <chr> "197612001AZ00100", "197708001AZ00100", "197801001A…
## $ year <dbl> 1976, 1977, 1978, 1982, 1985, 1988, 1989, 1990, 199…
## $ month <dbl> 12, 8, 1, 7, 6, 11, 10, 9, 12, 2, 12, 3, 9, 4, 2, 8…
## $ fip <chr> "04001", "04001", "04001", "04001", "04001", "04001…
## $ state_fip <chr> "04", "04", "04", "04", "04", "04", "04", "04", "04…
## $ cnty_name <chr> "Apache, AZ", "Apache, AZ", "Apache, AZ", "Apache, …
## $ msa_code <chr> "99904", "99904", "99904", "99904", "99904", "99904…
## $ msa_name <chr> "Rural Arizona", "Rural Arizona", "Rural Arizona", …
## $ state_abbr <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ…
## $ ori <chr> "AZ00100", "AZ00100", "AZ00100", "AZ00100", "AZ0010…
## $ agency <chr> "Apache County", "Apache County", "Apache County", …
## $ agency_type <chr> "Sheriff", "Sheriff", "Sheriff", "Sheriff", "Sherif…
## $ solved <chr> "Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "Yes…
## $ homicide_type <chr> "Murder and non-negligent manslaughter", "Murder an…
## $ incident_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ vic_age <dbl> 42, 28, 0, 23, 18, 23, 43, 33, 33, 0, 38, 0, 37, 25…
## $ vic_sex <chr> "Male", "Male", "Female", "Female", "Male", "Male",…
## $ vic_race_code <chr> "W", "W", "W", "W", "W", "W", "W", "W", "W", "W", "…
## $ vic_race <chr> "White", "White", "White", "White", "White", "White…
## $ vic_ethnic_code <chr> "U", "U", "U", "N", "U", "U", "N", "N", "N", "N", "…
## $ vic_ethnic <chr> "Unknown or not reported", "Unknown or not reported…
## $ off_age <dbl> 22, 44, 50, 999, 999, 41, 39, 48, 999, 29, 18, 14, …
## $ off_sex <chr> "Male", "Male", "Male", "Unknown", "Unknown", "Male…
## $ off_race_code <chr> "I", "W", "W", "U", "U", "W", "W", "W", "U", "W", "…
## $ off_race <chr> "American Indian or Alaskan Native", "White", "Whit…
## $ off_ethnic_code <chr> "U", "U", "U", "U", "U", "U", "N", "N", "U", "N", "…
## $ off_ethnic <chr> "Unknown or not reported", "Unknown or not reported…
## $ situation_code <chr> "C", "A", "C", "B", "B", "A", "A", "A", "B", "A", "…
## $ situation <chr> "Single victim/multiple offenders", "Single victim/…
## $ weapon_code <chr> "20", "13", "90", "11", "12", "12", "12", "12", "12…
## $ weapon <chr> "Knife or cutting instrument", "Rifle", "Other or t…
## $ relationship_code <chr> "AQ", "AQ", "DA", "UN", "UN", "AQ", "AQ", "AQ", "ST…
## $ relationship <chr> "Acquaintance", "Acquaintance", "Daughter", "Relati…
## $ circumstance_code <chr> "42", "44", "59", "99", "99", "45", "5", "60", "3",…
## $ circumstance <chr> "Brawl due to influence of alcohol", "Argument over…
## $ use_date <date> 1980-03-01, 1980-03-01, 1980-03-01, 1982-10-06, 19…
Any variable that begins with off
refers to the offender; any variable that begins with vic
refers to the victim. Most variables are pretty self-explanatory, but here are a few details:
fip
andstate_fip
are standard codes used across many databases to identify geographic areas – in this case, counties and states.msa_code
andmsa_name
refer to Metropolitan Statistical Areas, which combine nearby counties regardless of state into areas that are often considered on large metropolitan area, such as Washington DC and the Virginia and Maryland suburbs.ori
,agency
andagency_type
refer to the law enforcement agency – police or sheriff’s office – that investigated the murder. An ORI is a standard code for each agency from the FBI.relationship_code
andrelationship
can be confusing. They refer to the relationship of the victim to the offender, not the other way around. For example, “Wife” means that the victim was the killer’s wife, not that the killer was the victim’s wife. This is unclear in the FBI documentation.
More details are available in this detailed record layout and data dictionary.
C.2 Exercises by chapter
C.2.1 Select and filter exercises
Here are some suggested exercises to practice what you learned in Chapter 5, Select and Filter:
Older wives as victims in Arizona
- Create a new data frame called
arizona_murders
based on just the murders that were reported in that state.
Try doing these one step at a time by adding to a query :
Pick out just following variables to work with:
- year
- name of the county and the police department,
- whether it was solved,
- demographics (eg, age, race, sex, ethnicity) of the victim and the offender,
- information on the weapon and the relationship
List all of the murders in which the killer was the husband of the victim. This can be done using either the relationship or the relationship_code.
Add a condition that the wives were at least 60 years old.
Sort the answer by oldest to youngest
Advanced exercises using other conditions
If you’re feeling adventurous, try figuring out how you might find:
Any domestic-violence related incidents. Hint: This would be an
%in%
condition once you look at your options using either “distinct” or a group-by query.Try using
str_detect
when you want to use wild cards instead of exact matches. These take regular expressions as arguments. So to find any gun in this dataset, you’d usestr_detect(weapon_code, "^1")
(For more details on regular expressions, try the Regex101 tutorial on our class website. Using regular expressions is often a way to make queries shorter and less fussy, but they are not as clear to a reader – they often take some puzzling through.)
C.2.2 Group by and summarise
The group by exercises are just like a pivot table. In fact, to turn it on its head, you use the command “pivot_wider”
“The most”
Which county in this small dataset has the most murders? Which one has the most police killings? (Look in the circumstance column for this.)
Create a table showing the number of murders by year and state (with states across the top, and years down the side). This is a group_by / summarise / arrange / pivot_wider exercise
Try calculating the percent of murders by relationship. For this to work, you can only keep one group_by column (relationship)
Putting it all together
What percent of each state’s domestic violence victims are of Hispanic origin?
C.2.3 Mutate
In this case, there aren’t very many numbers for us to work with, but there are still things to be done to make more meaningful groupings. These are usually done using an if_else
statement or something like it.
- Create a table that compares the number of gun murders with other weapons for each year.
- Try to compute the percent of murders by weapon annually.
- Are gun murders more likely to be solved than others?
C.2.4 Join
This gives you an opportunity to see how to add data to a table from the Web. This Excel file includes lots of information from the Census as of 2017, which you can link to summary data from the murder data using the county fips codes.
C.3 Wrap-up exercise
Try getting as far as you can to answer this question: Characterize the county with the highest murder rate, using the Census data linked above. This will require doing everything in steps:
- Grouping to the county level, keeping the county FIPS code
- Importing the Excel file, making sure you keep the FIP code as the same type of data as the one in the murder dataset.
- Joining the grouped file to the Census data.
- Computing the murder rate
- Deciding which variables in the Census data to use or manipulate to find a characterization that you’re interested in.
C.4 Answers to exercises
C.4.1 Select and filter
Arizona wives
- Create a new dataset with just arizona:
<-
arizona_murders %>%
murder_data filter ( state_abbr == "AZ")
You should have 15,443 rows in this dataset.
- The final set of queries might look like this. (You might have noticed you that you had one victim age 999 when you sorted. That means “unknown” in this dataset, so you’ll want to filter that out as well.)
%>%
arizona_murders select ( year, cnty_name, agency, solved,
starts_with("vic"), starts_with("off"),
contains ("relationship")) %>%
filter (relationship == "Wife" &
>= 60 &
vic_age < 999 ) %>%
vic_age arrange ( desc(vic_age))
You could also use
relationship == "Wife" &
between (vic_age, 60, 998)
You might also notice that there is a Male “wife” as a victim, reflecting how poorly many police agencies fill out these forms.
C.4.2 Group by
“The most”
- County with the most killings
%>%
murder_data group_by (state_abbr, cnty_name) %>%
summarise ( cases = n() ) %>%
arrange (desc (cases))
## `summarise()` has grouped output by 'state_abbr'. You can override using the `.groups` argument.
Why is this answer not surprising? (Hint: Clark County, Nevada, has about half the population of Maricopa.) We’ll get to ways to normalize this in later chapters.
To just get the police shootings, filter the above query for circumstance_code == "81"
before the group_by
- Murders by state and year
“Mutate” here takes the result of the summary, then creates a new category with the annual totals. It’s a little confusing how it works, but don’t worry about it too much. Just know it’s possible.
%>%
murder_data group_by (year, state_abbr) %>%
summarise (cases = n() ) %>%
#bonus: Calculate the total number of cases by year:
mutate ( total_cases = sum(cases) ) %>%
pivot_wider (names_from = state_abbr, values_from = cases) %>%
arrange ( desc (year))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
Does this mean that these states have suddenly become more dangerous? What about population growth?
Alternative method
The janitor
package (which you may need to install) has a way to create cross-tabulations like this more simply. The function is tabyl
(to distinguish from other table operations in R, which you probably want to avoid). Here’s an example:
%>%
murder_data tabyl ( year , state_abbr) %>%
arrange ( desc (year) ) %>%
head ()
year | AZ | CO | ID | MT | NM | NV | OR | UT | WA | WY |
---|---|---|---|---|---|---|---|---|---|---|
2018 | 425 | 248 | 37 | 40 | 168 | 239 | 115 | 69 | 255 | 17 |
2017 | 484 | 252 | 48 | 43 | 166 | 311 | 131 | 82 | 242 | 14 |
2016 | 394 | 208 | 47 | 37 | 111 | 228 | 111 | 76 | 213 | 23 |
2015 | 344 | 189 | 32 | 38 | 105 | 203 | 85 | 61 | 224 | 18 |
2014 | 313 | 165 | 30 | 30 | 128 | 185 | 84 | 72 | 206 | 14 |
2013 | 383 | 186 | 30 | 16 | 119 | 168 | 85 | 51 | 194 | 17 |
- Percent of murders by relationship:
Once those totals are calculated, you can use them to compute a new variable, such as percent.
%>%
murder_data group_by (relationship) %>%
summarise (num_of_cases = n() ) %>%
mutate (total_cases = sum(num_of_cases) ,
# This rounds to 1 digit.
pct_cases = round (num_of_cases / total_cases * 100 , 1)
%>%
) # you could un-select the total cases since they'll always be the same , but for illustratio purposes I'm keeping it.
arrange ( desc (num_of_cases))
Alternative method with janitor::tabyl
%>%
murder_data tabyl (relationship) %>%
adorn_pct_formatting (digits=1) %>%
#this last part turns it into a normal data frame
arrange ( desc(n)) %>%
as_tibble()
Putting it all together
First, isolate the domestic violence cases. Let’s see what our choices are:
%>%
murder_data group_by (relationship_code, relationship) %>%
summarise (n())
## `summarise()` has grouped output by 'relationship_code'. You can override using the `.groups` argument.
relationship_code | relationship | n() |
---|---|---|
AQ | Acquaintance | 11935 |
BF | Boyfriend | 669 |
BR | Brother | 534 |
CH | Common-law husband | 84 |
CW | Common-law wife | 178 |
DA | Daughter | 1005 |
EE | Employee | 53 |
ER | Employer | 74 |
FA | Father | 480 |
FR | Friend | 2368 |
GF | Girlfriend | 1893 |
HO | Homosexual relationship | 135 |
HU | Husband | 904 |
IL | In-law | 365 |
MO | Mother | 559 |
NE | Neighbor | 727 |
OF | Other family | 873 |
OK | Other - known to victim | 2349 |
SD | Stepdaughter | 102 |
SF | Stepfather | 159 |
SI | Sister | 168 |
SM | Stepmother | 23 |
SO | Son | 1260 |
SS | Stepson | 163 |
ST | Stranger | 9988 |
UN | Relationship not determined | 15774 |
WI | Wife | 2890 |
XH | Ex-husband | 76 |
XW | Ex-wife | 287 |
Here’s one way to get the answer (eliminating cases in which we don’t know the victim’s ethnicity)
%>%
murder_data filter ( relationship_code %in% c("BF", "BR", "CH", "CW", "DA", "FA",
"GF", "HO", "HU", "IL", "MO", "OF", "SD", "SF",
"SI", "SM", "SO", "WI", "XH", "XW") &
%in% c("H", "N")
vic_ethnic_code %>%
) group_by ( state_abbr, vic_ethnic ) %>%
summarise (cases = n() ) %>%
#calculate total by state and percent
mutate ( total_cases = sum(cases),
pct_cases = cases / total_cases * 100 ) %>%
#get rid of case counts
select ( -cases ) %>%
#sort by state
arrange ( state_abbr ) %>%
# put ethnicity in columns
pivot_wider ( values_from = pct_cases, names_from = vic_ethnic)
## `summarise()` has grouped output by 'state_abbr'. You can override using the `.groups` argument.
If you wanted to go further with filtering, you might look at the
regular expressions
available for more sophisticated filtering using thestr_detect
function. In this case,str_detect (weapon_code, "^1")
searches for anything in the field that begins with a “1”.↩︎