library(tidyverse)Project proposal
Dataset
agencies <- read.csv("data/agencies.csv")For this project, we explore data obtained from the FBI Crime Data API. The dataset used is the Agency dataset, which contains 19,166 rows and 10 columns.
dim(agencies)[1] 19166 10
The dataset contains variables describing law enforcement agencies, including identifiers, location, and reporting status. The ori variable provides a unique agency ID, while county, state_abbr, and state describe the agency’s jurisdiction. Geographic coordinates are captured through the double variables latitude and longitude. Agency characteristics are recorded in agency_name and agency_type. The logical variable is_nibrs indicates participation in the FBI’s National Incident-Based Reporting System (NIBRS), and nibrs_start_date specifies when reporting began.
str(agencies)'data.frame': 19166 obs. of 10 variables:
$ ori : chr "AL0430200" "AL0430100" "AL0430000" "AL0070100" ...
$ county : chr "LEE" "LEE" "LEE" "BIBB" ...
$ latitude : num 32.6 32.6 32.6 33 32.9 ...
$ longitude : num -85.4 -85.5 -85.4 -87.1 -87.1 ...
$ state_abbr : chr "AL" "AL" "AL" "AL" ...
$ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ agency_name : chr "Opelika Police Department" "Auburn Police Department" "Lee County Sheriff's Office" "Centreville Police Department" ...
$ agency_type : chr "City" "City" "County" "City" ...
$ is_nibrs : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
$ nibrs_start_date: chr "2021-01-01" "2020-12-01" "2012-04-01" "2020-01-01" ...
We selected this dataset because it enables direct, visualization-driven analysis of law enforcement agencies across the United States. It contains key categorical fields such as agency type, state, and NIBRS participation, as well as spatial and time-based variables including geographic coordinates and NIBRS adoption dates.
The data is organized but imperfect. Some agencies are missing location information. Agency type categories are uneven. NIBRS start dates appear only for agencies that report through NIBRS. Addressing these issues requires filtering, recoding, and intentional design choices, making the dataset well-suited for practicing applied data cleaning and visualization.
Questions
The questions that we want to answer are:
Q1: Are certain types of law enforcement agencies systematically less likely to participate in NIBRS reporting, even within the same state?
state or state_abbris_nibrs
agency_type
Q2: Are non-NIBRS agencies geographically clustered within specific counties or regions of the United States?
latitude
longitude
county
state_abbr
is_nibrs
Analysis plan
For Question 1: How complete is NIBRS reporting across states, and which agency types account for most non-participation?
To answer Question 1, we will conduct a within-state comparative analysis. The unit of analysis is the individual agency (ori). We will first verify that each ori is unique and resolve any duplicates by retaining the most complete record, prioritizing valid state_abbr, agency_type, and is_nibrs.
Key categorical variables will be cleaned and standardized prior to analysis. The is_nibrs variable will be confirmed as a boolean (TRUE/FALSE) indicator of participation. The agency_type variable will be standardized to remove inconsistent capitalization or trailing spaces. Missing agency_type values will be examined: if minimal, the rows may be excluded; otherwise, they will be grouped into an “Unknown” category and reported transparently. Agencies missing state_abbr will be excluded, since within-state comparison is central to the research design. Missing latitude and longitude values will not affect this analysis, as spatial coordinates are not required for Q1.
Construction of Participation Metrics
We will construct a state-by-agency-type summary table. For each combination of state_abbr and agency_type, we will calculate:
n_agencies: total number of agenciesn_nibrs: number of agencies participating in NIBRSpct_nibrs = n_nibrs / n_agencies: participation rate
We will also compute each state’s overall participation rate across all agency types and use it to derive a participation gap measure: participation gap=agency-type-specific participation rate−state overall participation rate
This metric allows us to determine whether a given agency type is performing above or below the baseline participation level of its state. To avoid unstable percentages caused by very small group sizes, we will flag or filter agency-type groups within states containing fewer than a specified threshold of agencies (e.g., fewer than five agencies). The number of excluded or flagged cases will be reported.
Analytical Strategy
The analysis proceeds in two parts:
Within-State Comparison: Examine participation rates across agency types within each state to isolate agency-type differences from broader state-level policies.
Systematic Patterns Across States: Assess whether certain agency types consistently show negative participation gaps across many states. By summarizing the distribution of participation gaps by agency type across all states, we can evaluate whether patterns are systematic. If an agency type frequently falls below the state average in a large proportion of states, this would support the claim of systematically lower participation.
Visualization Plan
We will produce at least two different types of plots, including:
Faceted Bar Chart: Displays participation rates by agency type within selected states. States may be limited to those with the largest number of agencies for readability, and agency types will be ordered consistently across facets. This allows direct within-state comparison.
Heatmap: Shows participation rates with states on one axis, agency types on the other, and color intensity representing participation rates. This allows detection of structural patterns across many states.
State-Level Choropleth Map: Displays overall participation rates using color shading to highlight geographic variation.
For Question 2: When did agencies adopt NIBRS reporting, and which agency types adopted later than others?
To answer Question 2, we will conduct a year-level temporal adoption analysis, using the individual agency (ori) as the unit of analysis. We will first verify that each ori is unique and resolve any duplicates by retaining the most complete record, prioritizing valid nibrs_start_date, agency_type, and state_abbr.
The nibrs_start_date variable will be converted to a proper Date format and used to create a new variable, nibrs_start_year. The is_nibrs variable will be confirmed as a boolean indicator of participation. Agencies marked as participating but missing a start date will be flagged and reported transparently; if minimal, they will be excluded from timing-specific visualizations but included in overall participation counts. Agencies missing state_abbr will be excluded from state-level comparisons but retained in national summaries. Latitude and longitude are not required for this question.
Construction of Adoption Metrics
We will construct year-level summaries of adoption. For each year, we will calculate:
n_new_adopters_year: number of agencies beginning NIBRS reportingcumulative_adopters_year: cumulative number of adopters up to that yearcumulative_adoption_rate: cumulative adopters divided by the total number of agencies
The denominator (total agencies) will remain fixed across years to ensure comparability.
At the agency-type level, we will compute the same metrics for each agency_type × year combination. This allows direct comparison of adoption speed across agency types.
Adoption Timing Measures
To evaluate differences in timing, we will calculate for each agency type:
Median adoption year
Mean adoption year
Interquartile range
Time to 50% cumulative adoption
We will also define an adoption delay measure: adoption delay = median adoption year of agency type − national median adoption year Positive values indicate later-than-average adoption. This metric allows systematic comparison across agency types.
Agencies that never adopted NIBRS will be included in cumulative rate denominators but will not appear in year-of-adoption distribution plots.
Analytical Strategy
The analysis proceeds in three stages:
National Trend: Examine overall adoption over time to identify waves or structural shifts.
Agency-Type Comparison: Compare cumulative adoption curves and timing metrics across agency types to determine whether certain types adopted later.
State-Level Timing Variation: Compute median adoption year by state to assess geographic differences in adoption timing (distinct from Q1’s participation coverage focus).
Small agency types (e.g., fewer than five total agencies) will be flagged or grouped to avoid unstable estimates. All exclusions will be reported.
Visualization Plan
We will produce at least two different plot types, including color mapping or faceting:
Histogram of Adoption Year (national distribution of adoption timing).
Cumulative Line Plot by Agency Type (colored by
agency_type) to compare adoption speed.Boxplot of Adoption Year by Agency Type to compare medians and spread.
Faceted Line Chart by State to visualize state-level adoption trajectories.