Project proposal

Author

Trusting-leopard

library(tidyverse)

animal <- read_csv("data/longbeach.csv") |>
  filter(animal_type %in% c("cat", "dog", "bird"))

head(animal)

# A tibble: 6 × 22
  animal_id animal_name animal_type primary_color secondary_color sex    
  <chr>     <chr>       <chr>       <chr>         <chr>           <chr>  
1 A693708   *charlien   dog         white         <NA>            Female 
2 A638068   <NA>        bird        green         red             Unknown
3 A639310   <NA>        bird        white         gray            Unknown
4 A618968   *morgan     cat         black         white           Female 
5 A646202   <NA>        bird        black         <NA>            Unknown
6 A597464   <NA>        cat         black         <NA>            Unknown
# ℹ 16 more variables: dob <date>, intake_date <date>, intake_condition <chr>,
#   intake_type <chr>, intake_subtype <chr>, reason_for_intake <chr>,
#   outcome_date <date>, crossing <chr>, jurisdiction <chr>,
#   outcome_type <chr>, outcome_subtype <chr>, latitude <dbl>, longitude <dbl>,
#   outcome_is_dead <lgl>, was_outcome_alive <lgl>, geopoint <chr>

Dataset

We are using the Long Beach Animal Shelter Data for our project. This dataset comes from the City of Long Beach Animal Care services, and has records for nearly 30,000 animals who have stayed at the Long Beach Animal Shelter, with each row representing the intake and outcome record for each animal. We will be focusing on the data for cats, dogs, and birds within the dataset, as these are common pets, and are the animals with the most number of instances within the data, with a combined total of 25,988 records.

The data variables of note to our project are those relating to an animal’s time spent at the shelter, and their condition upon arrival and departure, such as intake_date, intake_condition, intake_type, outcome_date, and outcome_type. The data also features location data of where the animal was rescued and animal characteristics, such as name, species, sex, primary and secondary colors, and date of birth. For the variables we are using, if a row contains an “NA” value for that variable, we will be removing it from the dataset to ensure completeness and accuracy in our visualizations.

We chose this dataset because we believe we can tell a more compelling story through data visualization when the data is from a topic we are interested in. We all enjoy animals, and are curious to learn more about the time animals spend in the shelter, what their outcomes were, and how their condition when they arrive impacts their duration of stay.

Questions

How does an animal’s intake condition relate to its length of stay in the shelter? Does this relationship vary if the animal was a stray versus wild, or any other intake type?
How does the distribution of shelter outcomes vary by animal species? Have shelter outcomes changed over time?

Analysis plan

Question 1:

To answer this question, we will use intake_condition and intake_type as the primary categorical variables, alongside with intake_date and outcome_date as time variables. A new numerical variable, length of stay, will be created by calculating the difference in days between the outcome date and intake date for each animal (dogs, cats, and birds).

Before visualization, we will clean and preprocess the intake condition variable. Specifically, we will filter out intake conditions with very few observations to reduce noise and avoid misleading comparisons. We will also create two versions of the analysis: A version where similar intake conditions are grouped together, for example, injured mild, injured moderate, and injured severe, and another version where intake conditions remain separated to preserve detailed distinctions. This dual approach will allow us to compare patterns with grouped categories while also examining differences with separated categories. We will also clean and process the intake type variable to include only the most common intake types, owner surrender, stray, and wildlife.

We will first summarize length of stay across different intake conditions and intake types using descriptive statistics such as the median/interquartile range. To visualize the relationship, some potential plots we are considering are boxplots or violin plots showing length of stay by intake condition with facets or color used to distinguish intake types. Also histograms can be used to examine the overall distribution of length of stay. Our analysis on this question will focus on identifying patterns and differences in the distribution of length of stay across intake conditions and intake types, and on assessing whether grouping similar conditions changes the interpretation of the results. No external datasets will be merged in for this question.

Question 2:

For this question, we will examine how outcome patterns change over time for selected animal types, specifically cats, dogs, and birds. Prior to analysis, we will remove observations with missing values in the variables relevant to this question to ensure consistency in comparisons. The analysis will focus on animal_type and a narrowed set of outcome_type categories, with temporal information derived from outcome_date. We will focus on the most common outcome types, such as adoption, transfer, rescue, and euthanasia, which account for the majority of recorded outcomes. Less frequent outcome categories were excluded to improve interpretability and reduce visual clutter.

To capture trends over time, we will extract the year from the outcome date and calculate the proportion of each selected outcome type within each animal type for every year. We will visualize these trends using line graphs, with year on the x-axis and the proportion of outcomes on the y-axis. Each line will represent a specific outcome type, and the plots will be faceted by animal type to allow for clearer comparison across 3 focused species.

This approach allows us to more clearly observe how outcome distributions evolve over time within each animal category, while maintaining a descriptive and non-causal interpretation. No external datasets will be incorporated in this analysis.