Project title

Proposal

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
library(skimr)

Data 1

  • Identify the source of the data.

    The source of the data is the Stanford Open Policing Project, which records and tabulates police stops of cities and counties in the United States.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data is collected, by request, from police departments in every state. They started collecting in 2015.

  • Write a brief description of the observations.

    There are numerous observations, including the date of the stop, the biographical information of the stopped, whether a search was conducted, if arrests were made, and reason for stop. Not all data sets contain this information, as different police departments provided varying level of information. We have chosen to select all the findings of just one state, California.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    In California, who is most likely to be stopped by police, what about at night? What is the most prevalent reasoning for stopping per race, by jurisdiction? What is the per capital rate for pulling individuals over? Do certain biographical features lead to more frisks, searches, or arrests?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    We intend to study if race, gender, age, are correlated with stops, and subsequent searches, frisks, or arrests.

    Out hypothesis is that certain races and ages will experience more stops, and that they will also experience more subsequent actions.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Stop date and time is quantitative. Reason for stop, search conducted, frisk performed, and arrest made are categorical.

Glimpse of data

# example of the san diego dataset

sandiego <- read.csv("data/ca_san_diego_2020_04_01.csv")

skim(sandiego)
Data summary
Name sandiego
Number of rows 383027
Number of columns 21
_______________________
Column type frequency:
character 13
logical 7
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
raw_row_number 0 1.00 1 258 0 383027 0
date 183 1.00 10 10 0 1186 0
time 735 1.00 8 8 0 1440 0
service_area 0 1.00 3 8 0 25 0
subject_race 1234 1.00 5 22 0 5 0
subject_sex 661 1.00 4 6 0 2 0
type 0 1.00 9 9 0 1 0
outcome 39172 0.90 6 8 0 3 0
search_basis 366739 0.04 2 14 0 5 0
reason_for_search 368749 0.04 5 527 0 632 0
reason_for_stop 219 1.00 3 552 0 97 0
raw_action_taken 31971 0.92 2 352 0 166 0
raw_subject_race_description 1234 1.00 5 16 0 18 0

Variable type: logical

skim_variable n_missing complete_rate mean count
arrest_made 34743 0.91 0.01 FAL: 343469, TRU: 4815
citation_issued 31971 0.92 0.63 TRU: 221137, FAL: 129919
warning_issued 31971 0.92 0.34 FAL: 230386, TRU: 120670
contraband_found 366739 0.04 0.10 FAL: 14735, TRU: 1553
search_conducted 0 1.00 0.04 FAL: 366739, TRU: 16288
search_person 2190 0.99 0.02 FAL: 373413, TRU: 7424
search_vehicle 2190 0.99 0.03 FAL: 370738, TRU: 10099

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
subject_age 11963 0.97 37.1 14.18 10 25 34 47 100 ▇▇▅▁▁

Data 2

Introduction and data

  • Identify the source of the data.

    The data’s source is from the website data.austintexas.gov. The data is collected from the Austin Animal Center.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data was originally created on February 5, 2016. It was originally collected by recording, upon intake of each animal, their unique given Animal ID, name (if they have one), physical characteristics, and then each animal’s outcome type was recorded, along with the date of their outcome and age upon outcome.

  • Write a brief description of the observations.

    The observations are one outcome per animal per encounter, which includes their Animal ID, their Name (if they have one), the date of their outcome, their date of birth, their outcome type, the type of animal they are, their sex upon outcome, their age upon outcome, their breed, and their color.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1.) Does the type of animal have an impact on the likelihood of adoption being the outcome?

    2.) Does the breed of animal have an impact on the likelihood of euthanasia being the outcome?

    3.) What is the relationship between animal age and their outcome?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    We would like to research how different animal characteristics impact their outcomes. Our hypotheses are:

    1.) The type of animal does have an impact on the likelihood of adoption being the outcome, with Cats and Dogs being the two animal types with the highest probability of adoption.

    2.) The breed of animal does have an impact on the likelihood of euthanasia being the outcome, with breeds that are considered more aggressive and violent having higher rates of euthanasia than other breeds.

    3.) As age increases, the likelihood of adoption will decrease and the likelihood of euthanasia will increase.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Categorical variables: Outcome Type, Outcome Subtype, Animal Type, Sex upon Outcome, Breed, and Color

    Quantitative variable: Age upon Outcome

Glimpse of data

# add code here
animal_center_outcomes <- read_csv("data/Austin_Animal_Center_Outcomes .csv", skip = 1)
Rows: 149051 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Animal ID, Name, DateTime, MonthYear, Date of Birth, Outcome Type,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
animal_center_outcomes
# A tibble: 149,051 × 12
   `Animal ID` Name       DateTime      MonthYear `Date of Birth` `Outcome Type`
   <chr>       <chr>      <chr>         <chr>     <chr>           <chr>         
 1 A794011     Chunk      05/08/2019 0… May 2019  05/02/2017      Rto-Adopt     
 2 A776359     Gizmo      07/18/2018 0… Jul 2018  07/12/2017      Adoption      
 3 A821648     <NA>       08/16/2020 1… Aug 2020  08/16/2019      Euthanasia    
 4 A720371     Moose      02/13/2016 0… Feb 2016  10/08/2015      Adoption      
 5 A674754     <NA>       03/18/2014 1… Mar 2014  03/12/2014      Transfer      
 6 A659412     Princess   10/05/2020 0… Oct 2020  03/24/2013      Adoption      
 7 A814515     Quentin    05/06/2020 0… May 2020  03/01/2018      Adoption      
 8 A868405     *Leo       03/04/2023 0… Mar 2023  11/02/2020      Adoption      
 9 A689724     *Donatello 10/18/2014 0… Oct 2014  08/01/2014      Adoption      
10 A680969     *Zeus      08/05/2014 0… Aug 2014  06/03/2014      Adoption      
# ℹ 149,041 more rows
# ℹ 6 more variables: `Outcome Subtype` <chr>, `Animal Type` <chr>,
#   `Sex upon Outcome` <chr>, `Age upon Outcome` <chr>, Breed <chr>,
#   Color <chr>
glimpse(animal_center_outcomes)
Rows: 149,051
Columns: 12
$ `Animal ID`        <chr> "A794011", "A776359", "A821648", "A720371", "A67475…
$ Name               <chr> "Chunk", "Gizmo", NA, "Moose", NA, "Princess", "Que…
$ DateTime           <chr> "05/08/2019 06:20:00 PM", "07/18/2018 04:02:00 PM",…
$ MonthYear          <chr> "May 2019", "Jul 2018", "Aug 2020", "Feb 2016", "Ma…
$ `Date of Birth`    <chr> "05/02/2017", "07/12/2017", "08/16/2019", "10/08/20…
$ `Outcome Type`     <chr> "Rto-Adopt", "Adoption", "Euthanasia", "Adoption", …
$ `Outcome Subtype`  <chr> NA, NA, NA, NA, "Partner", NA, "Foster", NA, NA, NA…
$ `Animal Type`      <chr> "Cat", "Dog", "Other", "Dog", "Cat", "Dog", "Dog", …
$ `Sex upon Outcome` <chr> "Neutered Male", "Neutered Male", "Unknown", "Neute…
$ `Age upon Outcome` <chr> "2 years", "1 year", "1 year", "4 months", "6 days"…
$ Breed              <chr> "Domestic Shorthair Mix", "Chihuahua Shorthair Mix"…
$ Color              <chr> "Brown Tabby/White", "White/Brown", "Gray", "Buff",…

Data 3

Introduction and data

  • Identify the source of the data.

    Our World In Data, CO₂ and Greenhouse Gas Emissions by Hannah Ritchie, Max Roser and Pablo Rosado.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data set was first made available on 8/7/2020 and was collected by taking data from a number of other data sets including but not limited to the Statistical review of world energy (from BP), International energy data (from EIA), and Global carbon budget - Fossil CO2 emissions (from Global Carbon Project).

  • Write a brief description of the observations.

    The observations are the CO2, CH4, and NO2 produced from various industries, the GDP, and the populations per country per year.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1) Choose a country from each continent. For each country, how has the total amount of CO2 emissions changed over time?

    2) How does the change in GDP affect the total CO2 emissions of a country?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    We want to research the changes in global CO2 emissions and change over time or in accordance to changes in a country’s GDP. Specifically, we want to examine how global CO2 emissions may be different for countries in different parts of the world.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Quantitative variables (emissions, GDP)

    Categorical variables (country)

Glimpse of data

pollution <- read_csv("data/global_emissions.csv")
Rows: 2484 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country.Name, Country.Code
dbl (18): Year, Country.GDP, Country.Population, Emissions.Production.CH4, E...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(pollution)
Rows: 2,484
Columns: 20
$ Year                                 <dbl> 1992, 1993, 1994, 1995, 1996, 199…
$ Country.Name                         <chr> "Afghanistan", "Afghanistan", "Af…
$ Country.Code                         <chr> "AFG", "AFG", "AFG", "AFG", "AFG"…
$ Country.GDP                          <dbl> 12677538816, 9834580992, 79198571…
$ Country.Population                   <dbl> 14485543, 15816601, 17075728, 181…
$ Emissions.Production.CH4             <dbl> 7.13, 7.21, 7.47, 7.83, 8.67, 9.4…
$ Emissions.Production.N2O             <dbl> 2.89, 2.93, 2.76, 2.88, 3.12, 3.4…
$ Emissions.Production.CO2.Cement      <dbl> 0.046, 0.047, 0.047, 0.047, 0.047…
$ Emissions.Production.CO2.Coal        <dbl> 0.022, 0.018, 0.015, 0.015, 0.007…
$ Emissions.Production.CO2.Gas         <dbl> 0.363, 0.352, 0.338, 0.322, 0.308…
$ Emissions.Production.CO2.Oil         <dbl> 0.927, 0.894, 0.860, 0.824, 0.780…
$ Emissions.Production.CO2.Flaring     <dbl> 0.022, 0.022, 0.022, 0.022, 0.022…
$ Emissions.Production.CO2.Other       <dbl> 0.000000e+00, 0.000000e+00, 2.220…
$ Emissions.Production.CO2.Total       <dbl> 1.379, 1.333, 1.282, 1.230, 1.165…
$ `Emissions.Global Share.CO2.Cement`  <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Coal`    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
$ `Emissions.Global Share.CO2.Gas`     <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Oil`     <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Flaring` <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Total`   <dbl> 0.01, 0.01, 0.01, 0.01, 0.00, 0.0…