Project title

Proposal

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

library(skimr)

Data 1

Identify the source of the data.

The source of the data is the Stanford Open Policing Project, which records and tabulates police stops of cities and counties in the United States.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data is collected, by request, from police departments in every state. They started collecting in 2015.
Write a brief description of the observations.

There are numerous observations, including the date of the stop, the biographical information of the stopped, whether a search was conducted, if arrests were made, and reason for stop. Not all data sets contain this information, as different police departments provided varying level of information. We have chosen to select all the findings of just one state, California.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

In California, who is most likely to be stopped by police, what about at night? What is the most prevalent reasoning for stopping per race, by jurisdiction? What is the per capital rate for pulling individuals over? Do certain biographical features lead to more frisks, searches, or arrests?
A description of the research topic along with a concise statement of your hypotheses on this topic.

We intend to study if race, gender, age, are correlated with stops, and subsequent searches, frisks, or arrests.

Out hypothesis is that certain races and ages will experience more stops, and that they will also experience more subsequent actions.
Identify the types of variables in your research question. Categorical? Quantitative?

Stop date and time is quantitative. Reason for stop, search conducted, frisk performed, and arrest made are categorical.

Glimpse of data

# example of the san diego dataset

sandiego <- read.csv("data/ca_san_diego_2020_04_01.csv")

skim(sandiego)

Data summary
Name	sandiego
Number of rows	383027
Number of columns	21
_______________________
Column type frequency:
character	13
logical	7
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
raw_row_number	0	1.00	1	258	383027
date	183	1.00	10	10	1186
time	735	1.00	8	8	1440
service_area	0	1.00	3	8	25
subject_race	1234	1.00	5	22	5
subject_sex	661	1.00	4	6	2
type	0	1.00	9	9	1
outcome	39172	0.90	6	8	3
search_basis	366739	0.04	2	14	5
reason_for_search	368749	0.04	5	527	632
reason_for_stop	219	1.00	3	552	97
raw_action_taken	31971	0.92	2	352	166
raw_subject_race_description	1234	1.00	5	16	18

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
arrest_made	34743	0.91	0.01	FAL: 343469, TRU: 4815
citation_issued	31971	0.92	0.63	TRU: 221137, FAL: 129919
warning_issued	31971	0.92	0.34	FAL: 230386, TRU: 120670
contraband_found	366739	0.04	0.10	FAL: 14735, TRU: 1553
search_conducted	0	1.00	0.04	FAL: 366739, TRU: 16288
search_person	2190	0.99	0.02	FAL: 373413, TRU: 7424
search_vehicle	2190	0.99	0.03	FAL: 370738, TRU: 10099

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
subject_age	11963	0.97	37.1	14.18	10	25	34	47	100	▇▇▅▁▁

Data 2

Introduction and data

Identify the source of the data.

The data’s source is from the website data.austintexas.gov. The data is collected from the Austin Animal Center.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was originally created on February 5, 2016. It was originally collected by recording, upon intake of each animal, their unique given Animal ID, name (if they have one), physical characteristics, and then each animal’s outcome type was recorded, along with the date of their outcome and age upon outcome.
Write a brief description of the observations.

The observations are one outcome per animal per encounter, which includes their Animal ID, their Name (if they have one), the date of their outcome, their date of birth, their outcome type, the type of animal they are, their sex upon outcome, their age upon outcome, their breed, and their color.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1.) Does the type of animal have an impact on the likelihood of adoption being the outcome?

2.) Does the breed of animal have an impact on the likelihood of euthanasia being the outcome?

3.) What is the relationship between animal age and their outcome?
A description of the research topic along with a concise statement of your hypotheses on this topic.

We would like to research how different animal characteristics impact their outcomes. Our hypotheses are:

1.) The type of animal does have an impact on the likelihood of adoption being the outcome, with Cats and Dogs being the two animal types with the highest probability of adoption.

2.) The breed of animal does have an impact on the likelihood of euthanasia being the outcome, with breeds that are considered more aggressive and violent having higher rates of euthanasia than other breeds.

3.) As age increases, the likelihood of adoption will decrease and the likelihood of euthanasia will increase.
Identify the types of variables in your research question. Categorical? Quantitative?

Categorical variables: Outcome Type, Outcome Subtype, Animal Type, Sex upon Outcome, Breed, and Color

Quantitative variable: Age upon Outcome

Glimpse of data

# add code here
animal_center_outcomes <- read_csv("data/Austin_Animal_Center_Outcomes .csv", skip = 1)

Rows: 149051 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Animal ID, Name, DateTime, MonthYear, Date of Birth, Outcome Type,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

animal_center_outcomes

# A tibble: 149,051 × 12
   `Animal ID` Name       DateTime      MonthYear `Date of Birth` `Outcome Type`
   <chr>       <chr>      <chr>         <chr>     <chr>           <chr>         
 1 A794011     Chunk      05/08/2019 0… May 2019  05/02/2017      Rto-Adopt     
 2 A776359     Gizmo      07/18/2018 0… Jul 2018  07/12/2017      Adoption      
 3 A821648     <NA>       08/16/2020 1… Aug 2020  08/16/2019      Euthanasia    
 4 A720371     Moose      02/13/2016 0… Feb 2016  10/08/2015      Adoption      
 5 A674754     <NA>       03/18/2014 1… Mar 2014  03/12/2014      Transfer      
 6 A659412     Princess   10/05/2020 0… Oct 2020  03/24/2013      Adoption      
 7 A814515     Quentin    05/06/2020 0… May 2020  03/01/2018      Adoption      
 8 A868405     *Leo       03/04/2023 0… Mar 2023  11/02/2020      Adoption      
 9 A689724     *Donatello 10/18/2014 0… Oct 2014  08/01/2014      Adoption      
10 A680969     *Zeus      08/05/2014 0… Aug 2014  06/03/2014      Adoption      
# ℹ 149,041 more rows
# ℹ 6 more variables: `Outcome Subtype` <chr>, `Animal Type` <chr>,
#   `Sex upon Outcome` <chr>, `Age upon Outcome` <chr>, Breed <chr>,
#   Color <chr>

glimpse(animal_center_outcomes)

Rows: 149,051
Columns: 12
$ `Animal ID`        <chr> "A794011", "A776359", "A821648", "A720371", "A67475…
$ Name               <chr> "Chunk", "Gizmo", NA, "Moose", NA, "Princess", "Que…
$ DateTime           <chr> "05/08/2019 06:20:00 PM", "07/18/2018 04:02:00 PM",…
$ MonthYear          <chr> "May 2019", "Jul 2018", "Aug 2020", "Feb 2016", "Ma…
$ `Date of Birth`    <chr> "05/02/2017", "07/12/2017", "08/16/2019", "10/08/20…
$ `Outcome Type`     <chr> "Rto-Adopt", "Adoption", "Euthanasia", "Adoption", …
$ `Outcome Subtype`  <chr> NA, NA, NA, NA, "Partner", NA, "Foster", NA, NA, NA…
$ `Animal Type`      <chr> "Cat", "Dog", "Other", "Dog", "Cat", "Dog", "Dog", …
$ `Sex upon Outcome` <chr> "Neutered Male", "Neutered Male", "Unknown", "Neute…
$ `Age upon Outcome` <chr> "2 years", "1 year", "1 year", "4 months", "6 days"…
$ Breed              <chr> "Domestic Shorthair Mix", "Chihuahua Shorthair Mix"…
$ Color              <chr> "Brown Tabby/White", "White/Brown", "Gray", "Buff",…

Data 3

Introduction and data

Identify the source of the data.

Our World In Data, CO₂ and Greenhouse Gas Emissions by Hannah Ritchie, Max Roser and Pablo Rosado.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data set was first made available on 8/7/2020 and was collected by taking data from a number of other data sets including but not limited to the Statistical review of world energy (from BP), International energy data (from EIA), and Global carbon budget - Fossil CO2 emissions (from Global Carbon Project).
Write a brief description of the observations.

The observations are the CO2, CH4, and NO2 produced from various industries, the GDP, and the populations per country per year.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1) Choose a country from each continent. For each country, how has the total amount of CO2 emissions changed over time?

2) How does the change in GDP affect the total CO2 emissions of a country?
A description of the research topic along with a concise statement of your hypotheses on this topic.

We want to research the changes in global CO2 emissions and change over time or in accordance to changes in a country’s GDP. Specifically, we want to examine how global CO2 emissions may be different for countries in different parts of the world.
Identify the types of variables in your research question. Categorical? Quantitative?

Quantitative variables (emissions, GDP)

Categorical variables (country)

Glimpse of data

pollution <- read_csv("data/global_emissions.csv")

Rows: 2484 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country.Name, Country.Code
dbl (18): Year, Country.GDP, Country.Population, Emissions.Production.CH4, E...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(pollution)

Rows: 2,484
Columns: 20
$ Year                                 <dbl> 1992, 1993, 1994, 1995, 1996, 199…
$ Country.Name                         <chr> "Afghanistan", "Afghanistan", "Af…
$ Country.Code                         <chr> "AFG", "AFG", "AFG", "AFG", "AFG"…
$ Country.GDP                          <dbl> 12677538816, 9834580992, 79198571…
$ Country.Population                   <dbl> 14485543, 15816601, 17075728, 181…
$ Emissions.Production.CH4             <dbl> 7.13, 7.21, 7.47, 7.83, 8.67, 9.4…
$ Emissions.Production.N2O             <dbl> 2.89, 2.93, 2.76, 2.88, 3.12, 3.4…
$ Emissions.Production.CO2.Cement      <dbl> 0.046, 0.047, 0.047, 0.047, 0.047…
$ Emissions.Production.CO2.Coal        <dbl> 0.022, 0.018, 0.015, 0.015, 0.007…
$ Emissions.Production.CO2.Gas         <dbl> 0.363, 0.352, 0.338, 0.322, 0.308…
$ Emissions.Production.CO2.Oil         <dbl> 0.927, 0.894, 0.860, 0.824, 0.780…
$ Emissions.Production.CO2.Flaring     <dbl> 0.022, 0.022, 0.022, 0.022, 0.022…
$ Emissions.Production.CO2.Other       <dbl> 0.000000e+00, 0.000000e+00, 2.220…
$ Emissions.Production.CO2.Total       <dbl> 1.379, 1.333, 1.282, 1.230, 1.165…
$ `Emissions.Global Share.CO2.Cement`  <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Coal`    <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
$ `Emissions.Global Share.CO2.Gas`     <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Oil`     <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Flaring` <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
$ `Emissions.Global Share.CO2.Total`   <dbl> 0.01, 0.01, 0.01, 0.01, 0.00, 0.0…