project-elegant-buneary

Proposal

library(tidyverse)
library(skimr)

Data 1 - Airlines

Problem or question

  • Identify the problem you will solve or the question you will answer
    • What are the key insights and patterns in flight data in the United States, and how do various factors, such as delays, impact the airline industry and passengers’ travel experiences?
  • Explain why you think this topic is important.
    • From an traveler’s perspective, exploring the dataset can provide insights into the factors (weather, carrier, security, etc.) affecting flight delays, which can inform decisions about air travel for travelers.

    • Understanding flight delays can lead to improvements in the aviation industry, benefiting both airlines and passengers.

    • It can help identify the airlines with better on-time performance, which might influence travelers’ choices.

  • Identify the types of data/variables you will use.
    • Categorical Variables:
      • Airport Code (Categorical): Identifies airports with IATA codes.

      • Airport Name (Categorical): Full names of airports.

      • Carriers Names (Categorical): Full names of carriers.

    • Quantitative Variables:
      • Month (Quantitative): Represents the month as a numeric value.

      • Year (Quantitative): Represents the year as a 4-digit number.

      • Various statistics related to delays and flight counts, including counts and minutes delayed, flights delayed, flights canceled, flights diverted, flights on time, and total flights. These are all quantitative variables.

  • State the major deliverable(s) you will create to solve this problem/answer this question.
    • For the major deliverable, we will create a shiny web app to include our insights from the visualizations, statistics and key factors affecting delays
      • Visualizations such as bar charts, line graphs, heatmaps and more to visually represent patterns and trends in flight delays.
      • Summarized tables to gain an overview of flight delays and other relevant metrics.
      • Key factors contributing to flight delays, which could include specific airlines, airports, and types of delays.

Introduction and data

  • Identify the source of the data.

    • The source of the data is the CORGIS Dataset Project, and it was curated by Austin Cory Bart. The dataset was created as part of the CORGIS (Collection of Really Great, Interesting, and Situated Datasets) project.

    • Link: https://think.cs.vt.edu/corgis/csv/airlines/

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The dataset was originally collected on March 27, 2015, by Austin Cory Bart. It was gathered from official sources, BTS, which stands for U.S. Department of Transportation’s Bureau of Transportation Statistics. The BTS is responsible for collecting, analyzing, and publishing transportation data, including aviation data in the United States.

    • It is also the preeminent source of statistics on commercial aviation, multimodal freight activity, and transportation economics, and provides context to decision makers and the public for understanding statistics on transportation. (Source: https://www.bts.gov/about-BTS)

  • Write a brief description of the observations.

    • Airport Code and Airport Name: Uniquely identify airports and provide their full names.

    • Time Label, Time Month, and Time Year: Time periods in “year/month” format, numeric months, and 4-digit years.

    • Statistics - # of Delays (Carrier, Late Aircraft, National Aviation System, Security, Weather): Number of flight delays attributed to specific factors, such as airline-controlled issues, late aircraft arrivals, national aviation system challenges, security concerns, and adverse weather conditions. These statistics offer a deep understanding of the causes of flight delays.

    • Statistics - Carriers Names and Carriers Total: Carriers that reported data during the specified time and location, including their full names and the total number of carriers.

    • Statistics - Flights (Cancelled, Delayed, Diverted, On Time, Total): These observations offer valuable data about the overall performance of flights, indicating the number of canceled, delayed, diverted, on-time, and total flights during a given month.

    • Statistics - Minutes Delayed (Carrier, Late Aircraft, National Aviation System, Security, Weather): These observations capture the duration of flight delays attributed to specific factors, including airline-controlled issues, late aircraft arrivals, national aviation system conditions, security-related incidents, and severe weather events.

  • Ethical concerns:

    • Analysis of the data should be conducted in a way that avoids perpetuating bias or discrimination based on factors like airline, race, gender, or other demographic attributes.

Glimpse of data

# add code here
airlines <- read_csv("data/airlines.csv")
Rows: 4408 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Airport.Code, Airport.Name, Time.Label, Time.Month Name, Statistic...
dbl (19): Time.Month, Time.Year, Statistics.# of Delays.Carrier, Statistics....

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(airlines)
Data summary
Name airlines
Number of rows 4408
Number of columns 24
_______________________
Column type frequency:
character 5
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Airport.Code 0 1 3 3 0 29 0
Airport.Name 0 1 23 67 0 29 0
Time.Label 0 1 7 7 0 152 0
Time.Month Name 0 1 3 9 0 12 0
Statistics.Carriers.Names 0 1 68 412 0 841 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time.Month 0 1 6.58 3.46 1 4.00 7.0 10.00 12 ▇▅▅▅▇
Time.Year 0 1 2009.24 3.67 2003 2006.00 2009.0 2012.00 2016 ▇▇▅▇▆
Statistics.# of Delays.Carrier 0 1 574.63 329.62 112 358.00 476.0 692.00 3087 ▇▂▁▁▁
Statistics.# of Delays.Late Aircraft 0 1 789.08 561.80 86 425.00 618.5 959.00 4483 ▇▂▁▁▁
Statistics.# of Delays.National Aviation System 0 1 954.58 921.91 61 399.00 667.5 1166.00 9066 ▇▁▁▁▁
Statistics.# of Delays.Security 0 1 5.58 6.01 -1 2.00 4.0 7.00 94 ▇▁▁▁▁
Statistics.# of Delays.Weather 0 1 78.22 75.18 1 33.00 58.0 98.00 812 ▇▁▁▁▁
Statistics.Carriers.Total 0 1 12.25 2.29 3 11.00 12.0 14.00 18 ▁▂▇▇▁
Statistics.Flights.Cancelled 0 1 213.56 288.87 3 58.00 123.0 250.00 3680 ▇▁▁▁▁
Statistics.Flights.Delayed 0 1 2402.00 1710.95 283 1298.75 1899.0 2950.00 13699 ▇▂▁▁▁
Statistics.Flights.Diverted 0 1 27.88 36.36 0 8.00 15.0 32.00 442 ▇▁▁▁▁
Statistics.Flights.On Time 0 1 9254.42 5337.21 2003 5708.75 7477.0 10991.50 31468 ▇▅▂▁▁
Statistics.Flights.Total 0 1 11897.86 6861.69 2533 7400.00 9739.5 13842.50 38241 ▇▆▂▁▁
Statistics.Minutes Delayed.Carrier 0 1 35021.37 24327.72 6016 19530.75 27782.0 41606.00 220796 ▇▂▁▁▁
Statistics.Minutes Delayed.Late Aircraft 0 1 49410.27 38750.02 5121 25084.25 37483.0 59951.25 345456 ▇▁▁▁▁
Statistics.Minutes Delayed.National Aviation System 0 1 45077.11 57636.75 2183 14389.00 25762.0 50362.00 602479 ▇▁▁▁▁
Statistics.Minutes Delayed.Security 0 1 211.77 257.17 0 65.00 141.0 274.00 4949 ▇▁▁▁▁
Statistics.Minutes Delayed.Total 0 1 135997.54 113972.28 14752 65444.75 100711.0 164294.75 989367 ▇▁▁▁▁
Statistics.Minutes Delayed.Weather 0 1 6276.98 6477.42 46 2310.75 4298.5 7846.00 76770 ▇▁▁▁▁

Data 2 - Fatal Police shootings

Problem or question

  • Identify the problem you will solve or the question you will answer
    • What are the key insights and patterns in the number of on-duty police officers that were involved in shootings from 2015-present?
  • Explain why you think this topic is important.
    • Exploring the dataset can give an insight on the number of shootings caused by on duty police men in different states, and which state has a higher rate of police shootings. This would help the government to take better steps in enforcing law and security so civilians feel safe in their own state.

    • Understanding the cause of shootings by policemen, if they have any prior mental illness or there is any other reason.

    • It would also give insights on how many incidents were recorded on their body camera, which would help the government to enforce better rules and regulations.

    • Understanding the race and age of the policeman would dive us deeper into other motives and insights on how the age group of police shooters have changed over time

  • Identify the types of data/variables you will use.
    • Categorical Variables:
      • Signs of mental illness (Categorical): Specific type of disease the person had.

      • Body camera (Categorical): A generic name for the cause of death.

      • State (Categorical): Different states in the United States.

      • Various other categorical variables such as gender, race, armed, threat level, and flee would be useful in the project.

    • Quantitative Variables:
      • Date (tyep date): Represents the date in the format “yyyy-mm-dd”

      • Age (type integer): Represents the age of the policeman

  • State the major deliverable(s) you will create to solve this problem/answer this question.
    • For the major deliverable, we will create a shiny web app to include our insights from the visualizations, statistics and correlation between various variables.
      • A data analysis report to provide a comprehensive analysis of the dataset, including summary statistics, trends, and patterns.
      • Visualizations such as bar charts, line graphs, heatmaps and more to visually represent patterns and trends in the number of police shootings in states.
      • Statistical models to analyze data, such as regression models to assess the impact of variables like age, mental illness, or body camera usage on the number of shootings.

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The original source of the data is The Washington Post, which has been tracking fatal police shootings in the United States since 2015. The data includes information about civilians shot and killed by on-duty police officers. The Washington Post has compiled this dataset using various sources, including news and police reports, social media, and databases like “Killed by Police” and “Fatal Encounters.” (Source: https://github.com/washingtonpost/data-police-shootings)
  • Write a brief description of the observations.

    • ID: A unique identifier for each incident or case.

    • Name: The name of the individual involved in the incident.

    • Date: The date when the incident occurred.

    • Manner of Death: Describes how the individual died, such as being shot, shot and Tasered, or other specific circumstances.

    • Armed: Indicates whether the individual was armed during the incident and, if so, with what type of weapon (e.g., gun, toy weapon, unarmed).

    • Age: The age of the individual at the time of the incident.

    • Gender: The gender of the individual (e.g., Male, Female).

    • Race: The racial background of the individual (e.g., White, Hispanic, Black).

    • City: The city where the incident took place.

    • State: The U.S. state where the incident occurred.

    • Signs of Mental Illness: Indicates whether there were signs of mental illness or distress in the individual during the incident (e.g., True, False).

    • Threat Level: Describes the perceived threat level of the individual during the incident (e.g., attack, other).

    • Flee: Specifies whether the individual was fleeing the scene during the incident and, if so, the manner of fleeing (e.g., Not fleeing).

    • Body Camera: Indicates whether body cameras were used during the incident (e.g., True, False).

    • Longitude and Latitude: The geographical coordinates of the incident location.

    • Is Geocoding Exact: Specifies whether the geocoding (mapping) of the location is exact (e.g., True).

  • Ethical concerns:

    • The data includes information about individuals who lost their lives, which is sensitive and may impact the privacy and dignity of the deceased and their families. Ethical considerations should be taken into account when using and sharing this data.

Glimpse of data

# add code here
shootings <- read_csv("data/fatal-police-shootings-data.csv")
Rows: 8002 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): name, manner_of_death, armed, gender, race, city, state, threat_le...
dbl  (4): id, age, longitude, latitude
lgl  (3): signs_of_mental_illness, body_camera, is_geocoding_exact
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(shootings)
Data summary
Name shootings
Number of rows 8002
Number of columns 17
_______________________
Column type frequency:
character 9
Date 1
logical 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 454 0.94 6 33 0 7511 0
manner_of_death 0 1.00 4 16 0 2 0
armed 211 0.97 2 32 0 106 0
gender 31 1.00 1 1 0 2 0
race 1517 0.81 1 1 0 6 0
city 0 1.00 3 30 0 3215 0
state 0 1.00 2 2 0 51 0
threat_level 0 1.00 5 12 0 3 0
flee 966 0.88 3 11 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-01-02 2022-12-01 2019-01-25 2702

Variable type: logical

skim_variable n_missing complete_rate mean count
signs_of_mental_illness 0 1 0.21 FAL: 6331, TRU: 1671
body_camera 0 1 0.14 FAL: 6865, TRU: 1137
is_geocoding_exact 0 1 1.00 TRU: 7984, FAL: 18

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 4415.43 2497.15 3.00 2240.25 4445.50 6579.75 8696.00 ▇▇▇▇▇
age 503 0.94 37.21 12.98 2.00 27.00 35.00 45.00 92.00 ▁▇▅▁▁
longitude 840 0.90 -97.04 16.52 -160.01 -112.03 -94.32 -83.15 -67.87 ▁▁▇▇▇
latitude 840 0.90 36.68 5.38 19.50 33.48 36.10 40.03 71.30 ▁▇▃▁▁

Data 3 - Leading Causes of Death: United States

Problem or question

  • Identify the problem you will solve or the question you will answer
    • What are the key insights and patterns in causes of deaths in United States, and how are age and cause of death correlated?
  • Explain why you think this topic is important.
    • Exploring the dataset can give an insight on the number of people that are prone to a disease which would be helpful for the government to take better action in reducing the number of those cases in the future

    • Understanding the relation between age and the cause of death would give a better insight on certain age groups must be handled when it comes to certain diseases or illnesses.

    • It can help modify drugs and medications with respect to the age group and the type of illness for better effectiveness.

  • Identify the types of data/variables you will use.
    • Categorical Variables:
      • 113 Cause Name (Categorical): Specific type of disease the person had.

      • Cause Name (Categorical): A generic name for the cause of death.

      • State (Categorical): Names of States in the United States.

    • Quantitative Variables:
      • Year (Type Number): Represents the year as a 4-digit number.

      • Deaths (Type Number): Represents the number of deaths as a numeric value

      • Age-adjusted death rate (Type Number): Represents deaths in age group divided by the estimated population of that age group × 100,000.

  • State the major deliverable(s) you will create to solve this problem/answer this question.
    • For the major deliverable, we will create a shiny web app to include our insights from the visualizations, statistics involving causes of deaths over time.
      • Prepare a comprehensive report summarizing the analysis of the dataset, including statistics, trends, and patterns in causes of death over time.
      • Create various data visualizations, including bar charts, line graphs, pie charts, and heatmaps to illustrate key findings.
      • Perform correlation analysis between age and cause of death to identify any significant relationships or associations. Visualize these correlations using scatter plots or correlation matrices.
      • Key factors contributing to causes of deaths.

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The source of the data is CDC and It is specifically provided by the National Center for Health Statistics (NCHS), a branch of the CDC responsible for collecting and disseminating vital health statistics data. This dataset is made available by the CDC and is part of their National Vital Statistics System.
  • Write a brief description of the observations.

    • Year: The year in which the data is recorded.

    • 113 Cause Name: A detailed cause name, including specific codes (e.g., V01-X59, Y85-Y86, G30), which are part of the International Classification of Diseases.

    • Cause Name: A more general name for the cause of death.

    • State: The U.S. state where the data is recorded.

    • Deaths: The number of recorded deaths due to the specified cause in the given state and year.

    • Age-adjusted Death Rate: The death rate per 100,000 population, adjusted for age. It represents the rate of deaths due to the specified cause in the population, considering age distribution.

  • Ethical concerns:

    • Ethical concerns arise if the data used for analysis is inaccurate or not validated. Misinterpretation of such data can lead to incorrect conclusions with potential ethical implications.

Glimpse of data

# add code here
deaths <- read_csv("data/NCHS_-_Leading_Causes_of_Death__United_States_20231012.csv")
Rows: 10868 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): 113 Cause Name, Cause Name, State
dbl (3): Year, Deaths, Age-adjusted Death Rate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(deaths)
Data summary
Name deaths
Number of rows 10868
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
113 Cause Name 0 1 10 69 0 11 0
Cause Name 0 1 4 23 0 11 0
State 0 1 4 20 0 52 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2008.00 5.48 1999.0 2003.0 2008.0 2013.00 2017.0 ▇▇▆▇▇
Deaths 0 1 15459.91 112876.02 21.0 612.0 1718.5 5756.50 2813503.0 ▇▁▁▁▁
Age-adjusted Death Rate 0 1 127.56 223.64 2.6 19.2 35.9 151.72 1087.3 ▇▁▁▁▁