project-elegant-buneary

Exploration

Objective(s)

State the question(s) you are answering or the problem(s) you are solving clearly.

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

library(tidyverse)
library(skimr)
library(lubridate)
library(scales)
# load data and skim
airlines <- read_csv("data/airlines.csv")
Rows: 4408 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (5): Airport.Code, Airport.Name, Time.Label, Time.Month Name, Statistic...
dbl (19): Time.Month, Time.Year, Statistics.# of Delays.Carrier, Statistics....

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(airlines)
Data summary
Name airlines
Number of rows 4408
Number of columns 24
_______________________
Column type frequency:
character 5
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Airport.Code 0 1 3 3 0 29 0
Airport.Name 0 1 23 67 0 29 0
Time.Label 0 1 7 7 0 152 0
Time.Month Name 0 1 3 9 0 12 0
Statistics.Carriers.Names 0 1 68 412 0 841 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Time.Month 0 1 6.58 3.46 1 4.00 7.0 10.00 12 ▇▅▅▅▇
Time.Year 0 1 2009.24 3.67 2003 2006.00 2009.0 2012.00 2016 ▇▇▅▇▆
Statistics.# of Delays.Carrier 0 1 574.63 329.62 112 358.00 476.0 692.00 3087 ▇▂▁▁▁
Statistics.# of Delays.Late Aircraft 0 1 789.08 561.80 86 425.00 618.5 959.00 4483 ▇▂▁▁▁
Statistics.# of Delays.National Aviation System 0 1 954.58 921.91 61 399.00 667.5 1166.00 9066 ▇▁▁▁▁
Statistics.# of Delays.Security 0 1 5.58 6.01 -1 2.00 4.0 7.00 94 ▇▁▁▁▁
Statistics.# of Delays.Weather 0 1 78.22 75.18 1 33.00 58.0 98.00 812 ▇▁▁▁▁
Statistics.Carriers.Total 0 1 12.25 2.29 3 11.00 12.0 14.00 18 ▁▂▇▇▁
Statistics.Flights.Cancelled 0 1 213.56 288.87 3 58.00 123.0 250.00 3680 ▇▁▁▁▁
Statistics.Flights.Delayed 0 1 2402.00 1710.95 283 1298.75 1899.0 2950.00 13699 ▇▂▁▁▁
Statistics.Flights.Diverted 0 1 27.88 36.36 0 8.00 15.0 32.00 442 ▇▁▁▁▁
Statistics.Flights.On Time 0 1 9254.42 5337.21 2003 5708.75 7477.0 10991.50 31468 ▇▅▂▁▁
Statistics.Flights.Total 0 1 11897.86 6861.69 2533 7400.00 9739.5 13842.50 38241 ▇▆▂▁▁
Statistics.Minutes Delayed.Carrier 0 1 35021.37 24327.72 6016 19530.75 27782.0 41606.00 220796 ▇▂▁▁▁
Statistics.Minutes Delayed.Late Aircraft 0 1 49410.27 38750.02 5121 25084.25 37483.0 59951.25 345456 ▇▁▁▁▁
Statistics.Minutes Delayed.National Aviation System 0 1 45077.11 57636.75 2183 14389.00 25762.0 50362.00 602479 ▇▁▁▁▁
Statistics.Minutes Delayed.Security 0 1 211.77 257.17 0 65.00 141.0 274.00 4949 ▇▁▁▁▁
Statistics.Minutes Delayed.Total 0 1 135997.54 113972.28 14752 65444.75 100711.0 164294.75 989367 ▇▁▁▁▁
Statistics.Minutes Delayed.Weather 0 1 6276.98 6477.42 46 2310.75 4298.5 7846.00 76770 ▇▁▁▁▁
# fix column names by using saplly()
snake_case <- function(x) {
  x <- gsub(" ", "_", x)
  x <- gsub("#", "_", x)
  x <- gsub("\\.", "_", x)
  x <- gsub("___", "_", x)
  x <- gsub("__", "_", x)
  x <- tolower(x)
  return(x)
}

colnames(airlines) <- sapply(colnames(airlines), snake_case)

# remove "statistics_" from column names
colnames(airlines) <- gsub("statistics_of_", "", colnames(airlines))
colnames(airlines) <- gsub("statistics_", "", colnames(airlines))
airlines_clean <- airlines |>
  separate_rows(carriers_names, sep = ",") |>
  rename(airline = carriers_names) |>
  separate(
    col = airport_name,
    into = c("state", "country", "airport_name"),
    sep = "[,:]"
  ) |>
  rename(
    month_num = time_month,
    month_name = time_month_name,
    year = time_year
  ) 

delay_reason <- airlines |>
  select(starts_with("delays_")) 

airlien_clean <- airlines_clean |>
  mutate(
    month_num = as.numeric(month_num),
    year = as.numeric(year)
  )
# save cleaned data
write_csv(airlines_clean, "data/airlines_cleaned.csv")

Data description

The dataset used for the analysis of flight data in the United States is a comprehensive collection of statistics related to flights, delays, carriers, and airports. This analysis-ready dataset was originally collected from the U.S. Department of Transportation’s Bureau of Transportation Statistics (BTS) and is part of the CORGIS (Collection of Really Great, Interesting, and Situated Datasets) project. The dataset provides valuable insights into the aviation industry and its impact on passengers’ travel experiences.

The dataset used for the analysis of flight data in the United States has undergone a series of cleaning and preprocessing steps to ensure it is analysis-ready. We did cleaning by standardizing column names to snake_case and reshaping the data for easier analysis, and we also reshaped the data by using the pivot_longer() function to make it easier to analyze. For the airport name, we are separating from one columns to three, for example, for “Atlanta, GA: Hartsfield-Jackson Atlanta International” the result will be “Atlanta”, “GA”, and “Hartsfield-Jackson Atlanta International”. There were no NA values in the dataset, so we did not need to remove any rows or columns. The dataset is now ready for analysis.

Data limitations

Identify any potential problems with your dataset.

  • The dataset could be inaccurate. The accuracy of the data is paramount to the success of the project. Any inaccuracies in the dataset, such as incorrect flight statistics or airport information, could lead to incorrect conclusions.
  • There will also be temporal limitations. The dataset only covers data from June 2003 to Jan 2016, and this temporal limitation may affect the generalizability of findings, especially when analyzing contemporary trends.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

library(ggplot2)
airlines_clean <- read_csv("data/airlines_cleaned.csv")
Rows: 54013 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): airport_code, state, country, airport_name, time_label, month_name...
dbl (19): month_num, year, delays_carrier, delays_late_aircraft, delays_nati...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(airlines_clean)
Rows: 54,013
Columns: 26
$ airport_code                             <chr> "ATL", "ATL", "ATL", "ATL", "…
$ state                                    <chr> "Atlanta", "Atlanta", "Atlant…
$ country                                  <chr> "GA", "GA", "GA", "GA", "GA",…
$ airport_name                             <chr> "Hartsfield-Jackson Atlanta I…
$ time_label                               <chr> "2003/06", "2003/06", "2003/0…
$ month_num                                <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ month_name                               <chr> "June", "June", "June", "June…
$ year                                     <dbl> 2003, 2003, 2003, 2003, 2003,…
$ delays_carrier                           <dbl> 1009, 1009, 1009, 1009, 1009,…
$ delays_late_aircraft                     <dbl> 1275, 1275, 1275, 1275, 1275,…
$ delays_national_aviation_system          <dbl> 3217, 3217, 3217, 3217, 3217,…
$ delays_security                          <dbl> 17, 17, 17, 17, 17, 17, 17, 1…
$ delays_weather                           <dbl> 328, 328, 328, 328, 328, 328,…
$ airline                                  <chr> "American Airlines Inc.", "Je…
$ carriers_total                           <dbl> 11, 11, 11, 11, 11, 11, 11, 1…
$ flights_cancelled                        <dbl> 216, 216, 216, 216, 216, 216,…
$ flights_delayed                          <dbl> 5843, 5843, 5843, 5843, 5843,…
$ flights_diverted                         <dbl> 27, 27, 27, 27, 27, 27, 27, 2…
$ flights_on_time                          <dbl> 23974, 23974, 23974, 23974, 2…
$ flights_total                            <dbl> 30060, 30060, 30060, 30060, 3…
$ minutes_delayed_carrier                  <dbl> 61606, 61606, 61606, 61606, 6…
$ minutes_delayed_late_aircraft            <dbl> 68335, 68335, 68335, 68335, 6…
$ minutes_delayed_national_aviation_system <dbl> 118831, 118831, 118831, 11883…
$ minutes_delayed_security                 <dbl> 518, 518, 518, 518, 518, 518,…
$ minutes_delayed_total                    <dbl> 268764, 268764, 268764, 26876…
$ minutes_delayed_weather                  <dbl> 19474, 19474, 19474, 19474, 1…
carrier_summary <- airlines_clean |>
  group_by(airline) |>
  summarize(total_hours_delayed = sum(minutes_delayed_total) / 60, .groups = "drop") |>
  ungroup()

# Create a bar chart with carriers sorted from smallest to largest total hours delayed
ggplot(carrier_summary, aes(x = reorder(airline, total_hours_delayed), y = total_hours_delayed)) +
  geom_bar(stat = "identity", color = "#6F8FAF", fill = "#A7C7E7") +
  labs(
    title = "Total Hours Delayed by Carrier (Smallest to Largest)",
    x = "Carrier",
    y = "Total Hours Delayed"
  ) +
  theme_minimal() +
  coord_flip() +
  scale_y_continuous(labels = scales::comma)

airlines_clean |>
  filter(airline == "American Airlines Inc.") |>
  group_by(year) |>
  summarise(percentage_delayed = (sum(flights_delayed) / sum(flights_on_time))) |>
  drop_na() |>
  ggplot(mapping = aes(x = year, y = percentage_delayed)) +
  geom_line(aes(color = "pink")) +
  geom_point(aes(color = "turquoise"), size = 2.5) +
  labs(
    x = "Year",
    y = "Percentage of Flights Delayed",
    title = "Percentage of Flights Delayed for American Airlines Over the Years"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_continuous(labels = scales::percent)

filtered_df <- airlines_clean |> 
  filter(year >= 2010 & year <= 2015) |>
  group_by(airline) |>
  summarise(
    average_delay = mean(delays_late_aircraft) / mean(flights_total)
  ) |>
  drop_na() |>
  print()
# A tibble: 22 × 2
   airline                      average_delay
   <chr>                                <dbl>
 1 AirTran Airways Corporation         0.0660
 2 Alaska Airlines Inc.                0.0685
 3 American Airlines Inc.              0.0670
 4 American Eagle Airlines Inc.        0.0658
 5 Atlantic Southeast Airlines         0.0620
 6 Comair Inc.                         0.0611
 7 Continental Air Lines Inc.          0.0638
 8 Delta Air Lines Inc.                0.0673
 9 Endeavor Air Inc.                   0.0675
10 Envoy Air                           0.0693
# ℹ 12 more rows
airline_plot_3 <- ggplot(
  data = filtered_df,
  aes(x = reorder(airline, average_delay), y = average_delay)
) +
  geom_bar(stat = "identity", fill = "#A7C7E7", color = "#6F8FAF") +
  labs(
    x = "Airline",
    y = "Average Percentage of Flights Delayed",
    title = "Average Percentage of Flights Delayed per Airline"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1))
airline_plot_3

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

  • Is the cleaned dataset analysis-ready?
  • Are the visualizations clear and easy to understand?
  • Are there any other visualizations that would be helpful to include? (e.g., a map of the United States showing the number of flights delayed by state, or a drop-down menu to select a specific carrier and see the percentage of flights delayed over time) We will use shiny app for the project by the way.