Factors that appear to influence card issuance and final result in EPL matches

Appendix to report

Data cleaning

# A tibble: 8,020 × 12
   Div   Date     HomeTeam  AwayTeam  FTHG  FTAG FTR   Referee    HY    AY    HR
   <chr> <chr>    <chr>     <chr>    <int> <int> <chr> <chr>   <int> <int> <int>
 1 E0    19/08/00 Charlton  Man City     4     0 H     Rob Ha…     1     2     0
 2 E0    19/08/00 Chelsea   West Ham     4     2 H     Graham…     1     2     0
 3 E0    19/08/00 Coventry  Middles…     1     3 A     Barry …     5     3     1
 4 E0    19/08/00 Derby     Southam…     2     2 D     Andy D…     1     1     0
 5 E0    19/08/00 Leeds     Everton      2     0 H     Dermot…     1     3     0
 6 E0    19/08/00 Leicester Aston V…     0     0 D     Mike R…     2     3     0
 7 E0    19/08/00 Liverpool Bradford     1     0 H     Paul D…     1     1     0
 8 E0    19/08/00 Sunderla… Arsenal      1     0 H     Steve …     3     1     0
 9 E0    19/08/00 Tottenham Ipswich      3     1 H     Alan W…     0     0     0
10 E0    20/08/00 Man Unit… Newcast…     2     0 H     Steve …     0     1     0
# ℹ 8,010 more rows
# ℹ 1 more variable: AR <int>

We performed a data-cleaning process to prepare a comprehensive dataset of selected information for all English Premier League (EPL) matches within a specified time period, which can be used for further analysis.

  1. First, we imported 22 CSV files that contain data for the EPL from the 2000-01 to the 2021-22 seasons.

  2. Second, we narrowed down the dataset to only include specific columns relevant to our analysis, including Div, Date, HomeTeam, AwayTeam, FTHG (Full Time Home Goals), FTAG (Full Time Away Goals), FTR (Full Time Result), Referee, HY (Home Yellow Cards), AY (Away Yellow Cards), HR (Home Red Cards), and AR (Away Red Cards).

  3. Third, we combined the selected data into one dataset using the rbind() function.

  4. Fourth, we saved the resulting combined dataset to a new CSV file named “epl.csv”.

  5. Finally, we displayed the combined dataset as a tibble using the tibble() function for reference.