Data cleaning
We selected the relevant columns for our data analysis:
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2 ✔ purrr 1.0.0
✔ tibble 3.2.1 ✔ dplyr 1.1.2
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.2 ✔ rsample 1.1.1
✔ dials 1.1.0 ✔ tune 1.1.1
✔ infer 1.0.4 ✔ workflows 1.1.2
✔ modeldata 1.0.1 ✔ workflowsets 1.0.0
✔ parsnip 1.0.3 ✔ yardstick 1.1.0
✔ recipes 1.0.6
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
library (scales)
library (gapminder)
pollster_ratings <- read_csv ("data/pollster-ratings.csv" )
Rows: 517 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Pollster, AAPOR/Roper, Banned by 538, 538 Grade
dbl (17): Rank, Pollster Rating ID, Polls Analyzed, Predictive Plus-Minus, M...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pollster_ratings <- pollster_ratings |>
select (
Rank,
Pollster,
'Pollster Rating ID' ,
'Polls Analyzed' ,
'Predictive Plus-Minus' ,
'538 Grade' ,
'Races Called Correctly' ,
'Misses Outside MOE' ,
'Bias' ,
'House Effect' ,
'Average Distance from Polling Average (ADPA)' ,
'Herding Penalty'
)
Then, we renamed and re-formatted variables as needed to make the data easier to work with:
pollster_ratings <- tibble (
rank = pollster_ratings$ Rank,
pollster = pollster_ratings$ Pollster,
pollster_id = pollster_ratings$ 'Pollster Rating ID' ,
polls_analyzed = as.numeric (pollster_ratings$ 'Polls Analyzed' ),
predictive_pm = as.numeric (pollster_ratings$ 'Predictive Plus-Minus' ),
grade = as.factor (pollster_ratings$ '538 Grade' ),
called_correct = as.numeric (pollster_ratings$ 'Races Called Correctly' ),
misses_outside_moe = as.numeric (pollster_ratings$ 'Misses Outside MOE' ),
bias = as.numeric (pollster_ratings$ 'Bias' ),
house_effect = as.numeric (pollster_ratings$ 'House Effect' ),
herding_penalty = as.numeric (pollster_ratings$ 'Herding Penalty' )
)
pollster_ratings
# A tibble: 517 × 11
rank pollster pollster_id polls_analyzed predictive_pm grade called_correct
<dbl> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
1 1 Siena Co… 448 95 -1.19 A+ 0.747
2 2 Selzer &… 304 53 -1.18 A+ 0.811
3 3 Research… 280 44 -0.965 A+ 0.886
4 4 SurveyUSA 325 856 -0.917 A+ 0.891
5 5 Marquett… 195 15 -0.908 A/B 0.8
6 6 Siena Co… 305 62 -0.838 A 0.855
7 7 AtlasInt… 546 24 -0.762 A 0.729
8 8 ABC News… 3 82 -0.724 A 0.683
9 9 Cygnal 67 39 -0.723 A 0.936
10 10 Beacon R… 103 56 -0.721 A 0.741
# ℹ 507 more rows
# ℹ 4 more variables: misses_outside_moe <dbl>, bias <dbl>, house_effect <dbl>,
# herding_penalty <dbl>
write.csv (pollster_ratings, "data/pollster-ratings-clean.csv" )
Other appendicies (as necessary)