Polling Accuracy

FiveThirtyEight pollster analysis

Author

Elegant Starmie
Lourenço Belleau-Bhowmik, Tyler Fouch, Max Ma, Ian So, Andrew Song

Published

May 5, 2023

Introduce the topic and motivation

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
  • Polls are a critical part of election cycles, but their accuracy can vary widely.
  • Understanding the factors that contribute to accurate polling can improve our ability to make informed decisions.
  • We analyzed the Pollster Ratings dataset from FiveThirtyEight
  • Our research question is: Is there a relationship between the number of polls a pollster conducted and analyzed, and the accuracy of said polls?
  • Ultimately, our findings could help inform pollsters of best practices that can improve the reliability of polling data.

Introduction to Data:

Rows: 517 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Pollster, AAPOR/Roper, Banned by 538, 538 Grade
dbl (17): Rank, Pollster Rating ID, Polls Analyzed, Predictive Plus-Minus, M...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 21
   Rank Pollster             `Pollster Rating ID` `Polls Analyzed` `AAPOR/Roper`
  <dbl> <chr>                               <dbl>            <dbl> <chr>        
1     1 Siena College/The N…                  448               95 yes          
2     2 Selzer & Co.                          304               53 yes          
3     3 Research & Polling …                  280               44 no           
4     4 SurveyUSA                             325              856 no           
5     5 Marquette Universit…                  195               15 yes          
6     6 Siena College                         305               62 yes          
# ℹ 16 more variables: `Banned by 538` <chr>, `Predictive Plus-Minus` <dbl>,
#   `538 Grade` <chr>, `Mean-Reverted Bias` <dbl>,
#   `Races Called Correctly` <dbl>, `Misses Outside MOE` <dbl>,
#   `Simple Average Error` <dbl>, `Simple Expected Error` <dbl>,
#   `Simple Plus-Minus` <dbl>, `Advanced Plus-Minus` <dbl>,
#   `Mean-Reverted Advanced Plus-Minus` <dbl>,
#   `# of Polls for Bias Analysis` <dbl>, Bias <dbl>, `House Effect` <dbl>, …
# A tibble: 6 × 21
   Rank Pollster             `Pollster Rating ID` `Polls Analyzed` `AAPOR/Roper`
  <dbl> <chr>                               <dbl>            <dbl> <chr>        
1   512 Indiana University-…                  150                3 no           
2   513 Long Island Univers…                  633                4 no           
3   514 Nielson Brothers Po…                  231               11 no           
4   515 Ascend Action                         726                6 no           
5   516 Jayhawk Consulting …                  157                2 no           
6   517 Fort Hays State Uni…                  101                6 no           
# ℹ 16 more variables: `Banned by 538` <chr>, `Predictive Plus-Minus` <dbl>,
#   `538 Grade` <chr>, `Mean-Reverted Bias` <dbl>,
#   `Races Called Correctly` <dbl>, `Misses Outside MOE` <dbl>,
#   `Simple Average Error` <dbl>, `Simple Expected Error` <dbl>,
#   `Simple Plus-Minus` <dbl>, `Advanced Plus-Minus` <dbl>,
#   `Mean-Reverted Advanced Plus-Minus` <dbl>,
#   `# of Polls for Bias Analysis` <dbl>, Bias <dbl>, `House Effect` <dbl>, …
 Polls Analyzed   538 Grade              Bias        
 Min.   :  1.0   Length:517         Min.   :-23.977  
 1st Qu.:  2.0   Class :character   1st Qu.: -2.030  
 Median :  4.0   Mode  :character   Median :  1.032  
 Mean   : 22.2                      Mean   :  1.051  
 3rd Qu.: 13.0                      3rd Qu.:  4.361  
 Max.   :856.0                      Max.   : 37.613  
                                    NA's   :44       
The pollster data provided by 538 contains 21 columns/variables where they analyze 517 individual polling places (rows in the data). Out of this data, our group chose to focus on the number of polls analyzed (Polls Analyzed) by 538 at a pollster and its relationship with the pollsters level of assumed bias (Bias) and the grade assigned to it by 538 (538 grade).

Highlights from EDA

Rows: 517 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Pollster, AAPOR/Roper, Banned by 538, 538 Grade
dbl (17): Rank, Pollster Rating ID, Polls Analyzed, Predictive Plus-Minus, M...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: Removed 44 rows containing missing values (`geom_point()`).

Inference/modeling/other analysis

Rows: 517 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Pollster, AAPOR/Roper, Banned by 538, 538 Grade
dbl (17): Rank, Pollster Rating ID, Polls Analyzed, Predictive Plus-Minus, M...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 7 rows containing missing values (`geom_point()`).

Warning: Removed 7 rows containing missing values.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Inference/modeling/other analysis

\[ \widehat{numeric~grade} = 5.71 + 0.013 \times polls~nalyzed \\- 0.11 \times bias - 0.048 \times herding~penalty \]

Warning: Removed 7 rows containing missing values.
Message: The independence null hypothesis does not inform calculation of the observed fit and will be ignored.
# A tibble: 2 × 2
  term           estimate
  <chr>             <dbl>
1 intercept        5.08  
2 polls_analyzed   0.0183
Warning: Removed 51 rows containing missing values.
Message: The independence null hypothesis does not inform calculation of the observed fit and will be ignored.
# A tibble: 4 × 2
  term              estimate
  <chr>                <dbl>
1 intercept           5.71  
2 polls_analyzed      0.0129
3 abs_bias           -0.107 
4 `Herding Penalty`  -0.0487

Conclusions + future work

  • We are 95% confident that the correlation between polls_analyzed and 538 Grade is positive, and polls_analyzed is the most determining factor
  • Our data analysis indicates that there is in fact a relationship between the number of polls a pollster conducted and analyzed, and the accuracy of said polls. However, this relationship is weaker than we had originally believed.
  • Nevertheless, our conclusion still indicates an interesting trend that pollsters and FiveThirtyEight can take into account in future polls and political analysis