Elegant Starmie

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.

FiveThirtyEight

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

pollster-ratings.csv has been continually collected by FiveThirtyEight over the past year (2022 and 2023). They collect this data based on the accuracy of polls released by each pollster, by comparing it to real-life election results.

Write a brief description of the observations.

Some of the observations in this dataset are the number of polls each pollster analyzed, the grade that the pollster received from FiveThirtyEight, the magnitude and direction of the pollster’s bias, the proportion of races called correctly.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is there a relationship between the number of polls a pollster conducted and analyzed, and the accuracy of said polls?
What effect does the number of races called correctly have on the grade given to each pollster by FiveThirtyEight?

A description of the research topic along with a concise statement of your hypotheses on this topic.

The topic for this research question is whether there is a bias towards pollsters that may have interfered with the data they released. Hypothesis: Some pollsters may have simply not released polls that they deemed to be inaccurate and might have hurt their reputation, which may have artificially boosted their accuracy numbers.
The topic for this research question is how calling races can affect a pollster’s accuracy. Hypothesis: There isn’t a strong relationship between the proportion of races called correctly and a pollster’s accuracy because some pollsters may not bother calling/predicting races that have an obvious outcome.

Identify the types of variables in your research question. Categorical? Quantitative?

Number of polls analyzed — “Polls Analyzed” (quantitative); accuracy of said polls — “538 Grade” (categorical), “Bias” (quantitative), “Average Distance from Polling Average” (quantitative), etc.
Number of races called correctly — “Races Called Correctly” (quantitative), grade given by FiveThirtyEight — “538 Grade” (categorical)

Glimpse of data

pollster_ratings <- read_csv("data/pollster-ratings.csv")

Rows: 517 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): Pollster, AAPOR/Roper, Banned by 538, 538 Grade
dbl (17): Rank, Pollster Rating ID, Polls Analyzed, Predictive Plus-Minus, M...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(pollster_ratings)

Rows: 517
Columns: 21
$ Rank                                           <dbl> 1, 2, 3, 4, 5, 6, 7, 8,…
$ Pollster                                       <chr> "Siena College/The New …
$ `Pollster Rating ID`                           <dbl> 448, 304, 280, 325, 195…
$ `Polls Analyzed`                               <dbl> 95, 53, 44, 856, 15, 62…
$ `AAPOR/Roper`                                  <chr> "yes", "yes", "no", "no…
$ `Banned by 538`                                <chr> "no", "no", "no", "no",…
$ `Predictive Plus-Minus`                        <dbl> -1.1928598, -1.1759107,…
$ `538 Grade`                                    <chr> "A+", "A+", "A+", "A+",…
$ `Mean-Reverted Bias`                           <dbl> 1.006012283, 0.16111995…
$ `Races Called Correctly`                       <dbl> 0.7473684, 0.8113208, 0…
$ `Misses Outside MOE`                           <dbl> 0.17894737, 0.24528302,…
$ `Simple Average Error`                         <dbl> 4.043250, 4.916848, 4.1…
$ `Simple Expected Error`                        <dbl> 5.364792, 5.959235, 5.5…
$ `Simple Plus-Minus`                            <dbl> -1.31842859, -1.0392737…
$ `Advanced Plus-Minus`                          <dbl> -1.6255234, -1.6226705,…
$ `Mean-Reverted Advanced Plus-Minus`            <dbl> -1.3183722, -1.0008436,…
$ `# of Polls for Bias Analysis`                 <dbl> 94, 35, 43, 697, 11, 57…
$ Bias                                           <dbl> 1.243259572, 0.29764469…
$ `House Effect`                                 <dbl> 0.70050634, -0.34349505…
$ `Average Distance from Polling Average (ADPA)` <dbl> 3.575558, 5.288403, 4.6…
$ `Herding Penalty`                              <dbl> 0.26118782, 0.00000000,…

Data 2

Introduction and data

Identify the source of the data.

FiveThirtyEight, The Lasting Legacy Of Redlining
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

For metro-grades.csv, spatial data for micro- and metropolitan areas were collected from the Home Owners’ Loan Corporation (HOLC) maps drawn in 1935-1940 (downloaded from the Mapping Inequality Project). Population and race/ethnicity data was collected from the 2020 US decennial census.
Write a brief description of the observations.

For metro-grades.csv, observations are recorded by metro_area (city, state). Each observation has a designated HOLC grade, population by ethnicity, population percentage by ethnicity, a location quotient per metro area and HOLC grade, by ethnicity, population by ethnicity of the surrounding area, and population percentage by ethnicity of the surrounding area

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.) Is there presently a correlation between formerly redlined districts, and racial segregation in those districts in 2020? How have ethnic populations in those districts changed over time?
A description of the research topic along with a concise statement of your hypotheses on this topic. This research topic is based on an article published by FiveThirtyEight on the lasting effects of redlining in the late 1930s. Hypothesis: There is currently a strong correlation between formerly redlined districts, and racial segregation in thos districts.
Identify the types of variables in your research question. Categorical? Quantitative? The metro area and HOLC grade are categorical variables. The population / percentage data is split by ethnicity, which is categorical, but the observations are recorded quantitatively.

Glimpse of data

# add code here
metro_grades <- read.csv("data/metro-grades.csv")
glimpse(metro_grades)

Rows: 551
Columns: 28
$ metro_area          <chr> "Akron, OH", "Akron, OH", "Akron, OH", "Akron, OH"…
$ holc_grade          <chr> "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", …
$ white_pop           <int> 24702, 41531, 73105, 6179, 16989, 26644, 56878, 16…
$ black_pop           <int> 8624, 16499, 22847, 6921, 1818, 7094, 16795, 19581…
$ hisp_pop            <int> 956, 2208, 3149, 567, 1317, 4334, 10357, 6688, 367…
$ asian_pop           <int> 688, 3367, 6291, 455, 1998, 2509, 6355, 2191, 21, …
$ other_pop           <int> 1993, 4211, 7302, 1022, 1182, 4650, 11153, 4364, 8…
$ total_pop           <int> 36963, 67816, 112694, 15144, 23303, 45230, 101538,…
$ pct_white           <dbl> 66.83, 61.24, 64.87, 40.80, 72.91, 58.91, 56.02, 3…
$ pct_black           <dbl> 23.33, 24.33, 20.27, 45.70, 7.80, 15.68, 16.54, 39…
$ pct_hisp            <dbl> 2.59, 3.26, 2.79, 3.75, 5.65, 9.58, 10.20, 13.48, …
$ pct_asian           <dbl> 1.86, 4.96, 5.58, 3.00, 8.57, 5.55, 6.26, 4.42, 1.…
$ pct_other           <dbl> 5.39, 6.21, 6.48, 6.75, 5.07, 10.28, 10.98, 8.79, …
$ lq_white            <dbl> 0.94, 0.86, 0.91, 0.57, 1.09, 0.88, 0.84, 0.51, 1.…
$ lq_black            <dbl> 1.41, 1.47, 1.23, 2.76, 0.66, 1.33, 1.40, 3.35, 0.…
$ lq_hisp             <dbl> 1.00, 1.26, 1.08, 1.45, 0.77, 1.30, 1.39, 1.83, 0.…
$ lq_asian            <dbl> 0.46, 1.23, 1.38, 0.74, 1.21, 0.78, 0.88, 0.62, 0.…
$ lq_other            <dbl> 0.97, 1.11, 1.16, 1.21, 0.72, 1.47, 1.57, 1.26, 1.…
$ surr_area_white_pop <int> 304399, 304399, 304399, 304399, 387016, 387016, 38…
$ surr_area_black_pop <int> 70692, 70692, 70692, 70692, 68371, 68371, 68371, 6…
$ surr_area_hisp_pop  <int> 11037, 11037, 11037, 11037, 42699, 42699, 42699, 4…
$ surr_area_asian_pop <int> 17295, 17295, 17295, 17295, 41112, 41112, 41112, 4…
$ surr_area_other_pop <int> 23839, 23839, 23839, 23839, 40596, 40596, 40596, 4…
$ surr_area_pct_white <dbl> 71.24, 71.24, 71.24, 71.24, 66.75, 66.75, 66.75, 6…
$ surr_area_pct_black <dbl> 16.55, 16.55, 16.55, 16.55, 11.79, 11.79, 11.79, 1…
$ surr_area_pct_hisp  <dbl> 2.58, 2.58, 2.58, 2.58, 7.36, 7.36, 7.36, 7.36, 26…
$ surr_area_pct_asian <dbl> 4.05, 4.05, 4.05, 4.05, 7.09, 7.09, 7.09, 7.09, 3.…
$ surr_area_pct_other <dbl> 5.58, 5.58, 5.58, 5.58, 7.00, 7.00, 7.00, 7.00, 4.…

Data 3

Introduction and data

Identify the source of the data.

FiveThirtyEight

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The predictions rely on a significantly modified edition of ESPN’s Soccer Power Index (SPI), which is a rating mechanism initially created by FiveThirtyEight. FiveThirtyEight has refined and adjusted the SPI by including data from over 550,000 club soccer matches dating back to 1888, gathered from both ESPN’s database and the Engsoccerdata GitHub repository. In addition, FiveThirtyEight incorporated Opta’s play-by-play information, which has been accessible since 2010, to include different variables in the data set.

Write a brief description of the observations.

Each game has two teams. The SPIs are given for each team, along with the win probability for each team, and the probability of a tie. And finally, the actual score of the game is included.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What is the relationship between a soccer team’s historical performance and their probability of winning in the upcoming season, as predicted by the revised Soccer Power Index (SPI) rating system?

A description of the research topic along with a concise statement of your hypotheses on this topic.

Research Topic: The Predictive Power of the Revised Soccer Power Index (SPI) Rating System on Club Soccer Match Outcomes.

Hypothesis: The revised Soccer Power Index (SPI) rating system, which incorporates historical club soccer match data, is likely to be a more accurate predictor of club soccer match outcomes compared to previous versions of the SPI rating system that did not include historical data.

Identify the types of variables in your research question. Categorical? Quantitative?

The variables are all quantitative, as they all deal with scores, win probabilities, and SPIs, which are only determined by other quantitative variables.

Glimpse of data

club_soccer <- read_csv("data/spi_matches.csv")

Rows: 67726 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (3): league, team1, team2
dbl  (19): season, league_id, spi1, spi2, prob1, prob2, probtie, proj_score1...
date  (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(club_soccer)

Rows: 67,726
Columns: 23
$ season      <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016…
$ date        <date> 2016-07-09, 2016-07-10, 2016-07-10, 2016-07-16, 2016-07-1…
$ league_id   <dbl> 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921, 7921…
$ league      <chr> "FA Women's Super League", "FA Women's Super League", "FA …
$ team1       <chr> "Liverpool Women", "Arsenal Women", "Chelsea FC Women", "L…
$ team2       <chr> "Reading", "Notts County Ladies", "Birmingham City", "Nott…
$ spi1        <dbl> 51.56, 46.61, 59.85, 53.00, 59.43, 50.75, 48.13, 50.62, 48…
$ spi2        <dbl> 50.42, 54.03, 54.64, 52.35, 60.99, 55.03, 60.15, 52.63, 48…
$ prob1       <dbl> 0.4389, 0.3572, 0.4799, 0.4289, 0.4124, 0.3821, 0.3082, 0.…
$ prob2       <dbl> 0.2767, 0.3608, 0.2487, 0.2699, 0.3157, 0.3200, 0.3888, 0.…
$ probtie     <dbl> 0.2844, 0.2819, 0.2714, 0.3013, 0.2719, 0.2979, 0.3030, 0.…
$ proj_score1 <dbl> 1.39, 1.27, 1.53, 1.27, 1.45, 1.22, 1.04, 1.31, 1.64, 1.20…
$ proj_score2 <dbl> 1.05, 1.28, 1.03, 0.94, 1.24, 1.09, 1.20, 1.09, 1.35, 1.45…
$ importance1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 32.4, 53.7, 38.1, …
$ importance2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 67.7, 22.9, 22.2, …
$ score1      <dbl> 2, 2, 1, 0, 1, 1, 1, 1, 1, 1, 0, 2, 2, 0, 1, 1, 0, 1, 3, 2…
$ score2      <dbl> 0, 0, 1, 0, 2, 1, 5, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1…
$ xg1         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.97, 2.45, 0.85, …
$ xg2         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.63, 0.77, 2.77, …
$ nsxg1       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.43, 1.75, 0.17, …
$ nsxg2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.45, 0.42, 1.25, …
$ adj_score1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.00, 2.10, 2.10, …
$ adj_score2  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1.05, 2.10, 1.05, …