J-Cash (team 16)

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • From the CORGIS Dataset Project, By Sam Donald

  • Original data curator: @jthomasmock shares datasets on Github to contribute to the TidyTuesday project

    Created: Jul 6, 2020

  • The data contains observations about two different kinds of coffee beans (Arabica and Robusta) from various countries regarding things like it’s acidity, sweetness, fragrance, etc. being judged and given a score from 1-10.

Research question

  • How do Arabica and Robusta coffee beans differ in regards to aroma, flavor, aftertaste, acidity, and sweetness; is there a correlation between the above scores and the country/region from which the beans are from?
  • This question is important because it can help coffee drinkers and companies identify what beans are best for their specific coffee needs.
  • Overall, this research topic explores how coffee beans can differ on a variety of levels and carry complex traits. It is predicted that the Arabica beans and Robusta beans will differ in regards to the above traits, while having scores in common with beans that come from the same country/region.
  • helps to A description of the research topic along with a concise statement of your hypotheses on this topic.
  • The explanatory variables are quantitative in that they are a score.

Glimpse of data

# add code here
coffee <- read.csv('data/coffee.csv')
skimr::skim(coffee)
Data summary
Name coffee
Number of rows 989
Number of columns 23
_______________________
Column type frequency:
character 7
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Location.Country 0 1 4 28 0 32 0
Location.Region 0 1 3 76 0 278 0
Data.Owner 0 1 3 50 0 263 0
Data.Type.Species 0 1 7 7 0 2 0
Data.Type.Variety 0 1 3 21 0 28 0
Data.Type.Processing.method 0 1 3 25 0 6 0
Data.Color 0 1 4 12 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Location.Altitude.Min 0 1 1640.08 9192.52 0 905.00 1300.00 1550.00 190164.00 ▇▁▁▁▁
Location.Altitude.Max 0 1 1675.93 9191.96 0 950.00 1310.00 1600.00 190164.00 ▇▁▁▁▁
Location.Altitude.Average 0 1 1658.00 9192.06 0 950.00 1300.00 1600.00 190164.00 ▇▁▁▁▁
Year 0 1 2013.55 1.66 2010 2012.00 2013.00 2015.00 2018.00 ▁▇▃▃▁
Data.Production.Number.of.bags 0 1 151.76 125.67 1 15.00 170.00 275.00 600.00 ▇▁▇▁▁
Data.Production.Bag.weight 0 1 210.49 1666.71 0 1.00 60.00 69.00 19200.00 ▇▁▁▁▁
Data.Scores.Aroma 0 1 7.57 0.40 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
Data.Scores.Flavor 0 1 7.52 0.42 0 7.33 7.50 7.75 8.83 ▁▁▁▁▇
Data.Scores.Aftertaste 0 1 7.39 0.43 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
Data.Scores.Acidity 0 1 7.54 0.40 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
Data.Scores.Body 0 1 7.51 0.39 0 7.33 7.50 7.67 8.50 ▁▁▁▁▇
Data.Scores.Balance 0 1 7.50 0.43 0 7.33 7.50 7.75 8.58 ▁▁▁▁▇
Data.Scores.Uniformity 0 1 9.82 0.59 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
Data.Scores.Sweetness 0 1 9.83 0.69 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
Data.Scores.Moisture 0 1 0.09 0.04 0 0.10 0.11 0.12 0.28 ▃▇▆▁▁
Data.Scores.Total 0 1 81.97 3.86 0 81.08 82.50 83.58 90.58 ▁▁▁▁▇

Data 2

Introduction and data

  • Source: NBA 2022-2023 Season Predictions

  • Data found using FiveThirtyEight website. This specific data was collected by @jayb (Jay Boice) who published it on Github. Boice used The Complete History of the NBA and our NBA Predictions to compile it. It uses data from every NBA game since the 1946 season, and each franchise is given an Elo rating to predict how each future game will go based on each respective team’s Elo rating. The rating is purely based on the dates of the team’s games, the winning team, the score margin, and where the game was played.

  • The important observations are as follows (note: each observation is a different game): game date (date), team1 (which is also the home team), team2 (away team), team1’s pre-game Elo rating (elo1_pre), team2’s pre-game Elo rating (elo2_pre), team1’s probability of winning (elo1_prob), team2’s probability of winning (elo2_prob), team1’s predicted post-game Elo rating (elo1_post), and team2’s predicted post-game Elo rating (elo2_post).

Research question

  • How effective is the Elo rating for predicting the outcome of regular season NBA games for home versus away teams?
  • Sports prediction models have always been in extremely high demand, especially due to the growth in sports gambling. More people than ever have access to premium picks and lines sourced from casinos and gambling organizations headquartered in Las Vegas using apps on their phones. This is the longest standing type of rating and due to its slim amount of inputs and information about past games, I do not think this will be an accurate predictor for future NBA games. It does not take into account off-season player trades and that will have a huge effect on the beginning of the season. Additionally, the regular season player trades will change team trajectories as well, and the model will not be immediately updated.
  • Date, team1, and team2 are the only categorical variables. The rest are all quantitative.

Glimpse of data

# add code here
nba_2023_pred <- read.csv('data/nba_elo_latest.csv')
skimr::skim(nba_2023_pred)
Data summary
Name nba_2023_pred
Number of rows 1230
Number of columns 27
_______________________
Column type frequency:
character 3
logical 7
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 10 10 0 164 0
team1 0 1 3 3 0 30 0
team2 0 1 3 3 0 30 0

Variable type: logical

skim_variable n_missing complete_rate mean count
playoff 1230 0 NaN :
carm.elo1_pre 1230 0 NaN :
carm.elo2_pre 1230 0 NaN :
carm.elo_prob1 1230 0 NaN :
carm.elo_prob2 1230 0 NaN :
carm.elo1_post 1230 0 NaN :
carm.elo2_post 1230 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
season 0 1.00 2023.00 0.00 2023.00 2023.00 2023.00 2023.00 2023.00 ▁▁▇▁▁
neutral 0 1.00 0.00 0.04 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
elo1_pre 0 1.00 1505.23 86.81 1268.44 1454.31 1516.71 1567.58 1694.61 ▂▃▆▇▂
elo2_pre 0 1.00 1505.23 87.63 1263.86 1454.32 1517.56 1566.79 1719.45 ▂▃▇▆▁
elo_prob1 0 1.00 0.63 0.15 0.17 0.53 0.64 0.74 0.94 ▁▃▇▇▃
elo_prob2 0 1.00 0.37 0.15 0.06 0.26 0.36 0.47 0.83 ▃▇▇▃▁
elo1_post 195 0.84 1504.60 85.98 1263.86 1455.34 1514.33 1569.11 1694.61 ▁▃▇▇▂
elo2_post 195 0.84 1505.97 84.92 1272.43 1455.58 1516.00 1568.94 1719.45 ▂▃▇▆▁
raptor1_pre 0 1.00 1504.59 99.88 1140.48 1445.86 1520.95 1578.42 1699.08 ▁▂▃▇▃
raptor2_pre 0 1.00 1499.25 103.71 1104.98 1433.87 1517.40 1575.80 1703.53 ▁▂▃▇▃
raptor_prob1 0 1.00 0.61 0.18 0.09 0.49 0.62 0.74 0.96 ▁▃▇▇▅
raptor_prob2 0 1.00 0.39 0.18 0.04 0.26 0.38 0.51 0.91 ▅▇▇▃▁
score1 195 0.84 115.97 11.93 82.00 108.00 116.00 124.00 175.00 ▂▇▅▁▁
score2 195 0.84 112.98 11.45 81.00 105.00 113.00 121.00 176.00 ▂▇▃▁▁
quality 0 1.00 50.44 24.83 0.00 30.00 52.00 71.75 97.00 ▅▇▇▇▆
importance 0 1.00 29.13 24.51 0.00 9.00 23.00 43.00 100.00 ▇▅▂▂▁
total_rating 0 1.00 40.04 19.89 0.00 23.00 43.00 55.00 90.00 ▅▆▇▅▁

Data 3

Introduction and data

  • The “Cancer” collection from the CORGIS dataset repository, which was assembled by Virginia Tech’s DataLab, is the data’s original source. The Surveillance, Epidemiology and End Results (SEER) program of the National Cancer Institute, the Centers for Disease Control and Prevention (CDC), and the World Health Organization were some of the sources used to compile the data for this collection (WHO). 

    https://think.cs.vt.edu/corgis/csv/cancer/

  • Some of the data sets date back to the 1970s, and the data were gathered across a number of years. For instance, the CDC’s data span the years 1999 to 2016, whereas the SEER program’s data cover the years 1975 to 2015. Several techniques, such as surveys, clinical trials, and population-based cancer registries, were used to gather the data.

  • The data in this collection contains observations about a number of cancer-related topics, such as incidence rates, mortality rates, and risk factors. The dataset includes data on the number of new cases and fatalities from cancer by cancer type, age group, sex, and race/ethnicity. Together with information on risk factors like smoking and obesity, it also provides data on cancer incidence and fatality rates. Breast cancer, lung cancer, prostate cancer, colorectal cancer, and other cancers are among those covered by the dataset.

Research question

  • Questions:
    • Is there a difference in cancer incidence rates and survival rates between different races/ethnicities in the United States?

    • This question is important because it addresses the stark differences in cancer outcomes faced by different racial and ethnic groups in the US. Research has repeatedly shown that some racial and ethnic groups experience abnormally high rates of cancer incidence and mortality.

  • The link between race and ethnicity, cancer incidence rates, and survival rates in the US is the topic of the research. The hypothesis is that different races and ethnicities have varying cancer incidence rates and survival rates, with some groups having greater incidence rates and lower survival rates than others.
  • Both quantitative and categorical variables are present in the study question. The patients’ race and ethnicity as well as the type of cancer they have are categorical factors. Cancer incidence, death, and survival rates are examples of quantitative variables.

Glimpse of data

# add code here
cancer_data <- read.csv('data/cancer.csv')
skimr::skim(cancer_data)
Data summary
Name cancer_data
Number of rows 51
Number of columns 75
_______________________
Column type frequency:
character 1
numeric 74
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
State 0 1 4 20 0 51 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Total.Rate 0 1 190.66 28.59 98.5 176.50 196.1 210.75 254.6 ▁▂▇▇▂
Total.Number 0 1 78723.73 80861.29 6361.0 20631.00 54930.0 93328.00 393980.0 ▇▂▁▁▁
Total.Population 0 1 42401510.51 47842444.32 3931624.0 11869909.50 30348057.0 46503256.00 261135696.0 ▇▂▁▁▁
Rates.Age…18 0 1 2.12 0.50 0.0 2.05 2.2 2.40 2.7 ▁▁▁▅▇
Rates.Age.18.45 0 1 14.76 2.20 10.0 13.35 14.6 16.25 20.3 ▃▇▇▅▂
Rates.Age.45.64 0 1 197.58 31.26 132.3 175.00 189.3 217.85 263.9 ▁▇▆▂▃
Rates.Age…64 0 1 980.95 75.19 735.8 943.55 999.6 1031.45 1110.2 ▁▁▃▇▃
Rates.Age.and.Sex…18.Female 0 1 1.73 0.85 0.0 1.75 2.0 2.20 2.9 ▂▁▁▇▂
Rates.Age.and.Sex…18.Male 0 1 2.03 0.93 0.0 2.05 2.3 2.50 3.2 ▂▁▁▇▃
Rates.Age.and.Sex.18…45.Female 0 1 16.01 2.53 10.3 14.15 16.0 17.70 22.8 ▁▇▆▅▁
Rates.Age.and.Sex.18…45.Male 0 1 13.55 2.02 9.8 12.40 13.4 14.35 19.0 ▃▇▇▂▁
Rates.Age.and.Sex.45…64.Female 0 1 177.82 23.00 126.1 162.90 172.2 193.30 233.2 ▁▇▆▅▂
Rates.Age.and.Sex.45…64.Male 0 1 218.26 41.05 138.5 187.85 208.3 243.30 310.0 ▁▇▅▂▃
Rates.Age.and.Sex…64.Female 0 1 826.75 65.39 611.6 803.50 845.5 871.10 916.2 ▁▁▃▇▇
Rates.Age.and.Sex…64.Male 0 1 1181.82 105.16 884.9 1112.85 1195.0 1257.85 1372.7 ▁▃▇▇▆
Rates.Race.White 0 1 171.00 15.01 127.8 162.85 171.3 180.15 204.9 ▁▂▇▅▁
Rates.Race.White.non.Hispanic 0 1 173.19 15.04 129.1 165.65 173.9 183.40 205.9 ▁▂▇▆▂
Rates.Race.Black 0 1 182.84 48.68 0.0 162.60 194.4 220.75 235.1 ▁▁▂▃▇
Rates.Race.Asian 0 1 98.99 21.58 0.0 90.65 97.6 112.65 149.0 ▁▁▃▇▂
Rates.Race.Indigenous 0 1 111.11 68.30 0.0 57.95 102.2 162.05 248.0 ▅▇▆▃▃
Rates.Race.and.Sex.Female.White 0 1 145.56 11.44 109.9 140.45 146.2 152.45 170.5 ▁▂▇▇▁
Rates.Race.and.Sex.Female.White.non.Hispanic 0 1 147.65 11.46 110.5 143.55 149.9 155.20 171.2 ▁▂▅▇▁
Rates.Race.and.Sex.Female.Black 0 1 142.65 58.11 0.0 135.65 161.7 181.70 195.6 ▂▁▁▃▇
Rates.Race.and.Sex.Female.Black.non.Hispanic 0 1 146.60 58.25 0.0 146.35 165.6 182.80 198.5 ▂▁▁▅▇
Rates.Race.and.Sex.Female.Asian 0 1 85.01 23.22 0.0 79.50 85.6 94.80 130.3 ▁▁▂▇▂
Rates.Race.and.Sex.Female.Indigenous 0 1 88.71 61.70 0.0 43.70 78.7 127.65 219.4 ▇▇▆▃▃
Rates.Race.and.Sex.Male.White 0 1 206.38 21.54 145.1 195.45 205.4 218.15 253.6 ▁▂▇▅▂
Rates.Race.and.Sex.Male.White.non.Hispanic 0 1 208.71 21.72 146.2 197.00 207.8 220.20 255.1 ▁▂▇▅▃
Rates.Race.and.Sex.Male.Black 0 1 222.93 83.54 0.0 200.05 246.2 278.70 319.5 ▁▁▂▅▇
Rates.Race.and.Sex.Male.Black.non.Hispanic 0 1 228.24 83.12 0.0 216.55 252.7 282.10 321.2 ▁▁▂▅▇
Rates.Race.and.Sex.Male.Asian 0 1 107.22 41.70 0.0 96.45 112.5 129.15 172.2 ▂▁▃▇▃
Rates.Race.and.Sex.Male.Indigenous 0 1 129.56 90.58 0.0 66.60 114.2 198.35 303.3 ▆▇▅▅▅
Rates.Race.Hispanic 0 1 98.16 24.07 39.5 82.40 98.0 112.15 168.5 ▂▆▇▃▁
Rates.Race.and.Sex.Female.Hispanic 0 1 82.20 24.50 0.0 72.55 84.8 95.50 140.8 ▁▁▆▇▁
Rates.Race.and.Sex.Male.Hispanic 0 1 115.47 36.15 0.0 91.50 119.4 136.95 202.3 ▁▂▇▆▂
Types.Breast.Total 0 1 26.01 3.10 17.4 24.45 26.6 27.65 31.8 ▁▁▃▇▂
Types.Breast.Age.18…44 0 1 4.02 0.94 0.0 3.55 4.2 4.70 5.6 ▁▁▃▇▆
Types.Breast.Age.45…64 0 1 34.89 4.87 27.8 31.20 34.6 37.65 56.6 ▇▇▁▁▁
Types.Breast.Age…64 0 1 102.25 9.35 62.3 98.65 102.2 107.85 129.6 ▁▁▇▆▁
Types.Breast.Race.White 0 1 21.20 1.22 17.5 20.40 21.1 22.10 24.0 ▁▂▇▅▂
Types.Breast.Race.White.non.Hispanic. 0 1 21.62 1.45 18.2 20.75 21.4 22.25 25.5 ▁▇▇▃▂
Types.Breast.Race.Black 0 1 22.57 12.48 0.0 21.15 27.4 31.10 34.7 ▃▁▁▅▇
Types.Breast.Race.Black.non.Hispanic 0 1 23.19 12.68 0.0 23.60 28.4 31.85 35.0 ▃▁▁▃▇
Types.Breast.Race.Asian 0 1 6.02 5.48 0.0 0.00 8.5 10.70 15.8 ▇▁▃▆▁
Types.Breast.Race.Indigenous 0 1 4.98 8.06 0.0 0.00 0.0 10.60 27.4 ▇▁▁▁▁
Types.Breast.Race.Hispanic 0 1 8.56 5.77 0.0 0.00 10.5 12.25 18.2 ▇▁▇▇▅
Types.Colorectal.Total 0 1 17.50 2.71 9.0 16.05 17.7 19.40 24.4 ▁▂▇▆▁
Types.Colorectal.Age.and.Sex.Female.18…44 0 1 1.14 0.61 0.0 1.00 1.3 1.50 2.0 ▃▁▃▇▂
Types.Colorectal.Age.and.Sex.Male.18…44 0 1 1.50 0.69 0.0 1.40 1.7 1.85 3.0 ▂▁▇▆▁
Types.Colorectal.Age.and.Sex.Female.45…64 0 1 14.71 2.64 10.1 13.15 14.3 15.30 21.9 ▂▇▂▂▁
Types.Colorectal.Age.and.Sex.Male.45…64 0 1 21.35 4.23 14.5 18.20 21.0 23.50 34.1 ▇▇▆▃▁
Types.Colorectal.Age.and.Sex.Female…64 0 1 81.80 9.52 59.7 74.55 82.4 88.70 99.8 ▂▇▇▇▅
Types.Colorectal.Age.and.Sex.Male…64 0 1 101.45 10.89 72.4 93.35 102.1 110.70 124.6 ▁▅▇▆▂
Types.Colorectal.Race.White 0 1 15.23 1.69 10.0 14.30 15.2 16.60 19.1 ▁▂▇▅▂
Types.Colorectal.Race.White.non.Hispanic 0 1 15.35 1.77 9.8 14.50 15.2 16.70 19.2 ▁▁▇▅▂
Types.Colorectal.Race.Black 0 1 17.01 9.16 0.0 15.05 21.2 23.20 26.7 ▃▁▁▅▇
Types.Colorectal.Race.Black.non.Hispanic 0 1 17.45 9.27 0.0 16.35 21.5 23.50 27.2 ▃▁▁▅▇
Types.Colorectal.Race.Asian 0 1 6.95 5.37 0.0 0.00 9.1 10.65 18.0 ▇▂▇▃▁
Types.Colorectal.Race.Indigenous 0 1 6.93 9.95 0.0 0.00 0.0 13.95 34.7 ▇▁▂▁▁
Types.Colorectal.Race.Hispanic 0 1 8.23 5.07 0.0 5.90 9.0 11.60 18.0 ▆▅▇▇▂
Types.Lung.Total 0 1 53.21 12.60 15.6 46.55 52.4 61.80 81.3 ▁▃▇▇▂
Types.Lung.Age.and.Sex.Female.18…44 0 1 1.26 0.79 0.0 0.85 1.4 1.75 2.9 ▅▅▇▅▁
Types.Lung.Age.and.Sex.Male.18…44 0 1 1.37 0.80 0.0 0.95 1.4 1.80 3.1 ▃▅▇▂▂
Types.Lung.Age.and.Sex.Female.45…64 0 1 45.56 11.19 18.7 38.75 45.7 53.40 77.1 ▂▇▇▅▁
Types.Lung.Age.and.Sex.Male.45…64 0 1 64.83 21.29 26.0 48.95 59.7 77.60 112.6 ▂▇▅▃▂
Types.Lung.Age.and.Sex.Female…64 0 1 224.67 33.31 92.7 209.55 225.8 248.15 298.3 ▁▁▅▇▂
Types.Lung.Age.and.Sex.Male…64 0 1 354.96 69.94 153.2 313.30 342.6 391.00 519.1 ▁▂▇▃▂
Types.Lung.Race.White 0 1 48.16 9.45 20.4 42.75 48.1 54.50 71.5 ▁▃▇▅▂
Types.Lung.Race.White.non.Hispanic 0 1 49.34 9.27 20.6 43.70 48.8 54.95 71.9 ▁▂▇▆▂
Types.Lung.Race.Black 0 1 44.06 20.52 0.0 38.15 47.0 60.00 73.2 ▃▁▆▆▇
Types.Lung.Race.Black.non.Hispanic 0 1 44.50 21.43 0.0 39.95 48.6 60.35 73.6 ▃▁▅▇▇
Types.Lung.Race.Asian 0 1 19.32 10.43 0.0 18.35 22.2 25.50 33.9 ▃▁▃▇▂
Types.Lung.Race.Indigenous 0 1 27.31 26.44 0.0 4.45 20.9 44.25 87.5 ▇▅▂▂▂
Types.Lung.Race.Hispanic 0 1 16.18 8.04 0.0 13.70 18.1 20.45 35.2 ▂▂▇▂▁