Project Strength

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • source: https://www.ncaa.org/sports/2016/12/14/shared-ncaa-research-data.aspx
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • In 2008 late NCAA President Myles Brand charged the NCAA staff with developing a program to gather the data and provide it to interested scholars.
  • Write a brief description of the observations.

    • The csv contains data on all of the NCAA teams in America, with data on the teams from the past several years on each teams rates of academic progression and success.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • How has the academic progression of students advanced over the past decade and which NCAA D1 schools stand out in this measurement?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • To measure this, we can identify trends of APR per team and per school with APR being the APR system that includes rewards for superior academic performance and penalties for teams that do not achieve certain academic benchmarks. Data are collected annually, and results are announced in the spring. This datapoint will likely have a strong correlation to graduation rates which we can double check with overlapped graphs. By graphing APR rates over the years for each different school, we can answer our research questions.

    • I believe that Cornell University will have the highest rates of student-athlete academic success because it is the best NCAA D1 school.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    “DATA_TAB_GENERALINFO”, Categorical

    “SCL_UNITID”, Categorical

    “SCL_NAME”, Categorical

    “SPORT_CODE”, Categorical

    “SPORT_NAME”, Categorical

    “ACADEMIC_YEAR”, Categorical

    “SCL_DIV_19”, Categorical

    “SCL_SUB_19”, Categorical

    “D1_FB_CONF_19”, Categorical

    “CONFNAME_19”, Categorical

    “SCL_HBCU”, Categorical

    “SCL_PRIVATE”, Categorical

    “DATA_TAB_MULTIYRRATE”, Quantitative

    “MULTIYR_APR_RATE_1000_RAW”, Quantitative

    “MULTIYR_APR_RATE_1000_CI”, Quantitative

    “MULTIYR_APR_RATE_1000_OFFICIAL”, Quantitative

    “RAW_OR_CI”, Quantitative

    “MULTIYR_SQUAD_SIZE”, Quantitative

    “MULTIYR_ELIG_RATE”, Quantitative

    “MULTIYR_RET_RATE”, Quantitative

    “DATA_TAB_ANNUALRATE”, Quantitative

    “APR_RATE_2019_1000”, Quantitative

    “ELIG_RATE_2019”, Quantitative

    “RET_RATE_2019”, Quantitative

    “NUM_OF_ATHLETES_2019”, Quantitative

    All award data is Quantitative

    Glimpse of data

ncaa_data <- read_csv("data/2020RES_APR2019PubDataShare.csv")
Rows: 6017 Columns: 101
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): SCL_NAME, SPORT_NAME, D1_FB_CONF_19, CONFNAME_19, SCL_HBCU, SCL_PR...
dbl (90): SCL_UNITID, SPORT_CODE, ACADEMIC_YEAR, SCL_DIV_19, SCL_SUB_19, MUL...
lgl  (4): DATA_TAB_GENERALINFO, DATA_TAB_MULTIYRRATE, DATA_TAB_ANNUALRATE, D...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(ncaa_data)
Data summary
Name ncaa_data
Number of rows 6017
Number of columns 101
_______________________
Column type frequency:
character 7
logical 4
numeric 90
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
SCL_NAME 0 1.00 11 58 0 385 0
SPORT_NAME 0 1.00 8 28 0 37 0
D1_FB_CONF_19 1578 0.74 11 35 0 24 0
CONFNAME_19 0 1.00 14 49 0 45 0
SCL_HBCU 0 1.00 1 1 0 2 0
SCL_PRIVATE 0 1.00 1 1 0 2 0
RAW_OR_CI 0 1.00 2 3 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
DATA_TAB_GENERALINFO 6017 0 NaN :
DATA_TAB_MULTIYRRATE 6017 0 NaN :
DATA_TAB_ANNUALRATE 6017 0 NaN :
DATATAB_PUBLICAWARD 6017 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SCL_UNITID 0 1.00 179749.28 45347.46 100654.00 145637.00 185572.00 215770.00 486840 ▇▇▁▁▁
SPORT_CODE 0 1.00 18.48 11.59 1.00 6.00 18.00 30.00 37 ▇▅▅▅▇
ACADEMIC_YEAR 0 1.00 2019.00 0.00 2019.00 2019.00 2019.00 2019.00 2019 ▁▁▇▁▁
SCL_DIV_19 0 1.00 1.01 0.13 1.00 1.00 1.00 1.00 3 ▇▁▁▁▁
SCL_SUB_19 0 1.00 1.85 0.81 0.00 1.00 2.00 3.00 3 ▁▇▁▇▅
MULTIYR_APR_RATE_1000_RAW 15 1.00 983.44 16.84 810.00 975.00 988.00 996.00 1000 ▁▁▁▁▇
MULTIYR_APR_RATE_1000_CI 15 1.00 991.95 10.65 858.00 989.00 996.00 999.00 1000 ▁▁▁▁▇
MULTIYR_APR_RATE_1000_OFFICIAL 15 1.00 984.48 16.17 810.00 977.00 989.00 997.00 1000 ▁▁▁▁▇
MULTIYR_SQUAD_SIZE 15 1.00 78.29 66.83 4.00 38.00 57.00 97.00 445 ▇▂▁▁▁
MULTIYR_ELIG_RATE 15 1.00 0.99 0.02 0.74 0.98 0.99 1.00 1 ▁▁▁▁▇
MULTIYR_RET_RATE 15 1.00 0.98 0.02 0.84 0.97 0.98 0.99 1 ▁▁▁▂▇
APR_RATE_2019_1000 84 0.99 982.98 25.35 714.00 972.00 1000.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2019 84 0.99 0.99 0.03 0.75 0.98 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2019 84 0.99 0.98 0.03 0.57 0.96 1.00 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2019 84 0.99 20.21 17.07 4.00 10.00 15.00 25.00 116 ▇▂▁▁▁
APR_RATE_2018_1000 129 0.98 982.74 25.34 684.00 972.00 1000.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2018 129 0.98 0.99 0.03 0.71 0.98 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2018 129 0.98 0.98 0.03 0.56 0.96 1.00 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2018 129 0.98 20.07 16.98 4.00 10.00 15.00 24.00 163 ▇▁▁▁▁
APR_RATE_2017_1000 159 0.97 983.06 25.30 773.00 973.00 1000.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2017 159 0.97 0.99 0.03 0.73 0.98 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2017 159 0.97 0.98 0.03 0.73 0.96 1.00 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2017 159 0.97 19.84 16.68 4.00 10.00 15.00 24.00 113 ▇▂▁▁▁
APR_RATE_2016_1000 194 0.97 982.67 25.60 667.00 972.00 1000.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2016 194 0.97 0.98 0.03 0.60 0.98 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2016 194 0.97 0.98 0.03 0.44 0.96 1.00 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2016 194 0.97 19.77 16.70 4.00 10.00 15.00 24.00 115 ▇▂▁▁▁
APR_RATE_2015_1000 238 0.96 981.36 27.76 682.00 971.00 993.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2015 238 0.96 0.98 0.03 0.56 0.98 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2015 238 0.96 0.98 0.03 0.67 0.96 1.00 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2015 238 0.96 19.66 16.57 4.00 10.00 15.00 23.50 113 ▇▂▁▁▁
APR_RATE_2014_1000 253 0.96 980.63 28.76 667.00 969.00 992.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2014 253 0.96 0.98 0.04 0.50 0.97 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2014 253 0.96 0.98 0.04 0.64 0.96 1.00 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2014 253 0.96 19.49 16.49 4.00 10.00 14.00 23.00 139 ▇▁▁▁▁
APR_RATE_2013_1000 348 0.94 978.17 31.86 530.00 967.00 989.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2013 348 0.94 0.98 0.04 0.24 0.97 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2013 348 0.94 0.97 0.04 0.71 0.96 1.00 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2013 348 0.94 19.35 16.34 4.00 10.00 14.00 23.00 112 ▇▂▁▁▁
APR_RATE_2012_1000 369 0.94 975.61 35.97 472.00 962.00 986.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2012 369 0.94 0.98 0.05 0.00 0.96 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2012 369 0.94 0.97 0.04 0.67 0.95 0.98 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2012 369 0.94 19.15 16.30 4.00 10.00 14.00 23.00 113 ▇▂▁▁▁
APR_RATE_2011_1000 404 0.93 973.03 39.45 442.00 958.00 984.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2011 404 0.93 0.97 0.06 0.00 0.96 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2011 404 0.93 0.97 0.04 0.68 0.95 0.98 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2011 404 0.93 19.02 16.13 4.00 10.00 14.00 23.00 114 ▇▁▁▁▁
APR_RATE_2010_1000 417 0.93 972.14 42.12 380.00 959.00 984.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2010 417 0.93 0.97 0.06 0.00 0.96 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2010 417 0.93 0.97 0.04 0.67 0.95 0.98 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2010 417 0.93 18.90 16.09 4.00 10.00 14.00 23.00 120 ▇▁▁▁▁
APR_RATE_2009_1000 470 0.92 972.33 36.58 667.00 958.00 983.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2009 470 0.92 0.97 0.05 0.56 0.96 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2009 470 0.92 0.97 0.04 0.56 0.95 0.98 1.00 1 ▁▁▁▁▇
NUM_OF_ATHLETES_2009 470 0.92 18.79 15.96 4.00 10.00 14.00 22.00 120 ▇▁▁▁▁
APR_RATE_2008_1000 576 0.90 970.43 36.92 643.00 955.00 980.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2008 576 0.90 0.97 0.05 0.46 0.95 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2008 576 0.90 0.97 0.04 0.68 0.94 0.98 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2008 576 0.90 18.82 16.04 4.00 10.00 14.00 22.00 125 ▇▁▁▁▁
APR_RATE_2007_1000 644 0.89 963.51 40.90 615.00 944.00 974.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2007 644 0.89 0.97 0.05 0.33 0.95 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2007 644 0.89 0.96 0.05 0.57 0.93 0.97 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2007 644 0.89 18.65 15.98 4.00 10.00 14.00 22.00 128 ▇▁▁▁▁
APR_RATE_2006_1000 711 0.88 961.10 42.40 643.00 940.00 971.00 1000.00 1000 ▁▁▁▂▇
ELIG_RATE_2006 711 0.88 0.96 0.05 0.50 0.94 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2006 711 0.88 0.95 0.05 0.63 0.93 0.96 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2006 711 0.88 18.66 16.38 4.00 10.00 14.00 21.00 209 ▇▁▁▁▁
APR_RATE_2005_1000 867 0.86 960.51 42.80 600.00 940.00 971.00 1000.00 1000 ▁▁▁▁▇
ELIG_RATE_2005 867 0.86 0.96 0.05 0.60 0.94 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2005 867 0.86 0.95 0.05 0.57 0.93 0.96 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2005 867 0.86 18.55 16.38 4.00 10.00 14.00 21.00 179 ▇▁▁▁▁
APR_RATE_2004_1000 900 0.85 960.59 43.28 611.00 939.00 971.00 1000.00 1000 ▁▁▁▂▇
ELIG_RATE_2004 900 0.85 0.97 0.05 0.56 0.95 1.00 1.00 1 ▁▁▁▁▇
RET_RATE_2004 900 0.85 0.95 0.05 0.61 0.93 0.96 1.00 1 ▁▁▁▂▇
NUM_OF_ATHLETES_2004 900 0.85 18.27 16.12 4.00 9.00 14.00 21.00 123 ▇▁▁▁▁
PUB_AWARD_20 41 0.99 0.23 0.42 0.00 0.00 0.00 0.00 1 ▇▁▁▁▂
PUB_AWARD_19 58 0.99 0.22 0.42 0.00 0.00 0.00 0.00 1 ▇▁▁▁▂
PUB_AWARD_18 125 0.98 0.22 0.41 0.00 0.00 0.00 0.00 1 ▇▁▁▁▂
PUB_AWARD_17 152 0.97 0.20 0.40 0.00 0.00 0.00 0.00 1 ▇▁▁▁▂
PUB_AWARD_16 4959 0.18 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_15 4957 0.18 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_14 5033 0.16 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_13 5111 0.15 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_12 5134 0.15 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_11 5183 0.14 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_10 5244 0.13 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_09 5315 0.12 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_08 5364 0.11 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_07 5250 0.13 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁
PUB_AWARD_06 5065 0.16 1.00 0.00 1.00 1.00 1.00 1.00 1 ▁▁▇▁▁

Data 2

Introduction and data

  • Identify the source of the data.

    • The data was sourced from https://www.kaggle.com/datasets/open-powerlifting/powerlifting-database , which was in turn sourced from https://www.openpowerlifting.org/.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data is a snapshot of openpowerlifting.org as of April 2019. Contestants would send in their results from around the world based on full competitions.
  • Write a brief description of the observations.

    • The observations are individual entrants and their various stats, which include their sex, age, weight, class, squat, bench, deadlift and the meet at which they achieved these records. The observations are validated by lifting federation results, which are official.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • Does bodyweight correlate to increased capacity for bench, deadlift, and squat exercises?
    • Other research questions might encompass: do different powerlifting federations perform overall better than others.
    • Do overall weightlifting records go up overtime since the first recorded one?
    • Does sex have a impact on the amount of weight one can lift overall?
    • Does location have any bearing on how heavy one can lift?
    • Does having wraps or no wraps(wraps are wrist straps one uses to support weak wrists when lifting)affect the amount of weight lifting overall?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The research topic encompasses correlations between various attributes of an individual powerlifter, and their records for bench, deadlift, squat. Using data collected, I’ll try to find patterns and relationships between the data of lifters and their records.

    • I hypothesize that lifters with a higher body weight, tend to lift heavier weights. There are other hypotheses I wish to explore. For example, do different powerlifting federations overall perform better than others. This hypothesis is meant to determine if region plays a part in overall performance per region.

    • Finally, I want to explore if overall weight has trended upwards since the 1960s, because the data goes back to that time period.

    • Overall, in regards to my research question, I hypothesize that a higher bodyweight is correlated to a higher bench, squat, and deadlift.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Categorical Variables
      • Lifter

      • Federation (lifting federation)

      • Place

      • Sex

      • Equip (refers to if the lifter used wraps or no wraps during the lift)

      • AgeClass

      • Division

      • WeightClassKg

      • MeetCountry

      • MeetState

      • MeetName

      • Date

      • Tested

    • Quantitative Variables
      • Age

      • BodyweightKg

      • Squat1Kg

      • Squat2Kg

      • Squat3Kg

      • Squat4Kg

      • Beset3SquatKg

      • Bench1Kg

      • Bench2Kg

      • Bench3Kg

      • Bench4Kg

      • Best3BenchKg

      • Deadlift1Kg

      • Deadlift2Kg

      • Deadlift3Kg

      • Deadlift4Kg

      • Best3DeadliftKg

      • TotalKg

      • Wilks

      • McCulloch

      • Glossbrenner

      • IPFPoints

Glimpse of data

# add code here

openpowerlifting <- read.csv("data/openpowerlifting.csv")

skimr::skim(openpowerlifting)
Data summary
Name openpowerlifting
Number of rows 1423354
Number of columns 37
_______________________
Column type frequency:
character 15
numeric 22
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1 1 45 0 412574 0
Sex 0 1 1 1 0 2 0
Event 0 1 1 3 0 7 0
Equipment 0 1 3 10 0 5 0
AgeClass 0 1 0 6 636554 17 0
Division 0 1 0 48 8178 4843 0
WeightClassKg 0 1 0 6 13312 225 0
Place 0 1 1 3 0 124 0
Tested 0 1 0 3 329462 2 0
Country 0 1 0 24 1034470 177 0
Federation 0 1 2 14 0 222 0
Date 0 1 10 10 0 5367 0
MeetCountry 0 1 2 22 0 96 0
MeetState 0 1 0 3 481809 112 0
MeetName 0 1 1 155 0 11599 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 665827 0.53 31.50 13.37 0.00 21.00 28.00 40.00 97.00 ▂▇▃▁▁
BodyweightKg 16732 0.99 84.23 23.22 15.10 66.70 81.80 99.15 258.00 ▂▇▁▁▁
Squat1Kg 1085774 0.24 114.10 147.14 -555.00 90.00 147.50 200.00 555.00 ▁▂▃▇▁
Squat2Kg 1090005 0.23 92.16 173.70 -580.00 68.00 145.00 205.00 566.99 ▁▂▂▇▁
Squat3Kg 1099512 0.23 30.06 200.41 -600.50 -167.50 110.00 192.50 560.00 ▁▅▂▇▁
Squat4Kg 1419658 0.00 71.36 194.52 -550.00 -107.84 135.00 205.00 505.50 ▁▂▁▇▁
Best3SquatKg 391904 0.72 174.00 69.24 -477.50 122.47 167.83 217.50 575.00 ▁▁▆▇▁
Bench1Kg 923575 0.35 83.89 105.20 -480.00 57.50 105.00 145.00 467.50 ▁▁▅▇▁
Bench2Kg 929868 0.35 55.07 130.30 -507.50 -52.50 95.00 145.00 487.50 ▁▂▅▇▁
Bench3Kg 944869 0.34 -18.52 144.23 -575.00 -140.00 -60.00 117.50 478.54 ▁▃▇▇▁
Bench4Kg 1413849 0.01 24.85 165.63 -500.00 -127.50 77.50 157.50 487.61 ▁▅▅▇▁
Best3BenchKg 147173 0.90 116.54 54.84 -522.50 74.84 111.13 150.00 488.50 ▁▁▃▇▁
Deadlift1Kg 1059810 0.26 162.70 108.68 -461.00 125.00 180.00 226.80 450.00 ▁▁▁▇▁
Deadlift2Kg 1067331 0.25 130.23 162.68 -470.00 115.00 177.50 230.00 460.40 ▁▂▁▇▁
Deadlift3Kg 1083407 0.24 13.00 215.05 -587.50 -210.00 117.50 205.00 457.50 ▁▆▂▇▂
Deadlift4Kg 1414108 0.01 78.91 192.61 -461.00 -110.00 145.15 210.00 418.00 ▁▃▁▇▂
Best3DeadliftKg 341546 0.76 187.26 62.33 -410.00 138.35 185.00 230.00 585.00 ▁▁▇▇▁
TotalKg 110170 0.92 395.61 201.14 2.50 232.50 378.75 540.00 1367.50 ▆▇▃▁▁
Wilks 118947 0.92 288.22 123.18 1.47 197.90 305.20 374.56 779.38 ▅▆▇▁▁
McCulloch 119100 0.92 296.07 124.97 1.47 204.82 312.03 383.76 804.40 ▅▆▇▁▁
Glossbrenner 118947 0.92 271.85 117.56 1.41 182.81 285.94 355.28 742.96 ▅▇▇▁▁
IPFPoints 150068 0.89 485.43 113.35 2.16 402.86 478.05 559.70 1245.93 ▁▇▆▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    • Source: https://www.kaggle.com/datasets/jayrav13/olympic-track-field-results?resource=download

    • Author: Jay Ravaliya

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was originally collected from https://olympic.org/athletics using a basic python web-scraper around six years ago.
  • Write a brief description of the observations.

    • Each row represents an athlete, with characteristics detailed across eight columns: gender (athlete gender), event (type of T&F race), location (Olympics host nation), year (Olympics year), medal (Gold, Silver, or Bronze medal), name (athlete name), nationality (athlete nationality), and results (athlete’s race time).
  • Address ethical concerns about the data, if any.

    • No ethical concerns exist with this data.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Is athlete (male and female) performance affected when competing at events hosted in their home country versus abroad?

      • This question is important because it looks into factors other than raw athletic ability and performance that may influence an Olympic athlete’s results.

        Other potential research question ideas:

        • How has the average performance of athletes in a particular event changed over time?

        • What is the correlation between an athlete’s nationality and their chances of winning a gold medal?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • This research topic looks at all Olympic Track and Field results from 1986-2016 to discover the effects of event location on the results of male and female athletes. Countless factors such as altitude, climate, and cultural differences can potentially affect athlete performance, and these factors are also subject to change across locations. Locations with conditions that an athlete is more familiar with or places that they feel a cultural/historical connection with may positively impact athletic performance. We thus hypothesize that athletes participating in events within their home nation perform better compared to those competing abroad.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Variables to be utilized by main research question:

      • gender (categorical)

      • location (categorical)

      • nationality (categorical)

      • results (quantitative)

      • medal (categorical)

      • name (categorical)

    • Variables to be utilized by second research question:

      • results (quantitative)

      • medal (categorical)

      • name (categorical)

      • event (categorical)

      • year (quantitative)

    • Variables to be utilized by third research question:

      • name (categorical)

      • nationality (categorical)

      • medal (categorical)

Glimpse of data

# add code here

olympics <- read_csv("data/results.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 2394 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Gender, Event, Location, Medal, Name, Nationality, Result
dbl (1): Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(olympics)
Data summary
Name olympics
Number of rows 2394
Number of columns 8
_______________________
Column type frequency:
character 7
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1 1 1 0 2 0
Event 0 1 8 24 0 47 0
Location 0 1 3 21 0 23 0
Medal 0 1 1 1 0 3 0
Name 0 1 4 30 0 1682 0
Nationality 0 1 3 3 0 97 0
Result 0 1 3 11 0 1951 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 1970.38 34.71 1896 1948 1976 2000 2016 ▃▃▅▅▇