Project Proposals: MLB, Weather, or NY State Tests

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • The data come from Baseball Reference, a source for online information on MLB and baseball history and statistics that is updated daily. Baseball Reference is the baseball arm of Sports Reference, who’s stated purpose is to “be the trusted source of information and tools that inspire and empower users to enjoy, understand, and share the sports they love” (URL: https://www.sports-reference.com/).
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was collected over the 2006-2022 MLB seasons by Sportradar, the official statistics collector of the MLB. Sportradar gets their data directly from the MLB and from data journalists with whom they have contracts (URL: https://www.youtube.com/watch?v=mtyrM53R830).
  • Write a brief description of the observations.

    • The observations involve all standard statistics used for measuring batting, fielding, and pitching performance by the MLB, professional coaches and statisticians, and committed fans for every team over the 2006-2022 seasons. Any name changes have been accounted for by updating their names to the present name (e.g. Cleveland Indians have been updated to Cleveland Guardians to reflect their recent name change). The dataset includes (but is not limited to) batting data including batting average, on base percentage, singles (RBI’s), doubles, triples, and home runs, fielding data including (but not limited to) defensive runs saved (Rdrs) and fielding percentage, and pitching data including (but not limited to) games started, earned runs average, win-loss percentage, hits given up per nine innings, home runs given up per nine innings, shutouts per nine innings, and shutouts per win. The reason why we say “includes (but not limited to)” is because we only plan on using some of the variables that are listed in the dataset. However, in the off-chance that one/more of the unlisted variables become necessary, we wish to have them on hand to use. Therefore, we’ve listed the variables we plan on using here, and the unlisted variables are ones that we do not plan on using, but may become useful in the future.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • In the MLB, what is more important: Offense (batting), or Defense (fielding/pitching)?

    • What’s more important for winning games and overall success (e.g. making the playoffs, winning a championship), hitting or pitching?

    • What’s a more significant statistic for winning games, batting average or on base percentage?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • We want to see if offense or defense is more important in helping MLB teams win games and have sustained overall success throughout the season and postseason. We hypothesize that defense is likely more important; a firebrand offensive team like the Yankees will have great success and make the playoffs, but will ultimately not see sustained success and lose to teams who focus on having a deep pitching and fielding core.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Both types of variables are involved in this research question: boolean/categorical variables such as whether a team has made the playoffs is necessary in determining long-term success, while core statistics like batting average and earned runs average are numeric/quantitative variables.

Glimpse of data

# add code here

# read.csv use for group members
# read.csv(
#   "data/MLB-reference-data.csv"
#   )

skimr::skim("MLB-reference-data.csv")
Data summary
Name “MLB-reference-data.csv”
Number of rows 1
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 0 1 22 22 0 1 0

Data 2

Introduction and data

  • Identify the source of the data.

    • https://think.cs.vt.edu/corgis/csv/weather/

    • The data is provided by the National Weather Service about daily weather reports for cities across the county. And this is done by 122 different Weather Forcast Offices throughout the country. 

    • This data set takes the information from these WFO reports for cities across the country and summarizes it at the weekly level for all of 2016.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data is collected by WFO for cities across the country and summarizes at the weekly level for all of 2016.
  • Write a brief description of the observations.

    • It involves the specific date, location(station code, station location, station code), daily temperature(maximum, average and minimum) and wind speed and direction.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How has climate change shifted or changed rainfall amounts across the United States?

    • How has climate change shifted or changed rainfall amounts across globally?

    • How has climate change shifted or changed rainfall amounts across the California?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • Assumption: Every place is independent. It does not effect by factors from other place.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • It involves both kinds of variables.

Glimpse of data

# add code here

# read.csv use for group members
# read.csv("data/weather.csv")

skimr::skim("weather.csv")
Data summary
Name “weather.csv”
Number of rows 1
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 0 1 11 11 0 1 0

Data 3

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This data was collected through the New York State school system for the school year of 2021-2022. These were self-reports through schools for kids taking standardized math and ELA tests.
  • Write a brief description of the observations.

    • These observations were for both standardized Math and ELA tests, and the number of students (grade 3-8) who took the tests and who refused, by school. They were split into subgroups including All Students, English Language Learner, Students with Disabilities, and Economically Disadvantaged. In this way, it is made more clear the difference in refusal rates by subgroup and by school. 

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • Depending on economic status, English Language Learner status, and Disability status in New York State, are you more or less likely to opt-out of English Language Arts and Maths standardized tests from Grades 3-8? The intent of our work is to see if there is any connection between test refusal rates and these different variables, and if they differ between Math standardized tests and ELA standardized tests as outlined in the data set. By looking at these several variables, we can form an analysis determining the main reasons students choose to opt of participating in the standardized tests.
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • This research topic involves standardized assessments given to children in school grades 3-8 in New York State. We want to look at whether or not there is a relationship between economic status and the rate at which refusals for standardized tests exist, in comparison to the other subgroups and students as a whole. 

    • Hypothesis: We believe students who are economically disadvantaged will opt out of standardized testing more often due to the nature and expectations of standardized testing. 

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • The variables in this research question are quantitative. We are looking at percentages of students who opted out of standardized tests versus who opted in. 

Glimpse of data

# add code here

# read.csv use for group members
# read.csv(
#   "data/3-8-ELA-MATH-REFUSALS.csv"
#   )

skimr::skim("3-8-ELA-MATH-REFUSALS.csv")
Data summary
Name “3-8-ELA-MATH-REFUSALS.cs…
Number of rows 1
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data 0 1 25 25 0 1 0