Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    The source of the data was UNHCR

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data was originally collected, per the UNHCR website, is sourced primarily from governments and also from UNHCR operations. The UN is a good source of information and generally sources in an ethical manner.

  • Write a brief description of the observations.

    Each observation shows the percentage of refugees in certain countries in different years. The data is separated in different variables such as regions of countries, gender, refugees/asylum seeker status, etc.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    How does the percentage of refugees in different countries change per year?

    How does the percentage of refugees from different countries change per year?

    These questions are important because they allow us to see which countries are struggling the most with issues concerning migration. We can focus on addressing the problems these countries are facing. We can also see which countries are taking the most refugees so the data can also help us allocate and prepare for future number of refugees. 

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    Our research focuses on refugees and the countries they come from. It examines the total population of countries and the percentage that refugees make up. We believe that countries in North America as well as Europe take in a substantive number of refugees and this trend will continue to grow. 

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Country is a categorical variable. Percent and year are qualitative variables.

Glimpse of data

# add code here

migration <- read.csv("data/intl-mig-ref.csv")
skim(migration)
Data summary
Name migration
Number of rows 7221
Number of columns 7
_______________________
Column type frequency:
character 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
T05 0 1 1 19 0 264 0
International.migrants.and.refugees 0 1 0 29 1 264 0
X 0 1 4 4 0 10 0
X.1 0 1 6 61 0 9 0
X.2 0 1 1 11 0 4015 0
X.3 0 1 0 1035 3319 98 0
X.4 0 1 6 124 0 3 0

Data 2

Introduction and data

  • Identify the source of the data.

Data source is FiveThirtyEight.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

Data is collected for every game of every season of the NHL since the 1917-18 season. The data is sourced from Hockey-Reference.com which contains information about every player, current standings, game statistics, etc. The forecasts for the future season is based on the Elo rating system, created by Arpad Elo.

  • Write a brief description of the observations.

Each observation is one game that was played or will be played during the NHL season and contains 24 columns. The columns describe the games with some more descriptive than others. For example, there is season (the year season), date, playoff (whether the game was a playoff game or not), status (whether the game was played, has not been played yet, or is currently being played), and others. Furthermore, there are columns describing win probability, expected points, score, rating, game importance, game rating, etc. for both the home and away teams.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How do win rates differ between home and away games? Have some teams historically been more affected by home versus away games? Do win rates between home and away games differ more during the postseason than the regular season?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The topic deals with the idea of a “home-court advantage”. This is a well-know idea across many sports where it is thought that the home team gains an advantage through factors like spectator support and not having to travel for games.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Wins and win percentage are quantitative variables. Game location, team and regular/postseason are categorical variables.

Glimpse of data

# add code here
nhl <- read.csv("data/nhl_elo (1).csv")
skim(nhl)
Data summary
Name nhl
Number of rows 66030
Number of columns 24
_______________________
Column type frequency:
character 7
numeric 17
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
date 0 1 10 10 0 15617 0
status 0 1 3 4 0 2 0
ot 0 1 0 3 55247 8 0
home_team 0 1 12 29 0 40 0
away_team 0 1 12 29 0 40 0
home_team_abbr 0 1 3 3 0 41 0
away_team_abbr 0 1 3 3 0 41 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
season 0 1.00 1991.54 23.86 1918.00 1979.00 1996.00 2010.00 2023.00 ▁▂▃▇▇
playoff 0 1.00 0.07 0.26 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
neutral 0 1.00 0.00 0.07 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
home_team_pregame_rating 0 1.00 1500.69 54.64 1260.32 1467.10 1503.22 1538.05 1711.90 ▁▂▇▅▁
away_team_pregame_rating 0 1.00 1500.53 54.39 1260.97 1467.15 1503.21 1537.42 1714.19 ▁▂▇▃▁
home_team_winprob 0 1.00 0.57 0.10 0.15 0.50 0.57 0.64 0.91 ▁▂▇▆▁
away_team_winprob 0 1.00 0.43 0.10 0.09 0.36 0.43 0.50 0.85 ▁▆▇▂▁
overtime_prob 0 1.00 0.23 0.01 0.18 0.23 0.23 0.24 0.24 ▁▁▁▅▇
home_team_expected_points 0 1.00 1.24 0.18 0.46 1.13 1.25 1.36 1.84 ▁▂▇▆▁
away_team_expected_points 0 1.00 0.99 0.19 0.34 0.87 0.99 1.12 1.73 ▁▅▇▂▁
home_team_score 251 1.00 3.27 1.93 0.00 2.00 3.00 4.00 15.00 ▇▅▁▁▁
away_team_score 251 1.00 2.82 1.75 0.00 2.00 3.00 4.00 16.00 ▇▃▁▁▁
home_team_postgame_rating 251 1.00 1500.71 54.79 1262.08 1467.00 1503.20 1538.17 1713.17 ▁▂▇▃▁
away_team_postgame_rating 251 1.00 1500.47 54.59 1260.32 1466.97 1503.16 1537.52 1715.98 ▁▂▇▃▁
game_quality_rating 0 1.00 44.85 32.72 0.00 14.00 42.00 74.00 100.00 ▇▅▅▅▅
game_importance_rating 64718 0.02 30.11 31.15 0.00 4.00 19.00 50.00 100.00 ▇▃▂▁▂
game_overall_rating 64718 0.02 38.72 25.44 0.00 16.00 39.50 58.00 99.00 ▇▆▇▅▂

Data 3

Introduction and data

  • Identify the source of the data. FiveThirtyEight

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data). Locations where the data was collected were determined by HOLC maps drawn in 1935-40, downloaded from the Mapping Inequality Project. Current population and race/ethnicity data from these areas was collected from the 2020 census.

  • Write a brief description of the observations. Each observation is an area-grade pair (each area has 4 observations because there are 4 different grades). Each area-grade contains data about the current population and percent of residents who are a certain race/ethnicity: white, Black, Hispanic, Asian, or Other.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    How does the current racial makeup differ between areas previously classified/redlined as A vs. D?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    This question is important because it shows the longevity of the impact that racist practices (segregation via redlining from 1935-40) have on neighborhoods and communities.

    Redlining was done in the late 1930’s in an effort to mark areas that were good or bad for mortgage lending, but it ended up segregating metropolitan areas, because it was done in bad faith by classifying disproportionately Black communities as bad (D grade). I believe that this has a lasting impact to this day and places marked as D will have a greater proportion of minority residents vs. white residents, vice versa for areas marked as A. 

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Grade (categorical - grade the area received when it was redlined). Percent white, percent Black, percent Hispanic, percent Asian, and percent Other (quantitative - percent of residents who are a certain race/ethnicity in the area with a certain grade).

Glimpse of data

# add code here
redlining <- read.csv("data/metro-grades.csv")
skim(redlining)
Data summary
Name redlining
Number of rows 551
Number of columns 28
_______________________
Column type frequency:
character 2
numeric 26
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
metro_area 0 1 8 46 0 138 0
holc_grade 0 1 1 1 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
white_pop 0 1 28838.97 86150.59 158.00 3508.50 8448.00 21756.50 1164087.00 ▇▁▁▁▁
black_pop 0 1 15735.14 62120.43 1.00 530.00 2380.00 8218.00 894704.00 ▇▁▁▁▁
hisp_pop 0 1 19139.83 106594.49 6.00 407.50 1759.00 6920.50 1492338.00 ▇▁▁▁▁
asian_pop 0 1 6216.87 41375.78 1.00 82.50 305.00 1446.50 767862.00 ▇▁▁▁▁
other_pop 0 1 3809.54 14369.45 10.00 354.00 992.00 2350.00 239048.00 ▇▁▁▁▁
total_pop 0 1 73740.32 297082.01 228.00 7205.00 17460.00 43386.00 4558038.00 ▇▁▁▁▁
pct_white 0 1 55.45 22.48 3.77 39.27 59.01 74.71 94.12 ▃▅▆▇▆
pct_black 0 1 19.98 18.67 0.31 5.86 13.39 28.44 85.40 ▇▃▂▁▁
pct_hisp 0 1 15.54 16.86 1.10 4.62 8.80 19.89 93.90 ▇▂▁▁▁
pct_asian 0 1 3.23 3.67 0.09 1.00 2.01 3.91 31.39 ▇▁▁▁▁
pct_other 0 1 5.79 2.07 0.88 4.52 5.55 6.95 17.73 ▂▇▂▁▁
lq_white 0 1 1.05 0.48 0.16 0.78 0.98 1.22 4.11 ▇▇▁▁▁
lq_black 0 1 1.04 0.65 0.02 0.53 1.02 1.40 6.64 ▇▃▁▁▁
lq_hisp 0 1 1.02 0.47 0.18 0.69 1.00 1.26 3.64 ▆▇▁▁▁
lq_asian 0 1 0.82 0.43 0.11 0.54 0.74 1.03 2.68 ▆▇▂▁▁
lq_other 0 1 1.05 0.20 0.39 0.94 1.05 1.16 2.20 ▁▇▅▁▁
surr_area_white_pop 0 1 319868.85 710162.12 8715.00 47157.00 92469.00 248955.00 5169444.00 ▇▁▁▁▁
surr_area_black_pop 0 1 119940.46 297059.96 1080.00 11130.00 30777.00 83252.00 2626933.00 ▇▁▁▁▁
surr_area_hisp_pop 0 1 163548.68 620217.68 588.00 5652.00 23603.00 68357.00 5415084.00 ▇▁▁▁▁
surr_area_asian_pop 0 1 63296.08 253564.47 113.00 1539.00 5186.00 19426.00 2077111.00 ▇▁▁▁▁
surr_area_other_pop 0 1 34257.60 81771.26 1248.00 4450.00 9243.00 24755.00 675943.00 ▇▁▁▁▁
surr_area_pct_white 0 1 54.77 17.34 11.98 42.08 57.22 66.97 89.11 ▂▅▆▇▅
surr_area_pct_black 0 1 20.05 15.19 2.07 8.25 16.55 26.49 76.30 ▇▅▂▁▁
surr_area_pct_hisp 0 1 15.58 15.19 1.53 5.39 9.57 21.00 81.20 ▇▂▁▁▁
surr_area_pct_asian 0 1 4.08 4.23 0.37 1.35 3.10 4.91 29.21 ▇▂▁▁▁
surr_area_pct_other 0 1 5.51 1.77 1.80 4.34 5.31 6.43 15.52 ▃▇▂▁▁