Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

  • Write a brief description of the observations.

The data that I found is from the RMS Titanic, a British passenger liner that sank in 1912. The data covers a variety of information collected from the passengers on board during the last voyage the ship took. The dataset contains information from 891 of the passengers on board out of the 2224 passengers on board. This information seems to have been collected by a mix of the Titanic’s records with the help of a variety of researchers in the 1900s. The observations in the dataset include the passenger ID along with their survival status, class, name, sex, age, and other information.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
  • Identify the types of variables in your research question. Categorical? Quantitative?

We wanted to research, what do the survival rates look like for different classes on the titanic based on passenger gender? We think this question can be important to see how passengers of the titanic were prioritized and saved during the short amount of time that the boat had before sinking. The research topic explores survival rates based on gender and class demographics, we hypothesize that those in first class had better survival rates than other classes because they were likely prioritized by the crew to get to safety. We also think that women had a higher survival rate than men due to a consensus that women and children should be protected, a sentiment very popular especially at the time of the sink. Survival rate would be a quantitative variable while gender and class would be categorical.

Glimpse of data

# add code here
titanic <- read_csv("data/titanic.csv")
Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(titanic)
Data summary
Name titanic
Number of rows 891
Number of columns 12
_______________________
Column type frequency:
character 5
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Name 0 1.00 12 82 0 891 0
Sex 0 1.00 4 6 0 2 0
Ticket 0 1.00 3 18 0 681 0
Cabin 687 0.23 1 15 0 147 0
Embarked 2 1.00 1 1 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
PassengerId 0 1.0 446.00 257.35 1.00 223.50 446.00 668.5 891.00 ▇▇▇▇▇
Survived 0 1.0 0.38 0.49 0.00 0.00 0.00 1.0 1.00 ▇▁▁▁▅
Pclass 0 1.0 2.31 0.84 1.00 2.00 3.00 3.0 3.00 ▃▁▃▁▇
Age 177 0.8 29.70 14.53 0.42 20.12 28.00 38.0 80.00 ▂▇▅▂▁
SibSp 0 1.0 0.52 1.10 0.00 0.00 0.00 1.0 8.00 ▇▁▁▁▁
Parch 0 1.0 0.38 0.81 0.00 0.00 0.00 0.0 6.00 ▇▁▁▁▁
Fare 0 1.0 32.20 49.69 0.00 7.91 14.45 31.0 512.33 ▇▁▁▁▁

Data 2

Introduction and data

  • Identify the source of the data.

    • The data comes from the CORGIS dataset project. The dataset contains a variety of information about billionaires and has 2615 rows of data, rankings the billionaires from the years 1996, 2001, and 2014.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was originally collected from information in Forbes World’s Billionaires lists from 1996-2014, with additional data collection from Peterson Institute for International Economics.
  • Write a brief description of the observations.

    • The data provide rankings of the billionaires for the years 1996, 2001, and 2014. The observations include the sector of the billionaire’s company, age, gender, and whether the billionaire’s wealth was inherited or not.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • We were interested in finding out which factors correlate more with billionaire ranking (for example if company sector or whether wealth was inherited is more associated with a higher ranking).
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • We chose this research question because we think it would be interesting to see what most contributes to being a billionaire and what factors most correlate with people getting wealthier. We think that company sector will have the largest impact. Specifically, we predict that over the years the highest ranking billionaires will come from companies that are more technological as those companies have been growing at fast rates.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • There are primarily categorical variables in our research question including the company sector, industry being profited from, wealth inheritance, and gender.

Glimpse of data

# add code here
billionaires <- read.csv("data/billionaires.csv")
skimr::skim(billionaires)
Data summary
Name billionaires
Number of rows 2614
Number of columns 22
_______________________
Column type frequency:
character 13
logical 3
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 5 45 0 2077 0
company.name 0 1 0 59 38 1578 0
company.relationship 0 1 0 46 46 75 0
company.sector 0 1 0 52 23 521 0
company.type 0 1 0 22 36 19 0
demographics.gender 0 1 0 14 34 4 0
location.citizenship 0 1 4 20 0 73 0
location.country.code 0 1 3 6 0 74 0
location.region 0 1 1 24 0 8 0
wealth.type 0 1 0 24 22 6 0
wealth.how.category 0 1 0 18 1 10 0
wealth.how.industry 0 1 0 31 1 20 0
wealth.how.inherited 0 1 6 24 0 6 0

Variable type: logical

skim_variable n_missing complete_rate mean count
wealth.how.from.emerging 0 1 1 TRU: 2614
wealth.how.was.founder 0 1 1 TRU: 2614
wealth.how.was.political 0 1 1 TRU: 2614

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1 5.996700e+02 4.678900e+02 1 215.0 430 9.880e+02 1.565e+03 ▇▅▃▂▃
year 0 1 2.008410e+03 7.480000e+00 1996 2001.0 2014 2.014e+03 2.014e+03 ▂▂▁▁▇
company.founded 0 1 1.924710e+03 2.437800e+02 0 1936.0 1963 1.985e+03 2.012e+03 ▁▁▁▁▇
demographics.age 0 1 5.334000e+01 2.533000e+01 -42 47.0 59 7.000e+01 9.800e+01 ▁▂▁▇▃
location.gdp 0 1 1.769103e+12 3.547083e+12 0 0.0 0 7.250e+11 1.060e+13 ▇▁▁▁▁
wealth.worth.in.billions 0 1 3.530000e+00 5.090000e+00 1 1.4 2 3.500e+00 7.600e+01 ▇▁▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    • Netflix
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • All data is derived from public statistics about the Netflix’s content
  • Write a brief description of the observations.

    • The observations list every show and movie that exist on the platform and specifies various details about the content such as what it is rated (pg-13), the country of production, its duration on Netflix, etc.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    A research question to ask from this data could be What data distinctly separates movies and shows on Netflix?

    This question will attempt to explore how movies and shows behave differently on the platform. For instance, one avenue of exploration could be how movies and shows rating differs based on their country of origin. Also, with this data, one could discover how the general ratings from movies and shows differ, and their effects on the duration of existence on Netflix. The types of variables that the research question looks to examine are quantitative.

Glimpse of data

# add code here
netflix <- read_csv("data/netflix_titles.csv")
Rows: 8807 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): show_id, type, title, director, cast, country, date_added, rating,...
dbl  (1): release_year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(netflix)
Data summary
Name netflix
Number of rows 8807
Number of columns 12
_______________________
Column type frequency:
character 11
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
show_id 0 1.00 2 5 0 8807 0
type 0 1.00 5 7 0 2 0
title 0 1.00 1 104 0 8807 0
director 2634 0.70 2 208 0 4528 0
cast 825 0.91 3 771 0 7692 0
country 831 0.91 4 123 0 748 0
date_added 10 1.00 11 18 0 1714 0
rating 4 1.00 1 8 0 17 0
duration 3 1.00 5 10 0 220 0
listed_in 0 1.00 6 79 0 514 0
description 0 1.00 61 248 0 8775 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
release_year 0 1 2014.18 8.82 1925 2013 2017 2019 2021 ▁▁▁▁▇

Data 4

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

  • Write a brief description of the observations.

    The data is collected from Division I college basketball seasons from 2013-2021. The 2020 season is kept seperate because there was no postseason due to the COVID-19 pandemic. The data was scraped in 2021 from http://barttorvik.com/trank.php#. and then cleaned to add a postseason, seed, and year columns. Each observation in the data set represents A D1 college basketball team (TEAM) during a given season (YEAR). The columns for each observation contain data about the number of games played (G) and won (W) in the season, the athletic conference in which the school participates in (CONF), and the quality of the season including measures of adjusted offensive efficiency, effective field goal percentage shot, offensive rebound rate, free throw rate allowed, and three-point shooting. The POSTSEASON column indicates the round where the given team’s season ended, and the SEED column indicates the team’s seed in the NCAA March Madness Tournament.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • Which team statistics are the best predictor of success in the NCAA tournament?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The NCAA Men’s basketball tournament challenge invites fans to predict the winners of a 64 team bracket. The results may seem random at times, but we wanted to explore if certain team statistics could help predict team success. We hypothesize that teams with good defense statistics would perform better in the tournament.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • The variables are quantitative.

Glimpse of data

# add code here
cbb <- read_csv("data/cbb.csv")
Rows: 2455 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): TEAM, CONF, POSTSEASON
dbl (21): G, W, ADJOE, ADJDE, BARTHAG, EFG_O, EFG_D, TOR, TORD, ORB, DRB, FT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(cbb)
Data summary
Name cbb
Number of rows 2455
Number of columns 24
_______________________
Column type frequency:
character 3
numeric 21
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
TEAM 0 1.00 3 22 0 355 0
CONF 0 1.00 2 4 0 35 0
POSTSEASON 1979 0.19 2 9 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
G 0 1.00 31.49 2.66 15.0 30.00 31.00 33.00 40.00 ▁▁▅▇▁
W 0 1.00 16.28 6.61 0.0 11.00 16.00 21.00 38.00 ▂▇▇▃▁
ADJOE 0 1.00 103.30 7.38 76.6 98.30 103.00 108.00 129.10 ▁▃▇▃▁
ADJDE 0 1.00 103.30 6.61 84.0 98.50 103.50 107.90 124.00 ▁▅▇▃▁
BARTHAG 0 1.00 0.49 0.26 0.0 0.28 0.48 0.71 0.98 ▅▇▇▆▆
EFG_O 0 1.00 49.81 3.14 39.2 47.75 49.70 51.90 59.80 ▁▃▇▅▁
EFG_D 0 1.00 50.00 2.94 39.6 48.00 50.00 52.00 59.50 ▁▃▇▅▁
TOR 0 1.00 18.76 2.09 11.9 17.30 18.70 20.10 27.10 ▁▅▇▂▁
TORD 0 1.00 18.69 2.20 10.2 17.20 18.60 20.10 28.50 ▁▅▇▂▁
ORB 0 1.00 29.88 4.13 15.0 27.10 29.90 32.60 43.60 ▁▃▇▃▁
DRB 0 1.00 30.08 3.15 18.4 27.90 30.00 32.20 40.40 ▁▃▇▅▁
FTR 0 1.00 35.99 5.25 21.6 32.40 35.80 39.50 58.60 ▂▇▆▁▁
FTRD 0 1.00 36.27 6.25 21.8 31.90 35.80 40.20 60.70 ▂▇▆▁▁
2P_O 0 1.00 48.80 3.38 37.7 46.50 48.70 51.00 62.60 ▁▅▇▂▁
2P_D 0 1.00 48.98 3.34 37.7 46.70 49.00 51.30 61.20 ▁▅▇▃▁
3P_O 0 1.00 34.41 2.79 24.9 32.50 34.40 36.30 44.10 ▁▃▇▃▁
3P_D 0 1.00 34.60 2.42 27.1 33.00 34.60 36.20 43.10 ▁▅▇▃▁
ADJ_T 0 1.00 67.81 3.28 57.2 65.70 67.80 70.00 83.40 ▁▇▇▁▁
WAB 0 1.00 -7.80 6.97 -25.2 -13.00 -8.30 -3.15 13.10 ▂▇▇▃▁
SEED 1979 0.19 8.80 4.68 1.0 5.00 9.00 13.00 16.00 ▇▆▆▇▇
YEAR 0 1.00 2016.01 2.00 2013.0 2014.00 2016.00 2018.00 2019.00 ▇▃▃▃▇