Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.

The data that I found is from the RMS Titanic, a British passenger liner that sank in 1912. The data covers a variety of information collected from the passengers on board during the last voyage the ship took. The dataset contains information from 891 of the passengers on board out of the 2224 passengers on board. This information seems to have been collected by a mix of the Titanic’s records with the help of a variety of researchers in the 1900s. The observations in the dataset include the passenger ID along with their survival status, class, name, sex, age, and other information.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

We wanted to research, what do the survival rates look like for different classes on the titanic based on passenger gender? We think this question can be important to see how passengers of the titanic were prioritized and saved during the short amount of time that the boat had before sinking. The research topic explores survival rates based on gender and class demographics, we hypothesize that those in first class had better survival rates than other classes because they were likely prioritized by the crew to get to safety. We also think that women had a higher survival rate than men due to a consensus that women and children should be protected, a sentiment very popular especially at the time of the sink. Survival rate would be a quantitative variable while gender and class would be categorical.

Glimpse of data

# add code here
titanic <- read_csv("data/titanic.csv")

Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(titanic)

Data summary
Name	titanic
Number of rows	891
Number of columns	12
_______________________
Column type frequency:
character	5
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Name	0	1.00	12	82	891
Sex	0	1.00	4	6	2
Ticket	0	1.00	3	18	681
Cabin	687	0.23	1	15	147
Embarked	2	1.00	1	1	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
PassengerId	0	1.0	446.00	257.35	1.00	223.50	446.00	668.5	891.00	▇▇▇▇▇
Survived	0	1.0	0.38	0.49	0.00	0.00	0.00	1.0	1.00	▇▁▁▁▅
Pclass	0	1.0	2.31	0.84	1.00	2.00	3.00	3.0	3.00	▃▁▃▁▇
Age	177	0.8	29.70	14.53	0.42	20.12	28.00	38.0	80.00	▂▇▅▂▁
SibSp	0	1.0	0.52	1.10	0.00	0.00	0.00	1.0	8.00	▇▁▁▁▁
Parch	0	1.0	0.38	0.81	0.00	0.00	0.00	0.0	6.00	▇▁▁▁▁
Fare	0	1.0	32.20	49.69	0.00	7.91	14.45	31.0	512.33	▇▁▁▁▁

Data 2

Introduction and data

Identify the source of the data.
- The data comes from the CORGIS dataset project. The dataset contains a variety of information about billionaires and has 2615 rows of data, rankings the billionaires from the years 1996, 2001, and 2014.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data was originally collected from information in Forbes World’s Billionaires lists from 1996-2014, with additional data collection from Peterson Institute for International Economics.
Write a brief description of the observations.
- The data provide rankings of the billionaires for the years 1996, 2001, and 2014. The observations include the sector of the billionaire’s company, age, gender, and whether the billionaire’s wealth was inherited or not.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- We were interested in finding out which factors correlate more with billionaire ranking (for example if company sector or whether wealth was inherited is more associated with a higher ranking).
A description of the research topic along with a concise statement of your hypotheses on this topic.
- We chose this research question because we think it would be interesting to see what most contributes to being a billionaire and what factors most correlate with people getting wealthier. We think that company sector will have the largest impact. Specifically, we predict that over the years the highest ranking billionaires will come from companies that are more technological as those companies have been growing at fast rates.
Identify the types of variables in your research question. Categorical? Quantitative?
- There are primarily categorical variables in our research question including the company sector, industry being profited from, wealth inheritance, and gender.

Glimpse of data

# add code here
billionaires <- read.csv("data/billionaires.csv")
skimr::skim(billionaires)

Data summary
Name	billionaires
Number of rows	2614
Number of columns	22
_______________________
Column type frequency:
character	13
logical	3
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
name	1	5	45	0	2077
company.name	1	0	59	38	1578
company.relationship	1	0	46	46	75
company.sector	1	0	52	23	521
company.type	1	0	22	36	19
demographics.gender	1	0	14	34	4
location.citizenship	1	4	20	0	73
location.country.code	1	3	6	0	74
location.region	1	1	24	0	8
wealth.type	1	0	24	22	6
wealth.how.category	1	0	18	1	10
wealth.how.industry	1	0	31	1	20
wealth.how.inherited	1	6	24	0	6

Variable type: logical

skim_variable	complete_rate	mean	count
wealth.how.from.emerging	1	1	TRU: 2614
wealth.how.was.founder	1	1	TRU: 2614
wealth.how.was.political	1	1	TRU: 2614

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	5.996700e+02	4.678900e+02	1	215.0	430	9.880e+02	1.565e+03	▇▅▃▂▃
year	1	2.008410e+03	7.480000e+00	1996	2001.0	2014	2.014e+03	2.014e+03	▂▂▁▁▇
company.founded	1	1.924710e+03	2.437800e+02	0	1936.0	1963	1.985e+03	2.012e+03	▁▁▁▁▇
demographics.age	1	5.334000e+01	2.533000e+01	-42	47.0	59	7.000e+01	9.800e+01	▁▂▁▇▃
location.gdp	1	1.769103e+12	3.547083e+12	0	0.0	0	7.250e+11	1.060e+13	▇▁▁▁▁
wealth.worth.in.billions	1	3.530000e+00	5.090000e+00	1	1.4	2	3.500e+00	7.600e+01	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.
- Netflix
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- All data is derived from public statistics about the Netflix’s content
Write a brief description of the observations.
- The observations list every show and movie that exist on the platform and specifies various details about the content such as what it is rated (pg-13), the country of production, its duration on Netflix, etc.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

A research question to ask from this data could be What data distinctly separates movies and shows on Netflix?

This question will attempt to explore how movies and shows behave differently on the platform. For instance, one avenue of exploration could be how movies and shows rating differs based on their country of origin. Also, with this data, one could discover how the general ratings from movies and shows differ, and their effects on the duration of existence on Netflix. The types of variables that the research question looks to examine are quantitative.

Glimpse of data

# add code here
netflix <- read_csv("data/netflix_titles.csv")

Rows: 8807 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): show_id, type, title, director, cast, country, date_added, rating,...
dbl  (1): release_year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(netflix)

Data summary
Name	netflix
Number of rows	8807
Number of columns	12
_______________________
Column type frequency:
character	11
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
show_id	0	1.00	2	5	8807
type	0	1.00	5	7	2
title	0	1.00	1	104	8807
director	2634	0.70	2	208	4528
cast	825	0.91	3	771	7692
country	831	0.91	4	123	748
date_added	10	1.00	11	18	1714
rating	4	1.00	1	8	17
duration	3	1.00	5	10	220
listed_in	0	1.00	6	79	514
description	0	1.00	61	248	8775

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
release_year	0	1	2014.18	8.82	1925	2013	2017	2019	2021	▁▁▁▁▇

Data 4

Introduction and data

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.

The data is collected from Division I college basketball seasons from 2013-2021. The 2020 season is kept seperate because there was no postseason due to the COVID-19 pandemic. The data was scraped in 2021 from http://barttorvik.com/trank.php#. and then cleaned to add a postseason, seed, and year columns. Each observation in the data set represents A D1 college basketball team (TEAM) during a given season (YEAR). The columns for each observation contain data about the number of games played (G) and won (W) in the season, the athletic conference in which the school participates in (CONF), and the quality of the season including measures of adjusted offensive efficiency, effective field goal percentage shot, offensive rebound rate, free throw rate allowed, and three-point shooting. The POSTSEASON column indicates the round where the given team’s season ended, and the SEED column indicates the team’s seed in the NCAA March Madness Tournament.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Which team statistics are the best predictor of success in the NCAA tournament?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The NCAA Men’s basketball tournament challenge invites fans to predict the winners of a 64 team bracket. The results may seem random at times, but we wanted to explore if certain team statistics could help predict team success. We hypothesize that teams with good defense statistics would perform better in the tournament.
Identify the types of variables in your research question. Categorical? Quantitative?
- The variables are quantitative.

Glimpse of data

# add code here
cbb <- read_csv("data/cbb.csv")

Rows: 2455 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): TEAM, CONF, POSTSEASON
dbl (21): G, W, ADJOE, ADJDE, BARTHAG, EFG_O, EFG_D, TOR, TORD, ORB, DRB, FT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(cbb)

Data summary
Name	cbb
Number of rows	2455
Number of columns	24
_______________________
Column type frequency:
character	3
numeric	21
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
TEAM	0	1.00	3	22	355
CONF	0	1.00	2	4	35
POSTSEASON	1979	0.19	2	9	8

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
G	0	1.00	31.49	2.66	15.0	30.00	31.00	33.00	40.00	▁▁▅▇▁
W	0	1.00	16.28	6.61	0.0	11.00	16.00	21.00	38.00	▂▇▇▃▁
ADJOE	0	1.00	103.30	7.38	76.6	98.30	103.00	108.00	129.10	▁▃▇▃▁
ADJDE	0	1.00	103.30	6.61	84.0	98.50	103.50	107.90	124.00	▁▅▇▃▁
BARTHAG	0	1.00	0.49	0.26	0.0	0.28	0.48	0.71	0.98	▅▇▇▆▆
EFG_O	0	1.00	49.81	3.14	39.2	47.75	49.70	51.90	59.80	▁▃▇▅▁
EFG_D	0	1.00	50.00	2.94	39.6	48.00	50.00	52.00	59.50	▁▃▇▅▁
TOR	0	1.00	18.76	2.09	11.9	17.30	18.70	20.10	27.10	▁▅▇▂▁
TORD	0	1.00	18.69	2.20	10.2	17.20	18.60	20.10	28.50	▁▅▇▂▁
ORB	0	1.00	29.88	4.13	15.0	27.10	29.90	32.60	43.60	▁▃▇▃▁
DRB	0	1.00	30.08	3.15	18.4	27.90	30.00	32.20	40.40	▁▃▇▅▁
FTR	0	1.00	35.99	5.25	21.6	32.40	35.80	39.50	58.60	▂▇▆▁▁
FTRD	0	1.00	36.27	6.25	21.8	31.90	35.80	40.20	60.70	▂▇▆▁▁
2P_O	0	1.00	48.80	3.38	37.7	46.50	48.70	51.00	62.60	▁▅▇▂▁
2P_D	0	1.00	48.98	3.34	37.7	46.70	49.00	51.30	61.20	▁▅▇▃▁
3P_O	0	1.00	34.41	2.79	24.9	32.50	34.40	36.30	44.10	▁▃▇▃▁
3P_D	0	1.00	34.60	2.42	27.1	33.00	34.60	36.20	43.10	▁▅▇▃▁
ADJ_T	0	1.00	67.81	3.28	57.2	65.70	67.80	70.00	83.40	▁▇▇▁▁
WAB	0	1.00	-7.80	6.97	-25.2	-13.00	-8.30	-3.15	13.10	▂▇▇▃▁
SEED	1979	0.19	8.80	4.68	1.0	5.00	9.00	13.00	16.00	▇▆▆▇▇
YEAR	0	1.00	2016.01	2.00	2013.0	2014.00	2016.00	2018.00	2019.00	▇▃▃▃▇