Characteristics of Billionaires

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.

NYC Open Data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The dataset is collected by NYPD from April 28, 2014 to March 10, 2023 from all police reported motor vehicle collisions in NYC.
Write a brief description of the observations.

Each row is a Motor Vehicle Collision, with data like crash date, crash time, person injured, on/off street, etc.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is there a correlation between what season (time of year/date) it is and how many crashes there are in a certain borough (or all 5)? And what percentage of those caused a fatality?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic explores the danger of driving in NYC throughout the year by investigating the amount of vehicle crashes and resulting fatalities in NYC and how this relates to the time of year/season.

My hypothesis is that there are a higher number of crashes and resulting fatalities in the winter and spring months (due to road conditions and how busy the city is during these periods).
Statement on why this question is important.

This question is important because it explores the safety of driving at different times of years, which is important information for parents of drivers/all drivers in NYC to be conscious of.
Identify the types of variables in your research question. Categorical? Quantitative?
- Date - quantitative
- Borough - categorical
- Number of persons killed - quantitative

Glimpse of data

motorCrash <- read_csv("data/Motor_Vehicle_Collisions_-_Crashes.csv")

Rows: 1048575 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): CRASH DATE, BOROUGH, ON STREET NAME, CONTRIBUTING FACTOR VEHICLE 1...
dbl  (9): ZIP CODE, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUM...
time (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(motorCrash)

Data summary
Name	motorCrash
Number of rows	1048575
Number of columns	19
_______________________
Column type frequency:
character	9
difftime	1
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
CRASH DATE	0	1.00	6	8	2347
BOROUGH	376672	0.64	5	13	5
ON STREET NAME	257012	0.75	2	32	9891
CONTRIBUTING FACTOR VEHICLE 1	3751	1.00	2	53	57
CONTRIBUTING FACTOR VEHICLE 2	178461	0.83	5	53	56
CONTRIBUTING FACTOR VEHICLE 3	970988	0.07	5	53	42
VEHICLE TYPE CODE 1	8659	0.99	1	38	1334
VEHICLE TYPE CODE 2	248934	0.76	1	38	1461
VEHICLE TYPE CODE 3	975400	0.07	2	35	201

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
CRASH TIME	0	1	0 secs	86340 secs	14:11:00	1440

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ZIP CODE	376856	0.64	10869.60	541.48	10000	10453	11208	11249	11697	▅▃▁▇▅
NUMBER OF PERSONS INJURED	17	1.00	0.32	0.70	0	0	0	0	40	▇▁▁▁▁
NUMBER OF PERSONS KILLED	30	1.00	0.00	0.04	0	0	0	0	8	▇▁▁▁▁
NUMBER OF PEDESTRIANS INJURED	0	1.00	0.05	0.24	0	0	0	0	27	▇▁▁▁▁
NUMBER OF PEDESTRIANS KILLED	0	1.00	0.00	0.03	0	0	0	0	6	▇▁▁▁▁
NUMBER OF CYCLIST INJURED	0	1.00	0.03	0.17	0	0	0	0	3	▇▁▁▁▁
NUMBER OF CYCLIST KILLED	0	1.00	0.00	0.01	0	0	0	0	2	▇▁▁▁▁
NUMBER OF MOTORIST INJURED	0	1.00	0.23	0.67	0	0	0	0	40	▇▁▁▁▁
NUMBER OF MOTORIST KILLED	0	1.00	0.00	0.03	0	0	0	0	4	▇▁▁▁▁

Data 2

Introduction and data

Identify the source of the data.

UCI Machine Learning Repository
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

According to the original curator of the data published in the UCI Machine Learning Repository, the collection source is confidential to the public because the data involves real participants. The data was last updated three years ago.
Write a brief description of the observations.

This file concerns credit card applications. Within the data, all attribute names and values have been changed to a simple numeric ID to protect confidentiality of the data. The data consists of a range of categorical and numerical variables.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is there a correlation between age, current income, credit score, and marital status when it comes to credit card approvals? Can these variables be used to predict people’s chances of being approved for a credit card?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic relates to credit card application variables and whether these variables can be used to predict whether a candidate will get approved for a credit card or not. My hypothesis is that there will be a correlation between at least some of the 4 variables. Perhaps if a credit card company sees that an individual is older, has a steady income, high credit score, and is married, that they will be more likely to get approved.
Statement on why this question is important.

This question is important because it investigates what factors affect credit card applications getting approved and what people could do to improve their chances of getting approved, which is a real-life issue for many people.
Identify the types of variables in your research question. Categorical? Quantitative?
- Gender –> categorical
- Age –> numerical
- Debt –> numerical
- Marital status –> categorical
- Bank Customer –> categorical
- Education Level –> categorical
- Ethnicity –> categorical
- Years Employed –> numerical
- Prior Default –> categorical
- Employment Status –> categorical
- Credit Score –> numerical
- Drivers License –> categorical
- Citizenship –> categorical
- Zip Code –> categorical
- Current Income –> numerical
- Approval status –> categorical

Glimpse of data

credit_approval <- read_csv("data/crx.csv")

New names:
Rows: 689 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): b, 30.83, u, g...5, w, v, g...13, 202, + dbl (4): 0...3, 1.25, 1, 0...15
lgl (3): t...9, t...10, f
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `0` -> `0...3`
• `g` -> `g...5`
• `t` -> `t...9`
• `t` -> `t...10`
• `g` -> `g...13`
• `0` -> `0...15`

skim(credit_approval)

Data summary
Name	credit_approval
Number of rows	689
Number of columns	16
_______________________
Column type frequency:
character	9
logical	3
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
b	1	1	1	3
30.83	1	1	5	349
u	1	1	1	4
g…5	1	1	2	4
w	1	1	2	15
v	1	1	2	10
g…13	1	1	1	3
202	1	1	4	170
+	1	1	1	2

Variable type: logical

skim_variable	complete_rate	mean	count
t…9	1	0.52	TRU: 360, FAL: 329
t…10	1	0.43	FAL: 395, TRU: 294
f	1	0.46	FAL: 373, TRU: 316

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
0…3	1	4.77	4.98	1.00	2.75	7.25	28.0	▇▂▁▁▁
1.25	1	2.22	3.35	0.16	1.00	2.62	28.5	▇▁▁▁▁
1	1	2.40	4.87	0.00	0.00	3.00	67.0	▇▁▁▁▁
0…15	1	1018.86	5213.74	0.00	5.00	396.00	100000.0	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.

2022 CORGIS Datasets Project. Project by Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

It’s for a working paper from 2016 by Caroline Freund and Sarah Oliver for PIIE (Peterson Institute for International Economics). Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996-2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire - including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.)
Write a brief description of the observations.

The observations are related to each individual billionaire, their rankings, companies, relationship to the company, age, gender, country, type of wealth, inheritance, and more. The data serves to describe the factors related to why or how someone is a billionaire.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What is the industry that produces the most billionaires? How has that changed over the years?
- How does gender play a role in the number of billionaires in the North American region?
- Are people with inherited wealth more likely to become billionaires?
- Are there any characteristics that make you more likely to become a billionaire/shared characteristics of billionaires? If so, how have these changed over time?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- What is the industry that produces the most billionaires? How has that changed over the years?
  
  This topic is about changing landscape of most prosperous industries throughout the years with number of billionaires as an indicator of prosperity. My hypothesis is that it used to be more manufacturing/industrial focused but has transitioned to more finance and tech focused.
- How does gender play a role in the number of billionaires in the North American region?
  
  This topic investigates the relationship between gender and extreme wealth in North America. My hypothesis is that there are significantly less female billionaires in America due to the gender inequality in most industries.
- Are people with inherited wealth more likely to become billionaires?
  
  This topic investigates the topic of generational wealth and how that affects one’s mobility in economic class. My hypothesis is that people with inherited wealth are only slightly more likely to become billionaires, since the accumulation of ginormous wealth often come with opportunities of each era and chance.
- Are there any characteristics that make you more likely to become a billionaire/shared characteristics of billionaires? If so, how have these changed over time?
  
  The topic is about the shared characteristics of billionaires and whether these have evolved. My hypothesis is that many billionaires share the characteristics of coming from wealth, being male, being CEOs, and being in the finance/tech industries, but that the characteristics of billionaires have become more diverse over time.
Statement on why these questions are important.

These questions are important because they explore the characteristics that lead to extreme wealth, which could possibly serve as indicators of who is likely to be successful or maybe reveal patterns of inequity in our society.
Identify the types of variables in your research question. Categorical? Quantitative?
- name is categorical
- rank is quantitative
- year is quantitative
- company.founded is quantitative
- company.name is categorical
- company.relationship is categorical
- company.sector is categorical
- company.type is categorical
- demographics.age is quantitative
- demographics.gender is categorical
- location.citizenship is categorical
- location.country.code is categorical
- location.gdp is quantitative
- location.region is categorical
- wealth.type is categorical
- wealth.worth is quantitative
- wealth.how.category is categorical
- wealth.how.industry is categorical
- wealth.how.fromemerging is categorical(boolean)
- wealth.how.inherited is categorical
- wealth.how.was.founder is categorical(boolean)
- wealth.how.was.political is categorical(boolean)

Glimpse of data

# add code here
bills <- read_csv("data/billionaires.csv")

Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl  (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl  (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(bills)

Data summary
Name	bills
Number of rows	2614
Number of columns	22
_______________________
Column type frequency:
character	13
logical	3
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	5	45	2077
company.name	38	0.99	3	59	1576
company.relationship	46	0.98	3	46	73
company.sector	23	0.99	3	52	505
company.type	36	0.99	3	22	15
demographics.gender	34	0.99	4	14	3
location.citizenship	0	1.00	4	20	73
location.country code	0	1.00	3	6	74
location.region	0	1.00	1	24	8
wealth.type	22	0.99	9	24	5
wealth.how.category	1	1.00	1	18	9
wealth.how.industry	1	1.00	1	31	19
wealth.how.inherited	0	1.00	6	24	6

Variable type: logical

skim_variable	complete_rate	mean	count
wealth.how.from emerging	1	1	TRU: 2614
wealth.how.was founder	1	1	TRU: 2614
wealth.how.was political	1	1	TRU: 2614

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	5.996700e+02	4.678900e+02	1	215.0	430	9.880e+02	1.565e+03	▇▅▃▂▃
year	1	2.008410e+03	7.480000e+00	1996	2001.0	2014	2.014e+03	2.014e+03	▂▂▁▁▇
company.founded	1	1.924710e+03	2.437800e+02	0	1936.0	1963	1.985e+03	2.012e+03	▁▁▁▁▇
demographics.age	1	5.334000e+01	2.533000e+01	-42	47.0	59	7.000e+01	9.800e+01	▁▂▁▇▃
location.gdp	1	1.769103e+12	3.547083e+12	0	0.0	0	7.250e+11	1.060e+13	▇▁▁▁▁
wealth.worth in billions	1	3.530000e+00	5.090000e+00	1	1.4	2	3.500e+00	7.600e+01	▇▁▁▁▁