library(tidyverse)
library(skimr)
Characteristics of Billionaires
Proposal
Data 1
Introduction and data
Identify the source of the data.
NYC Open Data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The dataset is collected by NYPD from April 28, 2014 to March 10, 2023 from all police reported motor vehicle collisions in NYC.
Write a brief description of the observations.
Each row is a Motor Vehicle Collision, with data like crash date, crash time, person injured, on/off street, etc.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Is there a correlation between what season (time of year/date) it is and how many crashes there are in a certain borough (or all 5)? And what percentage of those caused a fatality?
A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic explores the danger of driving in NYC throughout the year by investigating the amount of vehicle crashes and resulting fatalities in NYC and how this relates to the time of year/season.
My hypothesis is that there are a higher number of crashes and resulting fatalities in the winter and spring months (due to road conditions and how busy the city is during these periods).
Statement on why this question is important.
This question is important because it explores the safety of driving at different times of years, which is important information for parents of drivers/all drivers in NYC to be conscious of.
Identify the types of variables in your research question. Categorical? Quantitative?
Date - quantitative
Borough - categorical
Number of persons killed - quantitative
Glimpse of data
<- read_csv("data/Motor_Vehicle_Collisions_-_Crashes.csv") motorCrash
Rows: 1048575 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): CRASH DATE, BOROUGH, ON STREET NAME, CONTRIBUTING FACTOR VEHICLE 1...
dbl (9): ZIP CODE, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUM...
time (1): CRASH TIME
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(motorCrash)
Name | motorCrash |
Number of rows | 1048575 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
character | 9 |
difftime | 1 |
numeric | 9 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
CRASH DATE | 0 | 1.00 | 6 | 8 | 0 | 2347 | 0 |
BOROUGH | 376672 | 0.64 | 5 | 13 | 0 | 5 | 0 |
ON STREET NAME | 257012 | 0.75 | 2 | 32 | 0 | 9891 | 0 |
CONTRIBUTING FACTOR VEHICLE 1 | 3751 | 1.00 | 2 | 53 | 0 | 57 | 0 |
CONTRIBUTING FACTOR VEHICLE 2 | 178461 | 0.83 | 5 | 53 | 0 | 56 | 0 |
CONTRIBUTING FACTOR VEHICLE 3 | 970988 | 0.07 | 5 | 53 | 0 | 42 | 0 |
VEHICLE TYPE CODE 1 | 8659 | 0.99 | 1 | 38 | 0 | 1334 | 0 |
VEHICLE TYPE CODE 2 | 248934 | 0.76 | 1 | 38 | 0 | 1461 | 0 |
VEHICLE TYPE CODE 3 | 975400 | 0.07 | 2 | 35 | 0 | 201 | 0 |
Variable type: difftime
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
CRASH TIME | 0 | 1 | 0 secs | 86340 secs | 14:11:00 | 1440 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ZIP CODE | 376856 | 0.64 | 10869.60 | 541.48 | 10000 | 10453 | 11208 | 11249 | 11697 | ▅▃▁▇▅ |
NUMBER OF PERSONS INJURED | 17 | 1.00 | 0.32 | 0.70 | 0 | 0 | 0 | 0 | 40 | ▇▁▁▁▁ |
NUMBER OF PERSONS KILLED | 30 | 1.00 | 0.00 | 0.04 | 0 | 0 | 0 | 0 | 8 | ▇▁▁▁▁ |
NUMBER OF PEDESTRIANS INJURED | 0 | 1.00 | 0.05 | 0.24 | 0 | 0 | 0 | 0 | 27 | ▇▁▁▁▁ |
NUMBER OF PEDESTRIANS KILLED | 0 | 1.00 | 0.00 | 0.03 | 0 | 0 | 0 | 0 | 6 | ▇▁▁▁▁ |
NUMBER OF CYCLIST INJURED | 0 | 1.00 | 0.03 | 0.17 | 0 | 0 | 0 | 0 | 3 | ▇▁▁▁▁ |
NUMBER OF CYCLIST KILLED | 0 | 1.00 | 0.00 | 0.01 | 0 | 0 | 0 | 0 | 2 | ▇▁▁▁▁ |
NUMBER OF MOTORIST INJURED | 0 | 1.00 | 0.23 | 0.67 | 0 | 0 | 0 | 0 | 40 | ▇▁▁▁▁ |
NUMBER OF MOTORIST KILLED | 0 | 1.00 | 0.00 | 0.03 | 0 | 0 | 0 | 0 | 4 | ▇▁▁▁▁ |
Data 2
Introduction and data
Identify the source of the data.
UCI Machine Learning Repository
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
According to the original curator of the data published in the UCI Machine Learning Repository, the collection source is confidential to the public because the data involves real participants. The data was last updated three years ago.
Write a brief description of the observations.
This file concerns credit card applications. Within the data, all attribute names and values have been changed to a simple numeric ID to protect confidentiality of the data. The data consists of a range of categorical and numerical variables.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Is there a correlation between age, current income, credit score, and marital status when it comes to credit card approvals? Can these variables be used to predict people’s chances of being approved for a credit card?
A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic relates to credit card application variables and whether these variables can be used to predict whether a candidate will get approved for a credit card or not. My hypothesis is that there will be a correlation between at least some of the 4 variables. Perhaps if a credit card company sees that an individual is older, has a steady income, high credit score, and is married, that they will be more likely to get approved.
Statement on why this question is important.
This question is important because it investigates what factors affect credit card applications getting approved and what people could do to improve their chances of getting approved, which is a real-life issue for many people.
Identify the types of variables in your research question. Categorical? Quantitative?
Gender –> categorical
Age –> numerical
Debt –> numerical
Marital status –> categorical
Bank Customer –> categorical
Education Level –> categorical
Ethnicity –> categorical
Years Employed –> numerical
Prior Default –> categorical
Employment Status –> categorical
Credit Score –> numerical
Drivers License –> categorical
Citizenship –> categorical
Zip Code –> categorical
Current Income –> numerical
Approval status –> categorical
Glimpse of data
<- read_csv("data/crx.csv") credit_approval
New names:
Rows: 689 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): b, 30.83, u, g...5, w, v, g...13, 202, + dbl (4): 0...3, 1.25, 1, 0...15
lgl (3): t...9, t...10, f
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `0` -> `0...3`
• `g` -> `g...5`
• `t` -> `t...9`
• `t` -> `t...10`
• `g` -> `g...13`
• `0` -> `0...15`
skim(credit_approval)
Name | credit_approval |
Number of rows | 689 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 9 |
logical | 3 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
b | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
30.83 | 0 | 1 | 1 | 5 | 0 | 349 | 0 |
u | 0 | 1 | 1 | 1 | 0 | 4 | 0 |
g…5 | 0 | 1 | 1 | 2 | 0 | 4 | 0 |
w | 0 | 1 | 1 | 2 | 0 | 15 | 0 |
v | 0 | 1 | 1 | 2 | 0 | 10 | 0 |
g…13 | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
202 | 0 | 1 | 1 | 4 | 0 | 170 | 0 |
+ | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
t…9 | 0 | 1 | 0.52 | TRU: 360, FAL: 329 |
t…10 | 0 | 1 | 0.43 | FAL: 395, TRU: 294 |
f | 0 | 1 | 0.46 | FAL: 373, TRU: 316 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
0…3 | 0 | 1 | 4.77 | 4.98 | 0 | 1.00 | 2.75 | 7.25 | 28.0 | ▇▂▁▁▁ |
1.25 | 0 | 1 | 2.22 | 3.35 | 0 | 0.16 | 1.00 | 2.62 | 28.5 | ▇▁▁▁▁ |
1 | 0 | 1 | 2.40 | 4.87 | 0 | 0.00 | 0.00 | 3.00 | 67.0 | ▇▁▁▁▁ |
0…15 | 0 | 1 | 1018.86 | 5213.74 | 0 | 0.00 | 5.00 | 396.00 | 100000.0 | ▇▁▁▁▁ |
Data 3
Introduction and data
Identify the source of the data.
2022 CORGIS Datasets Project. Project by Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
It’s for a working paper from 2016 by Caroline Freund and Sarah Oliver for PIIE (Peterson Institute for International Economics). Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996-2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire - including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.)
Write a brief description of the observations.
The observations are related to each individual billionaire, their rankings, companies, relationship to the company, age, gender, country, type of wealth, inheritance, and more. The data serves to describe the factors related to why or how someone is a billionaire.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What is the industry that produces the most billionaires? How has that changed over the years?
How does gender play a role in the number of billionaires in the North American region?
Are people with inherited wealth more likely to become billionaires?
Are there any characteristics that make you more likely to become a billionaire/shared characteristics of billionaires? If so, how have these changed over time?
A description of the research topic along with a concise statement of your hypotheses on this topic.
What is the industry that produces the most billionaires? How has that changed over the years?
This topic is about changing landscape of most prosperous industries throughout the years with number of billionaires as an indicator of prosperity. My hypothesis is that it used to be more manufacturing/industrial focused but has transitioned to more finance and tech focused.
How does gender play a role in the number of billionaires in the North American region?
This topic investigates the relationship between gender and extreme wealth in North America. My hypothesis is that there are significantly less female billionaires in America due to the gender inequality in most industries.
Are people with inherited wealth more likely to become billionaires?
This topic investigates the topic of generational wealth and how that affects one’s mobility in economic class. My hypothesis is that people with inherited wealth are only slightly more likely to become billionaires, since the accumulation of ginormous wealth often come with opportunities of each era and chance.
Are there any characteristics that make you more likely to become a billionaire/shared characteristics of billionaires? If so, how have these changed over time?
The topic is about the shared characteristics of billionaires and whether these have evolved. My hypothesis is that many billionaires share the characteristics of coming from wealth, being male, being CEOs, and being in the finance/tech industries, but that the characteristics of billionaires have become more diverse over time.
Statement on why these questions are important.
These questions are important because they explore the characteristics that lead to extreme wealth, which could possibly serve as indicators of who is likely to be successful or maybe reveal patterns of inequity in our society.
Identify the types of variables in your research question. Categorical? Quantitative?
name is categorical
rank is quantitative
year is quantitative
company.founded is quantitative
company.name is categorical
company.relationship is categorical
company.sector is categorical
company.type is categorical
demographics.age is quantitative
demographics.gender is categorical
location.citizenship is categorical
location.country.code is categorical
location.gdp is quantitative
location.region is categorical
wealth.type is categorical
wealth.worth is quantitative
wealth.how.category is categorical
wealth.how.industry is categorical
wealth.how.fromemerging is categorical(boolean)
wealth.how.inherited is categorical
wealth.how.was.founder is categorical(boolean)
wealth.how.was.political is categorical(boolean)
Glimpse of data
# add code here
<- read_csv("data/billionaires.csv") bills
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(bills)
Name | bills |
Number of rows | 2614 |
Number of columns | 22 |
_______________________ | |
Column type frequency: | |
character | 13 |
logical | 3 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 5 | 45 | 0 | 2077 | 0 |
company.name | 38 | 0.99 | 3 | 59 | 0 | 1576 | 0 |
company.relationship | 46 | 0.98 | 3 | 46 | 0 | 73 | 0 |
company.sector | 23 | 0.99 | 3 | 52 | 0 | 505 | 0 |
company.type | 36 | 0.99 | 3 | 22 | 0 | 15 | 0 |
demographics.gender | 34 | 0.99 | 4 | 14 | 0 | 3 | 0 |
location.citizenship | 0 | 1.00 | 4 | 20 | 0 | 73 | 0 |
location.country code | 0 | 1.00 | 3 | 6 | 0 | 74 | 0 |
location.region | 0 | 1.00 | 1 | 24 | 0 | 8 | 0 |
wealth.type | 22 | 0.99 | 9 | 24 | 0 | 5 | 0 |
wealth.how.category | 1 | 1.00 | 1 | 18 | 0 | 9 | 0 |
wealth.how.industry | 1 | 1.00 | 1 | 31 | 0 | 19 | 0 |
wealth.how.inherited | 0 | 1.00 | 6 | 24 | 0 | 6 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
wealth.how.from emerging | 0 | 1 | 1 | TRU: 2614 |
wealth.how.was founder | 0 | 1 | 1 | TRU: 2614 |
wealth.how.was political | 0 | 1 | 1 | TRU: 2614 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rank | 0 | 1 | 5.996700e+02 | 4.678900e+02 | 1 | 215.0 | 430 | 9.880e+02 | 1.565e+03 | ▇▅▃▂▃ |
year | 0 | 1 | 2.008410e+03 | 7.480000e+00 | 1996 | 2001.0 | 2014 | 2.014e+03 | 2.014e+03 | ▂▂▁▁▇ |
company.founded | 0 | 1 | 1.924710e+03 | 2.437800e+02 | 0 | 1936.0 | 1963 | 1.985e+03 | 2.012e+03 | ▁▁▁▁▇ |
demographics.age | 0 | 1 | 5.334000e+01 | 2.533000e+01 | -42 | 47.0 | 59 | 7.000e+01 | 9.800e+01 | ▁▂▁▇▃ |
location.gdp | 0 | 1 | 1.769103e+12 | 3.547083e+12 | 0 | 0.0 | 0 | 7.250e+11 | 1.060e+13 | ▇▁▁▁▁ |
wealth.worth in billions | 0 | 1 | 3.530000e+00 | 5.090000e+00 | 1 | 1.4 | 2 | 3.500e+00 | 7.600e+01 | ▇▁▁▁▁ |