Characteristics of Billionaires

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    NYC Open Data

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The dataset is collected by NYPD from April 28, 2014 to March 10, 2023 from all police reported motor vehicle collisions in NYC.

  • Write a brief description of the observations.

    Each row is a Motor Vehicle Collision, with data like crash date, crash time, person injured, on/off street, etc.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    Is there a correlation between what season (time of year/date) it is and how many crashes there are in a certain borough (or all 5)? And what percentage of those caused a fatality?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The research topic explores the danger of driving in NYC throughout the year by investigating the amount of vehicle crashes and resulting fatalities in NYC and how this relates to the time of year/season.

    My hypothesis is that there are a higher number of crashes and resulting fatalities in the winter and spring months (due to road conditions and how busy the city is during these periods).

  • Statement on why this question is important.

    This question is important because it explores the safety of driving at different times of years, which is important information for parents of drivers/all drivers in NYC to be conscious of.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Date - quantitative

    • Borough - categorical

    • Number of persons killed - quantitative

Glimpse of data

motorCrash <- read_csv("data/Motor_Vehicle_Collisions_-_Crashes.csv")
Rows: 1048575 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (9): CRASH DATE, BOROUGH, ON STREET NAME, CONTRIBUTING FACTOR VEHICLE 1...
dbl  (9): ZIP CODE, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUM...
time (1): CRASH TIME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(motorCrash)
Data summary
Name motorCrash
Number of rows 1048575
Number of columns 19
_______________________
Column type frequency:
character 9
difftime 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
CRASH DATE 0 1.00 6 8 0 2347 0
BOROUGH 376672 0.64 5 13 0 5 0
ON STREET NAME 257012 0.75 2 32 0 9891 0
CONTRIBUTING FACTOR VEHICLE 1 3751 1.00 2 53 0 57 0
CONTRIBUTING FACTOR VEHICLE 2 178461 0.83 5 53 0 56 0
CONTRIBUTING FACTOR VEHICLE 3 970988 0.07 5 53 0 42 0
VEHICLE TYPE CODE 1 8659 0.99 1 38 0 1334 0
VEHICLE TYPE CODE 2 248934 0.76 1 38 0 1461 0
VEHICLE TYPE CODE 3 975400 0.07 2 35 0 201 0

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
CRASH TIME 0 1 0 secs 86340 secs 14:11:00 1440

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ZIP CODE 376856 0.64 10869.60 541.48 10000 10453 11208 11249 11697 ▅▃▁▇▅
NUMBER OF PERSONS INJURED 17 1.00 0.32 0.70 0 0 0 0 40 ▇▁▁▁▁
NUMBER OF PERSONS KILLED 30 1.00 0.00 0.04 0 0 0 0 8 ▇▁▁▁▁
NUMBER OF PEDESTRIANS INJURED 0 1.00 0.05 0.24 0 0 0 0 27 ▇▁▁▁▁
NUMBER OF PEDESTRIANS KILLED 0 1.00 0.00 0.03 0 0 0 0 6 ▇▁▁▁▁
NUMBER OF CYCLIST INJURED 0 1.00 0.03 0.17 0 0 0 0 3 ▇▁▁▁▁
NUMBER OF CYCLIST KILLED 0 1.00 0.00 0.01 0 0 0 0 2 ▇▁▁▁▁
NUMBER OF MOTORIST INJURED 0 1.00 0.23 0.67 0 0 0 0 40 ▇▁▁▁▁
NUMBER OF MOTORIST KILLED 0 1.00 0.00 0.03 0 0 0 0 4 ▇▁▁▁▁

Data 2

Introduction and data

  • Identify the source of the data.

    UCI Machine Learning Repository

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    According to the original curator of the data published in the UCI Machine Learning Repository, the collection source is confidential to the public because the data involves real participants. The data was last updated three years ago.

  • Write a brief description of the observations.

    This file concerns credit card applications. Within the data, all attribute names and values have been changed to a simple numeric ID to protect confidentiality of the data. The data consists of a range of categorical and numerical variables.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    Is there a correlation between age, current income, credit score, and marital status when it comes to credit card approvals? Can these variables be used to predict people’s chances of being approved for a credit card?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The research topic relates to credit card application variables and whether these variables can be used to predict whether a candidate will get approved for a credit card or not. My hypothesis is that there will be a correlation between at least some of the 4 variables. Perhaps if a credit card company sees that an individual is older, has a steady income, high credit score, and is married, that they will be more likely to get approved.

  • Statement on why this question is important.

    This question is important because it investigates what factors affect credit card applications getting approved and what people could do to improve their chances of getting approved, which is a real-life issue for many people.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Gender –> categorical

    • Age –> numerical

    • Debt –> numerical

    • Marital status –> categorical

    • Bank Customer –> categorical

    • Education Level –> categorical

    • Ethnicity –> categorical

    • Years Employed –> numerical

    • Prior Default –> categorical

    • Employment Status –> categorical

    • Credit Score –> numerical

    • Drivers License –> categorical

    • Citizenship –> categorical

    • Zip Code –> categorical

    • Current Income –> numerical

    • Approval status –> categorical

Glimpse of data

credit_approval <- read_csv("data/crx.csv")
New names:
Rows: 689 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): b, 30.83, u, g...5, w, v, g...13, 202, + dbl (4): 0...3, 1.25, 1, 0...15
lgl (3): t...9, t...10, f
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `0` -> `0...3`
• `g` -> `g...5`
• `t` -> `t...9`
• `t` -> `t...10`
• `g` -> `g...13`
• `0` -> `0...15`
skim(credit_approval)
Data summary
Name credit_approval
Number of rows 689
Number of columns 16
_______________________
Column type frequency:
character 9
logical 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
b 0 1 1 1 0 3 0
30.83 0 1 1 5 0 349 0
u 0 1 1 1 0 4 0
g…5 0 1 1 2 0 4 0
w 0 1 1 2 0 15 0
v 0 1 1 2 0 10 0
g…13 0 1 1 1 0 3 0
202 0 1 1 4 0 170 0
+ 0 1 1 1 0 2 0

Variable type: logical

skim_variable n_missing complete_rate mean count
t…9 0 1 0.52 TRU: 360, FAL: 329
t…10 0 1 0.43 FAL: 395, TRU: 294
f 0 1 0.46 FAL: 373, TRU: 316

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
0…3 0 1 4.77 4.98 0 1.00 2.75 7.25 28.0 ▇▂▁▁▁
1.25 0 1 2.22 3.35 0 0.16 1.00 2.62 28.5 ▇▁▁▁▁
1 0 1 2.40 4.87 0 0.00 0.00 3.00 67.0 ▇▁▁▁▁
0…15 0 1 1018.86 5213.74 0 0.00 5.00 396.00 100000.0 ▇▁▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    2022 CORGIS Datasets Project. Project by Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    It’s for a working paper from 2016 by Caroline Freund and Sarah Oliver for PIIE (Peterson Institute for International Economics). Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996-2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire - including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.)

  • Write a brief description of the observations.

    The observations are related to each individual billionaire, their rankings, companies, relationship to the company, age, gender, country, type of wealth, inheritance, and more. The data serves to describe the factors related to why or how someone is a billionaire.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • What is the industry that produces the most billionaires? How has that changed over the years?

    • How does gender play a role in the number of billionaires in the North American region?

    • Are people with inherited wealth more likely to become billionaires?

    • Are there any characteristics that make you more likely to become a billionaire/shared characteristics of billionaires? If so, how have these changed over time?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • What is the industry that produces the most billionaires? How has that changed over the years?

      This topic is about changing landscape of most prosperous industries throughout the years with number of billionaires as an indicator of prosperity. My hypothesis is that it used to be more manufacturing/industrial focused but has transitioned to more finance and tech focused.

    • How does gender play a role in the number of billionaires in the North American region?

      This topic investigates the relationship between gender and extreme wealth in North America. My hypothesis is that there are significantly less female billionaires in America due to the gender inequality in most industries.

    • Are people with inherited wealth more likely to become billionaires?

      This topic investigates the topic of generational wealth and how that affects one’s mobility in economic class. My hypothesis is that people with inherited wealth are only slightly more likely to become billionaires, since the accumulation of ginormous wealth often come with opportunities of each era and chance.

    • Are there any characteristics that make you more likely to become a billionaire/shared characteristics of billionaires? If so, how have these changed over time?

      The topic is about the shared characteristics of billionaires and whether these have evolved. My hypothesis is that many billionaires share the characteristics of coming from wealth, being male, being CEOs, and being in the finance/tech industries, but that the characteristics of billionaires have become more diverse over time.

  • Statement on why these questions are important.

    These questions are important because they explore the characteristics that lead to extreme wealth, which could possibly serve as indicators of who is likely to be successful or maybe reveal patterns of inequity in our society.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • name is categorical

    • rank is quantitative

    • year is quantitative

    • company.founded is quantitative

    • company.name is categorical

    • company.relationship is categorical

    • company.sector is categorical

    • company.type is categorical

    • demographics.age is quantitative

    • demographics.gender is categorical

    • location.citizenship is categorical

    • location.country.code is categorical

    • location.gdp is quantitative

    • location.region is categorical

    • wealth.type is categorical

    • wealth.worth is quantitative

    • wealth.how.category is categorical

    • wealth.how.industry is categorical

    • wealth.how.fromemerging is categorical(boolean)

    • wealth.how.inherited is categorical

    • wealth.how.was.founder is categorical(boolean)

    • wealth.how.was.political is categorical(boolean)

Glimpse of data

# add code here
bills <- read_csv("data/billionaires.csv")
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl  (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl  (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(bills)
Data summary
Name bills
Number of rows 2614
Number of columns 22
_______________________
Column type frequency:
character 13
logical 3
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 5 45 0 2077 0
company.name 38 0.99 3 59 0 1576 0
company.relationship 46 0.98 3 46 0 73 0
company.sector 23 0.99 3 52 0 505 0
company.type 36 0.99 3 22 0 15 0
demographics.gender 34 0.99 4 14 0 3 0
location.citizenship 0 1.00 4 20 0 73 0
location.country code 0 1.00 3 6 0 74 0
location.region 0 1.00 1 24 0 8 0
wealth.type 22 0.99 9 24 0 5 0
wealth.how.category 1 1.00 1 18 0 9 0
wealth.how.industry 1 1.00 1 31 0 19 0
wealth.how.inherited 0 1.00 6 24 0 6 0

Variable type: logical

skim_variable n_missing complete_rate mean count
wealth.how.from emerging 0 1 1 TRU: 2614
wealth.how.was founder 0 1 1 TRU: 2614
wealth.how.was political 0 1 1 TRU: 2614

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1 5.996700e+02 4.678900e+02 1 215.0 430 9.880e+02 1.565e+03 ▇▅▃▂▃
year 0 1 2.008410e+03 7.480000e+00 1996 2001.0 2014 2.014e+03 2.014e+03 ▂▂▁▁▇
company.founded 0 1 1.924710e+03 2.437800e+02 0 1936.0 1963 1.985e+03 2.012e+03 ▁▁▁▁▇
demographics.age 0 1 5.334000e+01 2.533000e+01 -42 47.0 59 7.000e+01 9.800e+01 ▁▂▁▇▃
location.gdp 0 1 1.769103e+12 3.547083e+12 0 0.0 0 7.250e+11 1.060e+13 ▇▁▁▁▁
wealth.worth in billions 0 1 3.530000e+00 5.090000e+00 1 1.4 2 3.500e+00 7.600e+01 ▇▁▁▁▁