Project Wondrous Pikachu

Exploratory data analysis

Research question(s)

Research question(s). State your research question (s) clearly.

Our research question was: what factors have strong correlations with billionaire ranking? We thought this would be an interesting question to ask because we wanted to look in to the external factors of how billionaires generated their wealth. When asking, “which factors correlate more with billionaire ranking?”, we are able to see some of the measurable qualities that connect to billionaire status.

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

#replace blank and 0 values with NA
billionaires <- read.csv("data/billionaires.csv", na.strings = c("", "0"))

Data description

The data set provides rankings of the billionaires for the years 1996, 2001, and 2014. The attributes include the billionaire’s, age, gender, sector of company, whether the billionaire’s wealth was inherited or not, etc. and the observations represent each billionaire for each year that the data has been recorded. The data set comes from CORGIS data set project, but was originated and funded by Forbes through their World’s Billionaires lists. As far as factors that might have influenced what data was observed and recorded and what was not, there could have been individuals withholding certain personal information which could account for the NA values. However, the observations on the data set are fairly “matter of fact” so it would be difficult to fabricate such information. The data was well organized in its original state, hence the only processing that was done was replacing blank and 0 values with NA. The people involved in this data set were potentially aware of its use to be made public through Forbes. The information on this data set is also reflecting on these people in terms of their prowess in the business world. Information like wealth inheritance though can be researched through legal documents, so there was no surveying of these billionaires where they could fabricate answers 

Data limitations

A potential limitation is the lack of numeric variables in the Billionaires dataset. This may make it difficult to do sufficient quantitative analysis on the dataframe. There are additional limitations in that some individuals have their names repeated after being on the list at different years.

Exploratory data analysis

library(tidyverse)
library(skimr)
library(tidymodels)
library(scales)
library(palmerpenguins)
library(gapminder)

Perform an (initial) exploratory data analysis.

billionaires |>
  group_by(wealth.how.industry) |>
  summarize(
    mean_rank = mean(rank),
    median_rank = median(rank),
    std_rank = sd(rank),
    max = max(rank),
    min = min(rank)
  )
# A tibble: 19 × 6
   wealth.how.industry             mean_rank median_rank std_rank   max   min
   <chr>                               <dbl>       <dbl>    <dbl> <int> <int>
 1 Constrution                          671.        490      422.  1565    23
 2 Consumer                             572.        408      468.  1565     2
 3 Diversified financial                616.        520      461.  1565     6
 4 Energy                               619.        434.     472.  1565    40
 5 Hedge funds                          631.        452      462.  1565     5
 6 Media                                478.        336      437.  1565     2
 7 Mining and metals                    719.        652.     492.  1565    46
 8 Money Management                     523.        387      422.  1565     6
 9 Non-consumer industrial              707.        580      508.  1565    31
10 Other                                685.        506      441.  1565    69
11 Private equity/leveraged buyout      499.        520      306.  1154    59
12 Real Estate                          665.        520      487.  1565     4
13 Retail, Restaurant                   560.        388      464.  1565     3
14 Technology-Computer                  576.        402      478.  1565     1
15 Technology-Medical                   716.        520      501.  1565     3
16 Venture Capital                      724.        550.     315.  1154   452
17 banking                              296         296       NA    296   296
18 services                             324         324       NA    324   324
19 <NA>                                 728.        361      584.  1565   128
rank_industry_fit <- linear_reg() |>
  fit(rank ~ wealth.how.industry, data = billionaires)
tidy(rank_industry_fit)
# A tibble: 18 × 5
   term                                     estimate std.error statistic p.value
   <chr>                                       <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                                 296.       463.    0.639    0.523
 2 wealth.how.industryConstrution              375.       466.    0.804    0.421
 3 wealth.how.industryConsumer                 276.       464.    0.594    0.552
 4 wealth.how.industryDiversified financial    320.       465.    0.689    0.491
 5 wealth.how.industryEnergy                   323.       465.    0.694    0.488
 6 wealth.how.industryHedge funds              335.       467.    0.719    0.472
 7 wealth.how.industryMedia                    182.       464.    0.391    0.696
 8 wealth.how.industryMining and metals        423.       466.    0.908    0.364
 9 wealth.how.industryMoney Management         227.       464.    0.490    0.625
10 wealth.how.industryNon-consumer industr…    411.       465.    0.884    0.377
11 wealth.how.industryOther                    389.       466.    0.835    0.404
12 wealth.how.industryPrivate equity/lever…    203.       472.    0.429    0.668
13 wealth.how.industryReal Estate              369.       464.    0.795    0.427
14 wealth.how.industryRetail, Restaurant       264.       464.    0.569    0.569
15 wealth.how.industryservices                  28.0      655.    0.0427   0.966
16 wealth.how.industryTechnology-Computer      280.       464.    0.603    0.546
17 wealth.how.industryTechnology-Medical       420.       465.    0.903    0.367
18 wealth.how.industryVenture Capital          428.       491.    0.870    0.384
gdp_wealth_fit <-
  linear_reg() |>
  fit(location.gdp ~ wealth.worth.in.billions, data = billionaires) 
tidy(gdp_wealth_fit)
# A tibble: 2 × 5
  term                     estimate    std.error statistic   p.value
  <chr>                       <dbl>        <dbl>     <dbl>     <dbl>
1 (Intercept)               1.89e12 84375594505.     22.4  1.82e-101
2 wealth.worth.in.billions -3.33e10 13623020509.     -2.44 1.47e-  2
ggplot(data = billionaires, 
       mapping = aes(x = wealth.worth.in.billions, y = wealth.how.industry, color =wealth.how.industry )) +
  geom_point(alpha = 0.5) +
 
  labs(title = "Graphing amount and net worth of Billionares in each industry", 
       x= "Wealth- Net worth",
       y= "Industry") + 

  theme_minimal() + 
  guides(color = "none")

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Are there any other limitations in the data that you think should be addressed?

Are there any further data cleaning processes you would recommend?

Is there additional analysis that you think could be helpful to include?

Do we include enough description for our data cleaning process to be understood and replicated?

Are there any other suggestions or feedback that you have regarding our data collection and exploratory data analysis?