Project title

Appendix to report

Data cleaning

Other appendicies (as necessary)

Bill Gates Removal

#|label: age-wealth-scatterplot-outlier
#scatter_plot - age vs wealth - outlier
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(skimr)

billionaire <- read.csv("data/billionaires.csv")
billionaire_2001 <- filter(billionaire, year==2001) |>
  mutate(
    age.when.founded.company = ifelse(demographics.age==0 | company.founded==0, 
                                NA, demographics.age - (2001-company.founded)),     
    age.when.founded.company = 
      ifelse(str_detect(company.relationship, "founder"), 
             age.when.founded.company,NA),
    age.when.founded.company = ifelse(age.when.founded.company>0 , 
                                      age.when.founded.company,NA)
    ) |>
   select(name, demographics.age, location.citizenship, location.gdp, 
         location.region, wealth.worth.in.billions, wealth.how.industry,
         wealth.how.inherited, age.when.founded.company)

billinregions <- billionaire_2001 |>
  group_by(location.citizenship)|>
  count()|>
  mutate(location.num.billionaires = n)
  
billionaire_2001<-left_join(billionaire_2001, billinregions)
Joining with `by = join_by(location.citizenship)`
billionaire_2001$location.region[billionaire_2001$location.region == "0"] <- NA
billionaire_2001$wealth.how.industry[billionaire_2001$wealth.how.industry == "0"] <- NA
billionaire_2001$demographics.age[billionaire_2001$demographics.age == 0] <- NA


billionaire_2001_age_wealth_outliner <- billionaire_2001 |>
  filter(name != "Bill Gates")

ggplot(billionaire_2001_age_wealth_outliner,
       aes(x = age.when.founded.company, y = wealth.worth.in.billions)) +
  geom_point() +
  geom_smooth(method="lm") +
  labs(
    title= "Billionaire's Age When They Founded Their Company vs. Their Wealth",
    x="Age When They Founded Their Company",
    y="Wealth Worth in Billions"
  ) +
  scale_y_continuous(labels = label_dollar())
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 346 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 346 rows containing missing values (`geom_point()`).

#|label: age-wealth-correlation-and-linear-reg-outlier
#|include: false

# Correlation - age vs wealth - outlier
age_wealth_corr_outlier <- billionaire_2001_age_wealth_outliner |>
  drop_na(age.when.founded.company, wealth.worth.in.billions) |>
  summarize(age_cor = cor(age.when.founded.company, wealth.worth.in.billions))

age_wealth_corr_outlier
      age_cor
1 -0.09624012
#linear regression - age vs wealth - outlier
age_wealth_fit_outlier <- linear_reg() |>
  fit(wealth.worth.in.billions ~ age.when.founded.company, 
      data = billionaire_2001_age_wealth_outliner)

tidy(age_wealth_fit_outlier)
# A tibble: 2 × 5
  term                     estimate std.error statistic   p.value
  <chr>                       <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)                4.64      1.15        4.05 0.0000754
2 age.when.founded.company  -0.0444    0.0334     -1.33 0.185    

As we can see in the plot, we see that there is a billionaire that has a younger founding age (under 20) that has the most wealth (much more than others), we are wondering if this affects the result. Above the code is the process of deleting the outlier “Bill Gates” (the most wealthy billionaire). However, the correlation is still negative, and does not have much difference. Since deleting the outlier does not do much, we deleted it.

billionaire_2001|>
  group_by(location.citizenship) |>
  arrange(desc(location.num.billionaires))
# A tibble: 538 × 11
# Groups:   location.citizenship [46]
   name       demographics.age location.citizenship location.gdp location.region
   <chr>                 <int> <chr>                       <dbl> <chr>          
 1 Bill Gates               45 United States             1.06e13 North America  
 2 Warren Bu…               70 United States             1.06e13 North America  
 3 Paul Allen               48 United States             1.06e13 North America  
 4 Larry Ell…               56 United States             1.06e13 North America  
 5 Jim Walton               53 United States             1.06e13 North America  
 6 John Walt…               55 United States             1.06e13 North America  
 7 S Robson …               57 United States             1.06e13 North America  
 8 Alice Wal…               52 United States             1.06e13 North America  
 9 Helen Wal…               81 United States             1.06e13 North America  
10 Steven Ba…               44 United States             1.06e13 North America  
# ℹ 528 more rows
# ℹ 6 more variables: wealth.worth.in.billions <dbl>,
#   wealth.how.industry <chr>, wealth.how.inherited <chr>,
#   age.when.founded.company <dbl>, n <int>, location.num.billionaires <int>

However, for Countries’ GDP vs Number of billionaires in that country analysis, the deleting the outlier does make it better. By looking at the plot, we can see that there is a country that has a very high GDP with a very high number of billionaires (above code is how we find the outlier).