Project title

Appendix to report

Data cleaning

At minimum, you should have an appendix for your data cleaning. Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. When rendered, it should output the dataset you submit as part of your project (e.g. written as a .csv file).

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(skimr)
library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.

library(stats)

billionaires <- read.csv("data/billionaires.csv")


billionaires_cleaned <- billionaires |>
  select(name, year, rank, company.sector, location.country.code, location.region, 
         wealth.worth.in.billions, wealth.how.industry, wealth.how.inherited) |>
  filter(year == 2014 & (location.region == "North America" | location.region == "Europe")) |>
  mutate(
    wealth.how.inherited = case_when(
      wealth.how.inherited == "not inherited" ~ "Not inherited",
      wealth.how.inherited == "spouse/widow" ~ "Inherited spouse/widow",
      wealth.how.inherited != "spouse/widow" & wealth.how.inherited != "not inherited" 
      ~ "Inherited generation"
    ),
    location.region = as.factor(location.region), 
    wealth.how.industry = if_else(wealth.how.industry == "0", 
                                       "Other", wealth.how.industry),
    wealth.how.industry = as.factor(wealth.how.industry)
  ) 

write.csv(billionaires_cleaned, "data/billionaires_cleaned.csv")

Cleaning the dataset:

First, we decided which variables were of interest to us, and narrowed down the dataframe to only those variables with the select() function. These variables included name, year, rank, company.sector, location.country.code, location.region, wealth.worth.in.billions, wealth.how.industry, and wealth.how.inherited.
We filtered for billionaires in only in year 2014. This dataframe includes a list of billionaires in the years 1996, 2001, and 2014. We chose to focus on the most recent list, and thus filtered for the list of billionaires in 2014 using the filter() function. We also filtered for billionaires only in North America and Europe, because our main focus in this research is comparing between billionaires in those two regions.
We then adjusted the categories for wealth.how.inherited – the source of a billionaire’s wealth, creating three main categories: “not inherited”, “inherited from family”, “inherited from spouse/widow”. These are the three most distinct categories for wealth inheritance, since we don’t see 3rd generation inheritance different from 5th generation inheritance, grouping any form of generational inheritance into “inherited generation”.
We also cleaned up the wealth.how.industry variable because upon creating data visualizations, an industry category of “0” was apparent, which is not descriptive or indicative of any sort of industry a billionaire might work in. Instead of removing any billionaire with this value, we changed its value under wealth.how.industry to “Other”, another existing category.

Other appendicies (as necessary)