Characteristics of Billionaires

Appendix to report

Data Cleaning

We collected our data by downloading the .csv file from the CORGIS Dataset Project online. We then uploaded the raw csv data and loaded it using read_csv(). From there, we began cleaning/curating our data to our research question. First, we filtered so that only data collected in 2014 remained, since our research question pertains to data collected in 2014 only. The original data set had data from 1996, 2001, and 2014, so some people, such as Bill Gates, were included multiple times. Filtering for only 2014 data prevents these people from being counted multiple times in our analysis. In addition, the years were not consecutive, so the dataset does not display useful information on continuous change overtime. For that reason, we selected the most recent year available to investigate. Next, we selected only the columns that are directly related to our research questions, which are name,demographics.age, demographics.gender, location.region, wealth.how.inherited, and kept name as an identifier variable, so that only relevant data is included in our data set and it is not overwhelmed by unused variables. Then, we renamed them to meet style guidelines and also to make them easier to use and understand. We renamed demographics.age to age, demographics.gender to gender, location.region to region, wealth.how.inherited to inheritance. At this point, we realized that there were no NA values remaining in our data, but the age for some people was 0, which is clearly in lieu of it being NA. So, we filtered our data to get rid of these because they would skew the age analysis down inaccurately and they are not meaningful. Additionally, we added a variable age_range based on the typical age ranges used in other public research analyses we have seen. We thought this variable would be useful, since there are so many different ages in our data that it would not be practical to do any analysis grouping by individual, unique age values. Finally, we grouped the inheritance variable into two categories, “inherited” and “not inherited”, where “inherited” includes any value that is not “not inherited” such as “3rd generation”, “father”, and “5th generation or longer”. Now, our clean data frame bills_clean has no NA values, has clean names and values, and is ready for analysis.

# A tibble: 1,590 × 6
   name               age gender region        inheritance   age_range
   <chr>            <dbl> <chr>  <chr>         <chr>         <fct>    
 1 Bill Gates          58 male   North America not inherited 35-44    
 2 Carlos Slim Helu    74 male   Latin America not inherited 55-64    
 3 Amancio Ortega      77 male   Europe        not inherited 55-64    
 4 Warren Buffett      83 male   North America not inherited 55-64    
 5 Larry Ellison       69 male   North America not inherited 45-54    
 6 Charles Koch        78 male   North America inherited     55-64    
 7 David Koch          73 male   North America inherited     45-54    
 8 Sheldon Adelson     80 male   North America not inherited 55-64    
 9 Christy Walton      59 female North America inherited     35-44    
10 Jim Walton          66 male   North America inherited     45-54    
# ℹ 1,580 more rows

# A tibble: 2,614 × 22
   name             rank  year company.founded company.name company.relationship
   <chr>           <dbl> <dbl>           <dbl> <chr>        <chr>               
 1 Bill Gates          1  1996            1975 Microsoft    founder             
 2 Bill Gates          1  2001            1975 Microsoft    founder             
 3 Bill Gates          1  2014            1975 Microsoft    founder             
 4 Warren Buffett      2  1996            1962 Berkshire H… founder             
 5 Warren Buffett      2  2001            1962 Berkshire H… founder             
 6 Carlos Slim He…     2  2014            1990 Telmex       founder             
 7 Oeri Hoffman a…     3  1996            1896 F. Hoffmann… <NA>                
 8 Paul Allen          3  2001            1975 Microsoft    founder             
 9 Amancio Ortega      3  2014            1975 Zara         founder             
10 Lee Shau Kee        4  1996            1976 Henderson L… founder/chairman    
# ℹ 2,604 more rows
# ℹ 16 more variables: company.sector <chr>, company.type <chr>,
#   demographics.age <dbl>, demographics.gender <chr>,
#   location.citizenship <chr>, `location.country code` <chr>,
#   location.gdp <dbl>, location.region <chr>, wealth.type <chr>,
#   `wealth.worth in billions` <dbl>, wealth.how.category <chr>,
#   `wealth.how.from emerging` <lgl>, wealth.how.industry <chr>, …

Additional Appendices

Additional Summary Statistics

`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.

# A tibble: 14 × 3
# Groups:   region [7]
   region                   gender num_bills
   <chr>                    <chr>      <int>
 1 East Asia                female        22
 2 East Asia                male         316
 3 Europe                   female        51
 4 Europe                   male         403
 5 Latin America            female        20
 6 Latin America            male          86
 7 Middle East/North Africa female         6
 8 Middle East/North Africa male          64
 9 North America            female        62
10 North America            male         484
11 South Asia               female         3
12 South Asia               male          57
13 Sub-Saharan Africa       female         2
14 Sub-Saharan Africa       male          14

`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.

# A tibble: 39 × 3
# Groups:   region [7]
   region    age_range num_bills
   <chr>     <fct>         <int>
 1 East Asia 18-24             3
 2 East Asia 25-34            61
 3 East Asia 35-44           129
 4 East Asia 45-54            96
 5 East Asia 55-64            30
 6 East Asia 64-100           19
 7 Europe    18-24             5
 8 Europe    25-34            73
 9 Europe    35-44           151
10 Europe    45-54           133
# ℹ 29 more rows

`summarise()` has grouped output by 'region', 'gender'. You can override using
the `.groups` argument.

# A tibble: 67 × 4
# Groups:   region, gender [14]
   region    gender age_range num_bills
   <chr>     <chr>  <fct>         <int>
 1 East Asia female 18-24             2
 2 East Asia female 25-34             7
 3 East Asia female 35-44             9
 4 East Asia female 45-54             4
 5 East Asia male   18-24             1
 6 East Asia male   25-34            54
 7 East Asia male   35-44           120
 8 East Asia male   45-54            92
 9 East Asia male   55-64            30
10 East Asia male   64-100           19
# ℹ 57 more rows

# A tibble: 7 × 7
  region                   median_age mean_age sd_age min_age max_age num_bills
  <chr>                         <dbl>    <dbl>  <dbl>   <dbl>   <dbl>     <int>
1 East Asia                      59       60.0   12.9      24      93       338
2 Europe                         61       61.7   12.9      29      96       454
3 Latin America                  67       67.0   12.0      31      93       106
4 Middle East/North Africa       62.5     63.1   13.6      33      94        70
5 North America                  67       66.3   13.2      29      98       546
6 South Asia                     61       62.4   10.5      41      90        60
7 Sub-Saharan Africa             60.5     60.9   11.1      40      82        16

`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.

# A tibble: 14 × 8
# Groups:   region [7]
   region            gender median_age mean_age sd_age min_age max_age num_bills
   <chr>             <chr>       <dbl>    <dbl>  <dbl>   <dbl>   <dbl>     <int>
 1 East Asia         female       51.5     52.6   12.6      24      73        22
 2 East Asia         male         60       60.5   12.8      36      93       316
 3 Europe            female       63       62.5   14.7      33      95        51
 4 Europe            male         61       61.6   12.6      29      96       403
 5 Latin America     female       69.5     68.9   12.8      40      89        20
 6 Latin America     male         66.5     66.5   11.9      31      93        86
 7 Middle East/Nort… female       65.5     65.2   13.2      47      85         6
 8 Middle East/Nort… male         62       62.9   13.7      33      94        64
 9 North America     female       62.5     64.4   13.1      41      94        62
10 North America     male         67.5     66.5   13.3      29      98       484
11 South Asia        female       63       62     15.5      46      77         3
12 South Asia        male         60       62.4   10.4      41      90        57
13 Sub-Saharan Afri… female       51.5     51.5   16.3      40      63         2
14 Sub-Saharan Afri… male         60.5     62.2   10.3      49      82        14

These are additional summary statistics that we found interesting but did not include in our final report.