Characteristics of Billionaires
Appendix to report
Data Cleaning
We collected our data by downloading the .csv file from the CORGIS Dataset Project online. We then uploaded the raw csv data and loaded it using read_csv(). From there, we began cleaning/curating our data to our research question. First, we filtered so that only data collected in 2014 remained, since our research question pertains to data collected in 2014 only. The original data set had data from 1996, 2001, and 2014, so some people, such as Bill Gates, were included multiple times. Filtering for only 2014 data prevents these people from being counted multiple times in our analysis. In addition, the years were not consecutive, so the dataset does not display useful information on continuous change overtime. For that reason, we selected the most recent year available to investigate. Next, we selected only the columns that are directly related to our research questions, which are name
,demographics.age
, demographics.gender
, location.region
, wealth.how.inherited
, and kept name as an identifier variable, so that only relevant data is included in our data set and it is not overwhelmed by unused variables. Then, we renamed them to meet style guidelines and also to make them easier to use and understand. We renamed demographics.age
to age
, demographics.gender
to gender
, location.region
to region
, wealth.how.inherited
to inheritance
. At this point, we realized that there were no NA
values remaining in our data, but the age for some people was 0, which is clearly in lieu of it being NA
. So, we filtered our data to get rid of these because they would skew the age analysis down inaccurately and they are not meaningful. Additionally, we added a variable age_range
based on the typical age ranges used in other public research analyses we have seen. We thought this variable would be useful, since there are so many different ages in our data that it would not be practical to do any analysis grouping by individual, unique age values. Finally, we grouped the inheritance variable into two categories, “inherited” and “not inherited”, where “inherited” includes any value that is not “not inherited” such as “3rd generation”, “father”, and “5th generation or longer”. Now, our clean data frame bills_clean
has no NA values, has clean names and values, and is ready for analysis.
# A tibble: 1,590 × 6
name age gender region inheritance age_range
<chr> <dbl> <chr> <chr> <chr> <fct>
1 Bill Gates 58 male North America not inherited 35-44
2 Carlos Slim Helu 74 male Latin America not inherited 55-64
3 Amancio Ortega 77 male Europe not inherited 55-64
4 Warren Buffett 83 male North America not inherited 55-64
5 Larry Ellison 69 male North America not inherited 45-54
6 Charles Koch 78 male North America inherited 55-64
7 David Koch 73 male North America inherited 45-54
8 Sheldon Adelson 80 male North America not inherited 55-64
9 Christy Walton 59 female North America inherited 35-44
10 Jim Walton 66 male North America inherited 45-54
# ℹ 1,580 more rows
# A tibble: 2,614 × 22
name rank year company.founded company.name company.relationship
<chr> <dbl> <dbl> <dbl> <chr> <chr>
1 Bill Gates 1 1996 1975 Microsoft founder
2 Bill Gates 1 2001 1975 Microsoft founder
3 Bill Gates 1 2014 1975 Microsoft founder
4 Warren Buffett 2 1996 1962 Berkshire H… founder
5 Warren Buffett 2 2001 1962 Berkshire H… founder
6 Carlos Slim He… 2 2014 1990 Telmex founder
7 Oeri Hoffman a… 3 1996 1896 F. Hoffmann… <NA>
8 Paul Allen 3 2001 1975 Microsoft founder
9 Amancio Ortega 3 2014 1975 Zara founder
10 Lee Shau Kee 4 1996 1976 Henderson L… founder/chairman
# ℹ 2,604 more rows
# ℹ 16 more variables: company.sector <chr>, company.type <chr>,
# demographics.age <dbl>, demographics.gender <chr>,
# location.citizenship <chr>, `location.country code` <chr>,
# location.gdp <dbl>, location.region <chr>, wealth.type <chr>,
# `wealth.worth in billions` <dbl>, wealth.how.category <chr>,
# `wealth.how.from emerging` <lgl>, wealth.how.industry <chr>, …
Additional Appendices
Additional Summary Statistics
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
# A tibble: 14 × 3
# Groups: region [7]
region gender num_bills
<chr> <chr> <int>
1 East Asia female 22
2 East Asia male 316
3 Europe female 51
4 Europe male 403
5 Latin America female 20
6 Latin America male 86
7 Middle East/North Africa female 6
8 Middle East/North Africa male 64
9 North America female 62
10 North America male 484
11 South Asia female 3
12 South Asia male 57
13 Sub-Saharan Africa female 2
14 Sub-Saharan Africa male 14
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
# A tibble: 39 × 3
# Groups: region [7]
region age_range num_bills
<chr> <fct> <int>
1 East Asia 18-24 3
2 East Asia 25-34 61
3 East Asia 35-44 129
4 East Asia 45-54 96
5 East Asia 55-64 30
6 East Asia 64-100 19
7 Europe 18-24 5
8 Europe 25-34 73
9 Europe 35-44 151
10 Europe 45-54 133
# ℹ 29 more rows
`summarise()` has grouped output by 'region', 'gender'. You can override using
the `.groups` argument.
# A tibble: 67 × 4
# Groups: region, gender [14]
region gender age_range num_bills
<chr> <chr> <fct> <int>
1 East Asia female 18-24 2
2 East Asia female 25-34 7
3 East Asia female 35-44 9
4 East Asia female 45-54 4
5 East Asia male 18-24 1
6 East Asia male 25-34 54
7 East Asia male 35-44 120
8 East Asia male 45-54 92
9 East Asia male 55-64 30
10 East Asia male 64-100 19
# ℹ 57 more rows
# A tibble: 7 × 7
region median_age mean_age sd_age min_age max_age num_bills
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 East Asia 59 60.0 12.9 24 93 338
2 Europe 61 61.7 12.9 29 96 454
3 Latin America 67 67.0 12.0 31 93 106
4 Middle East/North Africa 62.5 63.1 13.6 33 94 70
5 North America 67 66.3 13.2 29 98 546
6 South Asia 61 62.4 10.5 41 90 60
7 Sub-Saharan Africa 60.5 60.9 11.1 40 82 16
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
# A tibble: 14 × 8
# Groups: region [7]
region gender median_age mean_age sd_age min_age max_age num_bills
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 East Asia female 51.5 52.6 12.6 24 73 22
2 East Asia male 60 60.5 12.8 36 93 316
3 Europe female 63 62.5 14.7 33 95 51
4 Europe male 61 61.6 12.6 29 96 403
5 Latin America female 69.5 68.9 12.8 40 89 20
6 Latin America male 66.5 66.5 11.9 31 93 86
7 Middle East/Nort… female 65.5 65.2 13.2 47 85 6
8 Middle East/Nort… male 62 62.9 13.7 33 94 64
9 North America female 62.5 64.4 13.1 41 94 62
10 North America male 67.5 66.5 13.3 29 98 484
11 South Asia female 63 62 15.5 46 77 3
12 South Asia male 60 62.4 10.4 41 90 57
13 Sub-Saharan Afri… female 51.5 51.5 16.3 40 63 2
14 Sub-Saharan Afri… male 60.5 62.2 10.3 49 82 14
These are additional summary statistics that we found interesting but did not include in our final report.