library(tidyverse)
library(skimr)
library(readr)
Accident severity in New York State
Proposal
Data 1 - Hurricanes
Introduction and data
This data set comes from the National Oceanic and Atmospheric Administration (NOAA)’s National Hurricane Center as part of the International Best Track Archive for Climate Stewardship (IBTrACS) project, which is “the most complete global collection of tropical cyclones available”.
This particular data set was formed over the last 3 years (2020 - 2022) as IBTrACS continuously merged data on tropical cyclones (hurricanes) from multiple weather agencies around the world, including all the World Meteorological Organization (WMO) Regional Specialized Meteorological Centres.
The observations in this data set are hurricane tracking data, with information for each hurricane reported in 3 hours intervals over the duration of the hurricane’s life cycle.
Research question
- Question: What has been the relationship between the geographic region and strength of hurricanes between 2020 and 2022?
- Importance: As we live in an era of global warming natural disaster have become all the more powerful, especially hurricanes. Thus it is important to see which geographic areas have recently experiences the highest intensity storms in order to understand how to best mitigate their effects.
- Description and Hypothesis: The research topic aims to investigate the relationship between the geographic region and strength of hurricanes that occurred between 2020 and 2022. The focus of the study will be to examine whether there is a correlation between the location of hurricane formation and the intensity of the hurricane that forms in that region during the specified time period. To conduct this research, data on the strength and geographic location of hurricanes that occurred between 2020 and 2022 will be collected and analyzed. The study will utilize statistical methods to explore any potential patterns or trends in the data, and to determine whether there is a significant relationship between hurricane strength and location. Our hypothesis is that the North Atlantic region experiences the strongest hurricanes, based on data from past deadly hurricanes.
- Variables:
BASIN: 7 general regions in which the hurricane can be located (categorical)
SUBBASIN: A group of sub-regions with each one falling into one of the 7 (categorical)
USA_WIND: Maximum sustained wind speed in knots: 0 - 300 kts (quantitative)
USA_PRES: Minimum sea level pressure, 850 - 1050 mb. (quantitative)
Glimpse of data
<- read_csv("data/ibtracs.last3years.csv") hurricanes
Rows: 18106 Columns: 163
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (141): SID, SEASON, BASIN, SUBBASIN, NAME, NATURE, LAT, LON, WMO_WIND, ...
dbl (17): NUMBER, USA_SSHS, TOKYO_GRADE, TOKYO_R50_DIR, TOKYO_R30_DIR, TOK...
lgl (4): DS824_STAGE, TD9636_STAGE, NEUMANN_CLASS, MLC_CLASS
dttm (1): ISO_TIME
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(hurricanes) skimr
Name | hurricanes |
Number of rows | 18106 |
Number of columns | 163 |
_______________________ | |
Column type frequency: | |
character | 141 |
logical | 4 |
numeric | 17 |
POSIXct | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
SID | 1 | 1.00 | 13 | 13 | 0 | 344 | 0 |
SEASON | 0 | 1.00 | 4 | 4 | 0 | 5 | 0 |
BASIN | 3902 | 0.78 | 2 | 2 | 0 | 5 | 0 |
SUBBASIN | 2954 | 0.84 | 2 | 2 | 0 | 8 | 0 |
NAME | 1 | 1.00 | 3 | 16 | 0 | 270 | 0 |
NATURE | 1 | 1.00 | 2 | 2 | 0 | 6 | 0 |
LAT | 0 | 1.00 | 7 | 13 | 0 | 10575 | 0 |
LON | 0 | 1.00 | 7 | 12 | 0 | 12828 | 0 |
WMO_WIND | 12462 | 0.31 | 2 | 3 | 0 | 46 | 0 |
WMO_PRES | 11783 | 0.35 | 2 | 4 | 0 | 106 | 0 |
WMO_AGENCY | 11686 | 0.35 | 3 | 10 | 0 | 8 | 0 |
TRACK_TYPE | 1 | 1.00 | 4 | 16 | 0 | 4 | 0 |
DIST2LAND | 0 | 1.00 | 1 | 4 | 0 | 2495 | 0 |
LANDFALL | 344 | 0.98 | 1 | 4 | 0 | 2446 | 0 |
IFLAG | 1 | 1.00 | 14 | 14 | 0 | 93 | 0 |
USA_AGENCY | 10021 | 0.45 | 4 | 14 | 0 | 9 | 0 |
USA_ATCF_ID | 2243 | 0.88 | 8 | 8 | 0 | 311 | 0 |
USA_LAT | 2313 | 0.87 | 7 | 13 | 0 | 7759 | 0 |
USA_LON | 2313 | 0.87 | 7 | 12 | 0 | 9903 | 0 |
USA_RECORD | 18043 | 0.00 | 1 | 1 | 0 | 2 | 0 |
USA_STATUS | 4559 | 0.75 | 2 | 2 | 0 | 14 | 0 |
USA_WIND | 2410 | 0.87 | 2 | 3 | 0 | 127 | 0 |
USA_PRES | 2556 | 0.86 | 2 | 4 | 0 | 127 | 0 |
USA_R34_NE | 9545 | 0.47 | 1 | 5 | 0 | 185 | 0 |
USA_R34_SE | 9818 | 0.46 | 1 | 5 | 0 | 160 | 0 |
USA_R34_SW | 10797 | 0.40 | 1 | 5 | 0 | 147 | 0 |
USA_R34_NW | 10336 | 0.43 | 1 | 5 | 0 | 165 | 0 |
USA_R50_NE | 14478 | 0.20 | 1 | 5 | 0 | 157 | 0 |
USA_R50_SE | 14809 | 0.18 | 1 | 5 | 0 | 149 | 0 |
USA_R50_SW | 15247 | 0.16 | 1 | 5 | 0 | 131 | 0 |
USA_R50_NW | 14884 | 0.18 | 1 | 5 | 0 | 145 | 0 |
USA_R64_NE | 15935 | 0.12 | 1 | 5 | 0 | 87 | 0 |
USA_R64_SE | 16072 | 0.11 | 1 | 5 | 0 | 88 | 0 |
USA_R64_SW | 16364 | 0.10 | 1 | 5 | 0 | 71 | 0 |
USA_R64_NW | 16145 | 0.11 | 1 | 5 | 0 | 76 | 0 |
USA_POCI | 2819 | 0.84 | 2 | 4 | 0 | 33 | 0 |
USA_ROCI | 2821 | 0.84 | 2 | 5 | 0 | 187 | 0 |
USA_RMW | 2848 | 0.84 | 1 | 5 | 0 | 104 | 0 |
USA_EYE | 17754 | 0.02 | 2 | 5 | 0 | 13 | 0 |
TOKYO_LAT | 15059 | 0.17 | 7 | 13 | 0 | 1815 | 0 |
TOKYO_LON | 15059 | 0.17 | 7 | 12 | 0 | 1930 | 0 |
TOKYO_WIND | 16467 | 0.09 | 2 | 3 | 0 | 47 | 0 |
TOKYO_PRES | 15059 | 0.17 | 2 | 4 | 0 | 86 | 0 |
TOKYO_R50_LONG | 17440 | 0.04 | 2 | 5 | 0 | 42 | 0 |
TOKYO_R50_SHORT | 17440 | 0.04 | 2 | 5 | 0 | 38 | 0 |
TOKYO_R30_LONG | 16467 | 0.09 | 2 | 5 | 0 | 70 | 0 |
TOKYO_R30_SHORT | 16467 | 0.09 | 2 | 5 | 0 | 61 | 0 |
CMA_LAT | 15046 | 0.17 | 7 | 13 | 0 | 1802 | 0 |
CMA_LON | 15046 | 0.17 | 7 | 12 | 0 | 1920 | 0 |
CMA_WIND | 15046 | 0.17 | 2 | 3 | 0 | 80 | 0 |
CMA_PRES | 15046 | 0.17 | 2 | 4 | 0 | 79 | 0 |
HKO_LAT | 15967 | 0.12 | 7 | 13 | 0 | 960 | 0 |
HKO_LON | 15967 | 0.12 | 7 | 12 | 0 | 1104 | 0 |
HKO_CAT | 15920 | 0.12 | 1 | 6 | 0 | 7 | 0 |
HKO_WIND | 15967 | 0.12 | 2 | 3 | 0 | 51 | 0 |
HKO_PRES | 15967 | 0.12 | 2 | 4 | 0 | 72 | 0 |
NEWDELHI_LAT | 17665 | 0.02 | 7 | 13 | 0 | 223 | 0 |
NEWDELHI_LON | 17665 | 0.02 | 7 | 12 | 0 | 250 | 0 |
NEWDELHI_GRADE | 17664 | 0.02 | 1 | 4 | 0 | 7 | 0 |
NEWDELHI_WIND | 17665 | 0.02 | 2 | 3 | 0 | 27 | 0 |
NEWDELHI_PRES | 17665 | 0.02 | 2 | 4 | 0 | 47 | 0 |
NEWDELHI_DP | 17665 | 0.02 | 1 | 2 | 0 | 37 | 0 |
NEWDELHI_POCI | 18105 | 0.00 | 2 | 2 | 0 | 1 | 0 |
REUNION_LAT | 16725 | 0.08 | 8 | 13 | 0 | 867 | 0 |
REUNION_LON | 16725 | 0.08 | 7 | 12 | 0 | 1085 | 0 |
REUNION_WIND | 16784 | 0.07 | 2 | 3 | 0 | 74 | 0 |
REUNION_PRES | 16820 | 0.07 | 2 | 4 | 0 | 69 | 0 |
REUNION_RMW | 17701 | 0.02 | 1 | 5 | 0 | 40 | 0 |
REUNION_R34_NE | 17480 | 0.03 | 1 | 5 | 0 | 78 | 0 |
REUNION_R34_SE | 17480 | 0.03 | 1 | 5 | 0 | 102 | 0 |
REUNION_R34_SW | 17480 | 0.03 | 1 | 5 | 0 | 89 | 0 |
REUNION_R34_NW | 17480 | 0.03 | 1 | 5 | 0 | 76 | 0 |
REUNION_R50_NE | 17812 | 0.02 | 1 | 5 | 0 | 42 | 0 |
REUNION_R50_SE | 17812 | 0.02 | 1 | 5 | 0 | 29 | 0 |
REUNION_R50_SW | 17812 | 0.02 | 1 | 5 | 0 | 33 | 0 |
REUNION_R50_NW | 17812 | 0.02 | 1 | 5 | 0 | 30 | 0 |
REUNION_R64_NE | 18105 | 0.00 | 5 | 5 | 0 | 1 | 0 |
REUNION_R64_SE | 18105 | 0.00 | 5 | 5 | 0 | 1 | 0 |
REUNION_R64_SW | 18105 | 0.00 | 5 | 5 | 0 | 1 | 0 |
REUNION_R64_NW | 18105 | 0.00 | 5 | 5 | 0 | 1 | 0 |
BOM_LAT | 15794 | 0.13 | 8 | 13 | 0 | 1385 | 0 |
BOM_LON | 15794 | 0.13 | 7 | 12 | 0 | 1754 | 0 |
BOM_WIND | 15838 | 0.13 | 2 | 3 | 0 | 51 | 0 |
BOM_PRES | 15852 | 0.12 | 2 | 4 | 0 | 79 | 0 |
BOM_RMW | 17430 | 0.04 | 1 | 5 | 0 | 34 | 0 |
BOM_R34_NE | 17495 | 0.03 | 2 | 5 | 0 | 66 | 0 |
BOM_R34_SE | 17440 | 0.04 | 2 | 5 | 0 | 71 | 0 |
BOM_R34_SW | 17370 | 0.04 | 2 | 5 | 0 | 69 | 0 |
BOM_R34_NW | 17461 | 0.04 | 2 | 5 | 0 | 62 | 0 |
BOM_R50_NE | 17875 | 0.01 | 2 | 5 | 0 | 41 | 0 |
BOM_R50_SE | 17874 | 0.01 | 2 | 5 | 0 | 42 | 0 |
BOM_R50_SW | 17870 | 0.01 | 2 | 5 | 0 | 39 | 0 |
BOM_R50_NW | 17889 | 0.01 | 2 | 5 | 0 | 37 | 0 |
BOM_R64_NE | 18032 | 0.00 | 2 | 5 | 0 | 17 | 0 |
BOM_R64_SE | 18031 | 0.00 | 2 | 5 | 0 | 14 | 0 |
BOM_R64_SW | 18041 | 0.00 | 2 | 5 | 0 | 14 | 0 |
BOM_R64_NW | 18043 | 0.00 | 2 | 5 | 0 | 12 | 0 |
BOM_ROCI | 16147 | 0.11 | 2 | 5 | 0 | 124 | 0 |
BOM_POCI | 16114 | 0.11 | 2 | 4 | 0 | 16 | 0 |
BOM_EYE | 18075 | 0.00 | 1 | 5 | 0 | 11 | 0 |
NADI_LAT | 17795 | 0.02 | 8 | 13 | 0 | 242 | 0 |
NADI_LON | 17795 | 0.02 | 7 | 12 | 0 | 276 | 0 |
NADI_WIND | 17795 | 0.02 | 2 | 3 | 0 | 40 | 0 |
NADI_PRES | 17795 | 0.02 | 2 | 3 | 0 | 75 | 0 |
WELLINGTON_LAT | 17957 | 0.01 | 8 | 13 | 0 | 134 | 0 |
WELLINGTON_LON | 17957 | 0.01 | 7 | 12 | 0 | 145 | 0 |
WELLINGTON_WIND | 17982 | 0.01 | 2 | 3 | 0 | 22 | 0 |
WELLINGTON_PRES | 18072 | 0.00 | 2 | 3 | 0 | 17 | 0 |
DS824_LAT | 18105 | 0.00 | 13 | 13 | 0 | 1 | 0 |
DS824_LON | 18105 | 0.00 | 12 | 12 | 0 | 1 | 0 |
DS824_WIND | 18105 | 0.00 | 3 | 3 | 0 | 1 | 0 |
DS824_PRES | 18105 | 0.00 | 2 | 2 | 0 | 1 | 0 |
TD9636_LAT | 18105 | 0.00 | 13 | 13 | 0 | 1 | 0 |
TD9636_LON | 18105 | 0.00 | 12 | 12 | 0 | 1 | 0 |
TD9636_WIND | 18105 | 0.00 | 3 | 3 | 0 | 1 | 0 |
TD9636_PRES | 18105 | 0.00 | 2 | 2 | 0 | 1 | 0 |
TD9635_LAT | 18105 | 0.00 | 13 | 13 | 0 | 1 | 0 |
TD9635_LON | 18105 | 0.00 | 12 | 12 | 0 | 1 | 0 |
TD9635_WIND | 18105 | 0.00 | 3 | 3 | 0 | 1 | 0 |
TD9635_PRES | 18105 | 0.00 | 2 | 2 | 0 | 1 | 0 |
TD9635_ROCI | 18105 | 0.00 | 5 | 5 | 0 | 1 | 0 |
NEUMANN_LAT | 18105 | 0.00 | 13 | 13 | 0 | 1 | 0 |
NEUMANN_LON | 18105 | 0.00 | 12 | 12 | 0 | 1 | 0 |
NEUMANN_WIND | 18105 | 0.00 | 3 | 3 | 0 | 1 | 0 |
NEUMANN_PRES | 18105 | 0.00 | 2 | 2 | 0 | 1 | 0 |
MLC_LAT | 18105 | 0.00 | 13 | 13 | 0 | 1 | 0 |
MLC_LON | 18105 | 0.00 | 12 | 12 | 0 | 1 | 0 |
MLC_WIND | 18105 | 0.00 | 3 | 3 | 0 | 1 | 0 |
MLC_PRES | 18105 | 0.00 | 2 | 2 | 0 | 1 | 0 |
USA_GUST | 15462 | 0.15 | 2 | 3 | 0 | 31 | 0 |
BOM_GUST | 16932 | 0.06 | 2 | 3 | 0 | 28 | 0 |
BOM_GUST_PER | 16932 | 0.06 | 2 | 6 | 0 | 28 | 0 |
REUNION_GUST | 18105 | 0.00 | 3 | 3 | 0 | 1 | 0 |
REUNION_GUST_PER | 17406 | 0.04 | 1 | 6 | 0 | 2 | 0 |
USA_SEAHGT | 15266 | 0.16 | 2 | 2 | 0 | 2 | 0 |
USA_SEARAD_NE | 15427 | 0.15 | 2 | 5 | 0 | 118 | 0 |
USA_SEARAD_SE | 15746 | 0.13 | 2 | 5 | 0 | 121 | 0 |
USA_SEARAD_SW | 16075 | 0.11 | 2 | 5 | 0 | 110 | 0 |
USA_SEARAD_NW | 15805 | 0.13 | 2 | 5 | 0 | 106 | 0 |
STORM_SPEED | 0 | 1.00 | 1 | 3 | 0 | 60 | 0 |
STORM_DIR | 0 | 1.00 | 1 | 7 | 0 | 362 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
DS824_STAGE | 18106 | 0 | NaN | : |
TD9636_STAGE | 18106 | 0 | NaN | : |
NEUMANN_CLASS | 18106 | 0 | NaN | : |
MLC_CLASS | 18106 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
NUMBER | 1 | 1.00 | 52.10 | 33.75 | 1.0 | 21.0 | 52.0 | 80.00 | 121.0 | ▇▅▆▅▃ |
USA_SSHS | 0 | 1.00 | -1.16 | 2.28 | -5.0 | -3.0 | -1.0 | 0.00 | 5.0 | ▇▃▇▁▁ |
TOKYO_GRADE | 15045 | 0.17 | 3.71 | 1.47 | 1.0 | 2.0 | 3.0 | 5.00 | 6.0 | ▇▇▃▅▅ |
TOKYO_R50_DIR | 15046 | 0.17 | 1.76 | 3.49 | 0.0 | 0.0 | 0.0 | 0.00 | 9.0 | ▇▁▁▁▂ |
TOKYO_R30_DIR | 15046 | 0.17 | 2.61 | 3.29 | 0.0 | 0.0 | 1.0 | 4.00 | 9.0 | ▇▃▁▁▂ |
TOKYO_LAND | 16470 | 0.09 | 0.00 | 0.07 | 0.0 | 0.0 | 0.0 | 0.00 | 1.0 | ▇▁▁▁▁ |
CMA_CAT | 15024 | 0.17 | 3.25 | 2.71 | 0.0 | 1.0 | 2.0 | 4.00 | 9.0 | ▇▇▃▁▃ |
NEWDELHI_CI | 17751 | 0.02 | 1.95 | 1.66 | 0.0 | 0.0 | 1.5 | 2.75 | 6.5 | ▅▇▂▂▁ |
REUNION_TYPE | 16722 | 0.08 | 4.03 | 1.90 | 1.0 | 3.0 | 3.0 | 6.00 | 8.0 | ▆▇▃▆▂ |
REUNION_TNUM | 17406 | 0.04 | 5.50 | 3.50 | 1.0 | 2.5 | 3.5 | 9.90 | 9.9 | ▇▅▁▁▇ |
REUNION_CI | 17663 | 0.02 | 3.15 | 1.34 | 1.0 | 2.5 | 3.0 | 3.50 | 7.0 | ▅▇▅▂▁ |
BOM_TYPE | 16205 | 0.10 | 27.81 | 14.72 | 10.0 | 20.0 | 20.0 | 30.00 | 81.0 | ▇▂▂▁▁ |
BOM_TNUM | 17362 | 0.04 | 2.61 | 1.17 | 1.0 | 2.0 | 2.5 | 3.00 | 7.0 | ▇▆▃▁▁ |
BOM_CI | 17358 | 0.04 | 2.70 | 1.20 | 0.5 | 2.0 | 2.5 | 3.50 | 7.0 | ▃▇▂▂▁ |
BOM_POS_METHOD | 17606 | 0.03 | 3.68 | 1.37 | 1.0 | 3.0 | 3.0 | 3.00 | 7.0 | ▁▇▁▁▂ |
BOM_PRES_METHOD | 17368 | 0.04 | 5.83 | 1.30 | 2.0 | 5.0 | 5.0 | 7.00 | 9.0 | ▁▁▇▂▁ |
NADI_CAT | 17785 | 0.02 | 2.20 | 1.54 | -1.0 | 1.0 | 2.0 | 3.00 | 5.0 | ▃▆▇▅▆ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
ISO_TIME | 1 | 1 | 2020-01-04 | 2023-03-14 | 2021-07-07 06:00:00 | 7872 |
Data 2 - US Vehicle Accidents 2016-2021
Introduction and data
This data comes from the paper Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, 2019. This dataset was posted to Kaggle.
According to the data publisher, this data was collected using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks.
Each observation is a traffic accident. Each accident has recorded details such as the date/time, location, severity, weather conditions, cause of incident, and other observed details.
NOTE: The original dataset has over 2.5 million observations and will not upload as-is to the posit rstudio. On my local computer and version of R Studio, I trimmed down the dataset to the cols of interest named above, and filtered for accidents in the state of NY, for now, but can modify the dataset used based on our questions…
Research question
- Questions:
- What factors (weather, cause of accident, nearby road infrastructure, time) contribute to the most severe accidents and can be used as predictors for potential future accidents?
- Importance:
- By utilizing the information in the dataset, future predictions can be made about when and where accidents, including the most severe accidents, will occur. This can be helpful for emergency services in determining road closures, how to position EMS teams, and how to better build infrastructure in the future to prevent accidents.
- Description/Hypothesis:
- The research topic aims to investigate the relationship between outside factors such as road structure, weather, time, and cause to see when accidents are most common. The focus of the study will be to determine key variables that predict when future accidents may occur. This will be accomplished using a large dataset pertaining to previous accidents that occurred in NY state. Our hypothesis that we will attempt to prove with this information is that notable accidents (determined by severity) are most likely to occur from poor weather and be located around certain road infrastructure hotspots, like roundabouts or stop signs.
- Variables:
City: location data (categorical)
County: location data (categorical)
State: location data (categorical)
Zipcode: location data (categorical)
Airport Code: location data (categorical)
Wind_Direction: weather data (categorical)
Weather_Condition: weather data (categorical)
Severity: severity of accident (categorical)
Distance.mi: length of road affected by accident (quantitative)
Temperature: weather data (quantitative)
Wind_Chill: Humidity: weather data (quantitative)
Pressure.in: weather data (quantitative)
Visibility.mi: weather data (quantitative)
Wind_Speed: weather data (quantitative)
Crossing: presence of a crossing in nearby area (categorical)
Junction: presence of a junction in nearby area (categorical)
Stop: presence of a stop in nearby area (categorical)
Traffic_Signal: presence of a traffic signal in nearby area (categorical)
Start_Time: when the accident occurred (quantitative)
End_Time: when the local roadway was clear of the accident (quantitative)
Weather_Timestamp: when the weather data was recorded (quantitative)
Glimpse of data
<- read_csv("data/Accidents_NY.csv") accidents_NY
Rows: 108049 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): City, County, State, Zipcode, Airport_Code, Wind_Direction, Weathe...
dbl (9): Severity, Distance.mi., Temperature.F., Wind_Chill.F., Humidity......
lgl (4): Crossing, Junction, Stop, Traffic_Signal
dttm (3): Start_Time, End_Time, Weather_Timestamp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(accidents_NY) skimr
Name | accidents_NY |
Number of rows | 108049 |
Number of columns | 24 |
_______________________ | |
Column type frequency: | |
character | 8 |
logical | 4 |
numeric | 9 |
POSIXct | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
City | 25 | 1.00 | 3 | 22 | 0 | 1053 | 0 |
County | 0 | 1.00 | 4 | 14 | 0 | 63 | 0 |
State | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
Zipcode | 2 | 1.00 | 5 | 10 | 0 | 13207 | 0 |
Airport_Code | 195 | 1.00 | 4 | 4 | 0 | 55 | 0 |
Wind_Direction | 1794 | 0.98 | 1 | 8 | 0 | 24 | 0 |
Weather_Condition | 925 | 0.99 | 3 | 28 | 0 | 72 | 0 |
Sunrise_Sunset | 119 | 1.00 | 3 | 5 | 0 | 2 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Crossing | 0 | 1 | 0.07 | FAL: 100844, TRU: 7205 |
Junction | 0 | 1 | 0.16 | FAL: 91022, TRU: 17027 |
Stop | 0 | 1 | 0.02 | FAL: 105930, TRU: 2119 |
Traffic_Signal | 0 | 1 | 0.11 | FAL: 95733, TRU: 12316 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Severity | 0 | 1.00 | 2.22 | 0.57 | 1.00 | 2.0 | 2.00 | 2.00 | 4.00 | ▁▇▁▁▁ |
Distance.mi. | 0 | 1.00 | 0.83 | 1.56 | 0.00 | 0.1 | 0.35 | 0.93 | 53.31 | ▇▁▁▁▁ |
Temperature.F. | 807 | 0.99 | 54.01 | 18.43 | -77.80 | 39.0 | 54.00 | 70.00 | 144.00 | ▁▁▇▇▁ |
Wind_Chill.F. | 17035 | 0.84 | 49.28 | 21.51 | -30.40 | 32.0 | 51.00 | 68.00 | 144.00 | ▁▆▇▂▁ |
Humidity… | 863 | 0.99 | 65.56 | 20.18 | 8.00 | 50.0 | 66.00 | 83.00 | 100.00 | ▁▅▇▇▇ |
Pressure.in. | 1072 | 0.99 | 29.75 | 0.40 | 19.75 | 29.5 | 29.80 | 30.03 | 30.87 | ▁▁▁▁▇ |
Visibility.mi. | 1093 | 0.99 | 9.04 | 2.72 | 0.00 | 10.0 | 10.00 | 10.00 | 30.00 | ▁▇▁▁▁ |
Wind_Speed.mph. | 4392 | 0.96 | 8.89 | 5.62 | 0.00 | 5.0 | 8.00 | 12.00 | 141.50 | ▇▁▁▁▁ |
Precipitation.in. | 19660 | 0.82 | 0.02 | 0.34 | 0.00 | 0.0 | 0.00 | 0.00 | 10.05 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Start_Time | 0 | 1.00 | 2016-03-23 02:35:03 | 2021-12-31 23:22:00 | 2021-01-20 15:24:13 | 78321 |
End_Time | 0 | 1.00 | 2016-03-23 08:35:03 | 2021-12-31 23:49:05 | 2021-01-20 20:55:08 | 93167 |
Weather_Timestamp | 615 | 0.99 | 2016-03-23 02:51:00 | 2021-12-31 23:32:00 | 2021-01-21 00:53:00 | 50268 |
Data 3
Introduction and data
This data comes from the Peterson Institute for International Economics, and it is part of a working paper by Caroline Freund and Sarah Oliver. We found this data through CORGIS(The Collection of Really Great, Interesting, Situated Datasets).
This data was originally collected by Freund and Oliver through the information published on Forbes’ list of billionaires that started in 1996. The data they gathered spans 20 years, from 1996-2015.
Each observation in this dataset contains information about a billionaire (somebody with a net worth of atleast $1 Billion in a given year), such as how they made their money, how much they are worth, whether they inherited money, gender, nationality, etc. The observations provide critical information into how these people made their fortunes, and help to identify similarities and differences between the billionaires. Their are potential ethical concerns with this data because Forbes included 8 monarchs and 4 dictators on the list who made their money through their positions of power, which is usually not allowed for members to be on the list but they made an exception in 1997-1998.
Research question
Question:
What are the most significant predictors of net worth for billionaires that inherit wealth and those that do not?
Importance:
Answering the research question of what are the most significant predictors of net worth for billionaires who inherit wealth and those who do not can provide insights into how billionaires accumulate wealth and how different paths to wealth affect the final outcome. This research can have policy implications for governments concerned with wealth inequality and can inform investment decisions for those who invest in companies or industries catering to high-net-worth individuals. Understanding the most significant predictors of net worth can help policymakers design policies aimed at reducing wealth inequality or supporting wealth creation for those who do not inherit wealth. It can also enable investors to make informed decisions about which companies or industries to invest in and what factors to consider when evaluating the potential for wealth creation. Overall, answering this research question can have important implications for wealth accumulation, policy, and investment decisions.
Description and Hypothesis:
The research topic is focused on identifying the most significant predictors of net worth for billionaires who inherit wealth and those who do not. The study seeks to understand how billionaires accumulate wealth and to determine whether the factors that contribute to wealth accumulation differ between those who inherit wealth and those who do not. The research may involve data analysis of a sample of billionaires, with variables such as family connections, access to capital, educational attainment, entrepreneurial experience, and risk-taking propensity being measured and analyzed for their correlation with net worth.
Variables:
wealth.worth in billions: How much the person is worth (quantitative)
wealth.how.inherited: How did the person inherit their money, could be self made (categorical)
wealth.how.was founder: Did that individual found the company (categorical)
wealth.how.was.political: Was the person involved in politics (categorical)
year: When did they become a billionaire (quantitative)
demographics.age: How old were they when they became a billionaire (quantitative)
demographics.gender: What is their gender (categorical)
company.relationship: What is the persons relationship with the company (categorical)
company.sector: What sector is the company a part of (categorical)
company.type: What type of company is it ie. public, private, etc.. (categorical)
location.region: Where is the billionaire based (categorical)
wealth.how sectors: What kind of sector is the business in for trading purposes (categorical)
wealth.how industry: What is the industry (categorical)
Glimpse of data
<- read_csv("data/billionaires.csv") billions
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(billions) skimr
Name | billions |
Number of rows | 2614 |
Number of columns | 22 |
_______________________ | |
Column type frequency: | |
character | 13 |
logical | 3 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 5 | 45 | 0 | 2077 | 0 |
company.name | 38 | 0.99 | 3 | 59 | 0 | 1576 | 0 |
company.relationship | 46 | 0.98 | 3 | 46 | 0 | 73 | 0 |
company.sector | 23 | 0.99 | 3 | 52 | 0 | 505 | 0 |
company.type | 36 | 0.99 | 3 | 22 | 0 | 15 | 0 |
demographics.gender | 34 | 0.99 | 4 | 14 | 0 | 3 | 0 |
location.citizenship | 0 | 1.00 | 4 | 20 | 0 | 73 | 0 |
location.country code | 0 | 1.00 | 3 | 6 | 0 | 74 | 0 |
location.region | 0 | 1.00 | 1 | 24 | 0 | 8 | 0 |
wealth.type | 22 | 0.99 | 9 | 24 | 0 | 5 | 0 |
wealth.how.category | 1 | 1.00 | 1 | 18 | 0 | 9 | 0 |
wealth.how.industry | 1 | 1.00 | 1 | 31 | 0 | 19 | 0 |
wealth.how.inherited | 0 | 1.00 | 6 | 24 | 0 | 6 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
wealth.how.from emerging | 0 | 1 | 1 | TRU: 2614 |
wealth.how.was founder | 0 | 1 | 1 | TRU: 2614 |
wealth.how.was political | 0 | 1 | 1 | TRU: 2614 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rank | 0 | 1 | 5.996700e+02 | 4.678900e+02 | 1 | 215.0 | 430 | 9.880e+02 | 1.565e+03 | ▇▅▃▂▃ |
year | 0 | 1 | 2.008410e+03 | 7.480000e+00 | 1996 | 2001.0 | 2014 | 2.014e+03 | 2.014e+03 | ▂▂▁▁▇ |
company.founded | 0 | 1 | 1.924710e+03 | 2.437800e+02 | 0 | 1936.0 | 1963 | 1.985e+03 | 2.012e+03 | ▁▁▁▁▇ |
demographics.age | 0 | 1 | 5.334000e+01 | 2.533000e+01 | -42 | 47.0 | 59 | 7.000e+01 | 9.800e+01 | ▁▂▁▇▃ |
location.gdp | 0 | 1 | 1.769103e+12 | 3.547083e+12 | 0 | 0.0 | 0 | 7.250e+11 | 1.060e+13 | ▇▁▁▁▁ |
wealth.worth in billions | 0 | 1 | 3.530000e+00 | 5.090000e+00 | 1 | 1.4 | 2 | 3.500e+00 | 7.600e+01 | ▇▁▁▁▁ |