Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data. CORGIS website: https://think.cs.vt.edu/corgis/csv/billionaires/ and Peterson Institute for International Economics.

State when and how it was originally collected (by the original data curator, not necessarily how you found the data). This data was collected in 2016 by Caroline Freund (Senior Research Staff) and Sarah Oliver (Research Analyst). Furthermore, this dataset was made by compiling the Forbes billionaires lists from 1996–2015 and adding detailed information on the individuals listed.
Write a brief description of the observations. Each observation represents a billionaire and includes information such as their rank, the name of their company, where they’re from, whether they were self-made or not, etc. Some billionaires show up for multiple observations because they were included in the Forbes list of billionaires for each of those randomly chosen years (1996, 2001, 2014). Additionally, there were some inconsistencies in the data due to the way Forbes reported everything which was another reason why some names show up more than once.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.) What factors contribute to the accumulation of wealth among billionaires, and what impact do these individuals have on the economy and society as a whole?
Identify the source of the data.
- Peterson Institute for International Economics
- CORGIS website: https://think.cs.vt.edu/corgis/csv/billionaires/ and Peterson Institute for International Economics.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data was collected in 2016 by Caroline Freund (Senior Research Staff) and Sarah Oliver (Research Analyst). Furthermore, this dataset was made by compiling the Forbes billionaires lists from 1996–2015 and adding detailed information on the individuals listed.
Write a brief description of the observations.
- Each observation represents a billionaire and includes information such as their rank, the name of their company, where they’re from, whether they were self-made or not, etc. Some billionaires show up for multiple observations because they were included in the Forbes list of billionaires for each of those randomly chosen years (1996, 2001, 2014). Additionally, there were some inconsistencies in the data due to the way Forbes reported everything which was another reason why some names show up more than once.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What factors contribute to the accumulation of wealth among billionaires, and what impact do these individuals have on the economy and society as a whole?
- Does a non-finance or finance background contribute to whether or not an individual becomes a billionaire?
- What are the key demographic characteristics that affect billionaire net worth in the US, and how do these factors interact with each other?
A description of the research topic along with a concise statement of your hypotheses on this topic. The study of billionaires as a research topic involves exploring various aspects of extreme wealth, such as how billionaires accumulate their wealth, the industries and sectors in which they tend to concentrate their wealth, and the factors that contribute to billionaire status (e.g. if wealth was inheritedd). Research on billionaires involves analyzing quantitative data, such as their net worth as well as qualitative data, such as their personal backgrounds. The topic is interdisciplinary, drawing on fields such as economics, finance, and political science. As such: The factors that contribute to billionaire status are a combination of self-made wealth and individual backgrounds, such as having a financial background, which are influenced by personal background like what country they are from.
Identify the types of variables in your research question. Categorical? Quantitative? There is a good mix of categorical and quantitative variables in this data set. Examples of categorical data is the billionaire and company name (in “Bill Gates”’s case, is “Microsoft”). An example of quantitative data is “rank” where billionaires are ranked from 1-1565.
- Based on demographic characteristics such as age, gender, and education level, what factors affect the net worth of billionaires?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The study of billionaires as a research topic involves exploring various aspects of extreme wealth, such as how billionaires accumulate their wealth, the industries and sectors in which they tend to concentrate their wealth, and the factors that contribute to billionaire status (e.g. if wealth was inheritedd). Research on billionaires involves analyzing quantitative data, such as their net worth as well as qualitative data, such as their personal backgrounds. The topic is interdisciplinary, drawing on fields such as economics, finance, and political science. As such: The factors that contribute to billionaire status are a combination of self-made wealth and individual backgrounds, such as having a financial background, which are influenced by personal background like what country they are from.
Identify the types of variables in your research question. Categorical? Quantitative?
- There is a good mix of categorical and quantitative variables in this data set. Examples of categorical data is the billionaire and company name (in “Bill Gates”’s case, is “Microsoft”). An example of quantitative data is “rank” where billionaires are ranked from 1-1565.

Glimpse of data

# add code here
billionaires <- read.csv(file = "data/billionaires.csv")

skimr::skim(billionaires)

Data summary
Name	billionaires
Number of rows	2614
Number of columns	22
_______________________
Column type frequency:
character	16
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
name	1	5	45	0	2077
company.name	1	0	59	38	1578
company.relationship	1	0	46	46	75
company.sector	1	0	52	23	521
company.type	1	0	22	36	19
demographics.gender	1	0	14	34	4
location.citizenship	1	4	20	0	73
location.country.code	1	3	6	0	74
location.region	1	1	24	0	8
wealth.type	1	0	24	22	6
wealth.how.category	1	0	18	1	10
wealth.how.from.emerging	1	4	4	0	1
wealth.how.industry	1	0	31	1	20
wealth.how.inherited	1	6	24	0	6
wealth.how.was.founder	1	4	4	0	1
wealth.how.was.political	1	4	4	0	1

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	5.996700e+02	4.678900e+02	1	215.0	430	9.880e+02	1.565e+03	▇▅▃▂▃
year	1	2.008410e+03	7.480000e+00	1996	2001.0	2014	2.014e+03	2.014e+03	▂▂▁▁▇
company.founded	1	1.924710e+03	2.437800e+02	0	1936.0	1963	1.985e+03	2.012e+03	▁▁▁▁▇
demographics.age	1	5.334000e+01	2.533000e+01	-42	47.0	59	7.000e+01	9.800e+01	▁▂▁▇▃
location.gdp	1	1.769103e+12	3.547083e+12	0	0.0	0	7.250e+11	1.060e+13	▇▁▁▁▁
wealth.worth.in.billions	1	3.530000e+00	5.090000e+00	1	1.4	2	3.500e+00	7.600e+01	▇▁▁▁▁

Data 2

Introduction and data

Identify the source of the data.
- U.S. Department of Agriculture: Economic Research Service
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Alana Rhone from the Economic Research Service (ERS) in 2020.
- Population data was collected from the 2010 Census of the Population. Data on income, vehicle availability, and SNAP participation are from the 2014-18 American Community Survey. Two 2019 lists of supermarkets, supercenters, and large grocery stores were combined to produce a comprehensive list of stores that represent sources of affordable and nutritious food.
Write a brief description of the observations.
- Each observation represents a county in the US and includes information about access to supermarkets, supercenters, grocery stores, or other sources of healthy and affordable food

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Based on different regions and resource-levels in the US, how does food access vary?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic is to investigate to relationship between food access and regional/neighborhood resources
- Our initial hypothesis is that counties with more population will have better food access than counties with lower population, and that low-income counties/communities are more likely to have limited food access
Identify the types of variables in your research question. Categorical? Quantitative?
- This dataset has both categorical and quantitative variables. An example of a categorical variable is the county name, and a quantitative variable is the low-income population

Glimpse of data

food <- read.csv(file = "data/food_access.csv")

skimr::skim(food)

Data summary
Name	food
Number of rows	3142
Number of columns	25
_______________________
Column type frequency:
character	2
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
County	0	1	10	33	0	1877	0
State	0	1	4	20	0	51	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Population	1	98264.02	312946.53	82	11114.50	25872.0	66780.00	9818605	▇▁▁▁▁
Housing.Data.Residing.in.Group.Quarters	1	2541.21	6512.50	0	177.00	602.0	2247.00	171670	▇▁▁▁▁
Housing.Data.Total.Housing.Units	1	37147.13	111990.96	39	4368.75	10017.0	25829.00	3241204	▇▁▁▁▁
Vehicle.Access.1.Mile	1	662.16	1095.32	0	118.00	332.0	739.75	13735	▇▁▁▁▁
Vehicle.Access.1.2.Mile	1	1503.13	3903.09	0	180.25	481.0	1197.75	83246	▇▁▁▁▁
Vehicle.Access.10.Miles	1	31.01	80.16	0	1.00	11.0	34.75	1826	▇▁▁▁▁
Vehicle.Access.20.Miles	1	5.16	47.42	0	0.00	0.0	0.00	1473	▇▁▁▁▁
Low.Access.Numbers.Children.1.Mile	1	9527.62	16747.45	0	1649.25	4108.0	9723.25	250060	▇▁▁▁▁
Low.Access.Numbers.Children.1.2.Mile	1	16668.66	41717.86	0	2176.50	5301.5	13327.25	911988	▇▁▁▁▁
Low.Access.Numbers.Children.10.Miles	1	372.74	596.69	0	34.00	210.0	524.75	11490	▇▁▁▁▁
Low.Access.Numbers.Children.20.Miles	1	40.76	235.28	0	0.00	0.0	0.00	5918	▇▁▁▁▁
Low.Access.Numbers.Low.Income.People.1.Mile	1	11199.22	17273.37	0	2501.00	6300.5	13138.25	260673	▇▁▁▁▁
Low.Access.Numbers.Low.Income.People.1.2.Mile	1	20660.44	48784.32	0	3472.25	8403.5	19185.50	1139072	▇▁▁▁▁
Low.Access.Numbers.Low.Income.People.10.Miles	1	617.69	1142.24	0	51.25	319.0	804.00	24663	▇▁▁▁▁
Low.Access.Numbers.Low.Income.People.20.Miles	1	76.11	476.40	0	0.00	0.0	0.00	12405	▇▁▁▁▁
Low.Access.Numbers.People.1.Mile	1	39091.71	64757.27	0	7306.50	17921.5	42034.75	903299	▇▁▁▁▁
Low.Access.Numbers.People.1.2.Mile	1	68483.47	164153.98	82	9527.50	22535.5	57185.00	3696268	▇▁▁▁▁
Low.Access.Numbers.People.10.Miles	1	1637.40	2386.60	0	174.00	955.0	2288.00	37500	▇▁▁▁▁
Low.Access.Numbers.People.20.Miles	1	172.54	823.48	0	0.00	0.0	0.00	17768	▇▁▁▁▁
Low.Access.Numbers.Seniors.1.Mile	1	5339.46	8298.88	0	1194.25	2693.5	5919.75	123489	▇▁▁▁▁
Low.Access.Numbers.Seniors.1.2.Mile	1	9148.15	20213.49	12	1556.25	3423.5	8226.75	431862	▇▁▁▁▁
Low.Access.Numbers.Seniors.10.Miles	1	274.73	382.57	0	28.00	165.5	388.75	5801	▇▁▁▁▁
Low.Access.Numbers.Seniors.20.Miles	1	30.33	137.68	0	0.00	0.0	0.00	4165	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data: https://chronicdata.cdc.gov/500-Cities-Places/500-Cities-Sleeping-less-than-7-hours-among-adults/eqbn-8mpz - Centers for Disease Control and Prevention
State when and how it was originally collected (by the original data curator, not necessarily how you found the data): Centers for Disease Control and Prevention (CDC), Division of Population Health, Epidemiology and Surveillance Branch in 2016. The CDC uses a peer-reviewed multilevel regression and poststratification (MRP) approach that links geocoded health surveys and high spatial resolution population demographic and socioeconomic data. This predicts individual disease risk and health behaviors in a multilevel modeling framework and estimates the geographic distributions of population disease burden and health behaviors.
Write a brief description of the observations: Each observation represents an adult aged 18 year old plus respondent and includes information about their demographic, location, health etc.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is there a significant relationship between the amount of sleep that 18-year-olds get and their level of mental stability and overall health across 500 cities?

Are there demographic or environmental factors that influence the relationship between sleep duration and mental stability/overall health among 18-year-olds in 500 cities, and if so, what are they?

A description of the research topic along with a concise statement of your hypotheses on this topic: The dataset on 18+year-olds’ sleep in 500 cities collected from the CDC includes information such as the average amount of sleep that those 18 and older get in each city, the distribution of sleep durations, and factors that may affect sleep, such as screen time and physical activity levels. The dataset also includes data on whether or not the individual engages in unhealthy behaviors.Hypothesis: There is a negative relationship between the amount of sleep that 18-year-olds and older get and their likelihood of experiencing unhealthy behaviors, and this relationship is affected by demographic and environmental factors such as socioeconomic status, physical activity, and screen time.
Identify the types of variables in your research question. Categorical? Quantitative? There are both categorical and quantitative variables in this dataset. An example of a categorical variable would be Data_Value_Type and an example of a quantitative variable would be Data_Value.

Glimpse of data

# add code here
cities <- read.csv("data/500_Cities.csv")

skimr::skim(cities)

Data summary
Name	cities
Number of rows	29006
Number of columns	24
_______________________
Column type frequency:
character	18
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
StateAbbr	1	2	2	0	52
StateDesc	1	4	13	0	52
CityName	1	0	26	2	475
GeographicLevel	1	2	12	0	3
DataSource	1	5	5	0	1
Category	1	19	19	0	1
UniqueID	1	2	19	0	28505
Measure	1	55	55	0	1
Data_Value_Unit	1	1	1	0	1
DataValueTypeID	1	6	9	0	2
Data_Value_Type	1	16	23	0	2
Data_Value_Footnote_Symbol	1	0	1	28212	2
Data_Value_Footnote	1	0	48	28212	2
PopulationCount	1	1	11	0	8050
GeoLocation	1	0	31	2	28505
CategoryID	1	6	6	0	1
MeasureId	1	5	5	0	1
Short_Question_Text	1	14	14	0	1

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Year	0	1.00	2.016000e+03	0.000000e+00	2.0160e+03	2016.0	2.0160e+03	2.016000e+03	2.0160e+03	▁▁▇▁▁
Data_Value	794	0.97	3.667000e+01	5.910000e+00	1.6100e+01	32.4	3.6100e+01	4.050000e+01	5.8700e+01	▁▅▇▃▁
Low_Confidence_Limit	794	0.97	3.530000e+01	5.920000e+00	1.3500e+01	31.0	3.4700e+01	3.920000e+01	5.7400e+01	▁▃▇▃▁
High_Confidence_Limit	794	0.97	3.799000e+01	5.870000e+00	1.9800e+01	33.7	3.7400e+01	4.170000e+01	5.9900e+01	▁▇▇▃▁
CityFIPS	2	1.00	2.607162e+06	1.686484e+06	1.5003e+04	681344.0	2.6220e+06	4.055000e+06	5.6139e+06	▇▅▃▆▆
TractFIPS	1002	0.97	2.593797e+10	1.675674e+10	1.0730e+09	8001009408.8	2.6081e+10	4.010911e+10	5.6021e+10	▇▅▃▆▅