Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.
- The source of the data is UNICEF.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data set is from July 2022. It was originally collected by UNICEF through the use of surveys.
Write a brief description of the observations.
- The observations are each representative of a country’s percentage per year of infants born to pregnant women living with HIV who received test results within 2 months of birth.

Research question

A well-formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How does a country’s percentage per year of infants born to pregnant women living with HIV who received test results within 2 months of birth change over time?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic focuses on how babies born to HIV-positive mothers changes over time by country.
- As the year increases, the percentage of people birthing babies while having HIV is decreasing in developed countries but remains constant in developing countries.
Identify the types of variables in your research question. Categorical? Quantitative?
- year - quantitative
- country - categorical
- percentage - quantitative

Glimpse of data

hiv_infant <- read.csv("data/HIV_Early_Infant_Diagnosis_2022.csv")
skimr::skim(hiv_infant)

Data summary
Name	hiv_infant
Number of rows	2121
Number of columns	10
_______________________
Column type frequency:
character	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Early.Infant.Diagnosis.for.HIV..2010.2021	1	3	10	0	119
X	1	4	7	0	3
X.1	1	4	37	0	119
X.2	1	0	31	244	10
X.3	1	9	130	0	3
X.4	1	11	48	0	3
X.5	1	4	4	0	13
X.6	1	1	7	0	929
X.7	1	0	5	1060	561
X.8	1	0	5	1060	590

Data 2

Introduction and data

Identify the source of the data.
- The source of the dataset is the CDC.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data was collected in 2017 but was last updated in 2019. The data was collected through means of local and national surveys. Due to the size of the data set, we limited the scope to data collected in New York City in the year 2017.
Write a brief description of the observations.
- The observations are each individual surveyed on youth risk behaviors.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What is the most frequent type of youth risk behavior in New York City in 2017?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic focuses on the relationship of it and the type and frequency of youth risk behavior, and we want to find out the most frequent type.
- Alcohol and other drug use may be the frequent youth risk behavior type.
Identify the types of variables in your research question. Categorical? Quantitative?
- LocationDesc - categorical
- Topic - categorical
- Greater_Risk_Data_Value - quantitative
- Lesser_Risk_Data_Value - quantitative
- Sample_Size - quantitative

Glimpse of data

youth_risk <- read_csv("data/NYC_youth_risk_behavior.csv")

Rows: 109560 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): LocationAbbr, LocationDesc, DataSource, Topic, Subtopic, ShortQues...
dbl (11): YEAR, Greater_Risk_Data_Value, Greater_Risk_Low_Confidence_Limit, ...
lgl  (4): Greater_Risk_Data_Value_Footnote_Symbol, Greater_Risk_Data_Value_F...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(youth_risk)

Data summary
Name	youth_risk
Number of rows	109560
Number of columns	42
_______________________
Column type frequency:
character	27
logical	4
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
LocationAbbr	1	3	3	1
LocationDesc	1	17	17	1
DataSource	1	5	5	1
Topic	1	11	39	8
Subtopic	1	4	51	18
ShortQuestionText	1	5	50	83
Greater_Risk_Question	1	3	284	77
Description	1	4	256	51
Data_Value_Symbol	1	1	1	1
Data_Value_Type	1	10	10	1
Lesser_Risk_Question	1	4	277	73
Sex	1	4	6	3
Race	1	5	41	8
Grade	1	3	5	5
SexualIdentity	1	5	25	6
SexOfSexualContacts	1	5	27	6
GeoLocation	1	23	23	1
TopicId	1	3	3	8
SubTopicID	1	3	3	18
QuestionCode	1	3	8	83
StratID1	1	2	2	3
StratID2	1	2	3	8
StratID3	1	2	2	5
StratID4	1	3	3	1
StratID5	1	2	3	6
StratificationType	1	5	5	1
StratID6	1	2	3	6

Variable type: logical

skim_variable	n_missing	mean	count
Greater_Risk_Data_Value_Footnote_Symbol	109560	NaN	:
Greater_Risk_Data_Value_Footnote	109560	NaN	:
Lesser_Risk_Data_Value_Footnote_Symbol	109560	NaN	:
Lesser_Risk_Data_Value_Footnote	109560	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
YEAR	0	1.00	2017.00	0.00	2017	2017.00	2017.00	2017.00	2017	▁▁▇▁▁
Greater_Risk_Data_Value	68479	0.37	25.71	28.12	0	4.19	14.30	36.72	100	▇▂▁▁▁
Greater_Risk_Low_Confidence_Limit	68479	0.37	19.71	25.35	0	1.59	8.33	26.08	100	▇▁▁▁▁
Greater_Risk_High_Confidence_Limit	68479	0.37	33.79	29.38	0	10.26	24.54	50.91	100	▇▅▂▂▂
Lesser_Risk_Data_Value	68479	0.37	74.29	28.12	0	63.28	85.70	95.81	100	▁▁▁▂▇
Lesser_Risk_Low_Confidence_Limit	68479	0.37	66.21	29.38	0	49.09	75.46	89.74	100	▂▂▂▅▇
Lesser_Risk_High_Confidence_Limit	68479	0.37	80.29	25.35	0	73.92	91.67	98.41	100	▁▁▁▁▇
Sample_Size	0	1.00	133.95	450.28	0	3.00	15.00	83.00	9950	▇▁▁▁▁
LocationId	0	1.00	121.00	0.00	121	121.00	121.00	121.00	121	▁▁▇▁▁
States	0	1.00	46.00	0.00	46	46.00	46.00	46.00	46	▁▁▇▁▁
Counties	0	1.00	2095.00	0.00	2095	2095.00	2095.00	2095.00	2095	▁▁▇▁▁

Data 3

Introduction and data

Identify the source of the data.
- The source of the data is the CORGIS Dataset Project and their research done building off of the Forbes World’s Billionaires lists from 1996-2014.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- It was originally collected through data from the Forbes World’s Billionaires lists from 1996-2014 and additional research done by scholars at Peterson Institute for International Economics. This data was collected through research and information online.
Write a brief description of the observations.
- The observations are different billionaires from around the world, including those in Europe, the United States, and other advanced countries. Some of the observations are repeats of billionaires at different points in time, displaying data from multiple years.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How are the country of origin, region, industry, wealth accumulation, the way the money was inherited, region of business operation GDP, age of billionaire, and wealth type related to one another and overall for billionaires surveyed in 2001?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The research topic focuses on billionaires and how various factors are related in terms of wealth accumulation and to one another.
- In 2001, Billionaires who are operating in regions with higher GDP in the technology-computer, real estate, or energy industries will have the highest wealth accumulation.
Identify the types of variables in your research question. Categorical? Quantitative?
- industry: categorical
- wealth: qualitative
- country: categorical
- region of business: categorical
- wealth accumulation: numerical
- wealth.how.inherited: categorical

Glimpse of data

# add code here
billionaire <- read.csv("data/billionaires.csv")
skimr::skim(billionaire)

Data summary
Name	billionaire
Number of rows	2614
Number of columns	22
_______________________
Column type frequency:
character	13
logical	3
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
name	1	5	45	0	2077
company.name	1	0	59	38	1578
company.relationship	1	0	46	46	75
company.sector	1	0	52	23	521
company.type	1	0	22	36	19
demographics.gender	1	0	14	34	4
location.citizenship	1	4	20	0	73
location.country.code	1	3	6	0	74
location.region	1	1	24	0	8
wealth.type	1	0	24	22	6
wealth.how.category	1	0	18	1	10
wealth.how.industry	1	0	31	1	20
wealth.how.inherited	1	6	24	0	6

Variable type: logical

skim_variable	complete_rate	mean	count
wealth.how.from.emerging	1	1	TRU: 2614
wealth.how.was.founder	1	1	TRU: 2614
wealth.how.was.political	1	1	TRU: 2614

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	5.996700e+02	4.678900e+02	1	215.0	430	9.880e+02	1.565e+03	▇▅▃▂▃
year	1	2.008410e+03	7.480000e+00	1996	2001.0	2014	2.014e+03	2.014e+03	▂▂▁▁▇
company.founded	1	1.924710e+03	2.437800e+02	0	1936.0	1963	1.985e+03	2.012e+03	▁▁▁▁▇
demographics.age	1	5.334000e+01	2.533000e+01	-42	47.0	59	7.000e+01	9.800e+01	▁▂▁▇▃
location.gdp	1	1.769103e+12	3.547083e+12	0	0.0	0	7.250e+11	1.060e+13	▇▁▁▁▁
wealth.worth.in.billions	1	3.530000e+00	5.090000e+00	1	1.4	2	3.500e+00	7.600e+01	▇▁▁▁▁