Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • The source of the data is UNICEF.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This data set is from July 2022. It was originally collected by UNICEF through the use of surveys.
  • Write a brief description of the observations.

    • The observations are each representative of a country’s percentage per year of infants born to pregnant women living with HIV who received test results within 2 months of birth.

Research question

  • A well-formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How does a country’s percentage per year of infants born to pregnant women living with HIV who received test results within 2 months of birth change over time?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The research topic focuses on how babies born to HIV-positive mothers changes over time by country.

    • As the year increases, the percentage of people birthing babies while having HIV is decreasing in developed countries but remains constant in developing countries.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • year - quantitative

    • country - categorical

    • percentage - quantitative

Glimpse of data

hiv_infant <- read.csv("data/HIV_Early_Infant_Diagnosis_2022.csv")
skimr::skim(hiv_infant)
Data summary
Name hiv_infant
Number of rows 2121
Number of columns 10
_______________________
Column type frequency:
character 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Early.Infant.Diagnosis.for.HIV..2010.2021 0 1 3 10 0 119 0
X 0 1 4 7 0 3 0
X.1 0 1 4 37 0 119 0
X.2 0 1 0 31 244 10 0
X.3 0 1 9 130 0 3 0
X.4 0 1 11 48 0 3 0
X.5 0 1 4 4 0 13 0
X.6 0 1 1 7 0 929 0
X.7 0 1 0 5 1060 561 0
X.8 0 1 0 5 1060 590 0

Data 2

Introduction and data

  • Identify the source of the data.

    • The source of the dataset is the CDC.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This data was collected in 2017 but was last updated in 2019. The data was collected through means of local and national surveys. Due to the size of the data set, we limited the scope to data collected in New York City in the year 2017.
  • Write a brief description of the observations.

    • The observations are each individual surveyed on youth risk behaviors.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What is the most frequent type of youth risk behavior in New York City in 2017?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The research topic focuses on the relationship of it and the type and frequency of youth risk behavior, and we want to find out the most frequent type.

    • Alcohol and other drug use may be the frequent youth risk behavior type.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • LocationDesc - categorical

    • Topic - categorical

    • Greater_Risk_Data_Value - quantitative

    • Lesser_Risk_Data_Value - quantitative

    • Sample_Size - quantitative

Glimpse of data

youth_risk <- read_csv("data/NYC_youth_risk_behavior.csv")
Rows: 109560 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): LocationAbbr, LocationDesc, DataSource, Topic, Subtopic, ShortQues...
dbl (11): YEAR, Greater_Risk_Data_Value, Greater_Risk_Low_Confidence_Limit, ...
lgl  (4): Greater_Risk_Data_Value_Footnote_Symbol, Greater_Risk_Data_Value_F...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(youth_risk)
Data summary
Name youth_risk
Number of rows 109560
Number of columns 42
_______________________
Column type frequency:
character 27
logical 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
LocationAbbr 0 1 3 3 0 1 0
LocationDesc 0 1 17 17 0 1 0
DataSource 0 1 5 5 0 1 0
Topic 0 1 11 39 0 8 0
Subtopic 0 1 4 51 0 18 0
ShortQuestionText 0 1 5 50 0 83 0
Greater_Risk_Question 0 1 3 284 0 77 0
Description 0 1 4 256 0 51 0
Data_Value_Symbol 0 1 1 1 0 1 0
Data_Value_Type 0 1 10 10 0 1 0
Lesser_Risk_Question 0 1 4 277 0 73 0
Sex 0 1 4 6 0 3 0
Race 0 1 5 41 0 8 0
Grade 0 1 3 5 0 5 0
SexualIdentity 0 1 5 25 0 6 0
SexOfSexualContacts 0 1 5 27 0 6 0
GeoLocation 0 1 23 23 0 1 0
TopicId 0 1 3 3 0 8 0
SubTopicID 0 1 3 3 0 18 0
QuestionCode 0 1 3 8 0 83 0
StratID1 0 1 2 2 0 3 0
StratID2 0 1 2 3 0 8 0
StratID3 0 1 2 2 0 5 0
StratID4 0 1 3 3 0 1 0
StratID5 0 1 2 3 0 6 0
StratificationType 0 1 5 5 0 1 0
StratID6 0 1 2 3 0 6 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Greater_Risk_Data_Value_Footnote_Symbol 109560 0 NaN :
Greater_Risk_Data_Value_Footnote 109560 0 NaN :
Lesser_Risk_Data_Value_Footnote_Symbol 109560 0 NaN :
Lesser_Risk_Data_Value_Footnote 109560 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
YEAR 0 1.00 2017.00 0.00 2017 2017.00 2017.00 2017.00 2017 ▁▁▇▁▁
Greater_Risk_Data_Value 68479 0.37 25.71 28.12 0 4.19 14.30 36.72 100 ▇▂▁▁▁
Greater_Risk_Low_Confidence_Limit 68479 0.37 19.71 25.35 0 1.59 8.33 26.08 100 ▇▁▁▁▁
Greater_Risk_High_Confidence_Limit 68479 0.37 33.79 29.38 0 10.26 24.54 50.91 100 ▇▅▂▂▂
Lesser_Risk_Data_Value 68479 0.37 74.29 28.12 0 63.28 85.70 95.81 100 ▁▁▁▂▇
Lesser_Risk_Low_Confidence_Limit 68479 0.37 66.21 29.38 0 49.09 75.46 89.74 100 ▂▂▂▅▇
Lesser_Risk_High_Confidence_Limit 68479 0.37 80.29 25.35 0 73.92 91.67 98.41 100 ▁▁▁▁▇
Sample_Size 0 1.00 133.95 450.28 0 3.00 15.00 83.00 9950 ▇▁▁▁▁
LocationId 0 1.00 121.00 0.00 121 121.00 121.00 121.00 121 ▁▁▇▁▁
States 0 1.00 46.00 0.00 46 46.00 46.00 46.00 46 ▁▁▇▁▁
Counties 0 1.00 2095.00 0.00 2095 2095.00 2095.00 2095.00 2095 ▁▁▇▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    • The source of the data is the CORGIS Dataset Project and their research done building off of the Forbes World’s Billionaires lists from 1996-2014.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • It was originally collected through data from the Forbes World’s Billionaires lists from 1996-2014 and additional research done by scholars at Peterson Institute for International Economics. This data was collected through research and information online.
  • Write a brief description of the observations.

    • The observations are different billionaires from around the world, including those in Europe, the United States, and other advanced countries. Some of the observations are repeats of billionaires at different points in time, displaying data from multiple years.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How are the country of origin, region, industry, wealth accumulation, the way the money was inherited, region of business operation GDP, age of billionaire, and wealth type related to one another and overall for billionaires surveyed in 2001?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The research topic focuses on billionaires and how various factors are related in terms of wealth accumulation and to one another.

    • In 2001, Billionaires who are operating in regions with higher GDP in the technology-computer, real estate, or energy industries will have the highest wealth accumulation.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • industry: categorical

    • wealth: qualitative

    • country: categorical

    • region of business: categorical

    • wealth accumulation: numerical

    • wealth.how.inherited: categorical

Glimpse of data

# add code here
billionaire <- read.csv("data/billionaires.csv")
skimr::skim(billionaire)
Data summary
Name billionaire
Number of rows 2614
Number of columns 22
_______________________
Column type frequency:
character 13
logical 3
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 5 45 0 2077 0
company.name 0 1 0 59 38 1578 0
company.relationship 0 1 0 46 46 75 0
company.sector 0 1 0 52 23 521 0
company.type 0 1 0 22 36 19 0
demographics.gender 0 1 0 14 34 4 0
location.citizenship 0 1 4 20 0 73 0
location.country.code 0 1 3 6 0 74 0
location.region 0 1 1 24 0 8 0
wealth.type 0 1 0 24 22 6 0
wealth.how.category 0 1 0 18 1 10 0
wealth.how.industry 0 1 0 31 1 20 0
wealth.how.inherited 0 1 6 24 0 6 0

Variable type: logical

skim_variable n_missing complete_rate mean count
wealth.how.from.emerging 0 1 1 TRU: 2614
wealth.how.was.founder 0 1 1 TRU: 2614
wealth.how.was.political 0 1 1 TRU: 2614

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rank 0 1 5.996700e+02 4.678900e+02 1 215.0 430 9.880e+02 1.565e+03 ▇▅▃▂▃
year 0 1 2.008410e+03 7.480000e+00 1996 2001.0 2014 2.014e+03 2.014e+03 ▂▂▁▁▇
company.founded 0 1 1.924710e+03 2.437800e+02 0 1936.0 1963 1.985e+03 2.012e+03 ▁▁▁▁▇
demographics.age 0 1 5.334000e+01 2.533000e+01 -42 47.0 59 7.000e+01 9.800e+01 ▁▂▁▇▃
location.gdp 0 1 1.769103e+12 3.547083e+12 0 0.0 0 7.250e+11 1.060e+13 ▇▁▁▁▁
wealth.worth.in.billions 0 1 3.530000e+00 5.090000e+00 1 1.4 2 3.500e+00 7.600e+01 ▇▁▁▁▁