library(tidyverse)
library(skimr)
Project title
Proposal
Data 1
Introduction and data
Identify the source of the data.
- The source of the data is UNICEF.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data set is from July 2022. It was originally collected by UNICEF through the use of surveys.
Write a brief description of the observations.
- The observations are each representative of a country’s percentage per year of infants born to pregnant women living with HIV who received test results within 2 months of birth.
Research question
- A well-formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How does a country’s percentage per year of infants born to pregnant women living with HIV who received test results within 2 months of birth change over time?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic focuses on how babies born to HIV-positive mothers changes over time by country.
As the year increases, the percentage of people birthing babies while having HIV is decreasing in developed countries but remains constant in developing countries.
- Identify the types of variables in your research question. Categorical? Quantitative?
year - quantitative
country - categorical
percentage - quantitative
Glimpse of data
<- read.csv("data/HIV_Early_Infant_Diagnosis_2022.csv")
hiv_infant ::skim(hiv_infant) skimr
Name | hiv_infant |
Number of rows | 2121 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Early.Infant.Diagnosis.for.HIV..2010.2021 | 0 | 1 | 3 | 10 | 0 | 119 | 0 |
X | 0 | 1 | 4 | 7 | 0 | 3 | 0 |
X.1 | 0 | 1 | 4 | 37 | 0 | 119 | 0 |
X.2 | 0 | 1 | 0 | 31 | 244 | 10 | 0 |
X.3 | 0 | 1 | 9 | 130 | 0 | 3 | 0 |
X.4 | 0 | 1 | 11 | 48 | 0 | 3 | 0 |
X.5 | 0 | 1 | 4 | 4 | 0 | 13 | 0 |
X.6 | 0 | 1 | 1 | 7 | 0 | 929 | 0 |
X.7 | 0 | 1 | 0 | 5 | 1060 | 561 | 0 |
X.8 | 0 | 1 | 0 | 5 | 1060 | 590 | 0 |
Data 2
Introduction and data
Identify the source of the data.
- The source of the dataset is the CDC.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data was collected in 2017 but was last updated in 2019. The data was collected through means of local and national surveys. Due to the size of the data set, we limited the scope to data collected in New York City in the year 2017.
Write a brief description of the observations.
- The observations are each individual surveyed on youth risk behaviors.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What is the most frequent type of youth risk behavior in New York City in 2017?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic focuses on the relationship of it and the type and frequency of youth risk behavior, and we want to find out the most frequent type.
Alcohol and other drug use may be the frequent youth risk behavior type.
- Identify the types of variables in your research question. Categorical? Quantitative?
LocationDesc - categorical
Topic - categorical
Greater_Risk_Data_Value - quantitative
Lesser_Risk_Data_Value - quantitative
Sample_Size - quantitative
Glimpse of data
<- read_csv("data/NYC_youth_risk_behavior.csv") youth_risk
Rows: 109560 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): LocationAbbr, LocationDesc, DataSource, Topic, Subtopic, ShortQues...
dbl (11): YEAR, Greater_Risk_Data_Value, Greater_Risk_Low_Confidence_Limit, ...
lgl (4): Greater_Risk_Data_Value_Footnote_Symbol, Greater_Risk_Data_Value_F...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(youth_risk) skimr
Name | youth_risk |
Number of rows | 109560 |
Number of columns | 42 |
_______________________ | |
Column type frequency: | |
character | 27 |
logical | 4 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
LocationAbbr | 0 | 1 | 3 | 3 | 0 | 1 | 0 |
LocationDesc | 0 | 1 | 17 | 17 | 0 | 1 | 0 |
DataSource | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
Topic | 0 | 1 | 11 | 39 | 0 | 8 | 0 |
Subtopic | 0 | 1 | 4 | 51 | 0 | 18 | 0 |
ShortQuestionText | 0 | 1 | 5 | 50 | 0 | 83 | 0 |
Greater_Risk_Question | 0 | 1 | 3 | 284 | 0 | 77 | 0 |
Description | 0 | 1 | 4 | 256 | 0 | 51 | 0 |
Data_Value_Symbol | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
Data_Value_Type | 0 | 1 | 10 | 10 | 0 | 1 | 0 |
Lesser_Risk_Question | 0 | 1 | 4 | 277 | 0 | 73 | 0 |
Sex | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
Race | 0 | 1 | 5 | 41 | 0 | 8 | 0 |
Grade | 0 | 1 | 3 | 5 | 0 | 5 | 0 |
SexualIdentity | 0 | 1 | 5 | 25 | 0 | 6 | 0 |
SexOfSexualContacts | 0 | 1 | 5 | 27 | 0 | 6 | 0 |
GeoLocation | 0 | 1 | 23 | 23 | 0 | 1 | 0 |
TopicId | 0 | 1 | 3 | 3 | 0 | 8 | 0 |
SubTopicID | 0 | 1 | 3 | 3 | 0 | 18 | 0 |
QuestionCode | 0 | 1 | 3 | 8 | 0 | 83 | 0 |
StratID1 | 0 | 1 | 2 | 2 | 0 | 3 | 0 |
StratID2 | 0 | 1 | 2 | 3 | 0 | 8 | 0 |
StratID3 | 0 | 1 | 2 | 2 | 0 | 5 | 0 |
StratID4 | 0 | 1 | 3 | 3 | 0 | 1 | 0 |
StratID5 | 0 | 1 | 2 | 3 | 0 | 6 | 0 |
StratificationType | 0 | 1 | 5 | 5 | 0 | 1 | 0 |
StratID6 | 0 | 1 | 2 | 3 | 0 | 6 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
Greater_Risk_Data_Value_Footnote_Symbol | 109560 | 0 | NaN | : |
Greater_Risk_Data_Value_Footnote | 109560 | 0 | NaN | : |
Lesser_Risk_Data_Value_Footnote_Symbol | 109560 | 0 | NaN | : |
Lesser_Risk_Data_Value_Footnote | 109560 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
YEAR | 0 | 1.00 | 2017.00 | 0.00 | 2017 | 2017.00 | 2017.00 | 2017.00 | 2017 | ▁▁▇▁▁ |
Greater_Risk_Data_Value | 68479 | 0.37 | 25.71 | 28.12 | 0 | 4.19 | 14.30 | 36.72 | 100 | ▇▂▁▁▁ |
Greater_Risk_Low_Confidence_Limit | 68479 | 0.37 | 19.71 | 25.35 | 0 | 1.59 | 8.33 | 26.08 | 100 | ▇▁▁▁▁ |
Greater_Risk_High_Confidence_Limit | 68479 | 0.37 | 33.79 | 29.38 | 0 | 10.26 | 24.54 | 50.91 | 100 | ▇▅▂▂▂ |
Lesser_Risk_Data_Value | 68479 | 0.37 | 74.29 | 28.12 | 0 | 63.28 | 85.70 | 95.81 | 100 | ▁▁▁▂▇ |
Lesser_Risk_Low_Confidence_Limit | 68479 | 0.37 | 66.21 | 29.38 | 0 | 49.09 | 75.46 | 89.74 | 100 | ▂▂▂▅▇ |
Lesser_Risk_High_Confidence_Limit | 68479 | 0.37 | 80.29 | 25.35 | 0 | 73.92 | 91.67 | 98.41 | 100 | ▁▁▁▁▇ |
Sample_Size | 0 | 1.00 | 133.95 | 450.28 | 0 | 3.00 | 15.00 | 83.00 | 9950 | ▇▁▁▁▁ |
LocationId | 0 | 1.00 | 121.00 | 0.00 | 121 | 121.00 | 121.00 | 121.00 | 121 | ▁▁▇▁▁ |
States | 0 | 1.00 | 46.00 | 0.00 | 46 | 46.00 | 46.00 | 46.00 | 46 | ▁▁▇▁▁ |
Counties | 0 | 1.00 | 2095.00 | 0.00 | 2095 | 2095.00 | 2095.00 | 2095.00 | 2095 | ▁▁▇▁▁ |
Data 3
Introduction and data
Identify the source of the data.
- The source of the data is the CORGIS Dataset Project and their research done building off of the Forbes World’s Billionaires lists from 1996-2014.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- It was originally collected through data from the Forbes World’s Billionaires lists from 1996-2014 and additional research done by scholars at Peterson Institute for International Economics. This data was collected through research and information online.
Write a brief description of the observations.
- The observations are different billionaires from around the world, including those in Europe, the United States, and other advanced countries. Some of the observations are repeats of billionaires at different points in time, displaying data from multiple years.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How are the country of origin, region, industry, wealth accumulation, the way the money was inherited, region of business operation GDP, age of billionaire, and wealth type related to one another and overall for billionaires surveyed in 2001?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic focuses on billionaires and how various factors are related in terms of wealth accumulation and to one another.
In 2001, Billionaires who are operating in regions with higher GDP in the technology-computer, real estate, or energy industries will have the highest wealth accumulation.
- Identify the types of variables in your research question. Categorical? Quantitative?
industry: categorical
wealth: qualitative
country: categorical
region of business: categorical
wealth accumulation: numerical
wealth.how.inherited: categorical
Glimpse of data
# add code here
<- read.csv("data/billionaires.csv")
billionaire ::skim(billionaire) skimr
Name | billionaire |
Number of rows | 2614 |
Number of columns | 22 |
_______________________ | |
Column type frequency: | |
character | 13 |
logical | 3 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1 | 5 | 45 | 0 | 2077 | 0 |
company.name | 0 | 1 | 0 | 59 | 38 | 1578 | 0 |
company.relationship | 0 | 1 | 0 | 46 | 46 | 75 | 0 |
company.sector | 0 | 1 | 0 | 52 | 23 | 521 | 0 |
company.type | 0 | 1 | 0 | 22 | 36 | 19 | 0 |
demographics.gender | 0 | 1 | 0 | 14 | 34 | 4 | 0 |
location.citizenship | 0 | 1 | 4 | 20 | 0 | 73 | 0 |
location.country.code | 0 | 1 | 3 | 6 | 0 | 74 | 0 |
location.region | 0 | 1 | 1 | 24 | 0 | 8 | 0 |
wealth.type | 0 | 1 | 0 | 24 | 22 | 6 | 0 |
wealth.how.category | 0 | 1 | 0 | 18 | 1 | 10 | 0 |
wealth.how.industry | 0 | 1 | 0 | 31 | 1 | 20 | 0 |
wealth.how.inherited | 0 | 1 | 6 | 24 | 0 | 6 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
wealth.how.from.emerging | 0 | 1 | 1 | TRU: 2614 |
wealth.how.was.founder | 0 | 1 | 1 | TRU: 2614 |
wealth.how.was.political | 0 | 1 | 1 | TRU: 2614 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rank | 0 | 1 | 5.996700e+02 | 4.678900e+02 | 1 | 215.0 | 430 | 9.880e+02 | 1.565e+03 | ▇▅▃▂▃ |
year | 0 | 1 | 2.008410e+03 | 7.480000e+00 | 1996 | 2001.0 | 2014 | 2.014e+03 | 2.014e+03 | ▂▂▁▁▇ |
company.founded | 0 | 1 | 1.924710e+03 | 2.437800e+02 | 0 | 1936.0 | 1963 | 1.985e+03 | 2.012e+03 | ▁▁▁▁▇ |
demographics.age | 0 | 1 | 5.334000e+01 | 2.533000e+01 | -42 | 47.0 | 59 | 7.000e+01 | 9.800e+01 | ▁▂▁▇▃ |
location.gdp | 0 | 1 | 1.769103e+12 | 3.547083e+12 | 0 | 0.0 | 0 | 7.250e+11 | 1.060e+13 | ▇▁▁▁▁ |
wealth.worth.in.billions | 0 | 1 | 3.530000e+00 | 5.090000e+00 | 1 | 1.4 | 2 | 3.500e+00 | 7.600e+01 | ▇▁▁▁▁ |