Project Brilliant Togepi

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • Source: CDC
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data on the death counts is taken by the CDC through reports by physicians, medical examiners, or coroners in the cause-of-death section of each certificate.

    • Data on vaccinations are obtained by the CDC through all vaccination partners including jurisdictional partner clinics, retail pharmacies, long-term care facilities, dialysis centers, Federal Emergency Management Agency and Health Resources and Services Administration partner sites, and federal entity facilities. 

  • Write a brief description of the observations.

    • Based on the data set, it seems like the focus of the data is on the period when COVID was most affecting the US.

    • The data iteself is untidy, and is not really organized in any particular pattern, so in its current state, it is difficult to spot any trends in the COVID-19 data

    • Most of the data included in the CSV files include the state, so combining these data sets for further analysis will not be too difficult

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • Do required mask mandates or vaccines have a larger impact on confirmed COVID cases?

    • As the percentage of counties (per state) requiring masks in public increases, how does the amount of confirmed COVID cases change?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • With these datasets together, we plan to combine variables from each to create a much tidier dataset and analyze COVID transmission. As COVID was politicized with misleading data being spread across the internet during the pandemic, it will be interesting to use this CDC data to create our own interpretations rather than simply trusting the opinions of those with large reach online. We believe that as vaccine and mask mandates increased in states, the number of confirmed COVID cases decreased. Additionally, we are interested in seeing what the correlation is between COVID cases and amount of page views for the CDC website.
  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Our research question has both categorical and quantititative variables. For example, we have the different brands of vaccines, counties in a state, and whether or not masks are required in public as categorical variables. We also have variables such as confirmed covid cases and amount of vaccines distributed as quantitative variables.

Glimpse of data

# will become one dataset

masks <- read.csv("data/us_mask.csv")
covid <- read.csv("data/us_covid.csv")
vaccine <- read.csv("data/us_vaccine.csv")
pageviews <- read_csv("data/cdc_monthly_page_views.csv")
Rows: 228 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Month
dbl (3): Sort, Year, Page Views
num (1): Page Visits

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(masks)
Data summary
Name masks
Number of rows 1593869
Number of columns 10
_______________________
Column type frequency:
character 7
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
State_Tribe_Territory 0 1.00 2 2 0 56 0
County_Name 0 1.00 4 33 0 1968 0
date 0 1.00 8 10 0 493 0
Face_Masks_Required_in_Public 606314 0.62 2 3 0 2 0
Source_of_Action 606314 0.62 8 13 0 2 0
URL 651574 0.59 43 540 0 548 0
Citation 616596 0.61 14 161 0 629 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
FIPS_State 0 1 31.44 16.41 1 19 30 46 78 ▅▇▆▅▁
FIPS_County 0 1 102.70 106.55 1 35 79 133 840 ▇▁▁▁▁
order_code 0 1 1.54 0.50 1 1 2 2 2 ▇▁▁▁▇
skimr::skim(covid)
Data summary
Name covid
Number of rows 60060
Number of columns 15
_______________________
Column type frequency:
character 5
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
submission_date 0 1 10 10 0 1001 0
state 0 1 2 3 0 60 0
created_at 0 1 22 22 0 2224 0
consent_cases 0 1 0 9 4009 4 0
consent_deaths 0 1 0 9 5005 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
tot_cases 0 1.00 656964.11 1173489.80 0 18303.25 222841.5 815855.25 11309237 ▇▁▁▁▁
conf_cases 26026 0.57 652799.39 1077693.49 0 65122.75 299246.0 842673.25 10458792 ▇▁▁▁▁
prob_cases 26098 0.57 107357.46 157946.54 0 169.25 32175.0 150251.25 850445 ▇▁▁▁▁
new_case 0 1.00 1601.41 5074.26 -10199 3.00 344.0 1435.00 319809 ▇▁▁▁▁
pnew_case 3526 0.94 267.29 1439.17 -171804 0.00 1.0 175.00 171617 ▁▁▇▁▁
tot_death 0 1.00 9351.24 14591.37 0 361.00 3241.0 12353.25 95604 ▇▁▁▁▁
conf_death 26787 0.55 9015.95 10431.92 0 1377.00 5193.0 13720.00 71408 ▇▂▁▁▁
prob_death 26787 0.55 1093.25 1549.19 0 0.00 309.0 1691.00 7889 ▇▂▁▁▁
new_death 0 1.00 17.37 43.50 -352 0.00 3.0 16.00 1178 ▁▇▁▁▁
pnew_death 3494 0.94 1.83 24.53 -2594 0.00 0.0 1.00 2919 ▁▁▇▁▁
skimr::skim(vaccine)
Data summary
Name vaccine
Number of rows 37912
Number of columns 109
_______________________
Column type frequency:
character 2
numeric 107
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 10 10 0 589 0
Location 0 1 2 3 0 66 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MMWR_week 0 1.00 23.56 15.48 1 10.00 21.00 37.00 53.0 ▇▇▅▃▅
Distributed 0 1.00 15203017.79 66346868.42 0 960725.00 3784610.00 9932098.75 967430045.0 ▇▁▁▁▁
Distributed_Janssen 0 1.00 684828.11 2964960.72 0 22500.00 171400.00 455800.00 32496900.0 ▇▁▁▁▁
Distributed_Moderna 0 1.00 5736583.13 25045480.19 0 228700.00 1468850.00 3926785.00 351003220.0 ▇▁▁▁▁
Distributed_Pfizer 0 1.00 8577346.08 38413907.57 0 226785.00 1989892.50 5450246.25 583678325.0 ▇▁▁▁▁
Distributed_Novavax 35800 0.06 28174.67 113286.64 0 2800.00 7800.00 21200.00 1196500.0 ▇▁▁▁▁
Distributed_Unk_Manuf 0 1.00 2113.54 83905.10 0 0.00 0.00 0.00 8282150.0 ▇▁▁▁▁
Dist_Per_100K 0 1.00 131261.63 79999.75 0 76030.00 135741.00 192751.50 405539.0 ▆▇▇▂▁
Distributed_Per_100k_5Plus 448 0.99 92229.58 111904.57 0 0.00 0.00 206543.50 425336.0 ▇▁▅▁▁
Distributed_Per_100k_12Plus 0 1.00 140163.46 107719.30 0 0.00 156914.50 227754.25 459301.0 ▇▇▇▂▁
Distributed_Per_100k_18Plus 0 1.00 166453.12 106763.72 0 88099.00 173357.50 251018.50 496196.0 ▆▇▇▂▁
Distributed_Per_100k_65Plus 0 1.00 829696.82 653184.74 0 405963.50 804765.00 1166627.50 6036990.0 ▇▂▁▁▁
Administered 0 1.00 11865368.11 52516352.87 0 729973.75 2929770.50 7797573.50 672537312.0 ▇▁▁▁▁
Administered_5Plus 448 0.99 7842623.34 46782774.14 0 0.00 0.00 4636601.00 668381561.0 ▇▁▁▁▁
Administered_12Plus 0 1.00 10803682.68 50827438.41 0 0.00 2151088.00 7230964.25 644700162.0 ▇▁▁▁▁
Administered_18Plus 0 1.00 10879838.42 48194401.00 0 561477.50 2658485.50 7187786.75 604314715.0 ▇▁▁▁▁
Administered_65Plus 0 1.00 3141122.34 13751149.80 0 50634.00 802419.00 2242111.50 182650661.0 ▇▁▁▁▁
Administered_Janssen 0 1.00 398282.62 1785837.10 0 12261.00 90187.50 260288.75 18985124.0 ▇▁▁▁▁
Administered_Moderna 0 1.00 4609474.92 20110316.58 0 316509.75 1202395.50 3039758.25 251560771.0 ▇▁▁▁▁
Administered_Pfizer 0 1.00 6845996.77 30606111.34 0 385234.50 1613378.00 4566258.50 401070068.0 ▇▁▁▁▁
Administered_Novavax 35807 0.06 1558.44 6881.34 0 79.00 343.00 1102.00 81544.0 ▇▁▁▁▁
Administered_Unk_Manuf 3 1.00 11528.18 53396.17 0 60.00 688.00 3577.00 839805.0 ▇▁▁▁▁
Admin_Per_100K 0 1.00 104728.23 64625.70 0 55922.50 111335.00 151618.25 298212.0 ▆▇▇▃▁
Admin_Per_100k_5Plus 448 0.99 73078.28 88849.00 0 0.00 0.00 161421.75 314272.0 ▇▁▃▂▁
Admin_Per_100k_12Plus 0 1.00 110140.86 83249.26 0 0.00 129922.00 174328.00 326935.0 ▇▅▇▃▁
Admin_Per_100k_18Plus 0 1.00 122769.77 76177.65 0 64419.25 134569.00 179839.25 329135.0 ▆▆▇▃▁
Admin_Per_100k_65Plus 0 1.00 162655.70 100831.76 0 99692.50 174850.50 243213.50 451103.0 ▆▇▇▂▁
Recip_Administered 0 1.00 11753657.29 52523589.66 0 556689.75 2836631.50 7609341.50 672537312.0 ▇▁▁▁▁
Administered_Dose1_Recip 0 1.00 5809640.28 25229047.65 0 323842.00 1478779.50 3872836.25 269650596.0 ▇▁▁▁▁
Administered_Dose1_Pop_Pct 0 1.00 51.24 28.79 0 33.50 58.70 72.00 100.0 ▅▂▆▇▃
Administered_Dose1_Recip_5Plus 448 0.99 3589049.33 21290413.36 0 0.00 0.00 2138729.25 267572623.0 ▇▁▁▁▁
Administered_Dose1_Recip_5PlusPop_Pct 448 0.99 33.03 39.57 0 0.00 0.00 74.40 99.9 ▇▁▁▃▃
Administered_Dose1_Recip_12Plus 0 1.00 5225542.09 24176524.62 0 0.00 1068491.00 3534497.00 256128740.0 ▇▁▁▁▁
Administered_Dose1_Recip_12PlusPop_Pct 0 1.00 52.30 37.27 0 0.00 68.30 82.50 99.9 ▇▁▂▇▇
Administered_Dose1_Recip_18Plus 0 1.00 5343397.08 23033494.46 0 308029.00 1414983.00 3606240.25 237904856.0 ▇▁▁▁▁
Administered_Dose1_Recip_18PlusPop_Pct 0 1.00 59.12 32.51 0 39.70 70.70 84.30 99.9 ▅▂▃▇▇
Administered_Dose1_Recip_65Plus 0 1.00 1412111.95 6002240.99 0 19319.00 378694.00 996032.75 58778173.0 ▇▁▁▁▁
Administered_Dose1_Recip_65PlusPop_Pct 0 1.00 69.38 37.42 0 58.30 89.80 95.00 109.0 ▃▁▁▃▇
Series_Complete_Yes 0 1.00 4811571.56 21300590.45 0 134682.75 1123323.00 3235252.75 230142115.0 ▇▁▁▁▁
Series_Complete_Pop_Pct 0 1.00 42.83 25.92 0 23.00 50.60 61.90 90.6 ▆▂▆▇▂
Series_Complete_5Plus 448 0.99 3054455.99 18126984.91 0 0.00 0.00 1819987.25 229040889.0 ▇▁▁▁▁
Series_Complete_5PlusPop_Pct 448 0.99 28.54 34.32 0 0.00 0.00 63.80 95.0 ▇▁▁▃▂
Series_Complete_12Plus 0 1.00 4457378.28 20640205.74 0 0.00 901074.00 3061199.75 219636602.0 ▇▁▁▁▁
Series_Complete_12PlusPop_Pct 0 1.00 45.57 32.85 0 0.00 59.00 71.20 100.0 ▇▁▃▇▂
Series_Complete_18Plus 0 1.00 4442850.68 19528361.93 0 127746.00 1068303.00 3054443.00 204032234.0 ▇▁▁▁▁
Series_Complete_18PlusPop_Pct 0 1.00 50.31 30.22 0 25.10 61.60 72.90 99.9 ▅▂▃▇▂
Series_Complete_65Plus 0 1.00 1211994.82 5181904.93 0 16652.75 318236.00 894077.75 51663242.0 ▇▁▁▁▁
Series_Complete_65PlusPop_Pct 0 1.00 62.88 35.25 0 37.90 81.00 88.20 99.9 ▃▁▁▃▇
Series_Complete_Janssen 0 1.00 374893.97 1671841.22 0 10865.75 84187.00 243465.75 17173853.0 ▇▁▁▁▁
Series_Complete_Moderna 0 1.00 1764095.79 7698539.35 0 51193.00 443592.00 1244783.00 79821310.0 ▇▁▁▁▁
Series_Complete_Pfizer 0 1.00 2669103.51 11933130.57 0 70964.25 608221.00 1801014.50 132517282.0 ▇▁▁▁▁
Series_Complete_Novavax 35808 0.06 471.48 2072.67 0 20.00 108.50 336.00 24759.0 ▇▁▁▁▁
Series_Complete_Unk_Manuf 4 1.00 3067.09 14762.28 0 3.00 204.00 1001.25 236584.0 ▇▁▁▁▁
Series_Complete_Janssen_5Plus 21016 0.45 525470.91 2063970.01 0 53934.00 164280.00 323872.25 17170667.0 ▇▁▁▁▁
Series_Complete_Moderna_5Plus 21016 0.45 2388243.61 9317660.24 0 238992.50 878268.00 1538076.50 79243124.0 ▇▁▁▁▁
Series_Complete_Pfizer_5Plus 21016 0.45 3854179.61 15125490.36 0 374123.75 1238119.00 2510312.25 132367780.0 ▇▁▁▁▁
Series_Complete_Unk_Manuf_5Plus 21020 0.45 4784.22 19653.25 0 115.00 590.00 2149.00 234718.0 ▇▁▁▁▁
Series_Complete_Janssen_12Plus 0 1.00 355781.05 1655509.10 0 0.00 63889.50 235776.00 17168556.0 ▇▁▁▁▁
Series_Complete_Moderna_12Plus 0 1.00 1651079.33 7596147.09 0 0.00 354799.50 1177058.75 79176937.0 ▇▁▁▁▁
Series_Complete_Pfizer_12Plus 0 1.00 2447594.96 11381545.82 0 0.00 485916.50 1684953.25 123043366.0 ▇▁▁▁▁
Series_Complete_Unk_Manuf_12Plus 4 1.00 2897.84 14472.82 0 0.00 124.00 880.00 223757.0 ▇▁▁▁▁
Series_Complete_Janssen_18Plus 0 1.00 373721.75 1666833.70 0 10849.00 84067.50 242791.50 17140407.0 ▇▁▁▁▁
Series_Complete_Moderna_18Plus 0 1.00 1759153.35 7676566.60 0 51193.00 443397.00 1243635.00 79066584.0 ▇▁▁▁▁
Series_Complete_Pfizer_18Plus 0 1.00 2307006.73 10176032.38 0 66427.75 549402.50 1629128.00 107596454.0 ▇▁▁▁▁
Series_Complete_Unk_Manuf_18Plus 4 1.00 2936.55 14039.12 0 3.00 191.00 924.00 206017.0 ▇▁▁▁▁
Series_Complete_Janssen_65Plus 0 1.00 56190.36 244398.56 0 632.00 13760.50 37361.75 2371698.0 ▇▁▁▁▁
Series_Complete_Moderna_65Plus 0 1.00 579595.47 2476550.17 0 6803.00 149418.00 427474.25 26213750.0 ▇▁▁▁▁
Series_Complete_Pfizer_65Plus 0 1.00 576257.90 2469478.29 0 9172.75 150715.00 436997.75 27935710.0 ▇▁▁▁▁
Series_Complete_Unk_Manuf_65Plus 9 1.00 1526.87 24691.99 0 1.00 71.00 422.00 2349816.0 ▇▁▁▁▁
Additional_Doses 16348 0.57 2152481.68 9931737.00 0 18553.50 467227.50 1478154.50 117621762.0 ▇▁▁▁▁
Additional_Doses_Vax_Pct 325 0.99 17.42 21.10 0 0.00 0.00 39.20 67.5 ▇▁▂▂▁
Additional_Doses_5Plus 35544 0.06 3548888.77 13956156.16 10232 344607.00 1025834.50 2404023.50 117547717.0 ▇▁▁▁▁
Additional_Doses_5Plus_Vax_Pct 35544 0.06 49.23 8.35 24 43.90 49.15 55.70 67.6 ▁▂▇▇▂
Additional_Doses_12Plus 26456 0.30 3163981.13 12497912.61 1411 293103.25 911145.00 2153299.25 115318712.0 ▇▁▁▁▁
Additional_Doses_12Plus_Vax_Pct 26456 0.30 46.04 10.87 0 40.40 46.50 53.50 71.0 ▁▁▅▇▂
Additional_Doses_18Plus 325 0.99 1192369.29 7318199.28 0 0.00 0.00 605249.00 110159095.0 ▇▁▁▁▁
Additional_Doses_18Plus_Vax_Pct 325 0.99 18.77 22.76 0 0.00 0.00 42.10 72.5 ▇▁▂▂▁
Additional_Doses_50Plus 325 0.99 778479.96 4695513.82 0 0.00 0.00 426214.00 68922968.0 ▇▁▁▁▁
Additional_Doses_50Plus_Vax_Pct 0 1.00 23.60 28.05 0 0.00 0.00 54.20 81.3 ▇▁▁▃▂
Additional_Doses_65Plus 325 0.99 445618.68 2643441.66 0 0.00 0.00 254436.00 38027526.0 ▇▁▁▁▁
Additional_Doses_65Plus_Vax_Pct 325 0.99 27.66 32.03 0 0.00 0.00 63.40 88.5 ▇▁▁▃▂
Additional_Doses_Moderna 325 0.99 526482.41 3259401.25 0 0.00 0.00 266539.50 48675059.0 ▇▁▁▁▁
Additional_Doses_Pfizer 325 0.99 688401.31 4227459.88 0 0.00 0.00 350871.50 67306254.0 ▇▁▁▁▁
Additional_Doses_Janssen 327 0.99 17887.45 111904.69 0 0.00 0.00 8202.00 1561830.0 ▇▁▁▁▁
Additional_Doses_Unk_Manuf 331 0.99 395.49 2685.55 0 0.00 0.00 70.00 72042.0 ▇▁▁▁▁
Second_Booster 37818 0.00 20940412.98 12993892.37 6065193 11604388.25 15724217.00 26300459.00 47866632.0 ▇▆▂▁▃
Second_Booster_50Plus 31896 0.16 572656.05 2549983.61 0 46138.00 139643.00 383135.50 36171359.0 ▇▁▁▁▁
Second_Booster_50Plus_Vax_Pct 31896 0.16 25.65 14.66 0 14.80 22.30 34.42 67.2 ▃▇▃▂▁
Second_Booster_65Plus 31896 0.16 385270.86 1674480.08 0 32933.75 99922.00 268246.75 22753682.0 ▇▁▁▁▁
Second_Booster_65Plus_Vax_Pct 31896 0.16 31.36 16.35 0 19.10 28.40 41.70 75.1 ▃▇▅▃▂
Second_Booster_Janssen 31905 0.16 535.09 2172.94 0 45.00 118.00 313.00 23472.0 ▇▁▁▁▁
Second_Booster_Moderna 31896 0.16 296702.65 1345796.53 1 22581.25 67113.00 186591.75 19829102.0 ▇▁▁▁▁
Second_Booster_Pfizer 31896 0.16 361232.53 1734306.44 4 26174.50 82005.50 220708.50 27976130.0 ▇▁▁▁▁
Second_Booster_Unk_Manuf 31907 0.16 486.88 2158.09 0 4.00 28.00 180.00 32190.0 ▇▁▁▁▁
Administered_Bivalent 36248 0.04 1158600.53 5048831.90 0 80719.25 272271.00 808310.25 54419971.0 ▇▁▁▁▁
Admin_Bivalent_PFR 36312 0.04 767149.06 3282941.69 0 60857.50 182003.00 537557.50 34747465.0 ▇▁▁▁▁
Admin_Bivalent_MOD 36312 0.04 435866.28 1861447.24 0 29521.25 105033.00 313338.75 19672506.0 ▇▁▁▁▁
Dist_Bivalent_PFR 36312 0.04 1901436.75 7808853.62 300 212070.00 499870.00 1421172.50 81485210.0 ▇▁▁▁▁
Dist_Bivalent_MOD 36312 0.04 893595.25 3700494.38 200 88900.00 227700.00 627075.00 38435900.0 ▇▁▁▁▁
Bivalent_Booster_5Plus 36568 0.04 1365374.26 5532552.15 0 141230.25 343484.50 1022303.25 53910391.0 ▇▁▁▁▁
Bivalent_Booster_5Plus_Pop_Pct 36568 0.04 12.65 7.58 0 7.50 11.90 17.50 34.3 ▅▇▆▃▁
Bivalent_Booster_12Plus 36504 0.04 1300581.72 5316625.55 0 130808.50 331617.00 945788.50 52668775.0 ▇▁▁▁▁
Bivalent_Booster_12Plus_Pop_Pct 36504 0.04 13.27 8.08 0 7.50 12.45 18.52 35.9 ▅▇▆▃▁
Bivalent_Booster_18Plus 36504 0.04 1258938.12 5141296.47 0 126968.50 322174.00 911081.75 50821425.0 ▇▁▁▁▁
Bivalent_Booster_18Plus_Pop_Pct 36504 0.04 14.07 8.43 0 8.10 13.30 19.60 37.1 ▅▇▆▃▁
Bivalent_Booster_65Plus 36504 0.04 583440.70 2356210.02 0 54719.50 165515.50 438194.75 22796124.0 ▇▁▁▁▁
Bivalent_Booster_65Plus_Pop_Pct 36504 0.04 30.49 16.76 0 18.88 31.60 42.92 68.1 ▅▆▇▆▂
skimr::skim(pageviews)
Data summary
Name pageviews
Number of rows 228
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Month 0 1 3 9 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sort 0 1.0 6.5 3.46 1 3.75 6.5 9.25 12 ▇▅▅▅▇
Year 0 1.0 2012.5 5.51 2003 2008.00 2012.5 2017.00 2022 ▇▇▇▇▇
Page Views 1 1.0 83473966.2 95509590.57 1415875 39897754.50 63277568.0 87652029.00 1075594920 ▇▁▁▁▁
Page Visits 22 0.9 35932832.5 55929589.20 1484766 9634804.50 18757177.0 30696210.00 464305872 ▇▁▁▁▁

Data 2

Introduction and data

The data was published by the Mexican government, and it covers more than 4,900 divorce cases.

Since the data comes from the Mexican government, they used the divorce case records from Xalapa, Mexico.

Recorded the date of the divorce, type of the divorce, ages of the two people getting divorced, and other factors. There are 41 columns and more than 4900 rows/independent observations. It is important to note, that although the table data has some columns that are in Spanish, they can be easily translated to English

Research question

Questions:

1.) What is the leading cause of divorce in the city of Xalapa, Mexico?

2.) What factor has the biggest impact on divorce rates in Xalapa, Mexico?

3.) Is there are a correlation between large age gaps and divorce?

We are considering analyzing divorce trends within the data. The data gives us many details about each of the people getting a divorce, which we then can use to connect them with similar cases. Additionally, we believe that certain circumstances, may have a correlation to divorce rates. Moreover, one of the circumstances we are considering is age gaps, as we believe there is a positive correlation between age gaps within married couples and divorce rates among those couples.

For this specific research question we would use a variable representing the age gap which would be quantitative and the divorce rate, which is also quantitative.

Glimpse of data

divorces <- read.csv("data/divorces_2000-2015_translated.csv")

skimr::skim(divorces)
Data summary
Name divorces
Number of rows 4923
Number of columns 41
_______________________
Column type frequency:
character 34
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Divorce_date 0 1 6 8 0 2596 0
Type_of_divorce 0 1 9 10 0 2 0
Nationality_partner_man 0 1 0 14 1 20 0
DOB_partner_man 0 1 0 8 381 3849 0
Place_of_birth_partner_man 0 1 0 28 126 670 0
Birth_municipality_of_partner_man 0 1 0 32 129 425 0
Birth_federal_partner_man 0 1 0 24 128 77 0
Birth_country_partner_man 0 1 0 25 127 22 0
Residence_municipality_partner_man 0 1 0 26 324 176 0
Residence_federal_partner_man 0 1 0 25 323 43 0
Residence_country_partner_man 0 1 0 25 324 6 0
Occupation_partner_man 0 1 0 32 529 222 0
Place_of_residence_partner_man 0 1 0 26 321 230 0
Nationality_partner_woman 0 1 0 14 3 17 0
DOB_partner_woman 0 1 0 8 452 3766 0
DOB_registration_date_partner_woman 0 1 0 10 2679 2004 0
Place_of_birth_partner_woman 0 1 0 31 140 654 0
Birth_municipality_of_partner_woman 0 1 0 26 139 405 0
Birth_federal_partner_woman 0 1 0 20 140 73 0
Birth_country_partner_woman 0 1 0 25 139 21 0
Place_of_residence_partner_woman 0 1 0 25 307 175 0
Residence_municipality_partner_woman 0 1 0 22 307 132 0
Residence_federal_partner_woman 0 1 0 20 305 34 0
Residence_country_partner_woman 0 1 0 14 305 5 0
Occupation_partner_woman 0 1 0 32 578 157 0
Date_of_marriage 0 1 6 8 0 3651 0
Marriage_certificate_place 0 1 4 25 0 216 0
Marriage_certificate_municipality 0 1 4 25 0 194 0
Marriage_certificate_federal 0 1 4 20 0 33 0
Level_of_education_partner_man 0 1 0 15 304 7 0
Employment_status_partner_man 0 1 0 48 356 11 0
Level_of_education_partner_woman 0 1 0 15 380 7 0
Employment_status_partner_woman 0 1 0 44 417 11 0
Custody 0 1 0 5 2851 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age_partner_man 107 0.98 39.44 10.40 19.0 32 38 46 91 ▆▇▃▁▁
Monthly_income_partner_man_peso 1419 0.71 10920.11 70569.76 2.4 3000 5600 10000 3150242 ▇▁▁▁▁
Age_partner_woman 151 0.97 36.96 9.93 17.0 29 35 43 84 ▅▇▃▁▁
Monthly_income_partner_woman_peso 2119 0.57 7374.25 16337.05 3.5 3000 5000 8000 708652 ▇▁▁▁▁
Marriage_duration 235 0.95 11.72 9.30 1.0 4 9 17 61 ▇▃▁▁▁
Marriage_duration_months 3368 0.32 6.29 3.87 0.0 4 6 9 93 ▇▁▁▁▁
Num_Children 1912 0.61 1.82 0.93 1.0 1 2 2 10 ▇▂▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

    • The source of the data is Inside Airbnb.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was collected in 2019 by dgomonov. It was originally collected to show the listing activities and the metrics of NYC in 2019. They collected it from Inside Airbnb.
  • Write a brief description of the observations.

    • Based on the data set, it seems that it describes the data of hosts, availability in different neighborhoods in NYC, room type, and pricing.
    • The data itself is untidy as there is no pattern for how the data was put.
    • Through the data it will be interesting to see things like which hosts are busy and which neighborhoods are preferred.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How do a listing’s cancellation policy and minimum nights correlate with the number of reviews that it receives?

What is the largest factor in a listing’s price and service fee?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic of the first question is the effects of how strict a listing’s policies are on the number of people that review it. There are many factors that might cause somebody to review a listing, and some of those likely come from how strict the policies are.

There may be a negative correlation with how strict a listing’s policies are with how many reviews it receives, possibly due to the number of people willing to rent those listings.

  • Identify the types of variables in your research question. Categorical? Quantitative?

The cancellation policy is categorical, but the minimum nights and number of reviews are quantitative.

Glimpse of data

airbnb <- read_csv("data/airbnb_open_data.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 102599 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): NAME, host_identity_verified, host name, neighbourhood group, neig...
dbl (11): id, host id, lat, long, Construction year, minimum nights, number ...
lgl  (2): instant_bookable, license

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(airbnb)
Data summary
Name airbnb
Number of rows 102599
Number of columns 26
_______________________
Column type frequency:
character 13
logical 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
NAME 249 1.00 1 248 0 61281 0
host_identity_verified 289 1.00 8 11 0 2 0
host name 406 1.00 1 35 0 13190 0
neighbourhood group 29 1.00 5 13 0 7 0
neighbourhood 16 1.00 4 26 0 224 0
country 532 0.99 13 13 0 1 0
country code 131 1.00 2 2 0 1 0
cancellation_policy 76 1.00 6 8 0 3 0
room type 0 1.00 10 15 0 4 0
price 247 1.00 3 6 0 1151 0
service fee 273 1.00 3 4 0 231 0
last review 15893 0.85 8 10 0 2477 0
house_rules 52131 0.49 6 1001 0 1964 0

Variable type: logical

skim_variable n_missing complete_rate mean count
instant_bookable 105 1 0.5 FAL: 51474, TRU: 51020
license 102599 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.914623e+07 1.625751e+07 1001254.00 1.508581e+07 2.913660e+07 4.32012e+07 5.736742e+07 ▇▇▇▇▇
host id 0 1.00 4.925411e+10 2.853900e+10 123600518.00 2.458333e+10 4.911774e+10 7.39965e+10 9.876313e+10 ▇▇▇▇▇
lat 8 1.00 4.073000e+01 6.000000e-02 40.50 4.069000e+01 4.072000e+01 4.07600e+01 4.092000e+01 ▁▂▇▅▁
long 8 1.00 -7.395000e+01 5.000000e-02 -74.25 -7.398000e+01 -7.395000e+01 -7.39300e+01 -7.371000e+01 ▁▁▇▂▁
Construction year 214 1.00 2.012490e+03 5.770000e+00 2003.00 2.007000e+03 2.012000e+03 2.01700e+03 2.022000e+03 ▇▇▇▇▇
minimum nights 409 1.00 8.140000e+00 3.055000e+01 -1223.00 2.000000e+00 3.000000e+00 5.00000e+00 5.645000e+03 ▇▁▁▁▁
number of reviews 183 1.00 2.748000e+01 4.951000e+01 0.00 1.000000e+00 7.000000e+00 3.00000e+01 1.024000e+03 ▇▁▁▁▁
reviews per month 15879 0.85 1.370000e+00 1.750000e+00 0.01 2.200000e-01 7.400000e-01 2.00000e+00 9.000000e+01 ▇▁▁▁▁
review rate number 326 1.00 3.280000e+00 1.280000e+00 1.00 2.000000e+00 3.000000e+00 4.00000e+00 5.000000e+00 ▃▇▇▇▇
calculated host listings count 319 1.00 7.940000e+00 3.222000e+01 1.00 1.000000e+00 1.000000e+00 2.00000e+00 3.320000e+02 ▇▁▁▁▁
availability 365 448 1.00 1.411300e+02 1.354400e+02 -10.00 3.000000e+00 9.600000e+01 2.69000e+02 3.677000e+03 ▇▁▁▁▁