Fabulous Hitmontop

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data
- The Metropolitan Museum of Art Collection (https://metmuseum.github.io/).
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The original data was collected and initially published in September 2016 by the Metropolitan Museum of Art. However, the museum continues to update the data set regularly. The Metropolitan Museum of Art provides select datasets of information on over 470,000 artworks in its collection. This dataset can be used for both commercial and noncommercial purposes. Additionally, the museum has waived all copyright and related or neighboring rights to this dataset using Creative Commons Zero.
Write a brief description of the observations.
- Each observation represents an artwork in the collection of the Metropolitan Museum of Art, with information such as title, author, region, culture, time period, etc. of the artwork as variables.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- For artworks that are considered popular and important in the Metropolitan Museum of Art Collection, what characteristics do they have?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Description: We would like to explore the characteristics that makes an artwork popular and important, such as whether the artwork is on the Timeline of Art History website, the culture of the artwork, the time period that the artwork was started and completed, etc.
- Hypothesis 1: Art works that are on the Timeline of Art History website are more likely to be considered popular and important.
- Hypothesis 2: Art works that are in the departments associated with European cultures are more likely to be considered popular and important.
- Hypothesis 3: Art works that are completed earlier are more likely to be considered popular and important compared to art works that are completed later.
- Hypothesis 4: Art works that are acquired earlier are more likely to be considered popular and important compared to art works that are acquired later.
Identify the types of variables in your research question. Categorical? Quantitative?
- Identifier variable:
  
  objectID: Identifying number for each artwork (unique, can be used as key field)
- Categorical variables:
  
  isHighlight: When “true” indicates a popular and important artwork in the collection
  
  isTimelineWork: Whether the object is on the Timeline of Art History website
  
  classification: General term describing the artwork type.
  
  department: Indicates The Met’s curatorial department responsible for the artwork
  
  culture: Information about the culture, or people from which an object was created
- Quantitative variable:
  
  accessionYear: Year the artwork was acquired
  
  objectBeginDate: Machine readable date indicating the year the artwork was started to be created
  
  objectEndDate: Machine readable date indicating the year the artwork was completed

Glimpse of data

# add code here
met <- read.csv("data/MetObjects.csv")
skimr::skim(met)

Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
Caused by warning in `inline_hist()`:
! Variable contains Inf or -Inf value(s) that were converted to NA.

Data summary
Name	met
Number of rows	477804
Number of columns	54
_______________________
Column type frequency:
character	49
logical	1
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique	whitespace
Object.Number	1	3	50	0	474872	0
Is.Highlight	1	4	5	0	2	0
Is.Timeline.Work	1	4	5	0	2	0
Is.Public.Domain	1	4	5	0	2	0
Gallery.Number	1	0	15	426028	419	0
Department	1	9	41	0	19	0
AccessionYear	1	0	10	3556	160	0
Object.Name	1	0	80	1691	28450	0
Title	1	0	837	29185	239092	0
Culture	1	0	98	270425	7181	0
Period	1	0	120	386848	1874	0
Dynasty	1	0	45	454571	404	0
Reign	1	0	55	466578	390	0
Portfolio	1	0	835	454274	3564	13
Artist.Role	1	0	1043	204368	6838	0
Artist.Prefix	1	0	1241	202269	8047	143569
Artist.Display.Name	1	0	2631	202269	64532	0
Artist.Display.Bio	1	0	5204	204368	52440	34106
Artist.Suffix	1	0	1301	202317	2730	176830
Artist.Alpha.Sort	1	0	2698	202269	64490	171
Artist.Nationality	1	0	1107	202269	5038	67789
Artist.Begin.Date	1	0	1495	202269	29717	31418
Artist.End.Date	1	0	1495	202269	29686	31448
Artist.Gender	1	0	182	374743	285	0
Artist.ULAN.URL	1	0	4572	255783	37236	0
Artist.Wikidata.URL	1	0	3919	260072	39363	0
Object.Date	1	0	239	13867	32697	0
Medium	1	0	8172	7120	64686	7
Dimensions	1	0	2232	75293	259821	21
Credit.Line	1	0	1002	451	35331	1
Geography.Type	1	0	72	418035	116	0
City	1	0	62	445397	2567	0
State	1	0	33	475254	105	0
County	1	0	48	469354	1094	0
Country	1	0	75	402053	947	0
Region	1	0	64	446444	718	0
Subregion	1	0	50	455680	366	0
Locale	1	0	83	462095	868	0
Locus	1	0	81	470311	1381	0
Excavation	1	0	75	461246	405	0
River	1	0	42	475709	230	0
Classification	1	0	74	78206	1213	0
Rights.and.Reproduction	1	0	251	453606	1416	0
Link.Resource	1	48	53	0	477804	0
Object.Wikidata.URL	1	0	40	455539	22214	0
Repository	1	40	40	0	1	0
Tags	1	0	153	277404	44831	0
Tags.AAT.URL	1	0	755	277404	44371	469
Tags.Wikidata.URL	1	0	660	277404	44565	502

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Metadata.Date	477804	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Object.ID	0	1.00	387582.17	237374.74	1	210119.8	371186.5	563883.2	860873	▆▇▇▃▅
Constituent.ID	202269	0.58	Inf	NaN	107	9278.0	16354.0	1641093810.0	Inf	▇▁▁▁▁
Object.Begin.Date	0	1.00	1294.06	1776.24	-400000	1529.0	1800.0	1890.0	5000	▁▁▁▁▇
Object.End.Date	0	1.00	1395.14	1154.08	-240000	1582.0	1836.0	1905.0	15335	▁▁▁▁▇

Data 2

Introduction and data

Identify the source of the data.
- Million Song Dataset (https://think.cs.vt.edu/corgis/csv/music/)
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This library used a company called the Echo Nest to derive data points about one million popular contemporary songs. The Million Song Dataset is a collaboration between the Echo Nest and LabROSA, a laboratory working towards intelligent machine listening.
Write a brief description of the observations.
- The data contains standard information about the songs such as artist name, title, and year released. Additionally, the data contains more advanced information; for example, the length of the song, how many musical bars long the song is, and how long the fade in to the song was.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What aspects of a song most influence/correlate to its popularity?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Description: We want to explore how characteristics of a song, such as key, tempo, time signature, duration, and genre, affect its popularity.
- Hypothesis: The tempo and genre of a song have the most influence on its popularity.
- Hypothesis: The key and time signature of a song have very little influence on its popularity, because the key is evenly random across songs and the time signature is heavily unimodal (where most songs have the same time signature).
Identify the types of variables in your research question. Categorical? Quantitative?
- Identifier variable: song.id
- Categorical variables: artist.terms (music genre)
- Quantitative variables: song.key, song.loudness, song.tempo, song.time_signature, song.duration, song.hotttnesss

Glimpse of data

# add code here
music <- read.csv("data/music.csv")
skimr::skim(music)

Data summary
Name	music
Number of rows	10000
Number of columns	35
_______________________
Column type frequency:
character	4
numeric	31
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
artist.id	1	18	18	0	3888
artist.name	1	1	255	0	4412
artist.terms	1	0	40	5	459
song.id	1	18	52	0	10000

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
artist.familiarity	1	0.57	0.16	0.00	0.47	0.56	0.67	1.00	▁▂▇▅▂
artist.hotttnesss	1	0.39	0.14	0.00	0.33	0.38	0.45	1.08	▁▇▃▁▁
artist.latitude	1	13.90	20.36	-41.28	0.00	0.00	34.42	69.65	▁▇▁▃▁
artist.location	1	0.08	7.80	0.00	0.00	0.00	0.00	780.00	▇▁▁▁▁
artist.longitude	1	-23.92	43.72	-162.44	-73.95	0.00	0.00	174.77	▁▂▇▁▁
artist.similar	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
artist.terms_freq	1	224.89	22392.16	0.00	0.95	1.00	1.00	2239217.00	▇▁▁▁▁
release.id	1	371024.06	236777.83	0.00	172858.00	333103.00	573532.50	823599.00	▇▇▅▆▅
release.name	1	23.10	1322.90	0.00	0.00	0.00	0.00	85555.00	▇▁▁▁▁
song.artist_mbtags	1	0.00	0.00	0.00	0.00	0.00	0.00	0.33	▇▁▁▁▁
song.artist_mbtags_count	1	0.52	0.88	0.00	0.00	0.00	1.00	9.00	▇▁▁▁▁
song.bars_confidence	1	0.24	0.29	0.00	0.04	0.12	0.35	8.86	▇▁▁▁▁
song.bars_start	1	1.07	1.72	0.00	0.44	0.79	1.22	59.74	▇▁▁▁▁
song.beats_confidence	1	0.61	0.32	0.00	0.41	0.69	0.88	1.00	▃▂▃▆▇
song.beats_start	1	0.43	0.81	-60.00	0.19	0.33	0.50	12.25	▁▁▁▁▇
song.duration	1	240.62	246.08	1.04	176.03	223.06	276.38	22050.00	▇▁▁▁▁
song.end_of_fade_in	1	0.76	1.86	0.00	0.00	0.20	0.42	43.12	▇▁▁▁▁
song.hotttnesss	1	-0.24	0.69	-1.00	-1.00	0.00	0.41	1.00	▇▁▃▆▂
song.key	1	5.37	9.67	0.00	2.00	5.00	8.00	904.80	▇▁▁▁▁
song.key_confidence	1	0.45	0.33	0.00	0.22	0.47	0.66	19.08	▇▁▁▁▁
song.loudness	1	-10.48	5.40	-51.64	-13.16	-9.38	-6.53	0.57	▁▁▁▆▇
song.mode	1	0.69	0.46	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
song.mode_confidence	1	0.48	0.19	0.00	0.36	0.49	0.61	1.00	▂▅▇▅▁
song.start_of_fade_out	1	229.88	112.02	-21.39	168.86	213.86	266.27	1813.43	▇▁▁▁▁
song.tatums_confidence	1	0.51	0.33	0.00	0.24	0.50	0.77	9.23	▇▁▁▁▁
song.tatums_start	1	0.30	0.51	0.00	0.11	0.19	0.29	12.25	▇▁▁▁▁
song.tempo	1	122.90	35.20	0.00	96.96	120.16	144.01	262.83	▁▆▇▂▁
song.time_signature	1	3.56	1.27	0.00	3.00	4.00	4.00	7.00	▂▁▇▁▁
song.time_signature_confidence	1	0.60	8.99	0.00	0.10	0.55	0.86	898.89	▇▁▁▁▁
song.title	1	10.01	945.49	0.00	0.00	0.00	0.00	94496.00	▇▁▁▁▁
song.year	1	934.70	996.65	0.00	0.00	0.00	2000.00	2010.00	▇▁▁▁▇

Data 3

Introduction and data

Identify the source of the data.
- The Centers for Disease Control and Prevention
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- It is collected by the National Center for Immunization and Respiratory Diseases, from December 14, 2020 to October 11, 2022. It uses website data gathered from IISInfo, which is a Microsoft web server that is used to exchange static and dynamic web content with internet users.
Write a brief description of the observations.
- The observations (rows) of the dataset is grouped by the data each day from December 14, 2020 to October 11, 2022. Out of each day, the data is further grouped into 8 age groups, from <2 years old to 65+ years old.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What is the relationship between the percent of individuals who received either their first or final vaccination and the 7-day average group cases of COVID-19 by age group, and how has the variables mentioned changed over the course of the pandemic (from the end of 2020 to the end of 2022)?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Research topic: COVID-19 Vaccination Rate and Case Trends by Age Group
- Hypothesis 1: As time progresses since the beginning of the pandemic, the vaccination rate across different doses increases then decreases because vaccines are only developed in the middle of the pandemic, and as more people get vaccinated, the rate decreases again.
- Hypothesis 2: The 7-day Average Group Cases Per 100K Individuals should also follow a similar trend to the vaccination rate because there was no vaccine in the beginning and then the infection rate slows down as people grow immunity against the virus.
- Hypothesis 3: Out of the various age groups, individuals with ages between 18~64 have a higher 7-day Average Group Cases Per 100K because they are more active in society and therefore is more likely to catch the disease.
Identify the types of variables in your research question. Categorical? Quantitative?
- The Date where the vaccines are Administered (Date)
- Age Group of the Vaccinated Patients (Categorical)
- 7-day Average Group Cases Per 100K Individuals (Quantitative)
- Percent of Individuals who Received their First Dose (Quantitative)
- Percent of Individuals who Completed Full Vaccination (Quantitative)

Glimpse of data

# add code here
covid <- read.csv('data/Archive__COVID-19_Vaccination_and_Case_Trends_by_Age_Group__United_States.csv')
skimr::skim(covid)

Data summary
Name	covid
Number of rows	5331
Number of columns	5
_______________________
Column type frequency:
character	2
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Date.Administered	0	1	22	22	0	667	0
AgeGroupVacc	0	1	8	13	0	8	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
X7.day_avg_group_cases_per_100k	1	30.63	36.31	1.62	12.04	21.83	33.35	300.72	▇▁▁▁▁
Administered_Dose1_pct_agegroup	1	0.43	0.37	0.00	0.00	0.41	0.79	0.95	▇▂▁▃▆
Series_Complete_Pop_pct_agegroup	1	0.36	0.34	0.00	0.00	0.31	0.66	0.93	▇▂▂▅▃