Fabulous Hitmontop

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data

    • The Metropolitan Museum of Art Collection (https://metmuseum.github.io/).
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The original data was collected and initially published in September 2016 by the Metropolitan Museum of Art. However, the museum continues to update the data set regularly. The Metropolitan Museum of Art provides select datasets of information on over 470,000 artworks in its collection. This dataset can be used for both commercial and noncommercial purposes. Additionally, the museum has waived all copyright and related or neighboring rights to this dataset using Creative Commons Zero.
  • Write a brief description of the observations.

    • Each observation represents an artwork in the collection of the Metropolitan Museum of Art, with information such as title, author, region, culture, time period, etc. of the artwork as variables.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • For artworks that are considered popular and important in the Metropolitan Museum of Art Collection, what characteristics do they have?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • Description: We would like to explore the characteristics that makes an artwork popular and important, such as whether the artwork is on the Timeline of Art History website, the culture of the artwork, the time period that the artwork was started and completed, etc.

    • Hypothesis 1: Art works that are on the Timeline of Art History website are more likely to be considered popular and important.

    • Hypothesis 2: Art works that are in the departments associated with European cultures are more likely to be considered popular and important.

    • Hypothesis 3: Art works that are completed earlier are more likely to be considered popular and important compared to art works that are completed later.

    • Hypothesis 4: Art works that are acquired earlier are more likely to be considered popular and important compared to art works that are acquired later.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Identifier variable:

      objectID: Identifying number for each artwork (unique, can be used as key field)

    • Categorical variables: 

      isHighlight: When “true” indicates a popular and important artwork in the collection

      isTimelineWork: Whether the object is on the Timeline of Art History website

      classification: General term describing the artwork type.

      department: Indicates The Met’s curatorial department responsible for the artwork

      culture: Information about the culture, or people from which an object was created

    • Quantitative variable:

      accessionYear: Year the artwork was acquired

      objectBeginDate: Machine readable date indicating the year the artwork was started to be created

      objectEndDate: Machine readable date indicating the year the artwork was completed

Glimpse of data

# add code here
met <- read.csv("data/MetObjects.csv")
skimr::skim(met)
Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
Caused by warning in `inline_hist()`:
! Variable contains Inf or -Inf value(s) that were converted to NA.
Data summary
Name met
Number of rows 477804
Number of columns 54
_______________________
Column type frequency:
character 49
logical 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Object.Number 0 1 3 50 0 474872 0
Is.Highlight 0 1 4 5 0 2 0
Is.Timeline.Work 0 1 4 5 0 2 0
Is.Public.Domain 0 1 4 5 0 2 0
Gallery.Number 0 1 0 15 426028 419 0
Department 0 1 9 41 0 19 0
AccessionYear 0 1 0 10 3556 160 0
Object.Name 0 1 0 80 1691 28450 0
Title 0 1 0 837 29185 239092 0
Culture 0 1 0 98 270425 7181 0
Period 0 1 0 120 386848 1874 0
Dynasty 0 1 0 45 454571 404 0
Reign 0 1 0 55 466578 390 0
Portfolio 0 1 0 835 454274 3564 13
Artist.Role 0 1 0 1043 204368 6838 0
Artist.Prefix 0 1 0 1241 202269 8047 143569
Artist.Display.Name 0 1 0 2631 202269 64532 0
Artist.Display.Bio 0 1 0 5204 204368 52440 34106
Artist.Suffix 0 1 0 1301 202317 2730 176830
Artist.Alpha.Sort 0 1 0 2698 202269 64490 171
Artist.Nationality 0 1 0 1107 202269 5038 67789
Artist.Begin.Date 0 1 0 1495 202269 29717 31418
Artist.End.Date 0 1 0 1495 202269 29686 31448
Artist.Gender 0 1 0 182 374743 285 0
Artist.ULAN.URL 0 1 0 4572 255783 37236 0
Artist.Wikidata.URL 0 1 0 3919 260072 39363 0
Object.Date 0 1 0 239 13867 32697 0
Medium 0 1 0 8172 7120 64686 7
Dimensions 0 1 0 2232 75293 259821 21
Credit.Line 0 1 0 1002 451 35331 1
Geography.Type 0 1 0 72 418035 116 0
City 0 1 0 62 445397 2567 0
State 0 1 0 33 475254 105 0
County 0 1 0 48 469354 1094 0
Country 0 1 0 75 402053 947 0
Region 0 1 0 64 446444 718 0
Subregion 0 1 0 50 455680 366 0
Locale 0 1 0 83 462095 868 0
Locus 0 1 0 81 470311 1381 0
Excavation 0 1 0 75 461246 405 0
River 0 1 0 42 475709 230 0
Classification 0 1 0 74 78206 1213 0
Rights.and.Reproduction 0 1 0 251 453606 1416 0
Link.Resource 0 1 48 53 0 477804 0
Object.Wikidata.URL 0 1 0 40 455539 22214 0
Repository 0 1 40 40 0 1 0
Tags 0 1 0 153 277404 44831 0
Tags.AAT.URL 0 1 0 755 277404 44371 469
Tags.Wikidata.URL 0 1 0 660 277404 44565 502

Variable type: logical

skim_variable n_missing complete_rate mean count
Metadata.Date 477804 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Object.ID 0 1.00 387582.17 237374.74 1 210119.8 371186.5 563883.2 860873 ▆▇▇▃▅
Constituent.ID 202269 0.58 Inf NaN 107 9278.0 16354.0 1641093810.0 Inf ▇▁▁▁▁
Object.Begin.Date 0 1.00 1294.06 1776.24 -400000 1529.0 1800.0 1890.0 5000 ▁▁▁▁▇
Object.End.Date 0 1.00 1395.14 1154.08 -240000 1582.0 1836.0 1905.0 15335 ▁▁▁▁▇

Data 2

Introduction and data

  • Identify the source of the data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This library used a company called the Echo Nest to derive data points about one million popular contemporary songs. The Million Song Dataset is a collaboration between the Echo Nest and LabROSA, a laboratory working towards intelligent machine listening.
  • Write a brief description of the observations.

    • The data contains standard information about the songs such as artist name, title, and year released. Additionally, the data contains more advanced information; for example, the length of the song, how many musical bars long the song is, and how long the fade in to the song was.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What aspects of a song most influence/correlate to its popularity?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • Description: We want to explore how characteristics of a song, such as key, tempo, time signature, duration, and genre, affect its popularity.

    • Hypothesis: The tempo and genre of a song have the most influence on its popularity.

    • Hypothesis: The key and time signature of a song have very little influence on its popularity, because the key is evenly random across songs and the time signature is heavily unimodal (where most songs have the same time signature).

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Identifier variable: song.id

    • Categorical variables: artist.terms (music genre)

    • Quantitative variables: song.key, song.loudness, song.tempo, song.time_signature, song.duration, song.hotttnesss

Glimpse of data

# add code here
music <- read.csv("data/music.csv")
skimr::skim(music)
Data summary
Name music
Number of rows 10000
Number of columns 35
_______________________
Column type frequency:
character 4
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
artist.id 0 1 18 18 0 3888 0
artist.name 0 1 1 255 0 4412 0
artist.terms 0 1 0 40 5 459 0
song.id 0 1 18 52 0 10000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
artist.familiarity 0 1 0.57 0.16 0.00 0.47 0.56 0.67 1.00 ▁▂▇▅▂
artist.hotttnesss 0 1 0.39 0.14 0.00 0.33 0.38 0.45 1.08 ▁▇▃▁▁
artist.latitude 0 1 13.90 20.36 -41.28 0.00 0.00 34.42 69.65 ▁▇▁▃▁
artist.location 0 1 0.08 7.80 0.00 0.00 0.00 0.00 780.00 ▇▁▁▁▁
artist.longitude 0 1 -23.92 43.72 -162.44 -73.95 0.00 0.00 174.77 ▁▂▇▁▁
artist.similar 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
artist.terms_freq 0 1 224.89 22392.16 0.00 0.95 1.00 1.00 2239217.00 ▇▁▁▁▁
release.id 0 1 371024.06 236777.83 0.00 172858.00 333103.00 573532.50 823599.00 ▇▇▅▆▅
release.name 0 1 23.10 1322.90 0.00 0.00 0.00 0.00 85555.00 ▇▁▁▁▁
song.artist_mbtags 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.33 ▇▁▁▁▁
song.artist_mbtags_count 0 1 0.52 0.88 0.00 0.00 0.00 1.00 9.00 ▇▁▁▁▁
song.bars_confidence 0 1 0.24 0.29 0.00 0.04 0.12 0.35 8.86 ▇▁▁▁▁
song.bars_start 0 1 1.07 1.72 0.00 0.44 0.79 1.22 59.74 ▇▁▁▁▁
song.beats_confidence 0 1 0.61 0.32 0.00 0.41 0.69 0.88 1.00 ▃▂▃▆▇
song.beats_start 0 1 0.43 0.81 -60.00 0.19 0.33 0.50 12.25 ▁▁▁▁▇
song.duration 0 1 240.62 246.08 1.04 176.03 223.06 276.38 22050.00 ▇▁▁▁▁
song.end_of_fade_in 0 1 0.76 1.86 0.00 0.00 0.20 0.42 43.12 ▇▁▁▁▁
song.hotttnesss 0 1 -0.24 0.69 -1.00 -1.00 0.00 0.41 1.00 ▇▁▃▆▂
song.key 0 1 5.37 9.67 0.00 2.00 5.00 8.00 904.80 ▇▁▁▁▁
song.key_confidence 0 1 0.45 0.33 0.00 0.22 0.47 0.66 19.08 ▇▁▁▁▁
song.loudness 0 1 -10.48 5.40 -51.64 -13.16 -9.38 -6.53 0.57 ▁▁▁▆▇
song.mode 0 1 0.69 0.46 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
song.mode_confidence 0 1 0.48 0.19 0.00 0.36 0.49 0.61 1.00 ▂▅▇▅▁
song.start_of_fade_out 0 1 229.88 112.02 -21.39 168.86 213.86 266.27 1813.43 ▇▁▁▁▁
song.tatums_confidence 0 1 0.51 0.33 0.00 0.24 0.50 0.77 9.23 ▇▁▁▁▁
song.tatums_start 0 1 0.30 0.51 0.00 0.11 0.19 0.29 12.25 ▇▁▁▁▁
song.tempo 0 1 122.90 35.20 0.00 96.96 120.16 144.01 262.83 ▁▆▇▂▁
song.time_signature 0 1 3.56 1.27 0.00 3.00 4.00 4.00 7.00 ▂▁▇▁▁
song.time_signature_confidence 0 1 0.60 8.99 0.00 0.10 0.55 0.86 898.89 ▇▁▁▁▁
song.title 0 1 10.01 945.49 0.00 0.00 0.00 0.00 94496.00 ▇▁▁▁▁
song.year 0 1 934.70 996.65 0.00 0.00 0.00 2000.00 2010.00 ▇▁▁▁▇

Data 3

Introduction and data

  • Identify the source of the data.

    • The Centers for Disease Control and Prevention
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • It is collected by the National Center for Immunization and Respiratory Diseases, from December 14, 2020 to October 11, 2022. It uses website data gathered from IISInfo, which is a Microsoft web server that is used to exchange static and dynamic web content with internet users.
  • Write a brief description of the observations.

    • The observations (rows) of the dataset is grouped by the data each day from December 14, 2020 to October 11, 2022. Out of each day, the data is further grouped into 8 age groups, from <2 years old to 65+ years old.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • What is the relationship between the percent of individuals who received either their first or final vaccination and the 7-day average group cases of COVID-19 by age group, and how has the variables mentioned changed over the course of the pandemic (from the end of 2020 to the end of 2022)?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • Research topic: COVID-19 Vaccination Rate and Case Trends by Age Group

    • Hypothesis 1: As time progresses since the beginning of the pandemic, the vaccination rate across different doses increases then decreases because vaccines are only developed in the middle of the pandemic, and as more people get vaccinated, the rate decreases again. 

    • Hypothesis 2: The 7-day Average Group Cases Per 100K Individuals should also follow a similar trend to the vaccination rate because there was no vaccine in the beginning and then the infection rate slows down as people grow immunity against the virus. 

    • Hypothesis 3: Out of the various age groups, individuals with ages between 18~64 have a higher 7-day Average Group Cases Per 100K because they are more active in society and therefore is more likely to catch the disease.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • The Date where the vaccines are Administered (Date)

    • Age Group of the Vaccinated Patients (Categorical)

    • 7-day Average Group Cases Per 100K Individuals (Quantitative)

    • Percent of Individuals who Received their First Dose (Quantitative)

    • Percent of Individuals who Completed Full Vaccination (Quantitative)

Glimpse of data

# add code here
covid <- read.csv('data/Archive__COVID-19_Vaccination_and_Case_Trends_by_Age_Group__United_States.csv')
skimr::skim(covid)
Data summary
Name covid
Number of rows 5331
Number of columns 5
_______________________
Column type frequency:
character 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date.Administered 0 1 22 22 0 667 0
AgeGroupVacc 0 1 8 13 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
X7.day_avg_group_cases_per_100k 0 1 30.63 36.31 1.62 12.04 21.83 33.35 300.72 ▇▁▁▁▁
Administered_Dose1_pct_agegroup 0 1 0.43 0.37 0.00 0.00 0.41 0.79 0.95 ▇▂▁▃▆
Series_Complete_Pop_pct_agegroup 0 1 0.36 0.34 0.00 0.00 0.31 0.66 0.93 ▇▂▂▅▃