library(tidyverse)
library(skimr)
Project Charmander
Proposal
Data 1
Introduction and data
Identify the source of the data.
- These are two data sets, both concerning music journalism. The first is from a very famous music publication, Pitchfork, which has been reviewing albums since the 90’s. The other is from a music youtuber / blogger named Anthony Fantano, who has been reviewing music since the early 2010s.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The Anthony Fantano dataset was scraped from his blog website by kaggle.com user Apatosaur. Details can be found here: https://www.kaggle.com/datasets/apat0saur/theneedledrop-reviews?resource=download
The pitchfork dataset was collected by a reddit user, but unfortunately Reddit is down at the time of writing so I can’t access the details. Full details can be found at https://www.reddit.com/r/datasets/comments/apdpzz/20783_pitchfork_reviews_jan_5_1999_jan_11_2019/.
Write a brief description of the observations.
- Each observation contains an album review and information like when the review was published, the score the album received, the artist name and album genre, et cetera. The datasets differ somewhat: for example, the pitchfork data set has an author column, but the Fantano one doesn’t, as he wrote every review.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
What are these sources average scores? Is this significantly different?
How does average score differ across genre? Across time? What about when viewing these traits in tandem?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Just for the first one: The research topic here I think would be individual vs publication, and how these sources review music differently given that pitchfork is a company and Fantano is an individual. My hypothesis is that pitchfork will give out higher reviews as they take their platform very seriously and wouldn’t use it to speak out against albums, more to amplify voices, especially with new artists. Fantano probably takes his work and influence less seriously, especially since he started very small, so he would be better off just reviewing whatever he wants and giving it an honest score. I don’t think pitchfork would want to be a negative voice in the music community.
- How does average score differ across genre? Across time? What about when viewing these traits in tandem? This question is basically investigating the general trend of music: which genres were popular in the past, which are currently popular, etc. We could produce a graph of multiple trend lines, each representing a genre, and plot score V.S. time (year). This is a pretty open-ended question, so we do not have a clear hypothesis until we actually do the data analysis. But a rough guess is that electronic music score would grow in current years, while folk/country music would be less favorable since it mostly cater to older generations.
- Identify the types of variables in your research question. Categorical? Quantitative?
- The first question uses numeric variables in album score and categorical ones in which source the review is from. Genre is another example of a categorical variable that we could use, and review date is another example of a numeric one.
- For question 2, it uses variable score and release_year which are both quantitative. But genre would be categorical, which will be used to possibly facet the graph produced.
Glimpse of data
<- read_csv("data/pitchfork.csv") pitchfork
Rows: 20873 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): artist, album, genre, date, author, role, review, link, label
dbl (3): score, bnm, release_year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(pitchfork)
Rows: 20,873
Columns: 12
$ artist <chr> "David Byrne", "DJ Healer", "Jorge Velez", "Chandra", "Th…
$ album <chr> "“…The Best Live Show of All Time” — NME EP", "Lost Loves…
$ genre <chr> "Rock", "Electronic", "Electronic", "Rock", "Electronic",…
$ score <dbl> 5.5, 6.2, 7.9, 7.8, 3.1, 7.8, 6.8, 7.3, 7.4, 7.7, 9.4, 7.…
$ date <chr> "January 11 2019", "January 11 2019", "January 10 2019", …
$ author <chr> "Andy Beta", "Chal Ravens", "Philip Sherburne", "Andy Bet…
$ role <chr> "Contributor", "Contributor", "Contributing Editor", "Con…
$ review <chr> "Viva Brother, Terris, Mansun, the Twang, Joe Lean & the …
$ bnm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ link <chr> "https://pitchfork.com/reviews/albums/david-byrne-the-bes…
$ label <chr> "Nonesuch", "Planet Uterus", "Self-released", "Telephone …
$ release_year <dbl> 2018, 2019, 2019, 2018, 2018, 2018, 2018, 2018, 2018, 201…
<- read_csv("data/fantano_scores.csv") fantano
Rows: 1272 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): album, artist, genre, name, score, tags, url
dttm (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(fantano)
Rows: 1,272
Columns: 8
$ album <chr> "Planet Her", "Butterfly 3000", "Black Metal 2", "The Life of P…
$ artist <chr> "Doja Cat", "King Gizzard & The Lizard Wizard", "Dean Blunt…
$ date <dttm> 2021-06-30 19:33:00, 2021-06-18 03:33:00, 2021-06-18 03:35:00,…
$ genre <chr> "hip hop", "rock", "other", "hip hop", "hip hop", "pop", "hip h…
$ name <chr> "Doja Cat - Planet Her", "King Gizzard & The Lizard Wizard …
$ score <chr> "5", "5", "7", "5", "6", "6", "8", "8", "6", "7", "2", "5", "6"…
$ tags <chr> "doja cat | planet her | 2021 | album | kemosabe | rap | hip ho…
$ url <chr> "https://www.theneedledrop.com/articles/2021/6/doja-cat-planet-…
# add code here
Data 2
Introduction and data
Data source: https://think.cs.vt.edu/corgis/csv/coffee/
“These data were collected from the Coffee Quality Institute’s review pages in January 2018.”
“There is data for both Arabica and Robusta beans, across many countries and professionally rated on a 0-100 scale. All sorts of scoring/ratings for things like acidity, sweetness, fragrance, balance, etc.”
Research question
- How does the score of coffee vary with its region of production? Many regions are known for good coffee production, because they have better climate and resources to grow coffee beans.
- This research question investigates which regions’ coffee receives highest scores. The overall score will be calculated as average of all the “Data.Score.” columns, including scores on Aroma, Flavor, Aftertaste, etc. A hypothesis would be coffee from tropical areas including regions in Brazil, or Vietnam, would score higher. Also regions with higher coffee bean production would receive probably be rated better.
- The variables involved would be eventually 2: region of production, and score, though to calculate the score a few other columns will be used. Region is a categorical variable, while score is quantitative and continuous. Thus, graphs like bar chart or box plot might be suitable.
Glimpse of data
# add code here
<- read_csv("data/coffee.csv") coffee
Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(coffee)
Rows: 989
Columns: 23
$ Location.Country <chr> "United States", "Brazil", "Brazil", …
$ Location.Region <chr> "kona", "sul de minas - carmo de mina…
$ Location.Altitude.Min <dbl> 0, 12, 12, 0, 0, 0, 1300, 0, 0, 640, …
$ Location.Altitude.Max <dbl> 0, 12, 12, 0, 0, 0, 1400, 0, 0, 1400,…
$ Location.Altitude.Average <dbl> 0, 12, 12, 0, 0, 0, 1350, 0, 0, 1020,…
$ Year <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2…
$ Data.Owner <chr> "kona pacific farmers cooperative", "…
$ Data.Type.Species <chr> "Arabica", "Arabica", "Arabica", "Ara…
$ Data.Type.Variety <chr> "nan", "Yellow Bourbon", "Yellow Bour…
$ `Data.Type.Processing method` <chr> "nan", "nan", "nan", "nan", "nan", "n…
$ `Data.Production.Number of bags` <dbl> 25, 300, 300, 360, 300, 12, 10, 360, …
$ `Data.Production.Bag weight` <dbl> 45.35920, 60.00000, 60.00000, 6.00000…
$ Data.Scores.Aroma <dbl> 8.25, 8.17, 8.42, 7.67, 7.58, 7.50, 7…
$ Data.Scores.Flavor <dbl> 8.42, 7.92, 7.92, 8.00, 7.83, 7.92, 7…
$ Data.Scores.Aftertaste <dbl> 8.08, 7.92, 8.00, 7.83, 7.58, 7.42, 7…
$ Data.Scores.Acidity <dbl> 7.75, 7.75, 7.75, 8.00, 8.00, 7.67, 7…
$ Data.Scores.Body <dbl> 7.67, 8.33, 7.92, 7.92, 7.83, 7.83, 7…
$ Data.Scores.Balance <dbl> 7.83, 8.00, 8.00, 7.83, 7.50, 7.58, 7…
$ Data.Scores.Uniformity <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10…
$ Data.Scores.Sweetness <dbl> 10.00, 10.00, 10.00, 10.00, 10.00, 10…
$ Data.Scores.Moisture <dbl> 0.00, 0.08, 0.01, 0.00, 0.10, 0.01, 0…
$ Data.Scores.Total <dbl> 86.25, 86.17, 86.17, 85.08, 83.83, 83…
$ Data.Color <chr> "Unknown", "Unknown", "Unknown", "Unk…
Data 3
Introduction and data
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Identify the types of variables in your research question. Categorical? Quantitative?
Glimpse of data
# add code here