Project proposal

Author

Gold Kangaroo

library(tidyverse)
library(tidytuesdayR)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

tuesdata <- tt_load(2024, week = 7)
--- Compiling #TidyTuesday Information for 2024-02-13 ----
--- There are 3 files available ---
--- Starting Download ---

    Downloading file 1 of 3: `historical_spending.csv`
    Downloading file 2 of 3: `gifts_age.csv`
    Downloading file 3 of 3: `gifts_gender.csv`
--- Download complete ---
historical_spending <- tuesdata$historical_spending
gifts_age <- tuesdata$gifts_age
gifts_gender <- tuesdata$gifts_gender

write.csv(historical_spending, file = "data/historical_spending.csv", row.names = FALSE)

write.csv(gifts_age, file = "data/gifts_age.csv", row.names = FALSE)

write.csv(gifts_gender, file = "data/gifts_gender.csv", row.names = FALSE)

The dataset we’re working on is the Valentine’s Day Consumer Data from tidytuesday. The data was originally sourced and downloaded from the Sunja aa Kaggle dataset which used data from the National Retail Federation who surveyed U.S. adult consumers over the course of 10 years on their Valentine’s Day spending behavior.

The dataset contains three dataframes: historical_spending, gifts_age, and gifts_gender.

gifts_age contains 6 rows and has 9 variables including Age, SpendingCelebrating (percent of people spending money on celebrating Valentine’s Day), Candy, Flowers, Jewelry, GreetingCards, EveningOut, Clothing, GiftCards. The last seven variables are a measurement of the average percent spending on a particular item or the name of the variable.

glimpse(gifts_gender)
Rows: 2
Columns: 9
$ Gender              <chr> "Men", "Women"
$ SpendingCelebrating <dbl> 27, 27
$ Candy               <dbl> 52, 59
$ Flowers             <dbl> 56, 19
$ Jewelry             <dbl> 30, 14
$ GreetingCards       <dbl> 37, 43
$ EveningOut          <dbl> 33, 29
$ Clothing            <dbl> 20, 24
$ GiftCards           <dbl> 18, 24

gifts_gender contains 2 rows and 9 columns or variables. These variables are Gender, SpendingCelebrating, Candy, Flowers, Jewelry, GreetingCards, EveningOut, Clothing, and GiftCards. The last eight variables are used to measure the same things as the equivalent variables in gifts_age. For example SpendingCelebrating is a measure of the percent spending money on celebrating Valentine’s Day and Flowers is the average percent spending on flowers.

glimpse(historical_spending)
Rows: 13
Columns: 10
$ Year               <dbl> 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 201…
$ PercentCelebrating <dbl> 60, 58, 59, 60, 54, 55, 55, 54, 55, 51, 55, 52, 53
$ PerPerson          <dbl> 103.00, 116.21, 126.03, 130.97, 133.91, 142.31, 146…
$ Candy              <dbl> 8.60, 10.75, 10.85, 11.64, 10.80, 12.70, 13.11, 12.…
$ Flowers            <dbl> 12.33, 12.62, 13.49, 13.48, 15.00, 15.72, 14.78, 14…
$ Jewelry            <dbl> 21.52, 26.18, 29.60, 30.94, 30.58, 36.30, 33.11, 32…
$ GreetingCards      <dbl> 5.91, 8.09, 6.93, 8.32, 7.97, 7.87, 8.52, 7.36, 6.5…
$ EveningOut         <dbl> 23.76, 24.86, 25.66, 27.93, 27.48, 27.27, 33.46, 28…
$ Clothing           <dbl> 10.93, 12.00, 10.42, 11.46, 13.37, 14.72, 15.05, 13…
$ GiftCards          <dbl> 8.42, 11.21, 8.43, 10.23, 9.00, 11.05, 12.52, 10.23…

historical_spending contains 13 rows and 10 columns with each column representing one of the following variables: Year, PercentCelebrating, PerPerson, Candy, Flowers, Jewelry, GreetingCards, EveningOut, Clothing, and GiftCards. The PerPerson variable measures the average amount of spending per person for Valentine’s Day. PercentCelebrating is the percent of people celebrating the holiday while Candy, Flowers, Jewelry, GreetingCards, EveningOut, Clothing, and GiftCards all measure the average amount is person is spending on the name of the variable (ex. the Clothing variable measures the average amount of spending on clothing for Valentine’s Day).

There are no missing or NA values in any of the three dataframes.

We chose to analyze this dataset because of the variety of variables that would allow us to effectively answer different questions regarding consumer spending on Valentine’s Day. Furthermore, we believe that this dataset is extremely relevant to current events as Valentine’s Day just happened, allowing us to not only explore fascinating questions but also visualize insights that connect back to recent events.

Questions

  1. How have gifting preferences changed over time?
  2. How does spending vary by age and gender over time?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

How Have Gifting Preferences Changed Over Time?

The first question looks at how gifting preferences have changed over time. People give a variety of gifts during Valentine’s Day, so we wanted to examine if there were any general trends in what people like gifting on Valentine’s Day. This dataset includes information from 12 years of surveys (from 2010 to 2022) and the average amount that a survey respondent is spending on various categories such as candy, flowers, and jewelry. This may be especially interesting as there could be differences that lined up with the pandemic and what was available.

To do this, we’ll focus on the year, Candy, Flowers, Jewelry, GreetingCards, EveningOut, Clothing, and GiftCards from the historical response data to compare the spending on each of these various celebratory categories. We likely don’t need to introduce any external data to compare categories since we already have relevant data and we’re not sure if we can easily find something that accompanies average data for individuals that would improve our exploration on individual spending patterns. We will explore the pattern with a data visualization of these categories.

To prepare the data for visualization, we’ll use pivot_longer to convert the existing historical_spending dataframe to a longer form with each row representing an observation for each spending category(candy, flowers, jewelry, etc.) for each year, producing a data frame with three columns: year, spending category, and amount. Because of this, before pivoting we’ll select for only year and spending category variables listed above.

To create the visualization, we’ll create a line chart using geom_line with year (2010 - 2022) on the x-axis and amount on the y-axis with the spending category serving as the color for each line resulting in seven visually different lines on the visualization for each category.

How Does Spending Vary by Age and Gender Over Time?

Our second question is “How does spending vary by age and gender?” Our team hopes to analyze spending patterns by age and gender. Our dataset includes three variables that allow us to do this analysis: SpendingCelebrating, Age, and Gender. Age is a character variable in the gifts_age dataset along with the SpendingCelebrating variable, which is a double and represents the percentage of money spent celebrating Valentine’s Day. Similarly, Gender is a character variable in the gift_gender dataset. This dataset also has a SpendingCelebrating variable, which is also the percentage of money spent during Valentine’s day for each gender category.

Our team plans to use these to understand better how Valentine’s Day spending varies by age and gender. This could include a chart that shows the distribution of spending based on gender and age. This might be a scatterplot to show individual points. We could also employ a boxplot to focus specifically on distribution without the noise of individual points. We could also use a histogram to achieve a balance between understanding the spread of points while also seeing a more detailed view of where the points fall. Which visualization we use depends on the actual spread of the data we’ll find during our analysis and what would best represent how spending varies by age and gender. Given that the dataset includes these variables, we do not plan to create any new variables for this question nor do we plan to introduce more data from other sources. However, we plan on pivoting the data using pivot_longer for age and gender so each row represents an observation per spending category per year. We are confident we have enough data to understand and visualize the spending patterns across these categories without more data.