Project Elegant Raichu

Preregistration of analyses

Analysis #1

Explanation

If our research question is: Is the true proportion of prices of books in one genre different than those of another genre? Then, for this question, since we are interested in estimating the difference, this suggests we should use bootstrapping and make a confidence interval. We will use the data from Amazon about the Top 100 best selling books from 2009 to 2021.

However, it is important to note that if we wanted to generalize to the whole population of nonfiction vs. fiction pricing, then this data set would perhaps not be the best choice. The Amazon data set only includes the Top 100 and it only incl dues books sold on Amazon. To improve the data set, we would need many more than 100 books. One reason for this is that maybe the top 100 consists of 80 fiction and only 20 non-fiction. Additionally, we would want a data set representative of books sold in different types of book shops (local, retail, small, large, around the world…). Perhaps another confounding variable is the book preference of the type of people who use Amazon.

Additionally, for context, as mentioned we got this data from Kaggle, and we cleaned it in the previous project checkpoints.

Amazon seems like a decent source, but we have to be careful about how far we generalize. We can really only generalize to price difference of non-fiction vs fiction books on Amazon (perhaps even only US Amazon). One final point to consider is how are we measuring price and genre. If we had romance, comedy, mystery as well, then we would have more detailed and more useful data BUT it could be more subjective/harder to determine price differences. Nonfiction vs fiction is relatively easy and good because it’s pretty easy to separate BUT we do loose a lot of granular detail in price trends.

At this point in our analysis, from our own experience buying books, we imagine that there would be a difference in price in non-fiction vs fiction books.

Analysis #2

Research Question

Research Question: Has the average price of best-selling Amazon books changed since 2020?

  • Goal: We want to evaluate whether the price of best-selling books on Amazon since (and including) 2020, the year of the pandemic, has changed from before 2020.

  • Type of Model: We will use a logistic regression model in order to compare the prices across the two groups (1. pre-2020 and 2. 2020 or greater), representing book prices released before 2020 and book prices since (and including) 2020.

  • Execution: We will use a data set (with filtered out NA values) for price and create a binary column based on a condition, searching for whether the year is before 2020 or including & after (0-for pre-2020 and 1-for post-2020 (inclusive of 2020)). Using this modified data set with the new variable, we will conduct a linear regression model with the intercept as when the

  • Confounding Variables: Some potential confounding variables that may make it hard to distinguish the relationship between book prices and whether the book prices were released before 2020 or post-2020 (inclusive of 2020) may be the popularity of the author, the book format (whether it was paperback or hardcover), and the credibility of the publisher.

Explanatory Variable: whether year is post-2020 (including 2020)

Response Variable: Price (continuous numerical)

-   Explanatory Variable: whether year is post-2020 (including 2020), indicated by the variable "pre_2020" (binary). Our independent variable can take on two values (0 for pre-2020, 1 for 2020 and on)

-   Response Variable: Price (continuous numerical)
  • Execution: We will use a data set (with filtered out NA values) for price and create a binary column based on a condition, searching for whether the year is before 2020 or including & after (0-for pre-2020 and 1-for post-2020 (inclusive of 2020)). Using this modified data set with the new variable, we will create a linear regression model with the intercept as the price when the year is before 2020. The slope would represent the difference in book prices before 2020 and post-2020 (inclusive of 2020).

  • Population: We can generalize the data to only the prices of top 100 best-selling books on Amazon for each year because our data is not collected from a random sample. We have only considered the top 100 best-selling books on Amazon between 2009 and 2021.

Population: We can generalize the data to only the prices of top 100 best-selling books on Amazon for each year because our data is not collected from a random sample. We have only considered the top 100 best-selling books on Amazon between 2009 and 2021.

  • Confounding Variables: Some potential confounding variables that may make it hard to distinguish the relationship between book prices and whether the book prices were released before 2020 or post-2020 (inclusive of 2020) may be the popularity of the author, the book format (whether it was paperback or hardcover), and the credibility of the publisher.

  • Solutions: Potential solutions could be using a multiple regression analysis that takes into account multiple factors, such as material (hardcover/paperback) (converted into binary), the amount of times author appears in the top-100 list (examining popularity of author based on list).

Population: We can generalize the data to only the prices of top 100 best-selling books on Amazon for each year because our data is not collected from a random sample. We have only considered the top 100 best-selling books on Amazon between 2009 and 2021.