Project Elegant Raichu
Report
1. Introduction:
The context of our work is the purchase of the top 100 best-selling books on Amazon from the years 2009 to 2021. We want to analyze how factors like genre, cover type, author popularity, and year affect the prices of books over the course of those years.
Our research question is: What factors affect price of a book from the top 100 best-selling books on Amazon from 2009-2021? We are examining the effects of factors of these books, such as whether they are fiction or nonfiction (genre), cover type, author popularity, and whether they were sold before or after 2020 on their official prices.
2. Data description:
- What are the observations (rows) and the attributes (columns)?
The original data set has 1291 rows and 10 columns, with each row representing a book (name) and each column representing a variable (such as genre or price). Our research question embodies variables that are both quantitative (such as price) and categorical (such as genre).
The full list of columns are: Unnamed (indexing of the books in which they were scraped 1-1,290 (13 yrs * 100 books each)), Price, Ranks, Title, No. of Reviews, Ratings, Author, Cover Type, Year, and Genre. We further cleaned and organized data from original 10 columns to 6 so that the variables we are interested in are: quantitative (Year, Rating, Price, Author Popularity) & categorical (Genre, Cover Type). In considering our variables as predictors, we ensured prevention of model over-fitting, which would hinder our interpretation of the dynamics between each predictors and the price.
- Why was this data set created?
- The data for the TOP 100 BEST SELLING BOOKS ON AMAZON 2009-2021 was scraped from the Amazon website by Kaggle to observe the top 100 best-selling books for each year from 2009 to 2021.
- Who funded the creation of the data set?
- Abdulhamid Adavize and Chisom Promise were the collaborators on this data set.
- What processes might have influenced what data was observed and recorded and what was not?
As learned from the website, “Amazon does not have a definite categorization of the books, and due to this, information from Goodreads, the world’s largest site for readers and book recommendations, was used to group the books into two distinct categories which are Fiction and Non-Fiction.” The fact that the only genres are Non-fiction and Fiction may fail to account for a reason some books are ranked higher than others.
Amazon’s algorithm might promote a higher-rated book for its audience, leading more customers to buy it and rank it potentially higher (biasing the ranking of the book). In other words, a self-fulfilling prophecy of ranks.
No digital copies in the data. The 8 cover type options are: Board book, Cards, Hardcover, Mass Market Paperback, Pamphlet, Paperback, Printed Access Code, and Spiral-bound. However, in our analysis, we only included Hardcover, Mass Market Paperback, and Paperback, so we can’t gauge whether a certain time pivoted people to buy less physical books in general compared to digital. Especially for Amazon this could be important because many people use the Amazon Kindle.
Inflation over the years from 2009-2021. This might have an especially larger impact towards the beginning of the data set given the financial crisis of ’08-’09.
- What pre-processing was done, and how did the data come to be in the form that you are using?
- The data set has been cleaned and missing values that could be retrieved from the website have been filled, with the exception of the names of books that are no longer available in the store. The coverage dates for the data set are from 12/31/2018 to 12/31/2021, and the data was last updated five months ago.
- As previously mentioned, we created the factor for Author Popularity. We also reduced Cover Type options down to only Hardcover and Paperback. We did not include Mass Market Paperback because there were so few books in this category, so we stuck to just Hardcover and Paperback.
- We also mutated a new “fiction” variable from genre.
- If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
- Authors of the book probably knew that the information would be public, but maybe not as far as it being scraped from Amazon. Also, people from Goodreads knew others would see their reviews, but they would not know that the collaborators would be scraping their data/ opinion if the book was fiction or non-fiction.
3.0 Data Analysis: EDA Results
Overview
- Visualizing Price Change Throughout Years:
- Average Price Based on Rank Split into Top 50 (“High”) & Bottom 50 (“Rank”)
# A tibble: 2 × 3 rankclass avg_price sd_price <chr> <dbl> <dbl> 1 high 14.2 8.83 2 low 14.1 10.4
Examining Variables as Potential Predictors
- Visualizing Count in the Top 100 By Cover Type:
- Average price based on whether book was released before 2020, the year of the pandemic:
# A tibble: 2 × 3 pre_2020 avg_price sd_price <chr> <dbl> <dbl> 1 No 11.7 4.83 2 Yes 14.6 10.2
- Summary statistics for books that appear across many different years:
# A tibble: 210 × 8 title count year_start year_end mean_ranking mean_price mean_rating <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Publication Ma… 10 2010 2019 29.4 24.3 4.5 2 StrengthsFinde… 10 2010 2019 19.9 18.3 4.1 3 The 7 Habits o… 9 2010 2018 45.7 18.9 4.66 4 The Four Agree… 9 2013 2021 25.8 7.74 4.7 5 How to Win Fri… 8 2013 2021 43.1 10.5 4.7 6 Love You Forev… 8 2014 2021 64.1 4.98 4.9 7 The Great Gats… 8 2012 2019 53.1 6.79 4.5 8 The Official S… 8 2010 2018 32.6 34.0 4.43 9 What to Expect… 8 2012 2021 63.8 15.4 4.8 10 Jesus Calling,… 7 2011 2017 28.7 7.53 4.9 # ℹ 200 more rows # ℹ 1 more variable: mean_reviews <dbl>
- Visualizing top book counts based on whether book was fiction or non-fiction:
# A tibble: 2 × 8 genre count year_start year_end mean_ranking mean_price mean_rating <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Fiction 475 2009 2021 49.8 13.4 4.65 2 Non Fiction 692 2009 2021 50.5 14.6 4.63 # ℹ 1 more variable: mean_reviews <dbl>
- Average price, standard deviation of price, and total count by author popularity:
# A tibble: 16 × 4 popularity avg_price sd_price popularity_score_count <int> <dbl> <dbl> <int> 1 1 13.7 6.11 249 2 2 13.5 6.82 174 3 3 16.4 18.3 165 4 4 12.6 5.29 84 5 5 12.2 4.38 80 6 6 13.6 6.99 48 7 7 16.7 10.2 56 8 8 14.5 6.09 40 9 9 10.4 4.30 54 10 10 13.3 9.78 40 11 11 15.9 12.2 44 12 12 15.2 3.17 12 13 13 16.7 7.04 52 14 15 14.8 3.84 15 15 17 15.4 12.2 34 16 20 12.6 3.14 20
3.1 Data Analysis: Book Prices Since 2020
# A tibble: 5 × 2
term estimate
<chr> <dbl>
1 intercept 13.1
2 pre_2020 2.83
3 cover_typePaperback -2.26
4 popularity 0.0651
5 fictionyes -1.64
# A tibble: 5,000 × 3
# Groups: replicate [1,000]
replicate term estimate
<int> <chr> <dbl>
1 1 intercept 14.2
2 1 pre_2020 0.694
3 1 cover_typePaperback -0.726
4 1 popularity -0.0896
5 1 fictionyes 0.330
6 2 intercept 14.6
7 2 pre_2020 -0.567
8 2 cover_typePaperback 0.473
9 2 popularity -0.0133
10 2 fictionyes -0.355
# ℹ 4,990 more rows
# A tibble: 5 × 2
term p_value
<chr> <dbl>
1 cover_typePaperback 0
2 fictionyes 0.006
3 intercept 0.172
4 popularity 0.32
5 pre_2020 0
\[ \widehat{book~price} = 13.08 + 2.83 \times pre~2020 - 2.26 \times cover~type_{paperback} \]
\[ + 0.07 \times popularity - 1.64 \times fiction \]
Why Do We Consider Relation to 2020, Cover Type, Author Popularity, and Genre?
Pre-2020: The Covid-19 pandemic had a crucial impact on consumer behavior, affecting the economy and the supply and demand of several goods. As more people have been staying inside their homes since 2020, we could consider factors, such as higher demand for books in times of isolation, as well as a lower dependence on book stores due to the Internet.
Cover Type: Since hardcover books often require higher variety of materials for durability and binding, we can generally expect (not infer) that they are likely to be more expensive. This is why cover type may have a significant impact on prices.
Author Popularity: Author popularity was analyzed by counting the number of author appearances in the data frame. The higher frequency of a certain author’s name being featured in the list may indicate their level of appeal to consumers and thus possibly leading to higher prices for their work/books.
Genre: According to a survey study conducted by the Pew Research Center in 2021, 72% of the 53% of respondents who stated that they read out of pleasure stated that they read fiction while 25% stated that they preferred non-fiction books. While this data was only from 2021, we chose to include genre as we understood that survey studies are usually collected on a periodic basis and considered the credibility of our resource. The data accounts for whether fiction/non-fiction books had a higher demand, and therefore, higher prices, on average (Rainie).
3.2 Data Analysis: Genre vs. Price
Now, using bootstrapping to estimate the difference between the prices of books fiction vs. nonfiction.
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 -2.66 -0.0238
Why Genre? Our barplot visualization below for genre demonstrates why we have chosen genre as a variable that may have a significant impact on the prices of our sample of books. Non-fiction books generally seem to have a higher price in the visualization shown below (except between 2017 and near 2019 when prices seem to be generally lower than fiction, on average). This may possible indicate that price varies by genre, and therefore, we may assume (not state) that these two variables are dependent.
4. Evaluation of significance
Prices Since 2020: The p values calculated show us the significance of the predictor variable, pre-2020 (whether books were produced before 2020 or not), accounting for other predictor variables, such as genre (fiction or not), cover type, and author popularity (how many times the author appeared in the data frame).
we found the p-value of the cover type to be 0
we found the p-value of fiction to be 0.006
we found the p-value of author popularity to be 0.320
we found the p-value of pre-2020 to be 0
Genre and Price: This visualization from the simulation shows that we are 99% confident that the difference in mean prices between fiction and non-fiction books is between the two red lines (between -2.66 USD and -0.023 USD).
5. Interpretation and conclusions
Prices Since 2020: Our table below summarizes our interpretation of data analysis for prices since 2020, using hypothesis testing for our multiple regression. Our significance level: ⍺ = 0.05
Predictor | p-value | Interpretation |
---|---|---|
pre-2020 |
0 | The found p-value indicates that we reject the null hypothesis, which states that there is no relationship between whether or not the book was released before 2020 and price of the top 100 best-selling books between 2009 and 2001. The value is statistically significant since 0 < 0.05 ➔ there is significant evidence that allows us to reject the null hypothesis. |
cover_type |
0 | Value indicates that we reject the null hypothesis since 0 < 0.05. There is not enough evidence that indicates that there is no relationship between price and cover type of the top 100 best-selling books between 2009 and 2001 ➔ there is significant evidence that allows us to reject the null hypothesis. |
fiction |
0.006 | We reject the null hypothesis as p-value < 0.05. There is not enough evidence that indicates that there is no relationship between fiction (whether or not book is fiction) and price of the top 100 best-selling books between 2009 and 2001. |
popularity |
0.320 | we fail to reject the null hypothesis as it is greater than the significance level of 0.05. There is not enough evidence that allows us to reject the null, stating that there is no relationship between author popularity (whether author appears more than once in the data frame) and price of the top 100 best-selling books between 2009 and 2001 ➔ p-value is not statistically significant |
Genre and Price: We are 99% confident that the true mean price difference between fiction books and nonfiction books is between -2.66 to -0.023 dollars. Our main finding is that genre of a book on Amazon may have a impact on price. 0 is not included in the confidence interval for difference in means, indicating that the proportion of price for fiction books may differ from the proportion of price for nonfiction (since 0 is not in the 99% confidence interval). This is further supported by the p-value of genre being less than 0.05, accounting for other predictors, in our multiple regression model. This interpretation may indicate that we may reject the null hypothesis that there is no difference in price depending on whether a book is fiction or non-fiction.
6. Limitations
Limitations | Resolution |
---|---|
Confounding Variables: Initially, we used a regression model with only pre_2020 as a predictor but wanted to account for any potential confounding variables that may make it hard to distinguish the relationship between book prices and whether the book prices were released before 2020 or post-2020 (inclusive of 2020). |
We modified our initial model so that it uses a multiple regression to consider potential confounding variables, encompassing the the popularity of the author, and the book format (whether it was paperback or hardcover). |
Variable Calculation: For the author popularity variable, we assume that the author’s level of popularity is attributed to the number of times they appear in our data set. The p-value leads us to initially assume that we fail to reject that there is no relationship between author popularity and book price. However, only accounting for amount of appearances in this data set provides us a limited view of which authors are considered “popular” among readers. For instance, maybe some authors appear less in the data set but their books have higher ratings than authors who appear more often in the data set. | Since we were working with a limited view of the author popularity, we narrowed the scope of our analysis of our failure to reject the null hypothesis for the relationship between price and author popularity of only the Top 100 Amazon book prices from 2009-2021. |
Causation vs. Inference: While our regression model may give us some insight to a relationship or association between the predictors and the price, we cannot assume causation. We collected data from an observational study. | We ensured that we only made inferences based on our results from the data analysis methods. Recognizing that the data was collected from an observational study, our statements imply an association rather than a legitimate causation. |
Generalization of Research Question: We also highlight that initially our research question was “How has the price of books changed throughout time based on the rating, genre, and ranking of the book?;” We changed the question because we realized 1) this question has the embedded assumption that rating, genre, and raking had an affect on price and 2) we did not have accurate inflation data which could be a confounding factor. | In our new research question: “What factors affect price of a book from the top 100 best-selling books on Amazon from 2009-2021?,” we did not assume initially what factors affected price, in fact that is what we set out to look for (and all these factors experiences inflation in the same way). |
Some other limitations that we acknowledge:
Moving forward, we might in fact need to use regression or machine learning model to do proper analysis. This would better help us to handle complexities in data.
As mentioned, the data set does not account for digital copies of the book being sold (Kindle, E-books, etc.), so if at a certain time, more were being sold online, it isn’t accounted for. Also, perhaps digital books are cheaper to produce, so they might be less expensive.
Our conclusions are limited for extrapolation globally because not many countries rely on Amazon to buy books. This is also limited for extrapolation within the U.S. because it only accounts for books and readers on Amazon. Therefore, the findings are not applicable to the entire realm of book preference pricing. To improve the data set, we would also need many more than 100 books. One reason for this is that maybe the top 100 consists of 80 fiction and only 20 non-fiction (although we know that is not the case for this particular data set). If there was an inbalance in count of fiction vs. nonfiction, there would be an effect on ML analysis, so would need to account for the imbalance there as well. Additionally, we would want a data set representative of books sold in different types of book shops (local, retail, small, large, around the world…). Perhaps another confounding variable is the book preference of the type of people who use Amazon. We can really only generalize to price difference of non-fiction vs fiction books on Amazon (perhaps even only US Amazon). One final point to consider is how are we measuring price and genre. If we had romance, comedy, mystery as well, then we would have more detailed and more useful data BUT it could be more subjective/harder to determine price differences. Nonfiction vs fiction is relatively easy and good because it’s pretty easy to separate BUT we do loose a lot of granular detail in price trends.
There are multiple duplicate rows. For example, the book Where the Crawdads Sing is in our data set five times because it made the top 100 books across four years, but it also counts it twice in 2019 because it is listed as two different books for paperback vs. hardcover.
One book with 88 reviews has the same ranking as one with 18,656 reviews, which we should address in the future. Our original research question fails to capture this.
7. Acknowledgments
Adavize, Abdulhamid & Promise, Chisom. (2021, Dec 30). TOP 100 BEST SELLING BOOKS ON AMAZON 2009-2021. Version 2. Retrieved April 26, 2023. from https://www.kaggle.com/datasets/abdulhamidadavize/top-100-best-selling-books-on-amazon-20092021.
Rainie, L. (2020, May 30). The rise of e-reading. Retrieved April 26, 2023, from https://www.pewresearch.org/internet/2012/04/04/the-rise-of-e-reading-5/