Project Elegant-Raichu

Appendix to report

Data cleaning

Raw Data Frame: Loaded original data frame

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'scales'


The following object is masked from 'package:purrr':

    discard


The following object is masked from 'package:readr':

    col_factor


New names:
# A tibble: 1,291 × 10
      id price ranks title   no_of_reviews ratings author cover_type  year genre
   <dbl> <dbl> <dbl> <chr>           <dbl>   <dbl> <chr>  <chr>      <dbl> <chr>
 1     0 12.5      1 The Lo…         16118     4.4 Dan B… Hardcover   2009 Fict…
 2     1 13.4      2 The Sh…         23392     4.7 Willi… Paperback   2009 Fict…
 3     2  9.93     3 Libert…          5036     4.8 Mark … Hardcover   2009 Non …
 4     3 14.3      4 Breaki…         16912     4.7 Steph… Hardcover   2009 Fict…
 5     4  9.99     5 Going …          1572     4.6 Sarah… Hardcover   2009 Non …
 6     5 18.3      6 Streng…          7082     4.1 Gallup Hardcover   2009 Non …
 7     6 12.7      7 The He…         18068     4.8 Kathr… Hardcover   2009 Fict…
 8     7 17.6      8 New Mo…         12329     4.7 Steph… Paperback   2009 Fict…
 9     8 58.9      9 The Tw…          6100     4.7 Steph… Hardcover   2009 Fict…
10     9 16.0     10 Outlie…         22209     4.7 Malco… Hardcover   2009 Non …
# ℹ 1,281 more rows

Narrowing Down Variables: Selecting Columns Price through Genre

New names:
• `` -> `...1`
# A tibble: 1,291 × 9
   price ranks title         no_of_reviews ratings author cover_type  year genre
   <dbl> <dbl> <chr>                 <dbl>   <dbl> <chr>  <chr>      <dbl> <chr>
 1 12.5      1 The Lost Sym…         16118     4.4 Dan B… Hardcover   2009 Fict…
 2 13.4      2 The Shack: W…         23392     4.7 Willi… Paperback   2009 Fict…
 3  9.93     3 Liberty and …          5036     4.8 Mark … Hardcover   2009 Non …
 4 14.3      4 Breaking Daw…         16912     4.7 Steph… Hardcover   2009 Fict…
 5  9.99     5 Going Rogue:…          1572     4.6 Sarah… Hardcover   2009 Non …
 6 18.3      6 StrengthsFin…          7082     4.1 Gallup Hardcover   2009 Non …
 7 12.7      7 The Help              18068     4.8 Kathr… Hardcover   2009 Fict…
 8 17.6      8 New Moon (Th…         12329     4.7 Steph… Paperback   2009 Fict…
 9 58.9      9 The Twilight…          6100     4.7 Steph… Hardcover   2009 Fict…
10 16.0     10 Outliers: Th…         22209     4.7 Malco… Hardcover   2009 Non …
# ℹ 1,281 more rows

Additional Data Cleaning: Categorizing years of book release based on whether they are before 2020

# A tibble: 1,291 × 3
   title                                         pre_2020  year
   <chr>                                            <dbl> <dbl>
 1 The Lost Symbol                                      1  2009
 2 The Shack: Where Tragedy Confronts Eternity          1  2009
 3 Liberty and Tyranny: A Conservative Manifesto        1  2009
 4 Breaking Dawn (The Twilight Saga, Book 4)            1  2009
 5 Going Rogue: An American Life                        1  2009
 6 StrengthsFinder 2.0                                  1  2009
 7 The Help                                             1  2009
 8 New Moon (The Twilight Saga)                         1  2009
 9 The Twilight Saga Collection                         1  2009
10 Outliers: The Story of Success                       1  2009
# ℹ 1,281 more rows

Mutate Variables: Mutating necessary variables for analysis, including Author Popularity & Fiction

Adding missing grouping variables: `author`
# A tibble: 1,167 × 4
# Groups:   author [472]
   author           title                                     fiction popularity
   <chr>            <chr>                                     <fct>        <int>
 1 Dan Brown        The Lost Symbol                           yes              3
 2 William P. Young The Shack: Where Tragedy Confronts Etern… yes              3
 3 Mark R. Levin    Liberty and Tyranny: A Conservative Mani… no               4
 4 Stephenie Meyer  Breaking Dawn (The Twilight Saga, Book 4) yes             10
 5 Sarah Palin      Going Rogue: An American Life             no               1
 6 Gallup           StrengthsFinder 2.0                       no              13
 7 Kathryn Stockett The Help                                  yes              5
 8 Stephenie Meyer  New Moon (The Twilight Saga)              yes             10
 9 Stephenie Meyer  The Twilight Saga Collection              yes             10
10 Malcolm Gladwell Outliers: The Story of Success            no              12
# ℹ 1,157 more rows

Other appendicies (as necessary)

  • Visualization for difference in price by rating and cover type:

  • Taking top 5 book titles with the highest number of reviews and use them for ranking:
# A tibble: 5 × 2
  title                         reviews
  <chr>                           <dbl>
1 Where the Crawdads Sing        344811
2 The Midnight Library: A Novel  199570
3 It Ends with Us: A Novel (1)   169014
4 Verity                         163818
5 The Silent Patient             135163
  • Examining most consistently highly-ranked books across multiple years:
# A tibble: 436 × 8
   title           count mean_ranking year_start year_end mean_price mean_rating
   <chr>           <int>        <dbl>      <dbl>    <dbl>      <dbl>       <dbl>
 1 What to Expect…     8         63.8       2012     2021      15.4          4.8
 2 Love You Forev…     7         71.1       2013     2021       4.98         4.9
 3 The Care and K…     7         75.4       2013     2019       9.48         4.8
 4 The Gifts of I…     7         65.4       2013     2019       6.92         4.7
 5 Dragons Love T…     6         84.8       2015     2021       8.95         4.8
 6 The Alchemist,…     6         64.8       2015     2020      13.3          4.7
 7 The Great Gats…     6         66         2011     2019       6.79         4.5
 8 Fahrenheit 451      5         67.2       2014     2019       8.29         4.6
 9 The Outsiders       5         86.4       2014     2021       9.49         4.8
10 Harry Potter P…     4         87.8       2011     2021      38.9          4.9
# ℹ 426 more rows
# ℹ 1 more variable: mean_reviews <dbl>
  • Addressing issue of duplicate books: Where The Crawdads Sing is in the data set multiple times because of the multiple editions
# A tibble: 5 × 3
# Groups:   year [4]
  title                    year price
  <chr>                   <dbl> <dbl>
1 Where the Crawdads Sing  2018 12.4 
2 Where the Crawdads Sing  2019 12.4 
3 Where the Crawdads Sing  2019 14.0 
4 Where the Crawdads Sing  2020 12.4 
5 Where the Crawdads Sing  2021  9.98
  • Relationship Between Book Price and Other Variables: