Data cleaning
Raw Data Frame: Loaded original data frame
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2 ✔ purrr 1.0.0
✔ tibble 3.2.1 ✔ dplyr 1.1.2
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
New names:
# A tibble: 1,291 × 10
id price ranks title no_of_reviews ratings author cover_type year genre
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 0 12.5 1 The Lo… 16118 4.4 Dan B… Hardcover 2009 Fict…
2 1 13.4 2 The Sh… 23392 4.7 Willi… Paperback 2009 Fict…
3 2 9.93 3 Libert… 5036 4.8 Mark … Hardcover 2009 Non …
4 3 14.3 4 Breaki… 16912 4.7 Steph… Hardcover 2009 Fict…
5 4 9.99 5 Going … 1572 4.6 Sarah… Hardcover 2009 Non …
6 5 18.3 6 Streng… 7082 4.1 Gallup Hardcover 2009 Non …
7 6 12.7 7 The He… 18068 4.8 Kathr… Hardcover 2009 Fict…
8 7 17.6 8 New Mo… 12329 4.7 Steph… Paperback 2009 Fict…
9 8 58.9 9 The Tw… 6100 4.7 Steph… Hardcover 2009 Fict…
10 9 16.0 10 Outlie… 22209 4.7 Malco… Hardcover 2009 Non …
# ℹ 1,281 more rows
Narrowing Down Variables: Selecting Columns Price through Genre
New names:
• `` -> `...1`
# A tibble: 1,291 × 9
price ranks title no_of_reviews ratings author cover_type year genre
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <chr>
1 12.5 1 The Lost Sym… 16118 4.4 Dan B… Hardcover 2009 Fict…
2 13.4 2 The Shack: W… 23392 4.7 Willi… Paperback 2009 Fict…
3 9.93 3 Liberty and … 5036 4.8 Mark … Hardcover 2009 Non …
4 14.3 4 Breaking Daw… 16912 4.7 Steph… Hardcover 2009 Fict…
5 9.99 5 Going Rogue:… 1572 4.6 Sarah… Hardcover 2009 Non …
6 18.3 6 StrengthsFin… 7082 4.1 Gallup Hardcover 2009 Non …
7 12.7 7 The Help 18068 4.8 Kathr… Hardcover 2009 Fict…
8 17.6 8 New Moon (Th… 12329 4.7 Steph… Paperback 2009 Fict…
9 58.9 9 The Twilight… 6100 4.7 Steph… Hardcover 2009 Fict…
10 16.0 10 Outliers: Th… 22209 4.7 Malco… Hardcover 2009 Non …
# ℹ 1,281 more rows
Additional Data Cleaning: Categorizing years of book release based on whether they are before 2020
# A tibble: 1,291 × 3
title pre_2020 year
<chr> <dbl> <dbl>
1 The Lost Symbol 1 2009
2 The Shack: Where Tragedy Confronts Eternity 1 2009
3 Liberty and Tyranny: A Conservative Manifesto 1 2009
4 Breaking Dawn (The Twilight Saga, Book 4) 1 2009
5 Going Rogue: An American Life 1 2009
6 StrengthsFinder 2.0 1 2009
7 The Help 1 2009
8 New Moon (The Twilight Saga) 1 2009
9 The Twilight Saga Collection 1 2009
10 Outliers: The Story of Success 1 2009
# ℹ 1,281 more rows
Mutate Variables: Mutating necessary variables for analysis, including Author Popularity & Fiction
Adding missing grouping variables: `author`
# A tibble: 1,167 × 4
# Groups: author [472]
author title fiction popularity
<chr> <chr> <fct> <int>
1 Dan Brown The Lost Symbol yes 3
2 William P. Young The Shack: Where Tragedy Confronts Etern… yes 3
3 Mark R. Levin Liberty and Tyranny: A Conservative Mani… no 4
4 Stephenie Meyer Breaking Dawn (The Twilight Saga, Book 4) yes 10
5 Sarah Palin Going Rogue: An American Life no 1
6 Gallup StrengthsFinder 2.0 no 13
7 Kathryn Stockett The Help yes 5
8 Stephenie Meyer New Moon (The Twilight Saga) yes 10
9 Stephenie Meyer The Twilight Saga Collection yes 10
10 Malcolm Gladwell Outliers: The Story of Success no 12
# ℹ 1,157 more rows
Other appendicies (as necessary)
- Visualization for difference in price by rating and cover type:
- Taking top 5 book titles with the highest number of reviews and use them for ranking:
# A tibble: 5 × 2
title reviews
<chr> <dbl>
1 Where the Crawdads Sing 344811
2 The Midnight Library: A Novel 199570
3 It Ends with Us: A Novel (1) 169014
4 Verity 163818
5 The Silent Patient 135163
- Examining most consistently highly-ranked books across multiple years:
# A tibble: 436 × 8
title count mean_ranking year_start year_end mean_price mean_rating
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 What to Expect… 8 63.8 2012 2021 15.4 4.8
2 Love You Forev… 7 71.1 2013 2021 4.98 4.9
3 The Care and K… 7 75.4 2013 2019 9.48 4.8
4 The Gifts of I… 7 65.4 2013 2019 6.92 4.7
5 Dragons Love T… 6 84.8 2015 2021 8.95 4.8
6 The Alchemist,… 6 64.8 2015 2020 13.3 4.7
7 The Great Gats… 6 66 2011 2019 6.79 4.5
8 Fahrenheit 451 5 67.2 2014 2019 8.29 4.6
9 The Outsiders 5 86.4 2014 2021 9.49 4.8
10 Harry Potter P… 4 87.8 2011 2021 38.9 4.9
# ℹ 426 more rows
# ℹ 1 more variable: mean_reviews <dbl>
- Addressing issue of duplicate books: Where The Crawdads Sing is in the data set multiple times because of the multiple editions
# A tibble: 5 × 3
# Groups: year [4]
title year price
<chr> <dbl> <dbl>
1 Where the Crawdads Sing 2018 12.4
2 Where the Crawdads Sing 2019 12.4
3 Where the Crawdads Sing 2019 14.0
4 Where the Crawdads Sing 2020 12.4
5 Where the Crawdads Sing 2021 9.98
- Relationship Between Book Price and Other Variables: