Project Fabulous Buneary

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    • The data comes from Kaggle. Link: https://www.kaggle.com/datasets/ulrikthygepedersen/shark-tank-companies?resource=download
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was published on March 2, 2023, but may have been collected before that. It is not clear how the author collected the data. They may have either looked up the individual companies in the dataset, watched the show itself, or found the data from other sources and aggregated it here. The data only contains factual information that can be verified by online research.
  • Write a brief description of the observations.

    • Each observation in the dataset consists of a company that went on the TV show Shark Tank between seasons 1 and 6 inclusive. It includes companies that both received and did not receive offers. The columns include various information about the company’s appearance on the show. Some variables include the company name, whether or not they received an offer, how much the company was worth, what percent stake they offered, when the episode aired, etc…

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How did the type of product relate to the shark(s) who invested in the product; were certain types of products more likely to attract some sharks over others?

    • How is valuation correlated to success rate; is the value of a company an indicator of the likelihood of reaching a deal with a shark?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The topic of interest concerns the success of beginner entrepreneurs in marketing their businesses to potential investors. By studying this data from Shark Tank, perhaps some insight can be gained into how to attract investors to products. Our hypotheses for the above two research questions are:

      • Sharks will be more likely to invest in products that they have prior experience with/have connections they can leverage that will make the product more likely to succeed.

        • A Shark’s background can easily be determined via outside research.
      • Companies with higher valuations will tend to receive deals more often than companies with lower valuations.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • The variables in the first research question include strictly categorical variables. The dataset stores the relevant variables under the names category (chr) , description (chr), and deal (boolean).

    • The variables in the second research question include both categorical and quantitative variables. The dataset stores the relevant variables under the names valuation (dbl) and deal (boolean).

Glimpse of data

shark <- read_csv("data/shark/shark_tank_companies.csv")
Rows: 495 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): description, category, entrepreneurs, location, website, shark1, s...
dbl  (5): episode, askedfor, exchangeforstake, valuation, season
lgl  (2): deal, multiple_entreprenuers

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(shark)
Rows: 495
Columns: 19
$ deal                   <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, F…
$ description            <chr> "Bluetooth device implant for your ear.", "Reta…
$ episode                <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4,…
$ category               <chr> "Novelties", "Specialty Food", "Baby and Child …
$ entrepreneurs          <chr> "Darrin Johnson", "Tod Wilson", "Tiffany Krumin…
$ location               <chr> "St. Paul, MN", "Somerset, NJ", "Atlanta, GA", …
$ website                <chr> NA, "http://whybake.com/", "http://www.avatheel…
$ askedfor               <dbl> 1000000, 460000, 50000, 250000, 1200000, 500000…
$ exchangeforstake       <dbl> 15, 10, 15, 25, 10, 15, 20, 20, 10, 10, 35, 10,…
$ valuation              <dbl> 6666667, 4600000, 333333, 1000000, 12000000, 33…
$ season                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ shark1                 <chr> "Barbara Corcoran", "Barbara Corcoran", "Barbar…
$ shark2                 <chr> "Robert Herjavec", "Robert Herjavec", "Robert H…
$ shark3                 <chr> "Kevin O'Leary", "Kevin O'Leary", "Kevin O'Lear…
$ shark4                 <chr> "Daymond John", "Daymond John", "Daymond John",…
$ shark5                 <chr> "Kevin Harrington", "Kevin Harrington", "Kevin …
$ title                  <chr> "Ionic Ear", "Mr. Tod's Pie Factory", "Ava the …
$ episode_season         <chr> "1-1", "1-1", "1-1", "1-1", "1-1", "1-2", "1-2"…
$ multiple_entreprenuers <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
head(shark)
# A tibble: 6 × 19
  deal  description     episode category entrepreneurs location website askedfor
  <lgl> <chr>             <dbl> <chr>    <chr>         <chr>    <chr>      <dbl>
1 FALSE Bluetooth devi…       1 Novelti… Darrin Johns… St. Pau… <NA>     1000000
2 TRUE  Retail and who…       1 Special… Tod Wilson    Somerse… http:/…   460000
3 TRUE  Ava the Elepha…       1 Baby an… Tiffany Krum… Atlanta… http:/…    50000
4 FALSE Organizing, pa…       1 Consume… Nick Friedma… Tampa, … http:/…   250000
5 FALSE Interactive me…       1 Consume… Kevin Flanne… Cary, NC http:/…  1200000
6 TRUE  One of the fir…       2 Special… Susan Knapp   Napa Va… http:/…   500000
# ℹ 11 more variables: exchangeforstake <dbl>, valuation <dbl>, season <dbl>,
#   shark1 <chr>, shark2 <chr>, shark3 <chr>, shark4 <chr>, shark5 <chr>,
#   title <chr>, episode_season <chr>, multiple_entreprenuers <lgl>

Data 2

Introduction and data

  • Identify the source of the data.

    • The data comes from basketball-reference.com. Link: https://www.basketball-reference.com/friv/buzzer-beaters.html?__hstc=213859787.d82bb7f1fc1135abdbcb70a98543569f.1678839079151.1678839079151.1678839079151.1&__hssc=213859787.1.1678839079151&__hsfp=862196976
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This data was published February of 2020 and has been updated for every NBA / BAA game since then. The original data curator defines a “buzzer beater shot” as “successful shots taken with the shooter’s team tied or trailing which left no time on the clock after going through the net.” They used a team stats finder (https://stathead.com/basketball/team-game-finder.cgi) to collect this data.
  • Write a brief description of the observations.

    • Each observation represents a game where a team won with a buzzer beater shot and includes columns about whether the shot was assisted, which player was shooting, what distance from the rim did the player shoot from, and general statistics about the overall game.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How has the frequency and success rate of buzzer-beater shots changed over time in professional basketball, and what factors may have contributed to these trends?

    • What is the overall success rate of buzzer-beater shots in NBA history, and how does this rate vary across different eras, teams, and players?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • The study focuses on buzzer-beater shots in the NBA, which are shots taken in the final seconds of a game with the goal of scoring before the game clock runs out. These shots can have a big impact on the outcome of a game and are thrilling for both players and fans.

    • Hypotheses:

      • Players who attempt buzzer-beater shots from a closer distance to the basket will be more successful than those who attempt shots from a greater distance.

      • Players who have attempted more buzzer-beater shots in their careers are more likely to succeed than those who have attempted fewer shots.

      • Teams with a higher overall field goal percentage during the game will be more successful with buzzer-beater shots.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Team and player names are all categorical variables that can be used to group and compare data. Time is a categorical variable that can be measured over time, such as by decade or NBA season.

    • The success rate of buzzer-beater shots is a quantitative variable that can be calculated as the percentage of made vs. attempted buzzer-beater shots. Because the success rate variable is not included in the dataset, this would have to be calculated using the data in the dataset.

Glimpse of data

buzzer_beaters <- read.csv("data/buzzerBeater/game-winning-buzzer-beaters.csv")
glimpse(buzzer_beaters)
Rows: 822
Columns: 24
$ Player   <chr> "Daniel Gafford", "Trae Young  (2)", "Wendell Carter Jr.", "S…
$ Game     <chr> "Mar 7 2023", "Feb 26 2023", "Feb 23 2023", "Jan 4 2023", "Ja…
$ Team     <chr> "WAS", "ATL", "ORL", "DET", "GSW", "MIA", "CHI", "OKC", "BRK"…
$ Opp      <chr> "DET", "BRK", "DET", "GSW", "ATL", "UTA", "ATL", "POR", "TOR"…
$ Margin   <chr> "tied", "tied", "tied", "tied", "tied", "tied", "tied", "tied…
$ Type     <chr> "2-pt FG", "2-pt FG", "2-pt FG", "3-pt FG", "2-pt FG", "3-pt …
$ Assisted <chr> "unassisted", "unassisted", "unassisted", "K. Hayes", "unassi…
$ Distance <chr> "At Rim", "12", "At Rim", "28", "At Rim", "25", "At Rim", "14…
$ MP       <chr> "24:38:00", "33:48:00", "31:46:00", "28:46:00", "32:25:00", "…
$ PTS      <int> 8, 34, 14, 17, 14, 29, 9, 35, 32, 17, 12, 17, 12, 37, 30, 31,…
$ FG       <int> 4, 12, 5, 6, 4, 10, 4, 10, 13, 7, 4, 8, 3, 14, 10, 9, 6, 11, …
$ FGA      <int> 5, 26, 13, 17, 8, 20, 6, 24, 22, 17, 8, 15, 5, 24, 17, 18, 14…
$ FG.      <dbl> 0.800, 0.462, 0.385, 0.353, 0.500, 0.500, 0.667, 0.417, 0.591…
$ X3P      <int> 0, 1, 0, 4, 0, 3, 1, 1, 3, 2, 4, 1, 1, 2, 1, 3, 2, 9, 4, 2, 4…
$ X3PA     <int> 0, 5, 3, 10, 0, 11, 3, 1, 9, 11, 7, 6, 3, 7, 3, 7, 6, 12, 5, …
$ X3P.     <dbl> NA, 0.200, 0.000, 0.400, NA, 0.273, 0.333, 1.000, 0.333, 0.18…
$ FT       <int> 0, 9, 4, 1, 6, 6, 0, 14, 3, 1, 0, 0, 5, 7, 9, 10, 8, 7, 2, 3,…
$ FTA      <int> 0, 9, 6, 1, 8, 7, 0, 14, 3, 1, 0, 0, 7, 7, 11, 12, 8, 8, 2, 3…
$ FT.      <dbl> NA, 1.000, 0.667, 1.000, 0.750, 0.857, NA, 1.000, 1.000, 1.00…
$ TRB      <int> 7, 3, 14, 3, 20, 9, 3, 2, 3, 2, 4, 5, 9, 5, 2, 4, 4, 2, 8, 4,…
$ AST      <int> 1, 8, 2, 1, 4, 6, 2, 6, 5, 2, 1, 1, 8, 3, 5, 8, 3, 0, 3, 6, 1…
$ STL      <int> 1, 2, 1, 2, 0, 2, 2, 1, 0, 0, 1, 1, 2, 1, 0, 1, 3, 1, 2, 0, 1…
$ BLK      <int> 1, 0, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 1…
$ GmSc     <dbl> 7.8, 24.7, 14.3, 11.1, 21.1, 20.9, 9.3, 26.6, 24.0, 8.8, 9.1,…
head(buzzer_beaters)
              Player        Game Team Opp Margin    Type   Assisted Distance
1     Daniel Gafford  Mar 7 2023  WAS DET   tied 2-pt FG unassisted   At Rim
2    Trae Young  (2) Feb 26 2023  ATL BRK   tied 2-pt FG unassisted       12
3 Wendell Carter Jr. Feb 23 2023  ORL DET   tied 2-pt FG unassisted   At Rim
4         Saddiq Bey  Jan 4 2023  DET GSW   tied 3-pt FG   K. Hayes       28
5       Kevon Looney  Jan 2 2023  GSW ATL   tied 2-pt FG unassisted   At Rim
6        Tyler Herro Dec 31 2022  MIA UTA   tied 3-pt FG unassisted       25
        MP PTS FG FGA   FG. X3P X3PA  X3P. FT FTA   FT. TRB AST STL BLK GmSc
1 24:38:00   8  4   5 0.800   0    0    NA  0   0    NA   7   1   1   1  7.8
2 33:48:00  34 12  26 0.462   1    5 0.200  9   9 1.000   3   8   2   0 24.7
3 31:46:00  14  5  13 0.385   0    3 0.000  4   6 0.667  14   2   1   2 14.3
4 28:46:00  17  6  17 0.353   4   10 0.400  1   1 1.000   3   1   2   0 11.1
5 32:25:00  14  4   8 0.500   0    0    NA  6   8 0.750  20   4   0   1 21.1
6 34:01:00  29 10  20 0.500   3   11 0.273  6   7 0.857   9   6   2   0 20.9

Data 3

Introduction and data

  • Identify the source of the data.

This data is from Kaggle. The link to it is: https://www.kaggle.com/datasets/thedevastator/movie-gross-and-ratings-from-1989-to-2014

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This data was originally collected by Yashwanth Sharaff through this website: https://data.world/yasharaff/movies-gross. It was collected and uploaded September 6th, 2021.

  • Write a brief description of the observations.

Each observation in the data describes the different attributes of a movie, specifically its MPAA ratings, budget, gross, genre, length, and release date to create a complete look at how a certain movie did when released. Because the dataset contains the top 20 movies each year from 1989 to 2014, the observations provide a look at how movies/these features of the movies have evolved over time.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • How have the MPAA ratings of a movie and their runtimes in comparison to their budgets changed over time, and what factors may have contributed to these trends?

      • This is important because it can provide insight into how movie runtimes have changed over time, and what budget movies get based on their MPAA ratings.
    • How is release date correlated to gross revenue; does the release date of a movie make it more or less likely to make money? Has this changed over time?

      • This question can provide insight on the best time to release a movie of a certain genre, and how to maximize the chances of getting a successful movie.
  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • This dataset shows how a number of factors such as release date, budget, runtime, and MCAA rating compare to their gross revenues and overall popularity. This research is important as it can show movie studios which modern-day films are succeeding, and perhaps provide an explanation as to what contributes to that success.

    • Hypotheses for the two questions:

      • Runtimes of movies have gotten shorter in recent years, yet movies have gotten more expensive to produce

      • More movies with mature MPAA ratings have been made in recent years with larger budgets in comparison to prior decades

      • Certain genres of movies do better during different times of year, and have larger gross revenues

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Title, MPAA Rating, release date, genre are all categorical variables.

    • Budget, gross, runtime, and rating count are all quantitative variables.

Glimpse of data

# add code here
movies <- read_csv("data/movies/Movies_gross_rating.csv")
Rows: 510 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Title, MPAA Rating, Genre
dbl  (7): index, MovieID, Budget, Gross, Runtime, Rating, Rating Count
date (1): Release Date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
Rows: 510
Columns: 11
$ index          <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ MovieID        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Title          <chr> "Look Who's Talking", "Driving Miss Daisy", "Turner & H…
$ `MPAA Rating`  <chr> "PG-13", "PG", "PG", "R", "PG", "PG", "R", "PG", "PG-13…
$ Budget         <dbl> 7500000, 7500000, 13000000, 14000000, 15000000, 1500000…
$ Gross          <dbl> 296000000, 145793296, 71079915, 161001698, 84431625, 79…
$ `Release Date` <date> 1989-10-12, 1989-12-13, 1989-07-28, 1989-12-20, 1989-0…
$ Genre          <chr> "Romance", "Comedy", "Crime", "War", "Drama", "Family",…
$ Runtime        <dbl> 93, 99, 100, 145, 107, 100, 96, 129, 124, 114, 116, 97,…
$ Rating         <dbl> 5.9, 7.4, 7.2, 7.2, 7.5, 7.0, 7.6, 8.1, 7.0, 7.2, 6.8, …
$ `Rating Count` <dbl> 73638, 91075, 91415, 91415, 101702, 77659, 180871, 3820…
head(movies)
# A tibble: 6 × 11
  index MovieID Title   `MPAA Rating` Budget  Gross `Release Date` Genre Runtime
  <dbl>   <dbl> <chr>   <chr>          <dbl>  <dbl> <date>         <chr>   <dbl>
1     0       1 Look W… PG-13          7.5e6 2.96e8 1989-10-12     Roma…      93
2     1       2 Drivin… PG             7.5e6 1.46e8 1989-12-13     Come…      99
3     2       3 Turner… PG             1.3e7 7.11e7 1989-07-28     Crime     100
4     3       4 Born o… R              1.4e7 1.61e8 1989-12-20     War       145
5     4       5 Field … PG             1.5e7 8.44e7 1989-04-21     Drama     107
6     5       6 Uncle … PG             1.5e7 7.93e7 1989-08-16     Fami…     100
# ℹ 2 more variables: Rating <dbl>, `Rating Count` <dbl>