library(tidyverse)
library(skimr)
Project Fabulous Buneary
Proposal
Data 1
Introduction and data
Identify the source of the data.
- The data comes from Kaggle. Link: https://www.kaggle.com/datasets/ulrikthygepedersen/shark-tank-companies?resource=download
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data was published on March 2, 2023, but may have been collected before that. It is not clear how the author collected the data. They may have either looked up the individual companies in the dataset, watched the show itself, or found the data from other sources and aggregated it here. The data only contains factual information that can be verified by online research.
Write a brief description of the observations.
- Each observation in the dataset consists of a company that went on the TV show Shark Tank between seasons 1 and 6 inclusive. It includes companies that both received and did not receive offers. The columns include various information about the company’s appearance on the show. Some variables include the company name, whether or not they received an offer, how much the company was worth, what percent stake they offered, when the episode aired, etc…
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How did the type of product relate to the shark(s) who invested in the product; were certain types of products more likely to attract some sharks over others?
How is valuation correlated to success rate; is the value of a company an indicator of the likelihood of reaching a deal with a shark?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The topic of interest concerns the success of beginner entrepreneurs in marketing their businesses to potential investors. By studying this data from Shark Tank, perhaps some insight can be gained into how to attract investors to products. Our hypotheses for the above two research questions are:
Sharks will be more likely to invest in products that they have prior experience with/have connections they can leverage that will make the product more likely to succeed.
- A Shark’s background can easily be determined via outside research.
Companies with higher valuations will tend to receive deals more often than companies with lower valuations.
- Identify the types of variables in your research question. Categorical? Quantitative?
The variables in the first research question include strictly categorical variables. The dataset stores the relevant variables under the names category (chr) , description (chr), and deal (boolean).
The variables in the second research question include both categorical and quantitative variables. The dataset stores the relevant variables under the names valuation (dbl) and deal (boolean).
Glimpse of data
<- read_csv("data/shark/shark_tank_companies.csv") shark
Rows: 495 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): description, category, entrepreneurs, location, website, shark1, s...
dbl (5): episode, askedfor, exchangeforstake, valuation, season
lgl (2): deal, multiple_entreprenuers
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(shark)
Rows: 495
Columns: 19
$ deal <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, F…
$ description <chr> "Bluetooth device implant for your ear.", "Reta…
$ episode <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4,…
$ category <chr> "Novelties", "Specialty Food", "Baby and Child …
$ entrepreneurs <chr> "Darrin Johnson", "Tod Wilson", "Tiffany Krumin…
$ location <chr> "St. Paul, MN", "Somerset, NJ", "Atlanta, GA", …
$ website <chr> NA, "http://whybake.com/", "http://www.avatheel…
$ askedfor <dbl> 1000000, 460000, 50000, 250000, 1200000, 500000…
$ exchangeforstake <dbl> 15, 10, 15, 25, 10, 15, 20, 20, 10, 10, 35, 10,…
$ valuation <dbl> 6666667, 4600000, 333333, 1000000, 12000000, 33…
$ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ shark1 <chr> "Barbara Corcoran", "Barbara Corcoran", "Barbar…
$ shark2 <chr> "Robert Herjavec", "Robert Herjavec", "Robert H…
$ shark3 <chr> "Kevin O'Leary", "Kevin O'Leary", "Kevin O'Lear…
$ shark4 <chr> "Daymond John", "Daymond John", "Daymond John",…
$ shark5 <chr> "Kevin Harrington", "Kevin Harrington", "Kevin …
$ title <chr> "Ionic Ear", "Mr. Tod's Pie Factory", "Ava the …
$ episode_season <chr> "1-1", "1-1", "1-1", "1-1", "1-1", "1-2", "1-2"…
$ multiple_entreprenuers <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
head(shark)
# A tibble: 6 × 19
deal description episode category entrepreneurs location website askedfor
<lgl> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 FALSE Bluetooth devi… 1 Novelti… Darrin Johns… St. Pau… <NA> 1000000
2 TRUE Retail and who… 1 Special… Tod Wilson Somerse… http:/… 460000
3 TRUE Ava the Elepha… 1 Baby an… Tiffany Krum… Atlanta… http:/… 50000
4 FALSE Organizing, pa… 1 Consume… Nick Friedma… Tampa, … http:/… 250000
5 FALSE Interactive me… 1 Consume… Kevin Flanne… Cary, NC http:/… 1200000
6 TRUE One of the fir… 2 Special… Susan Knapp Napa Va… http:/… 500000
# ℹ 11 more variables: exchangeforstake <dbl>, valuation <dbl>, season <dbl>,
# shark1 <chr>, shark2 <chr>, shark3 <chr>, shark4 <chr>, shark5 <chr>,
# title <chr>, episode_season <chr>, multiple_entreprenuers <lgl>
Data 2
Introduction and data
Identify the source of the data.
- The data comes from basketball-reference.com. Link: https://www.basketball-reference.com/friv/buzzer-beaters.html?__hstc=213859787.d82bb7f1fc1135abdbcb70a98543569f.1678839079151.1678839079151.1678839079151.1&__hssc=213859787.1.1678839079151&__hsfp=862196976
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data was published February of 2020 and has been updated for every NBA / BAA game since then. The original data curator defines a “buzzer beater shot” as “successful shots taken with the shooter’s team tied or trailing which left no time on the clock after going through the net.” They used a team stats finder (https://stathead.com/basketball/team-game-finder.cgi) to collect this data.
Write a brief description of the observations.
- Each observation represents a game where a team won with a buzzer beater shot and includes columns about whether the shot was assisted, which player was shooting, what distance from the rim did the player shoot from, and general statistics about the overall game.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How has the frequency and success rate of buzzer-beater shots changed over time in professional basketball, and what factors may have contributed to these trends?
What is the overall success rate of buzzer-beater shots in NBA history, and how does this rate vary across different eras, teams, and players?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The study focuses on buzzer-beater shots in the NBA, which are shots taken in the final seconds of a game with the goal of scoring before the game clock runs out. These shots can have a big impact on the outcome of a game and are thrilling for both players and fans.
Hypotheses:
Players who attempt buzzer-beater shots from a closer distance to the basket will be more successful than those who attempt shots from a greater distance.
Players who have attempted more buzzer-beater shots in their careers are more likely to succeed than those who have attempted fewer shots.
Teams with a higher overall field goal percentage during the game will be more successful with buzzer-beater shots.
- Identify the types of variables in your research question. Categorical? Quantitative?
Team and player names are all categorical variables that can be used to group and compare data. Time is a categorical variable that can be measured over time, such as by decade or NBA season.
The success rate of buzzer-beater shots is a quantitative variable that can be calculated as the percentage of made vs. attempted buzzer-beater shots. Because the success rate variable is not included in the dataset, this would have to be calculated using the data in the dataset.
Glimpse of data
<- read.csv("data/buzzerBeater/game-winning-buzzer-beaters.csv")
buzzer_beaters glimpse(buzzer_beaters)
Rows: 822
Columns: 24
$ Player <chr> "Daniel Gafford", "Trae Young (2)", "Wendell Carter Jr.", "S…
$ Game <chr> "Mar 7 2023", "Feb 26 2023", "Feb 23 2023", "Jan 4 2023", "Ja…
$ Team <chr> "WAS", "ATL", "ORL", "DET", "GSW", "MIA", "CHI", "OKC", "BRK"…
$ Opp <chr> "DET", "BRK", "DET", "GSW", "ATL", "UTA", "ATL", "POR", "TOR"…
$ Margin <chr> "tied", "tied", "tied", "tied", "tied", "tied", "tied", "tied…
$ Type <chr> "2-pt FG", "2-pt FG", "2-pt FG", "3-pt FG", "2-pt FG", "3-pt …
$ Assisted <chr> "unassisted", "unassisted", "unassisted", "K. Hayes", "unassi…
$ Distance <chr> "At Rim", "12", "At Rim", "28", "At Rim", "25", "At Rim", "14…
$ MP <chr> "24:38:00", "33:48:00", "31:46:00", "28:46:00", "32:25:00", "…
$ PTS <int> 8, 34, 14, 17, 14, 29, 9, 35, 32, 17, 12, 17, 12, 37, 30, 31,…
$ FG <int> 4, 12, 5, 6, 4, 10, 4, 10, 13, 7, 4, 8, 3, 14, 10, 9, 6, 11, …
$ FGA <int> 5, 26, 13, 17, 8, 20, 6, 24, 22, 17, 8, 15, 5, 24, 17, 18, 14…
$ FG. <dbl> 0.800, 0.462, 0.385, 0.353, 0.500, 0.500, 0.667, 0.417, 0.591…
$ X3P <int> 0, 1, 0, 4, 0, 3, 1, 1, 3, 2, 4, 1, 1, 2, 1, 3, 2, 9, 4, 2, 4…
$ X3PA <int> 0, 5, 3, 10, 0, 11, 3, 1, 9, 11, 7, 6, 3, 7, 3, 7, 6, 12, 5, …
$ X3P. <dbl> NA, 0.200, 0.000, 0.400, NA, 0.273, 0.333, 1.000, 0.333, 0.18…
$ FT <int> 0, 9, 4, 1, 6, 6, 0, 14, 3, 1, 0, 0, 5, 7, 9, 10, 8, 7, 2, 3,…
$ FTA <int> 0, 9, 6, 1, 8, 7, 0, 14, 3, 1, 0, 0, 7, 7, 11, 12, 8, 8, 2, 3…
$ FT. <dbl> NA, 1.000, 0.667, 1.000, 0.750, 0.857, NA, 1.000, 1.000, 1.00…
$ TRB <int> 7, 3, 14, 3, 20, 9, 3, 2, 3, 2, 4, 5, 9, 5, 2, 4, 4, 2, 8, 4,…
$ AST <int> 1, 8, 2, 1, 4, 6, 2, 6, 5, 2, 1, 1, 8, 3, 5, 8, 3, 0, 3, 6, 1…
$ STL <int> 1, 2, 1, 2, 0, 2, 2, 1, 0, 0, 1, 1, 2, 1, 0, 1, 3, 1, 2, 0, 1…
$ BLK <int> 1, 0, 2, 0, 1, 0, 0, 2, 0, 1, 2, 1, 0, 0, 1, 2, 0, 0, 0, 0, 1…
$ GmSc <dbl> 7.8, 24.7, 14.3, 11.1, 21.1, 20.9, 9.3, 26.6, 24.0, 8.8, 9.1,…
head(buzzer_beaters)
Player Game Team Opp Margin Type Assisted Distance
1 Daniel Gafford Mar 7 2023 WAS DET tied 2-pt FG unassisted At Rim
2 Trae Young (2) Feb 26 2023 ATL BRK tied 2-pt FG unassisted 12
3 Wendell Carter Jr. Feb 23 2023 ORL DET tied 2-pt FG unassisted At Rim
4 Saddiq Bey Jan 4 2023 DET GSW tied 3-pt FG K. Hayes 28
5 Kevon Looney Jan 2 2023 GSW ATL tied 2-pt FG unassisted At Rim
6 Tyler Herro Dec 31 2022 MIA UTA tied 3-pt FG unassisted 25
MP PTS FG FGA FG. X3P X3PA X3P. FT FTA FT. TRB AST STL BLK GmSc
1 24:38:00 8 4 5 0.800 0 0 NA 0 0 NA 7 1 1 1 7.8
2 33:48:00 34 12 26 0.462 1 5 0.200 9 9 1.000 3 8 2 0 24.7
3 31:46:00 14 5 13 0.385 0 3 0.000 4 6 0.667 14 2 1 2 14.3
4 28:46:00 17 6 17 0.353 4 10 0.400 1 1 1.000 3 1 2 0 11.1
5 32:25:00 14 4 8 0.500 0 0 NA 6 8 0.750 20 4 0 1 21.1
6 34:01:00 29 10 20 0.500 3 11 0.273 6 7 0.857 9 6 2 0 20.9
Data 3
Introduction and data
- Identify the source of the data.
This data is from Kaggle. The link to it is: https://www.kaggle.com/datasets/thedevastator/movie-gross-and-ratings-from-1989-to-2014
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
This data was originally collected by Yashwanth Sharaff through this website: https://data.world/yasharaff/movies-gross. It was collected and uploaded September 6th, 2021.
- Write a brief description of the observations.
Each observation in the data describes the different attributes of a movie, specifically its MPAA ratings, budget, gross, genre, length, and release date to create a complete look at how a certain movie did when released. Because the dataset contains the top 20 movies each year from 1989 to 2014, the observations provide a look at how movies/these features of the movies have evolved over time.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How have the MPAA ratings of a movie and their runtimes in comparison to their budgets changed over time, and what factors may have contributed to these trends?
- This is important because it can provide insight into how movie runtimes have changed over time, and what budget movies get based on their MPAA ratings.
How is release date correlated to gross revenue; does the release date of a movie make it more or less likely to make money? Has this changed over time?
- This question can provide insight on the best time to release a movie of a certain genre, and how to maximize the chances of getting a successful movie.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
This dataset shows how a number of factors such as release date, budget, runtime, and MCAA rating compare to their gross revenues and overall popularity. This research is important as it can show movie studios which modern-day films are succeeding, and perhaps provide an explanation as to what contributes to that success.
Hypotheses for the two questions:
Runtimes of movies have gotten shorter in recent years, yet movies have gotten more expensive to produce
More movies with mature MPAA ratings have been made in recent years with larger budgets in comparison to prior decades
Certain genres of movies do better during different times of year, and have larger gross revenues
- Identify the types of variables in your research question. Categorical? Quantitative?
Title, MPAA Rating, release date, genre are all categorical variables.
Budget, gross, runtime, and rating count are all quantitative variables.
Glimpse of data
# add code here
<- read_csv("data/movies/Movies_gross_rating.csv") movies
Rows: 510 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Title, MPAA Rating, Genre
dbl (7): index, MovieID, Budget, Gross, Runtime, Rating, Rating Count
date (1): Release Date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(movies)
Rows: 510
Columns: 11
$ index <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ MovieID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ Title <chr> "Look Who's Talking", "Driving Miss Daisy", "Turner & H…
$ `MPAA Rating` <chr> "PG-13", "PG", "PG", "R", "PG", "PG", "R", "PG", "PG-13…
$ Budget <dbl> 7500000, 7500000, 13000000, 14000000, 15000000, 1500000…
$ Gross <dbl> 296000000, 145793296, 71079915, 161001698, 84431625, 79…
$ `Release Date` <date> 1989-10-12, 1989-12-13, 1989-07-28, 1989-12-20, 1989-0…
$ Genre <chr> "Romance", "Comedy", "Crime", "War", "Drama", "Family",…
$ Runtime <dbl> 93, 99, 100, 145, 107, 100, 96, 129, 124, 114, 116, 97,…
$ Rating <dbl> 5.9, 7.4, 7.2, 7.2, 7.5, 7.0, 7.6, 8.1, 7.0, 7.2, 6.8, …
$ `Rating Count` <dbl> 73638, 91075, 91415, 91415, 101702, 77659, 180871, 3820…
head(movies)
# A tibble: 6 × 11
index MovieID Title `MPAA Rating` Budget Gross `Release Date` Genre Runtime
<dbl> <dbl> <chr> <chr> <dbl> <dbl> <date> <chr> <dbl>
1 0 1 Look W… PG-13 7.5e6 2.96e8 1989-10-12 Roma… 93
2 1 2 Drivin… PG 7.5e6 1.46e8 1989-12-13 Come… 99
3 2 3 Turner… PG 1.3e7 7.11e7 1989-07-28 Crime 100
4 3 4 Born o… R 1.4e7 1.61e8 1989-12-20 War 145
5 4 5 Field … PG 1.5e7 8.44e7 1989-04-21 Drama 107
6 5 6 Uncle … PG 1.5e7 7.93e7 1989-08-16 Fami… 100
# ℹ 2 more variables: Rating <dbl>, `Rating Count` <dbl>