Deal or No Deal - Investigating Shark Tank Deals Throughout the Show

Proposal

library(tidyverse)
library(skimr)

Data 1 - Shark Tank

Introduction and data

Identify the source of the data.
- The first data source called shark_tank.csv is located in the data folder. This dataset is a collection of observations from business pitches from the first 14 seasons of the American TV show “Shark Tank.” The dataset contains 1038 observations, with each observation being a unique business pitch. Associated with each observation are 52 columns that contain information regarding some pitch.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The dataset, which is available for public use via kaggle.com, was curated by a collaborator named Satya Thirumani. The page containing the dataset can be found at this link. The dataset is continually updated by this collaborator as Shark Tank airs new episodes and continues to release seasons of the show. The dataset was last updated on March 9, 2023. For the sake of this project, we will not include additional updates to this dataset past this March 9th update.
Write a brief description of the observations.
- Each observation contains 52 columns regarding the business pitch. Within these columns, there is a wide range of information. First, there’s details about the pitch as it relates to the Shark Tank show (ie; season, episode, when it aired, etc.). There are columns regarding the company being pitched such as its name and industry, as well as the entrepreneurs who run the company. Finally, there’s a great deal of information for each observation about the economics of each pitch. This includes data about the amount of money requested, the company’s original valuation, and ultimately which sharks (investors), if any, invested in the company and at what new valuation. Due to the nature of the show, where not all companies receive investment, it makes sense that many column/row pairings have blank (ie; NA) values.

Research question

Research question(s)
1. What makes a Shark Tank pitch more likely to reach a deal on the show?
  1. This question is important as it allows us to visualize trends in the types of Shark Tank pitches that ultimately reach a deal. The data may reveal that certain industries, investors, or investment deals prove to have higher success rates, which could serve as a guide to future entrepreneurs looking to pitch their business on the show.
2. When a company makes a deal on Shark Tank, how often do they receive the deal they requested? Do companies usually have to lower their valuations to reach a deal?
  1. This question is important as it may show that certain investment offers proposed by entrepreneurs are more or less beneficial toward reaching a deal based on the metrics. It may reveal a ‘sweet spot’, where there is a trend in the ranges of valuations that ultimately result in a deal with an investor.
3. Which sharks (investors) on the show are most likely to invest in a company given certain metrics about the company? Do certain sharks typically invest in companies in certain industries? Or in companies where the entrepreneurs come from a certain background?
  1. This question is important as it could unveil biases or tendencies in the deals that certain investors tend to make with entrepreneurs. Certain investors may be more inclined to reach a deal with a business if it meets the specific qualifications (if any) that they prefer in a pitch.
Description of research topic
- Analysis of Shark Tank business pitches: how a company does or doesn’t reach a deal on the hit show “Shark Tank”
Hypotheses on the topic
- Companies that reach deals on Shark Tank typically must lower their valuation from their original ask.
- Sharks typically invest more often in companies that are in their respective areas of expertise.
- Sharks are more likely to invest in companies whose entrepreneurs come from a similar background as themselves.
Types of variables in the research question
- In order to answer the questions listed above, there are a number of categorical and quantitative variables that are needed. In terms of quantitative variables, there must be data regarding capital/equity that was initially requested at the onset of the pitch, as well as capital/equity that is agreed upon if a deal is ultimately reached. In terms of categorical data, there will have to be data points regarding the business such as its industry and name. There will also have to be data about the background of the pitchers/entrepreneurs. Moreover, there should be data about the background of the sharks (investors). There also must be a crucial variable - whether or not the pitch results in a deal.

Glimpse of data

shark_tank <- read_csv("data/shark_tank.csv")

Rows: 1038 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Season Start, Season End, Original Air Date, Startup Name, Industr...
dbl (38): Season Number, Episode Number, Pitch Number, Multiple Entrepreneur...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(shark_tank)

Data summary
Name	shark_tank
Number of rows	1038
Number of columns	52
_______________________
Column type frequency:
character	14
numeric	38
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Season Start	0	1.00	9	9	14
Season End	7	0.99	9	9	13
Original Air Date	408	0.61	9	9	154
Startup Name	0	1.00	3	32	1036
Industry	0	1.00	6	23	15
Business Description	0	1.00	5	92	1036
Pitchers Gender	5	1.00	4	10	3
Pitchers City	540	0.48	3	18	250
Pitchers State	299	0.71	2	6	46
Pitchers Average Age	992	0.04	3	6	4
Entrepreneur Names	557	0.46	8	60	479
Company Website	570	0.45	9	65	466
Guest Name	837	0.19	9	17	24
Notes	899	0.13	9	191	129

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Season Number	0	1.00	6.76	3.11	1.00	4.00	7.00	9.00	1.400e+01	▃▇▅▇▁
Episode Number	0	1.00	12.12	7.74	1.00	5.00	11.00	18.00	2.900e+01	▇▆▅▅▂
Pitch Number	0	1.00	519.50	299.79	1.00	260.25	519.50	778.75	1.038e+03	▇▇▇▇▇
Multiple Entrepreneurs	487	0.53	0.35	0.48	0.00	0.00	0.00	1.00	1.000e+00	▇▁▁▁▅
US Viewership	416	0.60	6.10	1.35	2.31	5.15	6.38	7.11	8.640e+00	▁▃▅▇▃
Original Ask Amount	0	1.00	281798.65	379843.24	10000.00	100000.00	200000.00	300000.00	5.000e+06	▇▁▁▁▁
Original Offered Equity	0	1.00	14.64	8.91	1.50	10.00	10.00	20.00	1.000e+02	▇▁▁▁▁
Valuation Requested	0	1.00	3163290.63	4804725.88	40000.00	600000.00	1485294.00	3333333.00	4.500e+07	▇▁▁▁▁
Got Deal	0	1.00	0.58	0.49	0.00	0.00	1.00	1.00	1.000e+00	▆▁▁▁▇
Total Deal Amount	436	0.58	290921.37	378899.37	0.00	100000.00	200000.00	300000.00	5.000e+06	▇▁▁▁▁
Total Deal Equity	436	0.58	25.51	16.18	0.00	15.00	25.00	33.00	1.000e+02	▇▇▂▁▁
Deal Valuation	436	0.58	2042821.14	3718413.81	0.00	336206.75	800000.00	2000000.00	3.600e+07	▇▁▁▁▁
Number of sharks in deal	436	0.58	1.32	0.63	1.00	1.00	1.00	2.00	5.000e+00	▇▂▁▁▁
Investment Amount Per Shark	436	0.58	245115.72	350301.99	0.00	75000.00	150000.00	300000.00	5.000e+06	▇▁▁▁▁
Equity Per Shark	436	0.58	21.55	15.17	0.00	10.00	20.00	25.00	1.000e+02	▇▅▁▁▁
Royalty Deal	987	0.05	1.00	0.00	1.00	1.00	1.00	1.00	1.000e+00	▁▁▇▁▁
Loan	1001	0.04	1.00	0.00	1.00	1.00	1.00	1.00	1.000e+00	▁▁▇▁▁
Barbara Corcoran Investment Amount	940	0.09	143520.41	137398.90	12500.00	50000.00	100000.00	200000.00	1.000e+06	▇▂▁▁▁
Barbara Corcoran Investment Equity	940	0.09	23.98	13.09	5.00	15.00	20.00	32.25	5.500e+01	▇▇▂▂▂
Mark Cuban Investment Amount	857	0.17	245649.17	278613.24	12500.00	75000.00	150000.00	300000.00	2.000e+06	▇▁▁▁▁
Mark Cuban Investment Equity	857	0.17	18.80	15.40	2.50	10.00	15.00	25.00	1.000e+02	▇▃▁▁▁
Lori Greiner Investment Amount	882	0.15	205993.59	198022.87	17500.00	75000.00	150000.00	250000.00	1.000e+06	▇▂▁▁▁
Lori Greiner Investment Equity	882	0.15	16.61	12.03	0.00	10.00	12.50	20.00	6.500e+01	▇▅▁▁▁
Robert Herjavec Investment Amount	938	0.10	290973.33	581148.81	5000.00	86458.33	150000.00	300000.00	5.000e+06	▇▁▁▁▁
Robert Herjavec Investment Equity	938	0.10	18.66	13.36	0.00	10.00	15.00	25.00	1.000e+02	▇▃▁▁▁
Daymond John Investment Amount	943	0.09	186805.26	319390.55	5000.00	50000.00	100000.00	240000.00	3.000e+06	▇▁▁▁▁
Daymond John Investment Equity	943	0.09	26.06	16.18	0.00	15.82	25.00	33.30	1.000e+02	▇▇▁▁▁
Kevin O Leary Investment Amount	942	0.09	236276.04	315926.33	20000.00	80000.00	150000.00	250000.00	2.500e+06	▇▁▁▁▁
Kevin O Leary Investment Equity	942	0.09	15.83	11.65	0.00	8.56	10.83	25.00	5.000e+01	▇▃▂▁▁
Guest Investment Amount	969	0.07	216606.28	239754.19	0.00	75000.00	125000.00	250000.00	1.250e+06	▇▂▁▁▁
Guest Investment Equity	969	0.07	16.71	15.52	0.00	10.00	11.25	20.00	1.000e+02	▇▂▁▁▁
Barbara Corcoran Present	143	0.86	0.56	0.50	0.00	0.00	1.00	1.00	1.000e+00	▆▁▁▁▇
Mark Cuban Present	142	0.86	0.90	0.30	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇
Lori Greiner Present	142	0.86	0.75	0.43	0.00	0.75	1.00	1.00	1.000e+00	▂▁▁▁▇
Robert Herjavec Present	142	0.86	0.88	0.33	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇
Daymond John Present	143	0.86	0.66	0.47	0.00	0.00	1.00	1.00	1.000e+00	▅▁▁▁▇
Kevin O Leary Present	143	0.86	0.96	0.21	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇
Kevin Harrington Present	143	0.86	0.95	0.23	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇

Data 2 - March Madness

Introduction and data

Identify the source of the data.
- The data source called march_madness.csv is located in the data folder. This dataset is a collection of observations from March Madness (College basketball playoff tournament). The data comes from Kaggle, and can be found at this link.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- This data was taken by the Washington Post from the NCAA tournament data that has been collected and recorded by the NCAA throughout the duration of past March Madness tournaments.
Write a brief description of the observations.
- Each observation represents one game. There is an observation for every game in each round of all tournaments from 1985-2021.

Research question

Research question(s)
- What is the relationship between a team’s seeding and how far it makes it in the tournament?
- What “underdog” seeds are most likely to win an upset?
Description of research topic
- March Madness is known to have a degree of randomness in which teams win each game, as nobody has ever predicted every game winner accurately. We are wondering how this randomness affects the likelihood of a team making it far in the tournament.
Hypotheses on the topic
- High seeds and low seeds are more indicative of performance, while teams with middle-range seeds are less predictable.
- Underdogs that are smaller underdogs (ie; ranked better) will have the best likelihood to upset a higher ranked team.
Types of variables in the research question
- Categorical values are all of the possible seeds and the year of the tournament. A quantitative variable is the average round a seed makes.

Glimpse of data

march_madness <- read_csv("data/march_madness.csv")

Rows: 2317 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): WTEAM, LTEAM
dbl (6): YEAR, ROUND, WSEED, WSCORE, LSEED, LSCORE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(march_madness)

Data summary
Name	march_madness
Number of rows	2317
Number of columns	8
_______________________
Column type frequency:
character	2
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
WTEAM	0	1	3	25	0	207	0
LTEAM	0	1	3	25	0	302	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
YEAR	1	2002.76	10.47	1985	1994	2003	2012	2021	▇▇▇▇▇
ROUND	1	1.86	1.21	0	1	1	2	6	▇▃▂▁▁
WSEED	1	4.98	3.84	1	2	4	7	16	▇▃▂▂▁
WSCORE	1	76.87	11.84	43	69	76	84	149	▂▇▂▁▁
LSEED	1	8.72	4.60	1	5	9	13	16	▇▆▆▆▇
LSCORE	1	65.19	11.05	29	58	65	72	115	▁▇▇▂▁

Data 3 - Goodreads and Google Books

Introduction and data

Identify the source of the data.
- This data comes in two parts. The first is webscraped from Goodreads’ website, specifically their most popular list, “Books That Everyone Should Read At Least Once.” The link is: “https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once”. The second half of the data is from the Google Books API. We queried the API by searching each title from the Goodreads list. I wanted to use Goodreads’ API, but they discontinued it so we switched to the Google Books API. This resulted in some missing data points: while we used 800 books from the Goodreads list, only around 380 had an equivalent match on Google Books. With more time, we can expand this data set back to the 800 data point recommendation. ISBN was not available on the Goodreads website, which would have provided a more accurate way to search.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The Goodreads data was voted on by over 100,000 users. In total, the list features 20,000 books. Any user can vote, and the books are ranked by those with the most votes. The Google Books data was all collected by Google, in their attempts to digitize in mass the books of the world. Some books are not on the site due to copyright, so the selection varies. Further, Google Books often has multiple editions uploaded that alternate relevancy based on query, making selection difficult. Verified users and Google developers can add data to the site.
Write a brief description of the observations.
- This data set has 27 columns that cover everything from maturity rating to retail price. The relevant columns are title, author, publisher, publish date, description, page count, print type, categories (genre), average rating, number of ratings, maturity rating, language, links to preview or buy, subtitle, country, and price. The table can use a little more tidying and filtering.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Research Questions:
  1. How does the average rating of a book vary across different genres, publication dates, and book lengths? Is there a singular model of book that emerges as most popular?
  2. Do different publishing houses set different standards for qualities like page count and retail price of books? Does one of these models stand out as the most successful?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Our research topic with this data set involves an analysis of popular books based on their genre, publication date, book length, publishing house, page count, and retail price. The study aims to investigate the relationship between these variables and the average rating of books. Additionally, we will seek to explore whether certain models of books emerge as more popular than others and/or whether publishing houses set different standards for qualities like page count and retail price.
  
  Hypotheses:
  1. The average rating of books varies significantly across different genres, publication dates, and book lengths. However, as this is voted by people, we expect to see certain genres mirror the trends of recent years - Romance may have the most presence on the list. Longer books may be perceived as having more depth and value, resulting in higher average ratings, but significantly less amount of ratings. We don’t believe that there will be a singular model of book that emerges as most popular from this dataset, as readers’ preferences and tastes can vary widely.
  2. Different publishing houses may set different standards for qualities like page count and retail price of books. Some publishing houses may prioritize longer books with higher page counts, while others may prioritize shorter books that are more accessible to readers. However, we predict that they will be more consistent on genre and retail price than on page length. We expect to a variance in the trends of smaller publishing firms versus the dominant firms.
Identify the types of variables in your research question. Categorical? Quantitative?
- Categorical variables: genre, language, and maturity rating.
- Quantitative, discrete variables: publication date, page count, average rating, number of ratings.
- Quantitative continuous: price.

Glimpse of data

goodreadsdata <- read_csv("data/books.csv")

New names:
Rows: 386 Columns: 27
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(19): title, authors, publisher, publishedDate, description, printType, ... dbl
(5): pageCount, averageRating, ratingsCount, amount...23, amount...25 lgl (3):
allowAnonLogging, comicsContent, isEbook
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `amount...24` -> `amount...23`
• `currencyCode...25` -> `currencyCode...24`
• `amount...26` -> `amount...25`
• `currencyCode...27` -> `currencyCode...26`

glimpse(goodreadsdata)

Rows: 386
Columns: 27
$ title               <chr> "To Kill a Mockingbird 40th", "Harry Potter and th…
$ authors             <chr> "Harper Lee", "J.K. Rowling", "Jane Austen", "Anne…
$ publisher           <chr> "HarperCollins Christian Publishing", "Pottermore …
$ publishedDate       <chr> "1999-11-03", "2015-12-08", "2018", "2016", "2021-…
$ description         <chr> "The explosion of racial hate and violence in a sm…
$ pageCount           <dbl> 350, 311, 519, 0, 128, 117, 246, 165, 192, 0, 578,…
$ printType           <chr> "BOOK", "BOOK", "BOOK", "BOOK", "BOOK", "BOOK", "B…
$ categories          <chr> "FICTION", "Juvenile Fiction", "Courtship", "Amste…
$ averageRating       <dbl> 4.5, 4.5, NA, 4.0, 4.0, 4.0, NA, 3.5, 3.5, 4.5, 4.…
$ ratingsCount        <dbl> 2163, 2057, NA, 166, 7, 1906, NA, 3123, 485, 124, …
$ maturityRating      <chr> "NOT_MATURE", "NOT_MATURE", "NOT_MATURE", "NOT_MAT…
$ allowAnonLogging    <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRU…
$ contentVersion      <chr> "0.2.4.0.preview.0", "2.36.33.0.preview.3", "previ…
$ language            <chr> "en", "en", "en", "en", "en", "en", "en", "en", "e…
$ previewLink         <chr> "http://books.google.com/books?id=ayJpGQeyxgkC&pri…
$ infoLink            <chr> "http://books.google.com/books?id=ayJpGQeyxgkC&dq=…
$ canonicalVolumeLink <chr> "https://books.google.com/books/about/To_Kill_a_Mo…
$ subtitle            <chr> NA, NA, NA, NA, NA, NA, NA, "The Authorized Editio…
$ comicsContent       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ country             <chr> "US", "US", "US", "US", "US", "US", "US", "US", "U…
$ saleability         <chr> "NOT_FOR_SALE", "FOR_SALE", "NOT_FOR_SALE", "NOT_F…
$ isEbook             <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TR…
$ amount...23         <dbl> NA, 9.99, NA, NA, NA, NA, NA, 1.99, NA, NA, NA, NA…
$ currencyCode...24   <chr> NA, "USD", NA, NA, NA, NA, NA, "USD", NA, NA, NA, …
$ amount...25         <dbl> NA, 9.99, NA, NA, NA, NA, NA, 1.99, NA, NA, NA, NA…
$ currencyCode...26   <chr> NA, "USD", NA, NA, NA, NA, NA, "USD", NA, NA, NA, …
$ buyLink             <chr> NA, "https://play.google.com/store/books/details?i…

skim(goodreadsdata)

Data summary
Name	goodreadsdata
Number of rows	386
Number of columns	27
_______________________
Column type frequency:
character	19
logical	3
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
title	0	1.00	4	87	382
authors	0	1.00	4	65	295
publisher	51	0.87	2	44	151
publishedDate	6	0.98	4	20	325
description	27	0.93	14	4857	357
printType	3	0.99	4	4	1
categories	0	1.00	4	40	77
maturityRating	0	1.00	10	10	1
contentVersion	0	1.00	10	21	183
language	0	1.00	2	2	7
previewLink	0	1.00	77	210	385
infoLink	0	1.00	72	194	383
canonicalVolumeLink	0	1.00	59	102	383
subtitle	278	0.28	6	165	85
country	0	1.00	2	2	1
saleability	0	1.00	4	12	3
currencyCode…24	278	0.28	3	3	1
currencyCode…26	278	0.28	3	3	1
buyLink	263	0.32	104	104	121

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
allowAnonLogging	0	1.00	0.32	FAL: 262, TRU: 124
comicsContent	384	0.01	1.00	TRU: 2
isEbook	0	1.00	0.32	FAL: 263, TRU: 123

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
pageCount	12	0.97	340.86	279.84	0	164.00	297.00	448.00	1600	▇▅▁▁▁
averageRating	90	0.77	4.04	0.53	2	4.00	4.00	4.50	5	▁▁▃▇▅
ratingsCount	90	0.77	690.37	1217.53	1	11.75	118.50	482.00	4916	▇▁▁▁▁
amount…23	278	0.28	9.72	5.82	0	5.99	9.99	12.99	35	▅▇▂▁▁
amount…25	278	0.28	9.33	5.57	0	5.99	9.99	12.99	35	▅▇▁▁▁