Deal or No Deal - Investigating Shark Tank Deals Throughout the Show

Exploratory data analysis

Research question(s)

What makes a Shark Tank pitch more likely to reach a deal on the show? How does the value of the business and their industry affect the likelihood of reaching a deal?

This question gives insight into how a pitch on Shark Tank may or may not ultimately get a deal. The information and its conclusions will be useful in that it may reveal that certain industries, investors, or investment deals prove to have higher success rates. This, in turn, could serve as a guide to future entrepreneurs looking to pitch their business on the show.

Although it is a very broad question, in particular we plan to focus on two specific categories of variables that may influence whether a pitch reaches a deal: their valuation of the deal, including whether or not the company lowered their valuations, and the industry that the company is a part of. We may also consider breaking the industry evaluation up differently for each shark to see if their preferences are different based on their own experience and familiarity with certain industries, such as retail or technology.

We know that many companies apply to the show, but only a small subsection of them are chosen to appear and only a subsection of those that appear are even given deals. Startups that have a certain level of success already tend to be those that are more likely to appear, but we’re curious about whether there is a range that applies to successfully closing deals once they have already secured an appearance because the producers likely pick a combination of companies with promise and those that have entertaining premises considering it is ultimately a TV show. If we had ways to easily find it, it would also be interesting to look at the ways that Shark Tank appearances in general have affected these companies, but given the size of the data and the many independent startups involved, we believe that this information would be too hard to reasonably look for without going through each one individually so we are not pursuing this option.

Data collection and cleaning

This dataset largely already came in analysis-ready format. After downloading the data from Kaggle and converting the column names into snake case, our main job for cleaning was deciding which of the variables could be removed — both those that were not helpful for our research question and those that contained too many missing variables to be important. The ones that we decided to remove were pitchers_average_age (many missing values), company_website (missing values and unhelpful for analysis), business_description (redundant and less analyzable than industry), and notes (unhelpful for analysis). We also converted the character values of season_start, season_end, and original_air_date to dates using lubridate to make analysis easier. The last thing we reckoned with was that the original data set gives NA values for the columns like total_deal_amount if that pitch did not get a deal, and so we were questioning whether it would be better for analysis to change those NA values to 0. However, we decided that it is not clear in some cases whether this NA is from no deal or an actual missing value, and also that the primary tool of analysis for our research question will be the got_deal column which has no missing values. To analyze the amount of money given more specifically, we believe that analysis can be done just among the pitches that did get a deal.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(skimr)

unwanted <- c('pitchers_average_age', 'company_website', 'business_description', 'notes')
shark_tank <- read_csv("data/shark_tank.csv") |>
  janitor::clean_names() |>
  select(!unwanted) |>
  mutate(season_start = lubridate::dmy(season_start),
         season_end = lubridate::dmy(season_end),
         original_air_date = lubridate::dmy(original_air_date))

Rows: 1038 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Season Start, Season End, Original Air Date, Startup Name, Industr...
dbl (38): Season Number, Episode Number, Pitch Number, Multiple Entrepreneur...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(unwanted)

  # Now:
  data %>% select(all_of(unwanted))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

skim(shark_tank)

Data summary
Name	shark_tank
Number of rows	1038
Number of columns	48
_______________________
Column type frequency:
character	7
Date	3
numeric	38
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
startup_name	0	1.00	3	32	1036
industry	0	1.00	6	23	15
pitchers_gender	5	1.00	4	10	3
pitchers_city	540	0.48	3	18	250
pitchers_state	299	0.71	2	6	46
entrepreneur_names	557	0.46	8	60	479
guest_name	837	0.19	9	17	24

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
season_start	0	1.00	2009-08-09	2022-09-23	2015-09-25	14
season_end	7	0.99	2010-02-05	2022-05-20	2016-05-20	13
original_air_date	408	0.61	2009-08-09	2022-09-30	2014-01-13	154

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
season_number	0	1.00	6.76	3.11	1.00	4.00	7.00	9.00	1.400e+01	▃▇▅▇▁
episode_number	0	1.00	12.12	7.74	1.00	5.00	11.00	18.00	2.900e+01	▇▆▅▅▂
pitch_number	0	1.00	519.50	299.79	1.00	260.25	519.50	778.75	1.038e+03	▇▇▇▇▇
multiple_entrepreneurs	487	0.53	0.35	0.48	0.00	0.00	0.00	1.00	1.000e+00	▇▁▁▁▅
us_viewership	416	0.60	6.10	1.35	2.31	5.15	6.38	7.11	8.640e+00	▁▃▅▇▃
original_ask_amount	0	1.00	281798.65	379843.24	10000.00	100000.00	200000.00	300000.00	5.000e+06	▇▁▁▁▁
original_offered_equity	0	1.00	14.64	8.91	1.50	10.00	10.00	20.00	1.000e+02	▇▁▁▁▁
valuation_requested	0	1.00	3163290.63	4804725.88	40000.00	600000.00	1485294.00	3333333.00	4.500e+07	▇▁▁▁▁
got_deal	0	1.00	0.58	0.49	0.00	0.00	1.00	1.00	1.000e+00	▆▁▁▁▇
total_deal_amount	436	0.58	290921.37	378899.37	0.00	100000.00	200000.00	300000.00	5.000e+06	▇▁▁▁▁
total_deal_equity	436	0.58	25.51	16.18	0.00	15.00	25.00	33.00	1.000e+02	▇▇▂▁▁
deal_valuation	436	0.58	2042821.14	3718413.81	0.00	336206.75	800000.00	2000000.00	3.600e+07	▇▁▁▁▁
number_of_sharks_in_deal	436	0.58	1.32	0.63	1.00	1.00	1.00	2.00	5.000e+00	▇▂▁▁▁
investment_amount_per_shark	436	0.58	245115.72	350301.99	0.00	75000.00	150000.00	300000.00	5.000e+06	▇▁▁▁▁
equity_per_shark	436	0.58	21.55	15.17	0.00	10.00	20.00	25.00	1.000e+02	▇▅▁▁▁
royalty_deal	987	0.05	1.00	0.00	1.00	1.00	1.00	1.00	1.000e+00	▁▁▇▁▁
loan	1001	0.04	1.00	0.00	1.00	1.00	1.00	1.00	1.000e+00	▁▁▇▁▁
barbara_corcoran_investment_amount	940	0.09	143520.41	137398.90	12500.00	50000.00	100000.00	200000.00	1.000e+06	▇▂▁▁▁
barbara_corcoran_investment_equity	940	0.09	23.98	13.09	5.00	15.00	20.00	32.25	5.500e+01	▇▇▂▂▂
mark_cuban_investment_amount	857	0.17	245649.17	278613.24	12500.00	75000.00	150000.00	300000.00	2.000e+06	▇▁▁▁▁
mark_cuban_investment_equity	857	0.17	18.80	15.40	2.50	10.00	15.00	25.00	1.000e+02	▇▃▁▁▁
lori_greiner_investment_amount	882	0.15	205993.59	198022.87	17500.00	75000.00	150000.00	250000.00	1.000e+06	▇▂▁▁▁
lori_greiner_investment_equity	882	0.15	16.61	12.03	0.00	10.00	12.50	20.00	6.500e+01	▇▅▁▁▁
robert_herjavec_investment_amount	938	0.10	290973.33	581148.81	5000.00	86458.33	150000.00	300000.00	5.000e+06	▇▁▁▁▁
robert_herjavec_investment_equity	938	0.10	18.66	13.36	0.00	10.00	15.00	25.00	1.000e+02	▇▃▁▁▁
daymond_john_investment_amount	943	0.09	186805.26	319390.55	5000.00	50000.00	100000.00	240000.00	3.000e+06	▇▁▁▁▁
daymond_john_investment_equity	943	0.09	26.06	16.18	0.00	15.82	25.00	33.30	1.000e+02	▇▇▁▁▁
kevin_o_leary_investment_amount	942	0.09	236276.04	315926.33	20000.00	80000.00	150000.00	250000.00	2.500e+06	▇▁▁▁▁
kevin_o_leary_investment_equity	942	0.09	15.83	11.65	0.00	8.56	10.83	25.00	5.000e+01	▇▃▂▁▁
guest_investment_amount	969	0.07	216606.28	239754.19	0.00	75000.00	125000.00	250000.00	1.250e+06	▇▂▁▁▁
guest_investment_equity	969	0.07	16.71	15.52	0.00	10.00	11.25	20.00	1.000e+02	▇▂▁▁▁
barbara_corcoran_present	143	0.86	0.56	0.50	0.00	0.00	1.00	1.00	1.000e+00	▆▁▁▁▇
mark_cuban_present	142	0.86	0.90	0.30	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇
lori_greiner_present	142	0.86	0.75	0.43	0.00	0.75	1.00	1.00	1.000e+00	▂▁▁▁▇
robert_herjavec_present	142	0.86	0.88	0.33	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇
daymond_john_present	143	0.86	0.66	0.47	0.00	0.00	1.00	1.00	1.000e+00	▅▁▁▁▇
kevin_o_leary_present	143	0.86	0.96	0.21	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇
kevin_harrington_present	143	0.86	0.95	0.23	0.00	1.00	1.00	1.00	1.000e+00	▁▁▁▁▇

glimpse(shark_tank)

Rows: 1,038
Columns: 48
$ season_number                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ season_start                       <date> 2009-08-09, 2009-08-09, 2009-08-09…
$ season_end                         <date> 2010-02-05, 2010-02-05, 2010-02-05…
$ episode_number                     <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
$ pitch_number                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, …
$ original_air_date                  <date> 2009-08-09, 2009-08-09, 2009-08-09…
$ startup_name                       <chr> "AvaTheElephant", "Mr.Tod'sPieFacto…
$ industry                           <chr> "Health/Wellness", "Food and Bevera…
$ pitchers_gender                    <chr> "Female", "Male", "Male", "Male", "…
$ pitchers_city                      <chr> "Atlanta", "Somerset", "Cary", "Tam…
$ pitchers_state                     <chr> "GA", "NJ", "NC", "FL", "MN", "CA",…
$ entrepreneur_names                 <chr> "Tiffany Krumins", "Tod Wilson", "K…
$ multiple_entrepreneurs             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ us_viewership                      <dbl> 4.15, 4.15, 4.15, 4.15, 4.15, 5.59,…
$ original_ask_amount                <dbl> 50000, 460000, 1200000, 250000, 100…
$ original_offered_equity            <dbl> 15, 10, 10, 25, 15, 15, 10, 10, 20,…
$ valuation_requested                <dbl> 333333, 4600000, 12000000, 1000000,…
$ got_deal                           <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,…
$ total_deal_amount                  <dbl> 50000, 460000, NA, NA, NA, 500000, …
$ total_deal_equity                  <dbl> 55, 50, NA, NA, NA, 50, 100, NA, NA…
$ deal_valuation                     <dbl> 90909, 920000, NA, NA, NA, 1000000,…
$ number_of_sharks_in_deal           <dbl> 1, 2, NA, NA, NA, 2, 5, NA, NA, NA,…
$ investment_amount_per_shark        <dbl> 50000, 230000, NA, NA, NA, 250000, …
$ equity_per_shark                   <dbl> 55.0, 25.0, NA, NA, NA, 25.0, 20.0,…
$ royalty_deal                       <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, …
$ loan                               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ barbara_corcoran_investment_amount <dbl> 50000, 230000, NA, NA, NA, NA, 5000…
$ barbara_corcoran_investment_equity <dbl> 55, 25, NA, NA, NA, NA, 20, NA, NA,…
$ mark_cuban_investment_amount       <dbl> NA, NA, NA, NA, NA, NA, 50000, NA, …
$ mark_cuban_investment_equity       <dbl> NA, NA, NA, NA, NA, NA, 20, NA, NA,…
$ lori_greiner_investment_amount     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lori_greiner_investment_equity     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ robert_herjavec_investment_amount  <dbl> NA, NA, NA, NA, NA, 250000, 50000, …
$ robert_herjavec_investment_equity  <dbl> NA, NA, NA, NA, NA, 25.0, 20.0, NA,…
$ daymond_john_investment_amount     <dbl> NA, 230000, NA, NA, NA, NA, 50000, …
$ daymond_john_investment_equity     <dbl> NA, 25, NA, NA, NA, NA, 20, NA, NA,…
$ kevin_o_leary_investment_amount    <dbl> NA, NA, NA, NA, NA, 250000, 50000, …
$ kevin_o_leary_investment_equity    <dbl> NA, NA, NA, NA, NA, 25.0, 20.0, NA,…
$ guest_investment_amount            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ guest_investment_equity            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ barbara_corcoran_present           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mark_cuban_present                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ lori_greiner_present               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ robert_herjavec_present            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ daymond_john_present               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ kevin_o_leary_present              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ kevin_harrington_present           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ guest_name                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Data description

Motivation

Why was this dataset created?
- The dataset was created to identify any trends among investors (sharks), entrepreneurs, and the deals that are made between them in seasons 1 through 14 on Shark Tank US. It allows users to analyze information of all brands that pitched, whether they received any investment or debt, and which Sharks invested in the idea.
Who created the dataset?
- The dataset was created by Satya Thirumani with the help of contributors Arsalan ur Rehman and Jatila Molagoda, along with 2 other unnamed collaborators.
Who funded the creation of the dataset?
- There is no mention of any associated grants that funded the creation of the dataset.

Composition

What are the observations (rows) and the attributes (columns)?
- Each observation represents one pitch from an episode in the show. Each attribute provides various kinds of information on the pitch such as details about the entrepreneur and their brand, original ask amount and offered equity (in USD), actions taken by each shark, and whether or not a deal was ultimately reached.
How many instances are there in total (of each type, if appropriate)?
- There are 1,038 observations and 52 attributes in total within the dataset.

Collection Process

What processes might have influenced what data was observed and recorded and what was not?
- Missing data could be the result of a deal not being reached between the entrepreneur and sharks, leaving cells as NA where these final details would otherwise be included. Additionally, missing values could also be due to certain sharks not investing in pitches where others do, leaving those cells empty as well. Other data not recorded were likely the result of companies not having websites, pitchers not disclosing their age, or there being no royalty deal.
If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
- The people involved in the dataset are each of the Sharks and all of the entrepreneurs across the seasons who delivered a pitch during the show. The data collected was based on information that was disclosed during the pitch, as well as background information on the entrepreneur and their brand. These individuals were likely unaware of the data collection for the purpose of creating this dataset, however, none of the data is information not accessible online by a member of the public.

Pre-processing/cleaning/labeling

What pre-processing was done, and how did the data come to be in the form that you are using?
- No observations or attributes were discarded upon pre-processing the data, meaning that all information visible in shark_tank.csv is identical to the original dataset.
- Jonathan finish?

Uses

Has the dataset been used for any tasks already?
- Since its publication, the dataset has been downloaded 963 times by various users. It is unknown which tasks these individuals used the dataset for.

Distribution

How will the dataset be distributed (e.g. tarball on website, API, GitHub)?
- The dataset is distributed on Satya Thirumani’s page on Kaggle: https://www.kaggle.com/datasets/thirumani/shark-tank-us-dataset?datasetId=2885132

Maintenance

Who will be supporting/hosting/maintaining the dataset?
- Satya Thirumani is supporting/maintaining the dataset.
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
- Updates will be posted on the dataset webpage listed above. It is expected to be updated quarterly, with the most recent update occurring on March 27th, 2023.

Data limitations

One limitation of this data set is that its contents can not really be generalized to anything beyond the tv show “Shark Tank.” All of the information provided seems only applicable in an analysis of the show itself. Another possible limitation is the quantity of NA values. Some columns in the data set have a high frequency of these values, which could affect some analysis.

Exploratory data analysis

library(viridis)

Loading required package: viridisLite

shark_tank_plots <- shark_tank |>
  mutate(got_deal = if_else(got_deal == 0, "No Deal", "Deal")) |>
  group_by(industry, got_deal) |>
  summarize(count = n()) |>
  group_by(industry) |>
  mutate(total_count = sum(count)) |>
  ungroup() |>
  mutate(industry = fct_reorder(industry, total_count))

`summarise()` has grouped output by 'industry'. You can override using the
`.groups` argument.

ggplot(shark_tank_plots, aes(y = industry, x = count, fill = got_deal)) +
  geom_col() +
  labs(x = "Count",
       y = "Industry",
       title = "Shark Tank Pitches: Deals vs. No Deals by Industry",
       subtitle = "Data from Shark Tank Episodes, 2009-2022",
       fill = "Deal Status"
       ) +
  scale_fill_brewer(palette = "Dark2")

ggplot(shark_tank_plots, aes(y = industry, x = count, fill = got_deal)) +
  geom_col(position = "fill") +
  labs(x = "Percentage",
       y = "Industry",
       title = "Shark Tank Pitches: Deals vs. No Deals by Industry",
       subtitle = "Data from Shark Tank Episodes, 2009-2022",
       fill = "Deal Status"
       ) +
  scale_fill_brewer(palette = "Dark2")

Above, I created two plots displaying Shark Tank pitches, their industries, and their deal status. The first graph displays the industries by order of most to least popular, and each industry is stacked with both pitches that did and did not get deals. This graph shows how there is a wide variation of popularity of industry pitches. Food and beverage had over 200 products pitched, while clean/greenTech has less than 25 (not sure what that says about society, haha). The amount of no deals generally increases proportionally with the total number of pitches, but the growth is not fully consistent. For example, automotive has a visibly smaller proportion of no deals than its neighbors. The proportion of deals to no deals become more clear with the second plot, where I did a proportional stacked bar plot. I kept the order of industries the same for comparative purposes, and it shows that there is not any strong correlation between the most popular industries and the rates of deal/no deals. The categories that prove the most commonly invested are automotive (by a dramatic margin), lifestyle/home, and children/education. The categories that are the least successful are electronics, other, and travel.

It is difficult to do a linear regression with this data as the categories are not numerical, but perhaps they could be turned to numeric factors in a later data exploration. However, I do not think this would yield much new information.

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

val_deal_fit <- linear_reg() |>
  fit(got_deal ~ valuation_requested, data = shark_tank)

tidy(val_deal_fit)

# A tibble: 2 × 5
  term                estimate     std.error statistic   p.value
  <chr>                  <dbl>         <dbl>     <dbl>     <dbl>
1 (Intercept)          5.88e-1 0.0184           32.0   6.36e-157
2 valuation_requested -3.14e-9 0.00000000319    -0.984 3.25e-  1

This above chunk explores the linear relationship between between valuations and ask amounts. A next step in exploration would be graphing it, perhaps also including initial ask price.

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Do we need to get more specific with the research question? Is it already if our ideas and plan changes a little as the project goes on?