Deal or No Deal - Investigating Shark Tank Deals Throughout the Show

Exploratory data analysis

Research question(s)

  1. What makes a Shark Tank pitch more likely to reach a deal on the show? How does the value of the business and their industry affect the likelihood of reaching a deal?

    This question gives insight into how a pitch on Shark Tank may or may not ultimately get a deal. The information and its conclusions will be useful in that it may reveal that certain industries, investors, or investment deals prove to have higher success rates. This, in turn, could serve as a guide to future entrepreneurs looking to pitch their business on the show.

    Although it is a very broad question, in particular we plan to focus on two specific categories of variables that may influence whether a pitch reaches a deal: their valuation of the deal, including whether or not the company lowered their valuations, and the industry that the company is a part of. We may also consider breaking the industry evaluation up differently for each shark to see if their preferences are different based on their own experience and familiarity with certain industries, such as retail or technology.

    We know that many companies apply to the show, but only a small subsection of them are chosen to appear and only a subsection of those that appear are even given deals. Startups that have a certain level of success already tend to be those that are more likely to appear, but we’re curious about whether there is a range that applies to successfully closing deals once they have already secured an appearance because the producers likely pick a combination of companies with promise and those that have entertaining premises considering it is ultimately a TV show. If we had ways to easily find it, it would also be interesting to look at the ways that Shark Tank appearances in general have affected these companies, but given the size of the data and the many independent startups involved, we believe that this information would be too hard to reasonably look for without going through each one individually so we are not pursuing this option.

Data collection and cleaning

This dataset largely already came in analysis-ready format. After downloading the data from Kaggle and converting the column names into snake case, our main job for cleaning was deciding which of the variables could be removed — both those that were not helpful for our research question and those that contained too many missing variables to be important. The ones that we decided to remove were pitchers_average_age (many missing values), company_website (missing values and unhelpful for analysis), business_description (redundant and less analyzable than industry), and notes (unhelpful for analysis). We also converted the character values of season_start, season_end, and original_air_date to dates using lubridate to make analysis easier. The last thing we reckoned with was that the original data set gives NA values for the columns like total_deal_amount if that pitch did not get a deal, and so we were questioning whether it would be better for analysis to change those NA values to 0. However, we decided that it is not clear in some cases whether this NA is from no deal or an actual missing value, and also that the primary tool of analysis for our research question will be the got_deal column which has no missing values. To analyze the amount of money given more specifically, we believe that analysis can be done just among the pitches that did get a deal.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.0
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(skimr)

unwanted <- c('pitchers_average_age', 'company_website', 'business_description', 'notes')
shark_tank <- read_csv("data/shark_tank.csv") |>
  janitor::clean_names() |>
  select(!unwanted) |>
  mutate(season_start = lubridate::dmy(season_start),
         season_end = lubridate::dmy(season_end),
         original_air_date = lubridate::dmy(original_air_date))
Rows: 1038 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Season Start, Season End, Original Air Date, Startup Name, Industr...
dbl (38): Season Number, Episode Number, Pitch Number, Multiple Entrepreneur...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(unwanted)

  # Now:
  data %>% select(all_of(unwanted))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
skim(shark_tank)
Data summary
Name shark_tank
Number of rows 1038
Number of columns 48
_______________________
Column type frequency:
character 7
Date 3
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
startup_name 0 1.00 3 32 0 1036 0
industry 0 1.00 6 23 0 15 0
pitchers_gender 5 1.00 4 10 0 3 0
pitchers_city 540 0.48 3 18 0 250 0
pitchers_state 299 0.71 2 6 0 46 0
entrepreneur_names 557 0.46 8 60 0 479 0
guest_name 837 0.19 9 17 0 24 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
season_start 0 1.00 2009-08-09 2022-09-23 2015-09-25 14
season_end 7 0.99 2010-02-05 2022-05-20 2016-05-20 13
original_air_date 408 0.61 2009-08-09 2022-09-30 2014-01-13 154

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
season_number 0 1.00 6.76 3.11 1.00 4.00 7.00 9.00 1.400e+01 ▃▇▅▇▁
episode_number 0 1.00 12.12 7.74 1.00 5.00 11.00 18.00 2.900e+01 ▇▆▅▅▂
pitch_number 0 1.00 519.50 299.79 1.00 260.25 519.50 778.75 1.038e+03 ▇▇▇▇▇
multiple_entrepreneurs 487 0.53 0.35 0.48 0.00 0.00 0.00 1.00 1.000e+00 ▇▁▁▁▅
us_viewership 416 0.60 6.10 1.35 2.31 5.15 6.38 7.11 8.640e+00 ▁▃▅▇▃
original_ask_amount 0 1.00 281798.65 379843.24 10000.00 100000.00 200000.00 300000.00 5.000e+06 ▇▁▁▁▁
original_offered_equity 0 1.00 14.64 8.91 1.50 10.00 10.00 20.00 1.000e+02 ▇▁▁▁▁
valuation_requested 0 1.00 3163290.63 4804725.88 40000.00 600000.00 1485294.00 3333333.00 4.500e+07 ▇▁▁▁▁
got_deal 0 1.00 0.58 0.49 0.00 0.00 1.00 1.00 1.000e+00 ▆▁▁▁▇
total_deal_amount 436 0.58 290921.37 378899.37 0.00 100000.00 200000.00 300000.00 5.000e+06 ▇▁▁▁▁
total_deal_equity 436 0.58 25.51 16.18 0.00 15.00 25.00 33.00 1.000e+02 ▇▇▂▁▁
deal_valuation 436 0.58 2042821.14 3718413.81 0.00 336206.75 800000.00 2000000.00 3.600e+07 ▇▁▁▁▁
number_of_sharks_in_deal 436 0.58 1.32 0.63 1.00 1.00 1.00 2.00 5.000e+00 ▇▂▁▁▁
investment_amount_per_shark 436 0.58 245115.72 350301.99 0.00 75000.00 150000.00 300000.00 5.000e+06 ▇▁▁▁▁
equity_per_shark 436 0.58 21.55 15.17 0.00 10.00 20.00 25.00 1.000e+02 ▇▅▁▁▁
royalty_deal 987 0.05 1.00 0.00 1.00 1.00 1.00 1.00 1.000e+00 ▁▁▇▁▁
loan 1001 0.04 1.00 0.00 1.00 1.00 1.00 1.00 1.000e+00 ▁▁▇▁▁
barbara_corcoran_investment_amount 940 0.09 143520.41 137398.90 12500.00 50000.00 100000.00 200000.00 1.000e+06 ▇▂▁▁▁
barbara_corcoran_investment_equity 940 0.09 23.98 13.09 5.00 15.00 20.00 32.25 5.500e+01 ▇▇▂▂▂
mark_cuban_investment_amount 857 0.17 245649.17 278613.24 12500.00 75000.00 150000.00 300000.00 2.000e+06 ▇▁▁▁▁
mark_cuban_investment_equity 857 0.17 18.80 15.40 2.50 10.00 15.00 25.00 1.000e+02 ▇▃▁▁▁
lori_greiner_investment_amount 882 0.15 205993.59 198022.87 17500.00 75000.00 150000.00 250000.00 1.000e+06 ▇▂▁▁▁
lori_greiner_investment_equity 882 0.15 16.61 12.03 0.00 10.00 12.50 20.00 6.500e+01 ▇▅▁▁▁
robert_herjavec_investment_amount 938 0.10 290973.33 581148.81 5000.00 86458.33 150000.00 300000.00 5.000e+06 ▇▁▁▁▁
robert_herjavec_investment_equity 938 0.10 18.66 13.36 0.00 10.00 15.00 25.00 1.000e+02 ▇▃▁▁▁
daymond_john_investment_amount 943 0.09 186805.26 319390.55 5000.00 50000.00 100000.00 240000.00 3.000e+06 ▇▁▁▁▁
daymond_john_investment_equity 943 0.09 26.06 16.18 0.00 15.82 25.00 33.30 1.000e+02 ▇▇▁▁▁
kevin_o_leary_investment_amount 942 0.09 236276.04 315926.33 20000.00 80000.00 150000.00 250000.00 2.500e+06 ▇▁▁▁▁
kevin_o_leary_investment_equity 942 0.09 15.83 11.65 0.00 8.56 10.83 25.00 5.000e+01 ▇▃▂▁▁
guest_investment_amount 969 0.07 216606.28 239754.19 0.00 75000.00 125000.00 250000.00 1.250e+06 ▇▂▁▁▁
guest_investment_equity 969 0.07 16.71 15.52 0.00 10.00 11.25 20.00 1.000e+02 ▇▂▁▁▁
barbara_corcoran_present 143 0.86 0.56 0.50 0.00 0.00 1.00 1.00 1.000e+00 ▆▁▁▁▇
mark_cuban_present 142 0.86 0.90 0.30 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
lori_greiner_present 142 0.86 0.75 0.43 0.00 0.75 1.00 1.00 1.000e+00 ▂▁▁▁▇
robert_herjavec_present 142 0.86 0.88 0.33 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
daymond_john_present 143 0.86 0.66 0.47 0.00 0.00 1.00 1.00 1.000e+00 ▅▁▁▁▇
kevin_o_leary_present 143 0.86 0.96 0.21 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
kevin_harrington_present 143 0.86 0.95 0.23 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
glimpse(shark_tank)
Rows: 1,038
Columns: 48
$ season_number                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ season_start                       <date> 2009-08-09, 2009-08-09, 2009-08-09…
$ season_end                         <date> 2010-02-05, 2010-02-05, 2010-02-05…
$ episode_number                     <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,…
$ pitch_number                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, …
$ original_air_date                  <date> 2009-08-09, 2009-08-09, 2009-08-09…
$ startup_name                       <chr> "AvaTheElephant", "Mr.Tod'sPieFacto…
$ industry                           <chr> "Health/Wellness", "Food and Bevera…
$ pitchers_gender                    <chr> "Female", "Male", "Male", "Male", "…
$ pitchers_city                      <chr> "Atlanta", "Somerset", "Cary", "Tam…
$ pitchers_state                     <chr> "GA", "NJ", "NC", "FL", "MN", "CA",…
$ entrepreneur_names                 <chr> "Tiffany Krumins", "Tod Wilson", "K…
$ multiple_entrepreneurs             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ us_viewership                      <dbl> 4.15, 4.15, 4.15, 4.15, 4.15, 5.59,…
$ original_ask_amount                <dbl> 50000, 460000, 1200000, 250000, 100…
$ original_offered_equity            <dbl> 15, 10, 10, 25, 15, 15, 10, 10, 20,…
$ valuation_requested                <dbl> 333333, 4600000, 12000000, 1000000,…
$ got_deal                           <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1,…
$ total_deal_amount                  <dbl> 50000, 460000, NA, NA, NA, 500000, …
$ total_deal_equity                  <dbl> 55, 50, NA, NA, NA, 50, 100, NA, NA…
$ deal_valuation                     <dbl> 90909, 920000, NA, NA, NA, 1000000,…
$ number_of_sharks_in_deal           <dbl> 1, 2, NA, NA, NA, 2, 5, NA, NA, NA,…
$ investment_amount_per_shark        <dbl> 50000, 230000, NA, NA, NA, 250000, …
$ equity_per_shark                   <dbl> 55.0, 25.0, NA, NA, NA, 25.0, 20.0,…
$ royalty_deal                       <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, …
$ loan                               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ barbara_corcoran_investment_amount <dbl> 50000, 230000, NA, NA, NA, NA, 5000…
$ barbara_corcoran_investment_equity <dbl> 55, 25, NA, NA, NA, NA, 20, NA, NA,…
$ mark_cuban_investment_amount       <dbl> NA, NA, NA, NA, NA, NA, 50000, NA, …
$ mark_cuban_investment_equity       <dbl> NA, NA, NA, NA, NA, NA, 20, NA, NA,…
$ lori_greiner_investment_amount     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lori_greiner_investment_equity     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ robert_herjavec_investment_amount  <dbl> NA, NA, NA, NA, NA, 250000, 50000, …
$ robert_herjavec_investment_equity  <dbl> NA, NA, NA, NA, NA, 25.0, 20.0, NA,…
$ daymond_john_investment_amount     <dbl> NA, 230000, NA, NA, NA, NA, 50000, …
$ daymond_john_investment_equity     <dbl> NA, 25, NA, NA, NA, NA, 20, NA, NA,…
$ kevin_o_leary_investment_amount    <dbl> NA, NA, NA, NA, NA, 250000, 50000, …
$ kevin_o_leary_investment_equity    <dbl> NA, NA, NA, NA, NA, 25.0, 20.0, NA,…
$ guest_investment_amount            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ guest_investment_equity            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ barbara_corcoran_present           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ mark_cuban_present                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ lori_greiner_present               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ robert_herjavec_present            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ daymond_john_present               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ kevin_o_leary_present              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ kevin_harrington_present           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ guest_name                         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Data description

Motivation

  • Why was this dataset created?

    • The dataset was created to identify any trends among investors (sharks), entrepreneurs, and the deals that are made between them in seasons 1 through 14 on Shark Tank US. It allows users to analyze information of all brands that pitched, whether they received any investment or debt, and which Sharks invested in the idea.
  • Who created the dataset?

    • The dataset was created by Satya Thirumani with the help of contributors Arsalan ur Rehman and Jatila Molagoda, along with 2 other unnamed collaborators.
  • Who funded the creation of the dataset?

    • There is no mention of any associated grants that funded the creation of the dataset.

Composition

  • What are the observations (rows) and the attributes (columns)?

    • Each observation represents one pitch from an episode in the show. Each attribute provides various kinds of information on the pitch such as details about the entrepreneur and their brand, original ask amount and offered equity (in USD), actions taken by each shark, and whether or not a deal was ultimately reached.
  • How many instances are there in total (of each type, if appropriate)?

    • There are 1,038 observations and 52 attributes in total within the dataset.

Collection Process

  • What processes might have influenced what data was observed and recorded and what was not?

    • Missing data could be the result of a deal not being reached between the entrepreneur and sharks, leaving cells as NA where these final details would otherwise be included. Additionally, missing values could also be due to certain sharks not investing in pitches where others do, leaving those cells empty as well. Other data not recorded were likely the result of companies not having websites, pitchers not disclosing their age, or there being no royalty deal.
  • If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

    • The people involved in the dataset are each of the Sharks and all of the entrepreneurs across the seasons who delivered a pitch during the show. The data collected was based on information that was disclosed during the pitch, as well as background information on the entrepreneur and their brand. These individuals were likely unaware of the data collection for the purpose of creating this dataset, however, none of the data is information not accessible online by a member of the public.

Pre-processing/cleaning/labeling

  • What pre-processing was done, and how did the data come to be in the form that you are using?
    • No observations or attributes were discarded upon pre-processing the data, meaning that all information visible in shark_tank.csv is identical to the original dataset.

    • Jonathan finish?

Uses

  • Has the dataset been used for any tasks already?

    • Since its publication, the dataset has been downloaded 963 times by various users. It is unknown which tasks these individuals used the dataset for.

Distribution

  • How will the dataset be distributed (e.g. tarball on website, API, GitHub)?

    • The dataset is distributed on Satya Thirumani’s page on Kaggle: https://www.kaggle.com/datasets/thirumani/shark-tank-us-dataset?datasetId=2885132

Maintenance

  • Who will be supporting/hosting/maintaining the dataset?

    • Satya Thirumani is supporting/maintaining the dataset.
  • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

    • Updates will be posted on the dataset webpage listed above. It is expected to be updated quarterly, with the most recent update occurring on March 27th, 2023.

Data limitations

One limitation of this data set is that its contents can not really be generalized to anything beyond the tv show “Shark Tank.” All of the information provided seems only applicable in an analysis of the show itself. Another possible limitation is the quantity of NA values. Some columns in the data set have a high frequency of these values, which could affect some analysis.

Exploratory data analysis

library(viridis)
Loading required package: viridisLite
shark_tank_plots <- shark_tank |>
  mutate(got_deal = if_else(got_deal == 0, "No Deal", "Deal")) |>
  group_by(industry, got_deal) |>
  summarize(count = n()) |>
  group_by(industry) |>
  mutate(total_count = sum(count)) |>
  ungroup() |>
  mutate(industry = fct_reorder(industry, total_count))
`summarise()` has grouped output by 'industry'. You can override using the
`.groups` argument.
ggplot(shark_tank_plots, aes(y = industry, x = count, fill = got_deal)) +
  geom_col() +
  labs(x = "Count",
       y = "Industry",
       title = "Shark Tank Pitches: Deals vs. No Deals by Industry",
       subtitle = "Data from Shark Tank Episodes, 2009-2022",
       fill = "Deal Status"
       ) +
  scale_fill_brewer(palette = "Dark2")

ggplot(shark_tank_plots, aes(y = industry, x = count, fill = got_deal)) +
  geom_col(position = "fill") +
  labs(x = "Percentage",
       y = "Industry",
       title = "Shark Tank Pitches: Deals vs. No Deals by Industry",
       subtitle = "Data from Shark Tank Episodes, 2009-2022",
       fill = "Deal Status"
       ) +
  scale_fill_brewer(palette = "Dark2")

Above, I created two plots displaying Shark Tank pitches, their industries, and their deal status. The first graph displays the industries by order of most to least popular, and each industry is stacked with both pitches that did and did not get deals. This graph shows how there is a wide variation of popularity of industry pitches. Food and beverage had over 200 products pitched, while clean/greenTech has less than 25 (not sure what that says about society, haha). The amount of no deals generally increases proportionally with the total number of pitches, but the growth is not fully consistent. For example, automotive has a visibly smaller proportion of no deals than its neighbors. The proportion of deals to no deals become more clear with the second plot, where I did a proportional stacked bar plot. I kept the order of industries the same for comparative purposes, and it shows that there is not any strong correlation between the most popular industries and the rates of deal/no deals. The categories that prove the most commonly invested are automotive (by a dramatic margin), lifestyle/home, and children/education. The categories that are the least successful are electronics, other, and travel.

It is difficult to do a linear regression with this data as the categories are not numerical, but perhaps they could be turned to numeric factors in a later data exploration. However, I do not think this would yield much new information.

library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.1.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.6     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
val_deal_fit <- linear_reg() |>
  fit(got_deal ~ valuation_requested, data = shark_tank)

tidy(val_deal_fit)
# A tibble: 2 × 5
  term                estimate     std.error statistic   p.value
  <chr>                  <dbl>         <dbl>     <dbl>     <dbl>
1 (Intercept)          5.88e-1 0.0184           32.0   6.36e-157
2 valuation_requested -3.14e-9 0.00000000319    -0.984 3.25e-  1

This above chunk explores the linear relationship between between valuations and ask amounts. A next step in exploration would be graphing it, perhaps also including initial ask price.

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Do we need to get more specific with the research question? Is it already if our ideas and plan changes a little as the project goes on?