Deal or No Deal - Investigating Shark Tank Deals Throughout the Show

Proposal

library(tidyverse)
library(skimr)

Data 1 - Shark Tank

Introduction and data

  • Identify the source of the data.

    • The first data source called shark_tank.csv is located in the data folder. This dataset is a collection of observations from business pitches from the first 14 seasons of the American TV show “Shark Tank.” The dataset contains 1038 observations, with each observation being a unique business pitch. Associated with each observation are 52 columns that contain information regarding some pitch.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The dataset, which is available for public use via kaggle.com, was curated by a collaborator named Satya Thirumani. The page containing the dataset can be found at this link. The dataset is continually updated by this collaborator as Shark Tank airs new episodes and continues to release seasons of the show. The dataset was last updated on March 9, 2023. For the sake of this project, we will not include additional updates to this dataset past this March 9th update.
  • Write a brief description of the observations.

    • Each observation contains 52 columns regarding the business pitch. Within these columns, there is a wide range of information. First, there’s details about the pitch as it relates to the Shark Tank show (ie; season, episode, when it aired, etc.). There are columns regarding the company being pitched such as its name and industry, as well as the entrepreneurs who run the company. Finally, there’s a great deal of information for each observation about the economics of each pitch. This includes data about the amount of money requested, the company’s original valuation, and ultimately which sharks (investors), if any, invested in the company and at what new valuation. Due to the nature of the show, where not all companies receive investment, it makes sense that many column/row pairings have blank (ie; NA) values.

Research question

  • Research question(s)
    1. What makes a Shark Tank pitch more likely to reach a deal on the show?
      1. This question is important as it allows us to visualize trends in the types of Shark Tank pitches that ultimately reach a deal. The data may reveal that certain industries, investors, or investment deals prove to have higher success rates, which could serve as a guide to future entrepreneurs looking to pitch their business on the show.
    2. When a company makes a deal on Shark Tank, how often do they receive the deal they requested? Do companies usually have to lower their valuations to reach a deal?
      1. This question is important as it may show that certain investment offers proposed by entrepreneurs are more or less beneficial toward reaching a deal based on the metrics. It may reveal a ‘sweet spot’, where there is a trend in the ranges of valuations that ultimately result in a deal with an investor.
    3. Which sharks (investors) on the show are most likely to invest in a company given certain metrics about the company? Do certain sharks typically invest in companies in certain industries? Or in companies where the entrepreneurs come from a certain background?
      1. This question is important as it could unveil biases or tendencies in the deals that certain investors tend to make with entrepreneurs. Certain investors may be more inclined to reach a deal with a business if it meets the specific qualifications (if any) that they prefer in a pitch.
  • Description of research topic
    • Analysis of Shark Tank business pitches: how a company does or doesn’t reach a deal on the hit show “Shark Tank”
  • Hypotheses on the topic
    • Companies that reach deals on Shark Tank typically must lower their valuation from their original ask.

    • Sharks typically invest more often in companies that are in their respective areas of expertise.

    • Sharks are more likely to invest in companies whose entrepreneurs come from a similar background as themselves.

  • Types of variables in the research question
    • In order to answer the questions listed above, there are a number of categorical and quantitative variables that are needed. In terms of quantitative variables, there must be data regarding capital/equity that was initially requested at the onset of the pitch, as well as capital/equity that is agreed upon if a deal is ultimately reached. In terms of categorical data, there will have to be data points regarding the business such as its industry and name. There will also have to be data about the background of the pitchers/entrepreneurs. Moreover, there should be data about the background of the sharks (investors). There also must be a crucial variable - whether or not the pitch results in a deal.

Glimpse of data

shark_tank <- read_csv("data/shark_tank.csv")
Rows: 1038 Columns: 52
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Season Start, Season End, Original Air Date, Startup Name, Industr...
dbl (38): Season Number, Episode Number, Pitch Number, Multiple Entrepreneur...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(shark_tank)
Data summary
Name shark_tank
Number of rows 1038
Number of columns 52
_______________________
Column type frequency:
character 14
numeric 38
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Season Start 0 1.00 9 9 0 14 0
Season End 7 0.99 9 9 0 13 0
Original Air Date 408 0.61 9 9 0 154 0
Startup Name 0 1.00 3 32 0 1036 0
Industry 0 1.00 6 23 0 15 0
Business Description 0 1.00 5 92 0 1036 0
Pitchers Gender 5 1.00 4 10 0 3 0
Pitchers City 540 0.48 3 18 0 250 0
Pitchers State 299 0.71 2 6 0 46 0
Pitchers Average Age 992 0.04 3 6 0 4 0
Entrepreneur Names 557 0.46 8 60 0 479 0
Company Website 570 0.45 9 65 0 466 0
Guest Name 837 0.19 9 17 0 24 0
Notes 899 0.13 9 191 0 129 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Season Number 0 1.00 6.76 3.11 1.00 4.00 7.00 9.00 1.400e+01 ▃▇▅▇▁
Episode Number 0 1.00 12.12 7.74 1.00 5.00 11.00 18.00 2.900e+01 ▇▆▅▅▂
Pitch Number 0 1.00 519.50 299.79 1.00 260.25 519.50 778.75 1.038e+03 ▇▇▇▇▇
Multiple Entrepreneurs 487 0.53 0.35 0.48 0.00 0.00 0.00 1.00 1.000e+00 ▇▁▁▁▅
US Viewership 416 0.60 6.10 1.35 2.31 5.15 6.38 7.11 8.640e+00 ▁▃▅▇▃
Original Ask Amount 0 1.00 281798.65 379843.24 10000.00 100000.00 200000.00 300000.00 5.000e+06 ▇▁▁▁▁
Original Offered Equity 0 1.00 14.64 8.91 1.50 10.00 10.00 20.00 1.000e+02 ▇▁▁▁▁
Valuation Requested 0 1.00 3163290.63 4804725.88 40000.00 600000.00 1485294.00 3333333.00 4.500e+07 ▇▁▁▁▁
Got Deal 0 1.00 0.58 0.49 0.00 0.00 1.00 1.00 1.000e+00 ▆▁▁▁▇
Total Deal Amount 436 0.58 290921.37 378899.37 0.00 100000.00 200000.00 300000.00 5.000e+06 ▇▁▁▁▁
Total Deal Equity 436 0.58 25.51 16.18 0.00 15.00 25.00 33.00 1.000e+02 ▇▇▂▁▁
Deal Valuation 436 0.58 2042821.14 3718413.81 0.00 336206.75 800000.00 2000000.00 3.600e+07 ▇▁▁▁▁
Number of sharks in deal 436 0.58 1.32 0.63 1.00 1.00 1.00 2.00 5.000e+00 ▇▂▁▁▁
Investment Amount Per Shark 436 0.58 245115.72 350301.99 0.00 75000.00 150000.00 300000.00 5.000e+06 ▇▁▁▁▁
Equity Per Shark 436 0.58 21.55 15.17 0.00 10.00 20.00 25.00 1.000e+02 ▇▅▁▁▁
Royalty Deal 987 0.05 1.00 0.00 1.00 1.00 1.00 1.00 1.000e+00 ▁▁▇▁▁
Loan 1001 0.04 1.00 0.00 1.00 1.00 1.00 1.00 1.000e+00 ▁▁▇▁▁
Barbara Corcoran Investment Amount 940 0.09 143520.41 137398.90 12500.00 50000.00 100000.00 200000.00 1.000e+06 ▇▂▁▁▁
Barbara Corcoran Investment Equity 940 0.09 23.98 13.09 5.00 15.00 20.00 32.25 5.500e+01 ▇▇▂▂▂
Mark Cuban Investment Amount 857 0.17 245649.17 278613.24 12500.00 75000.00 150000.00 300000.00 2.000e+06 ▇▁▁▁▁
Mark Cuban Investment Equity 857 0.17 18.80 15.40 2.50 10.00 15.00 25.00 1.000e+02 ▇▃▁▁▁
Lori Greiner Investment Amount 882 0.15 205993.59 198022.87 17500.00 75000.00 150000.00 250000.00 1.000e+06 ▇▂▁▁▁
Lori Greiner Investment Equity 882 0.15 16.61 12.03 0.00 10.00 12.50 20.00 6.500e+01 ▇▅▁▁▁
Robert Herjavec Investment Amount 938 0.10 290973.33 581148.81 5000.00 86458.33 150000.00 300000.00 5.000e+06 ▇▁▁▁▁
Robert Herjavec Investment Equity 938 0.10 18.66 13.36 0.00 10.00 15.00 25.00 1.000e+02 ▇▃▁▁▁
Daymond John Investment Amount 943 0.09 186805.26 319390.55 5000.00 50000.00 100000.00 240000.00 3.000e+06 ▇▁▁▁▁
Daymond John Investment Equity 943 0.09 26.06 16.18 0.00 15.82 25.00 33.30 1.000e+02 ▇▇▁▁▁
Kevin O Leary Investment Amount 942 0.09 236276.04 315926.33 20000.00 80000.00 150000.00 250000.00 2.500e+06 ▇▁▁▁▁
Kevin O Leary Investment Equity 942 0.09 15.83 11.65 0.00 8.56 10.83 25.00 5.000e+01 ▇▃▂▁▁
Guest Investment Amount 969 0.07 216606.28 239754.19 0.00 75000.00 125000.00 250000.00 1.250e+06 ▇▂▁▁▁
Guest Investment Equity 969 0.07 16.71 15.52 0.00 10.00 11.25 20.00 1.000e+02 ▇▂▁▁▁
Barbara Corcoran Present 143 0.86 0.56 0.50 0.00 0.00 1.00 1.00 1.000e+00 ▆▁▁▁▇
Mark Cuban Present 142 0.86 0.90 0.30 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
Lori Greiner Present 142 0.86 0.75 0.43 0.00 0.75 1.00 1.00 1.000e+00 ▂▁▁▁▇
Robert Herjavec Present 142 0.86 0.88 0.33 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
Daymond John Present 143 0.86 0.66 0.47 0.00 0.00 1.00 1.00 1.000e+00 ▅▁▁▁▇
Kevin O Leary Present 143 0.86 0.96 0.21 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇
Kevin Harrington Present 143 0.86 0.95 0.23 0.00 1.00 1.00 1.00 1.000e+00 ▁▁▁▁▇

Data 2 - March Madness

Introduction and data

  • Identify the source of the data.

    • The data source called march_madness.csv is located in the data folder. This dataset is a collection of observations from March Madness (College basketball playoff tournament). The data comes from Kaggle, and can be found at this link.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This data was taken by the Washington Post from the NCAA tournament data that has been collected and recorded by the NCAA throughout the duration of past March Madness tournaments.
  • Write a brief description of the observations.

    • Each observation represents one game. There is an observation for every game in each round of all tournaments from 1985-2021.

Research question

  • Research question(s)

    • What is the relationship between a team’s seeding and how far it makes it in the tournament?
    • What “underdog” seeds are most likely to win an upset?
  • Description of research topic

    • March Madness is known to have a degree of randomness in which teams win each game, as nobody has ever predicted every game winner accurately. We are wondering how this randomness affects the likelihood of a team making it far in the tournament.
  • Hypotheses on the topic

    • High seeds and low seeds are more indicative of performance, while teams with middle-range seeds are less predictable.

    • Underdogs that are smaller underdogs (ie; ranked better) will have the best likelihood to upset a higher ranked team.

  • Types of variables in the research question

    • Categorical values are all of the possible seeds and the year of the tournament. A quantitative variable is the average round a seed makes.

Glimpse of data

march_madness <- read_csv("data/march_madness.csv")
Rows: 2317 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): WTEAM, LTEAM
dbl (6): YEAR, ROUND, WSEED, WSCORE, LSEED, LSCORE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(march_madness)
Data summary
Name march_madness
Number of rows 2317
Number of columns 8
_______________________
Column type frequency:
character 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
WTEAM 0 1 3 25 0 207 0
LTEAM 0 1 3 25 0 302 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
YEAR 0 1 2002.76 10.47 1985 1994 2003 2012 2021 ▇▇▇▇▇
ROUND 0 1 1.86 1.21 0 1 1 2 6 ▇▃▂▁▁
WSEED 0 1 4.98 3.84 1 2 4 7 16 ▇▃▂▂▁
WSCORE 0 1 76.87 11.84 43 69 76 84 149 ▂▇▂▁▁
LSEED 0 1 8.72 4.60 1 5 9 13 16 ▇▆▆▆▇
LSCORE 0 1 65.19 11.05 29 58 65 72 115 ▁▇▇▂▁

Data 3 - Goodreads and Google Books

Introduction and data

  • Identify the source of the data.

    • This data comes in two parts. The first is webscraped from Goodreads’ website, specifically their most popular list, “Books That Everyone Should Read At Least Once.” The link is: “https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once”. The second half of the data is from the Google Books API. We queried the API by searching each title from the Goodreads list. I wanted to use Goodreads’ API, but they discontinued it so we switched to the Google Books API. This resulted in some missing data points: while we used 800 books from the Goodreads list, only around 380 had an equivalent match on Google Books. With more time, we can expand this data set back to the 800 data point recommendation. ISBN was not available on the Goodreads website, which would have provided a more accurate way to search.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The Goodreads data was voted on by over 100,000 users. In total, the list features 20,000 books. Any user can vote, and the books are ranked by those with the most votes. The Google Books data was all collected by Google, in their attempts to digitize in mass the books of the world. Some books are not on the site due to copyright, so the selection varies. Further, Google Books often has multiple editions uploaded that alternate relevancy based on query, making selection difficult. Verified users and Google developers can add data to the site.
  • Write a brief description of the observations.

    • This data set has 27 columns that cover everything from maturity rating to retail price. The relevant columns are title, author, publisher, publish date, description, page count, print type, categories (genre), average rating, number of ratings, maturity rating, language, links to preview or buy, subtitle, country, and price. The table can use a little more tidying and filtering.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
    • Research Questions:

      1. How does the average rating of a book vary across different genres, publication dates, and book lengths? Is there a singular model of book that emerges as most popular?

      2. Do different publishing houses set different standards for qualities like page count and retail price of books? Does one of these models stand out as the most successful?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.
    • Our research topic with this data set involves an analysis of popular books based on their genre, publication date, book length, publishing house, page count, and retail price. The study aims to investigate the relationship between these variables and the average rating of books. Additionally, we will seek to explore whether certain models of books emerge as more popular than others and/or whether publishing houses set different standards for qualities like page count and retail price.

      Hypotheses:

      1. The average rating of books varies significantly across different genres, publication dates, and book lengths. However, as this is voted by people, we expect to see certain genres mirror the trends of recent years - Romance may have the most presence on the list. Longer books may be perceived as having more depth and value, resulting in higher average ratings, but significantly less amount of ratings. We don’t believe that there will be a singular model of book that emerges as most popular from this dataset, as readers’ preferences and tastes can vary widely.

      2. Different publishing houses may set different standards for qualities like page count and retail price of books. Some publishing houses may prioritize longer books with higher page counts, while others may prioritize shorter books that are more accessible to readers. However, we predict that they will be more consistent on genre and retail price than on page length. We expect to a variance in the trends of smaller publishing firms versus the dominant firms.

  • Identify the types of variables in your research question. Categorical? Quantitative?
    • Categorical variables: genre, language, and maturity rating.

    • Quantitative, discrete variables: publication date, page count, average rating, number of ratings.

    • Quantitative continuous: price.

Glimpse of data

goodreadsdata <- read_csv("data/books.csv")
New names:
Rows: 386 Columns: 27
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(19): title, authors, publisher, publishedDate, description, printType, ... dbl
(5): pageCount, averageRating, ratingsCount, amount...23, amount...25 lgl (3):
allowAnonLogging, comicsContent, isEbook
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `amount...24` -> `amount...23`
• `currencyCode...25` -> `currencyCode...24`
• `amount...26` -> `amount...25`
• `currencyCode...27` -> `currencyCode...26`
glimpse(goodreadsdata)
Rows: 386
Columns: 27
$ title               <chr> "To Kill a Mockingbird 40th", "Harry Potter and th…
$ authors             <chr> "Harper Lee", "J.K. Rowling", "Jane Austen", "Anne…
$ publisher           <chr> "HarperCollins Christian Publishing", "Pottermore …
$ publishedDate       <chr> "1999-11-03", "2015-12-08", "2018", "2016", "2021-…
$ description         <chr> "The explosion of racial hate and violence in a sm…
$ pageCount           <dbl> 350, 311, 519, 0, 128, 117, 246, 165, 192, 0, 578,…
$ printType           <chr> "BOOK", "BOOK", "BOOK", "BOOK", "BOOK", "BOOK", "B…
$ categories          <chr> "FICTION", "Juvenile Fiction", "Courtship", "Amste…
$ averageRating       <dbl> 4.5, 4.5, NA, 4.0, 4.0, 4.0, NA, 3.5, 3.5, 4.5, 4.…
$ ratingsCount        <dbl> 2163, 2057, NA, 166, 7, 1906, NA, 3123, 485, 124, …
$ maturityRating      <chr> "NOT_MATURE", "NOT_MATURE", "NOT_MATURE", "NOT_MAT…
$ allowAnonLogging    <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRU…
$ contentVersion      <chr> "0.2.4.0.preview.0", "2.36.33.0.preview.3", "previ…
$ language            <chr> "en", "en", "en", "en", "en", "en", "en", "en", "e…
$ previewLink         <chr> "http://books.google.com/books?id=ayJpGQeyxgkC&pri…
$ infoLink            <chr> "http://books.google.com/books?id=ayJpGQeyxgkC&dq=…
$ canonicalVolumeLink <chr> "https://books.google.com/books/about/To_Kill_a_Mo…
$ subtitle            <chr> NA, NA, NA, NA, NA, NA, NA, "The Authorized Editio…
$ comicsContent       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ country             <chr> "US", "US", "US", "US", "US", "US", "US", "US", "U…
$ saleability         <chr> "NOT_FOR_SALE", "FOR_SALE", "NOT_FOR_SALE", "NOT_F…
$ isEbook             <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TR…
$ amount...23         <dbl> NA, 9.99, NA, NA, NA, NA, NA, 1.99, NA, NA, NA, NA…
$ currencyCode...24   <chr> NA, "USD", NA, NA, NA, NA, NA, "USD", NA, NA, NA, …
$ amount...25         <dbl> NA, 9.99, NA, NA, NA, NA, NA, 1.99, NA, NA, NA, NA…
$ currencyCode...26   <chr> NA, "USD", NA, NA, NA, NA, NA, "USD", NA, NA, NA, …
$ buyLink             <chr> NA, "https://play.google.com/store/books/details?i…
skim(goodreadsdata)
Data summary
Name goodreadsdata
Number of rows 386
Number of columns 27
_______________________
Column type frequency:
character 19
logical 3
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1.00 4 87 0 382 0
authors 0 1.00 4 65 0 295 0
publisher 51 0.87 2 44 0 151 0
publishedDate 6 0.98 4 20 0 325 0
description 27 0.93 14 4857 0 357 0
printType 3 0.99 4 4 0 1 0
categories 0 1.00 4 40 0 77 0
maturityRating 0 1.00 10 10 0 1 0
contentVersion 0 1.00 10 21 0 183 0
language 0 1.00 2 2 0 7 0
previewLink 0 1.00 77 210 0 385 0
infoLink 0 1.00 72 194 0 383 0
canonicalVolumeLink 0 1.00 59 102 0 383 0
subtitle 278 0.28 6 165 0 85 0
country 0 1.00 2 2 0 1 0
saleability 0 1.00 4 12 0 3 0
currencyCode…24 278 0.28 3 3 0 1 0
currencyCode…26 278 0.28 3 3 0 1 0
buyLink 263 0.32 104 104 0 121 0

Variable type: logical

skim_variable n_missing complete_rate mean count
allowAnonLogging 0 1.00 0.32 FAL: 262, TRU: 124
comicsContent 384 0.01 1.00 TRU: 2
isEbook 0 1.00 0.32 FAL: 263, TRU: 123

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
pageCount 12 0.97 340.86 279.84 0 164.00 297.00 448.00 1600 ▇▅▁▁▁
averageRating 90 0.77 4.04 0.53 2 4.00 4.00 4.50 5 ▁▁▃▇▅
ratingsCount 90 0.77 690.37 1217.53 1 11.75 118.50 482.00 4916 ▇▁▁▁▁
amount…23 278 0.28 9.72 5.82 0 5.99 9.99 12.99 35 ▅▇▂▁▁
amount…25 278 0.28 9.33 5.57 0 5.99 9.99 12.99 35 ▅▇▁▁▁