Team Elegant Evee

Preregistration of analyses

Analysis #1

Based on our project mentor’s feedback, we plan to thoroughly explore our second proposed question from the EDA (“Which release dates see the greatest profits?”) and pre-register our analysis for it below. The reason for this choice is, in many ways, intuitive. We do not expect films to see equal box office success throughout a year—some film studios reserve their biggest movies for certain periods of the year, suggesting that some months see better financial success than others.

Our proposed hypothesis is follows:

  • Films released in December and the summer months, namely May through July, will see the greatest profits, whereas movies released in the months immediately after these peak periods will take in significantly less at the box office.

The reason for this hypothesis is twofold. First, films are more convenient for people to watch during these peak months (December and summer)—people work fewer days during these months because of holidays and children tend to be out of school, for example. Second, we expect that film studios somewhat contribute to self-fulfilling prophecies for peak box office performance. Assuming that some months see greater profits than others, we expect major film studios to identify these months and consequently save their biggest films to be released during expected peak months. This choice to release the best movies during months already considered to be financially favorable ones further entrenches the income advantage that some months have over others.

We can analyze this hypothesis in two key ways, among others. First, we can group the films’ release dates by their release month and create stacked boxplots comparing the income distributions of each of these months’ films. Some months, for example, may have higher medians or more outliers, and boxplots are intuitive way to display how incomes relate to release date months. Second, we could filter our data to only include films grossing over $500 million and then create the same boxplots as above. This choice could help us more clearly identify when studios release their most successful films as opposed to analyzing the income of all their films, thus helping us further identify the months with the greatest box office hauls. Boxplots showing the medians of a group of high-grossing films by month may tell a very different story from boxplots showing the medians of all films, regardless of income.

Analysis #2

Our third research question (“How strong is the relationship between a film’s release year and income versus its iMDB rating?”) is also quite promising. Intuitively, we understand that films released more recently may have different iMDB ratings from those released long ago. Just as grade inflation has affected college GPAs, generations of critics may score films differently. Additionally, the genres of films have shifted toward action over time, and ratings may reflect this change. We also suspect that films with higher ratings are likely of higher quality, thus helping their box office hauls. Based on these intuitions, we believe that there is significant analytic meat for this research question and propose two hypotheses.

Our hypotheses are as follows:

  • More recent film release years are associated with a lower iMDB rating, and a film’s release year is weakly related to its iMDB rating.

  • Films with higher incomes are associated with a higher iMDB rating, and a film’s income is moderately related to its iMDB rating.

As stated above, we believe that income and iMDB ratings are both measures of a film’s appeal and quality, although perhaps oblique. As such, we hope to explore how predictive income is of iMDB ratings to test this intuition. Additionally, conventional wisdom holds that film quality has declined over time. With the rise of streaming services and increasing focus on movie sequels over original content, we expect that critics score older movies better than more recent ones.

We plan to test these hypotheses by fitting three linear models. First, we plan to fit two linear regression models with a single predictor—one with release year vs. rating and another with income vs. rating. Based on the correlation coefficients for these two models, we can comment on the strength of the relationship, as well as whether the relationship is positive or negative. We also plan to fit one additive linear regression model to see how predictive these two variables (release year and income) are of ratings in an additive model. We will use the adjusted R^2 to evaluate how much of the percent variability in the response is explained by the model. All of these approaches are thorough and sufficient for evaluating the hypotheses.