Project title

Proposal

Authors

Matthew Roohan (mlr287)

Senhuang Cai (sc3322)

Noorejehan Umar (nu44)

Aishwarya Gupta (ag2469)

library(tidyverse)
library(skimr)
library(openxlsx)
library(scales)
library(readxl)

Data 1

Problem or question

Questions:

The questions we are trying to answer are:

What countries in the world are the happiest?
Does happiness correlate to other factors such as life expectancy, country GDP, social support, etc?
Is there one particular factor that influences how ‘happy’ a country is?

Explain why you think this topic is important:

This topic is important because it allows a more nuanced understanding of what factors might be contributing to the happiness of different countries. Finding out if certain regions are happier than others can help us think about what those countries do differently that makes their citizens happier, and act as a basis for other countries to follow. It would also be useful to see if there is a significant correlation between happiness and other factors such as GDP, life expectancy, social support, etc. This information could then influence policymakers and governments to make decisions aimed at improving the overall happiness of their countries. For example, if there is a strong correlation between GDP and happiness, they could focus more on economic reform to improve quality of life for individuals.

Identify the types of data/variables you will use:

We’ll be using both categorical and quantitative variables. Categorical variables will include variables such as the country name, while quantitative variables will include variables such as happiness index, life expectancy, gdp, household income, democratic quality, etc.

State the major deliverable(s) you will create to solve this problem/answer this question:

The major deliverable will be an interactive and educational website that is meant to educate policymakers as well as the general public about what factors affect happiness and how different regions vary by happiness.

Introduction and data

Source of the data:

The source of the data is Helliwell, J. F., Layard, R., Sachs, J. D., De Neve, J.-E., Aknin, L. B., & Wang, S. (Eds.). (2022). World Happiness Report 2022. New York: Sustainable Development Solutions Network. We got the data from this link: https://worldhappiness.report/ed/2022/

State when and how it was originally collected (by the original data curator, not necessarily how you found the data):

According to the report, the data was collected by sending out a survey to a representative sample of each country’s population and asking them to rate their happiness from 0-10.

Write a brief description of the observations:

We have two datasets we’ll be focusing on, both of which are from the World Happiness report from 2022.

The first dataset has 2089 rows and 12 columns. Some of the important columns we’ll be focusing on are explained below. Note that all the variables below, except GDP and life expectancy, were based on responses from the Gallup World Poll.

GDP per capita: The gross domestic product per person, which relates to the overall economic health of a country.
Social support: This was the national average of respondents’ responses about the availability of their friends and relatives that they could count on when in trouble.
Healthy life expectancy: This is determined using data from the World Health Organization from select years (2000, 2010, 2015, and 2019), matching the report’s sample period (2005-2021) through interpolation and extrapolation.
Freedom to make life choices: This was the national average of respondents’ responses about their satisfaction with their individual freedom to choose what they wish to do with their lives.
Generosity: This was the national average of respondents’ responses about donating money to charity in the past month, regressed on log GDP per capita.
Perceptions of corruption: This was the national average of respondents’ responses about the prevalence of corruption in government and businesses.

The second dataset has 147 rows and 12 columns. Some variables it contains, that we would focus on are:

Happiness Score: On average how much the happiness score was for that country in that year.
It includes indications of how much the happiness index can be explained by each of the GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption variables..
It also contains a variable called dystopia which is defined by the researchers as “a hypothetical country called”Dystopia”—named because it has values equal to the world’s lowest national averages for 2019-2021 for each of the six key variables used. We use Dystopia as a benchmark against which to compare contributions from each of the six factors. “

The content for the description of the dataset variables have been taken from the researchers’ report, found here: https://worldhappiness.report/ed/2022/happiness-benevolence-and-trust-during-covid-19-and-beyond/#ranking-of-happiness-2019-2021

Ethical Concerns:

While analyzing the data, it will be crucial to think about the fact that the concept of happiness is very subjective and might be influenced by cultural, societal, and personal factors. We will need to interpret the results with caution and remember that the measures of happiness do not imply that one country’s culture or approach is superior to others. Furthermore, since a lot of the data is based on individual participants’ responses, we need to recognize that these can be highly biased.

Glimpse of data

# add code here
dataset1.1 <- read_excel("data/dataset1.1_happiness.xls")
dataset1.2 <- read_excel("data/dataset1.2_happiness.xls")


skimr::skim(dataset1.1)

Data summary
Name	dataset1.1
Number of rows	2089
Number of columns	12
_______________________
Column type frequency:
character	1
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Country name	0	1	4	25	0	166	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	2013.73	4.46	2005.00	2010.00	2014.00	2017.00	2021.00	▅▆▆▆▇
Life Ladder	0	1.00	5.47	1.12	2.18	4.65	5.41	6.29	8.02	▁▅▇▆▃
Log GDP per capita	27	0.99	9.38	1.14	5.53	8.47	9.46	10.35	11.67	▁▃▆▇▅
Social support	13	0.99	0.81	0.12	0.29	0.75	0.83	0.90	0.99	▁▁▂▆▇
Healthy life expectancy at birth	58	0.97	63.18	6.95	6.72	58.97	64.98	68.36	74.35	▁▁▁▃▇
Freedom to make life choices	32	0.98	0.75	0.14	0.26	0.65	0.77	0.86	0.99	▁▂▅▇▆
Generosity	80	0.96	0.00	0.16	-0.34	-0.11	-0.02	0.09	0.71	▃▇▃▁▁
Perceptions of corruption	113	0.95	0.75	0.19	0.04	0.69	0.80	0.87	0.98	▁▁▁▅▇
Positive affect	24	0.99	0.65	0.11	0.18	0.57	0.66	0.74	0.88	▁▁▆▇▅
Negative affect	16	0.99	0.27	0.09	0.08	0.21	0.26	0.32	0.70	▃▇▃▁▁
Confidence in national government	216	0.90	0.48	0.19	0.07	0.33	0.47	0.62	0.99	▃▇▇▅▂

skimr::skim(dataset1.2)

Data summary
Name	dataset1.2
Number of rows	147
Number of columns	12
_______________________
Column type frequency:
character	1
numeric	11
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Country	0	1	2	25	0	147	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
RANK	0	1.00	74.00	42.58	1.00	37.50	74.00	110.50	147.00	▇▇▇▇▇
Happiness score	1	0.99	5.55	1.09	2.40	4.89	5.57	6.31	7.82	▁▃▇▇▃
Whisker-high	1	0.99	5.67	1.07	2.47	5.01	5.68	6.45	7.89	▁▃▇▇▃
Whisker-low	1	0.99	5.43	1.11	2.34	4.75	5.45	6.19	7.76	▁▃▇▇▃
Dystopia (1.83) + residual	1	0.99	1.83	0.53	0.19	1.56	1.89	2.15	2.84	▁▂▅▇▃
Explained by: GDP per capita	1	0.99	1.41	0.42	0.00	1.10	1.45	1.79	2.21	▁▃▆▇▆
Explained by: Social support	1	0.99	0.91	0.28	0.00	0.73	0.96	1.11	1.32	▁▂▅▇▇
Explained by: Healthy life expectancy	1	0.99	0.59	0.18	0.00	0.46	0.62	0.72	0.94	▁▃▃▇▃
Explained by: Freedom to make life choices	1	0.99	0.52	0.15	0.00	0.44	0.54	0.63	0.74	▁▁▃▇▇
Explained by: Generosity	1	0.99	0.15	0.08	0.00	0.09	0.13	0.20	0.47	▆▇▅▁▁
Explained by: Perceptions of corruption	1	0.99	0.15	0.13	0.00	0.07	0.12	0.20	0.59	▇▅▂▁▁

Data 2

Problem or question

Questions:

The questions we are trying to answer are:

Are there significant differences between the representation of women vs men in comic books?
How has the representation of women evolved in comic books over the years?
Is there a difference between how DC and Marvel comics represent women?

Explain why you think this topic is important:

The underrepresentation of women in comic books and the lack of diversity in character gender within major comic book universes (DC and Marvel) is a pervasive issue. Our objective is to investigate and highlight the gender disparity the characters featured in comic books. This topic is important because comic books are a significant part of popular culture, influencing perceptions and shaping societal norms. Understanding the gender dynamics within the comic book industry and the characters portrayed can help issues of representation and equality.

Identify the types of data/variables you will use:

We will be using both quantitative and categorical variables. Categorical variables such as Align, Sex, Alive, First appearance, year, and GSM. Meanwhile, some of the quantitative variables like Appearance will be used.

State the major deliverable(s) you will create to solve this problem/answer this question:

The major deliverable will be a comic-strip themed website that will include analyses and visualizations showcasing trends and patterns about gender representation in comic books. We will:

Identify changes in gender representation over the years.
Investigate correlations between gender representation and the number of appearances of the character in comic books.

Introduction and data

Source of the data:

The data comes from Marvel Wikia and DC Wikia. Characters were scrapped on August 24, 2014.(README.md)

State when and how it was originally collected (by the original data curator, not necessarily how you found the data):

The data has been collected from the Marvel and DC databases and encyclopedias which houses the information of all the comic book characters. However, the appearance counts were scrapped on September 2, 2014. The month and year of the first issue each character appeared in was pulled on October 6, 2014.

Write a brief description of the observations:

The data is split into two files, for DC and Marvel, respectively:

The DC dataset has 13 columns and 6897 rows
The Marvel dataset has 13 columns, 16377 rows

Each file contains 13 variables: page_id, name, url slug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, APPEARANCES, FIRST APPEARANCE, YEAR

ALIGN: If the character is Good, Bad or Neutral
SEX: Sex of the character (e.g. Male, Female, etc.)
ALIVE: If the character is alive or deceased
FIRST APPEARANCE: The month and year of the character’s first appearance in a comic book, if available
YEAR: The year of the character’s first appearance in a comic book, if available
GSM: If character is of a gender or sexual minority category
APPEARANCE: The number of appearances of the character in comic books (as of Sep. 2, 2014.)

Ethical Concerns:

Since the data is based on fictional characters, there are no overarching ethical concerns. However, the issue of gender representation and gender identity can be a sensitive one, so we want to make sure we consider the sensitive nature of the subject when thinking about how to display our analyses.

Glimpse of data

# add code here
dataset2.1 <- read_csv("data/dataset2.1_comics_dc.csv")

Rows: 6896 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, urlslug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, FIRST APPEAR...
dbl  (3): page_id, APPEARANCES, YEAR

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset2.2 <- read_csv("data/dataset2.2_comics_marvel.csv")

Rows: 16376 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, urlslug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, FIRST APPEAR...
dbl  (3): page_id, APPEARANCES, Year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(dataset2.1)

Data summary
Name	dataset2.1
Number of rows	6896
Number of columns	13
_______________________
Column type frequency:
character	10
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	10	46	6896
urlslug	0	1.00	18	54	6896
ID	2013	0.71	15	16	3
ALIGN	601	0.91	14	18	4
EYE	3628	0.47	8	18	17
HAIR	2274	0.67	8	21	17
SEX	125	0.98	15	22	4
GSM	6832	0.01	19	21	2
ALIVE	3	1.00	17	19	2
FIRST APPEARANCE	69	0.99	4	15	774

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
page_id	0	1.00	147441.21	108388.63	1380	44105.5	141267	213203	404010	▇▆▅▃▂
APPEARANCES	355	0.95	23.63	87.38	1	2.0	6	15	3093	▇▁▁▁▁
YEAR	69	0.99	1989.77	16.82	1935	1983.0	1992	2003	2013	▁▁▂▇▆

skimr::skim(dataset2.2)

Data summary
Name	dataset2.2
Number of rows	16376
Number of columns	13
_______________________
Column type frequency:
character	10
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	4	69	16376
urlslug	0	1.00	6	71	16376
ID	3770	0.77	15	29	4
ALIGN	2812	0.83	14	18	3
EYE	9767	0.40	7	15	24
HAIR	4264	0.74	4	21	25
SEX	854	0.95	15	22	4
GSM	16286	0.01	13	22	6
ALIVE	3	1.00	17	19	2
FIRST APPEARANCE	815	0.95	6	6	832

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
page_id	0	1.00	300232.08	253460.40	1025	28309.5	282578	509077	755278	▇▂▂▅▃
APPEARANCES	1096	0.93	17.03	96.37	1	1.0	3	8	4043	▇▁▁▁▁
Year	815	0.95	1984.95	19.66	1939	1974.0	1990	2000	2013	▃▂▅▇▇

Data 3

Problem or question

Identify the question you will answer

Determining which gender more frequently engages in dialogue in films and calculating the ratio of their speaking roles.
Exploring whether an actor’s age correlates with the number of words they speak in movies, investigating if there is an age-related pattern in dialogue distribution.
Analyzing whether the quantity of words spoken by actors has any influence on the gross revenue generated by films, potentially highlighting a relationship between dialogue and financial success.
Exploring the impact of supporting actors on a film’s success and determining if there are gender imbalances within supporting roles

Explain why you think this topic is important

This project seeks to analyze the frequency of gender imbalances in speaking roles and dialogues within the film industry, providing insights into which gender tends to exert greater influence in on-screen conversations. By doing so, we aim to gain a comprehensive understanding of the substantial gender gap present in films.

Identify the types of data/variables you will use

We will be using both quantitative and categorical variables. Categorical variables include gender, character name and titles. Meanwhile, quantitative variables include columns like age, words, and script IDs.

State the major deliverable(s) you will create to solve this problem/answer this question

Our major deliverable will be an engaging website featuring compelling visual representations. This platform aims to enlighten future producers about the discernible trends in gender inequality, with the goal of promoting more equitable opportunities within the industry.

Introduction and data

If you are using a dataset:

Identify the source of the data

The data source for this project comes from Polygraph’s movie-dialog gender ratio database.

State when and how it was originally collected (by the original data curator, not necessarily how you found the data)

The original data collection by the Polygraph’s team began with the acquisition of approximately 8,000 scripts from various screenplay repositories available on the internet. Subsequently, each of these scripts was cross-referenced with an IMDB page to establish a connection. Following this, the content of each script was meticulously parsed, dissecting all the lines of dialogue attributed to individual characters. To complete the data compilation, a matching process was carried out to associate each character with their respective actor.

Write a brief description of the observations

The data has 3 files.

Character_list5.csv

Has 5 columns - script_id, imdb_character_name, words, gender, age
Has 23049 rows of data

Some of the columns we will be focusing on are:

IMDB Character Name: This column contains the names of the characters in the movie as they are officially recognized on the Internet Movie Database (IMDB).
Words: This column quantifies the number of words spoken by each character in the film. It serves as a measure of the character’s speaking role and contributes to the analysis of dialogue distribution in the movie.
Gender*: The “Gender” column indicates the gender of the actor who portrayed each character in the movie.
Age: The “Age” column provides details about the age of the actor who played each character in the film.

Character_mapping.csv

Has 5 columns - script_id, imdb_id, character_from_script, closest_character_name_from_imdb_match, closest_imdb_character_id
Has 99391 rows of data

Some of the columns we will be focusing on are:

Character from Script - This refers to a character as it appears in the screenplay
Closest Character Name - This indicates the most closely matching character name from the associated film’s IMDb database.

This pairing helps align the character in the script with its corresponding IMDb-recognized name from the movie, allowing for consistent referencing and analysis when comparing the written script to the actual film.

Meta_data.csv

Has 6 columns - script_id, imdb_id, year, gross, title, lines_data
Has 2001 rows of data

Some of the columns we will be focusing on are:

Gross - refers to the total revenue or earnings generated by a film, often at the box office or through various distribution channels.
Title - refers to the name or title of the film serving as its primary identifier.

Ethical Concerns:

The project openly addresses its data collection process and methodology, acknowledging potential limitations and concerns, such as the composition of the sample and the exclusion of minor characters. It emphasizes its primary role as a data-gathering initiative, distinct from a formal academic study, as it operates on the internet domain. While not conducting extensive statistical analysis, the project’s aim is to provide the data openly for others to interpret and use for their own analyses of gender representation in film dialogue. This approach offers a valuable contribution to the broader conversation surrounding this issue.

Reference: https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751

Glimpse of data

# add code here
dataset3.1 <- read_csv("data/dataset3.1_words_character-list.csv")

Rows: 23048 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdb_character_name, gender, age
dbl (2): script_id, words

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset3.2 <- read_csv("data/dataset3.2_words_character-mapping.csv")

Rows: 99390 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): imdb_id, character_from_script, closest_character_name_from_imdb_ma...
dbl (1): script_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset3.3 <- read_csv("data/dataset3.3_words_meta-data.csv")

Rows: 2000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdb_id, title, lines_data
dbl (3): script_id, year, gross

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(dataset3.1)

Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 50 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
Caused by warning in `grepl()`:
! unable to translate '<e9>tienne' to a wide string
ℹ Run `dplyr::last_dplyr_warnings()` to see the 49 remaining warnings.

Data summary
Name	dataset3.1
Number of rows	23048
Number of columns	5
_______________________
Column type frequency:
character	3
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
imdb_character_name	1	1	15	17609
gender	1	1	1	3
age	1	1	4	105

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
script_id	0	1	4194.78	2472.99	280	2095	3694	6219.75	9254	▇▇▆▃▅
words	0	1	907.87	1399.59	101	193	396	980.00	28102	▇▁▁▁▁

skimr::skim(dataset3.2)

Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 171 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
Caused by warning in `grepl()`:
! unable to translate 'maitre d<ed>' to a wide string
ℹ Run `dplyr::last_dplyr_warnings()` to see the 170 remaining warnings.

Data summary
Name	dataset3.2
Number of rows	99390
Number of columns	5
_______________________
Column type frequency:
character	4
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
imdb_id	0	1	9	9	2388
character_from_script	47	1	1	79	18540
closest_character_name_from_imdb_match	63	1	1	88	30651
closest_imdb_character_id	0	1	9	9	25728

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
script_id	0	1	4132.47	2646.04	1	1970	3587	6242	9254	▆▇▆▃▅

skimr::skim(dataset3.3)

Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 12 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
  mangled_skimmers$funs)`.
Caused by warning in `grepl()`:
! unable to translate 'Alien<b3>' to a wide string
ℹ Run `dplyr::last_dplyr_warnings()` to see the 11 remaining warnings.

Data summary
Name	dataset3.3
Number of rows	2000
Number of columns	6
_______________________
Column type frequency:
character	3
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
imdb_id	1	9	9	2000
title	1	2	62	1994
lines_data	1	9	255	2000

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
script_id	0	1.00	4396.84	2506.36	280	2201.75	4108	6607.75	9254	▇▇▇▅▆
year	0	1.00	1998.10	14.81	1929	1992.00	2001	2009.00	2015	▁▁▁▃▇
gross	338	0.83	103.94	140.56	0	20.00	54	132.75	1798	▇▁▁▁▁