library(tidyverse)
library(skimr)
library(openxlsx)
library(scales)
library(readxl)Project title
Proposal
Data 1
Problem or question
Questions:
The questions we are trying to answer are:
What countries in the world are the happiest?
Does happiness correlate to other factors such as life expectancy, country GDP, social support, etc?
Is there one particular factor that influences how ‘happy’ a country is?
Explain why you think this topic is important:
This topic is important because it allows a more nuanced understanding of what factors might be contributing to the happiness of different countries. Finding out if certain regions are happier than others can help us think about what those countries do differently that makes their citizens happier, and act as a basis for other countries to follow. It would also be useful to see if there is a significant correlation between happiness and other factors such as GDP, life expectancy, social support, etc. This information could then influence policymakers and governments to make decisions aimed at improving the overall happiness of their countries. For example, if there is a strong correlation between GDP and happiness, they could focus more on economic reform to improve quality of life for individuals.
Identify the types of data/variables you will use:
We’ll be using both categorical and quantitative variables. Categorical variables will include variables such as the country name, while quantitative variables will include variables such as happiness index, life expectancy, gdp, household income, democratic quality, etc.
State the major deliverable(s) you will create to solve this problem/answer this question:
The major deliverable will be an interactive and educational website that is meant to educate policymakers as well as the general public about what factors affect happiness and how different regions vary by happiness.
Introduction and data
Source of the data:
The source of the data is Helliwell, J. F., Layard, R., Sachs, J. D., De Neve, J.-E., Aknin, L. B., & Wang, S. (Eds.). (2022). World Happiness Report 2022. New York: Sustainable Development Solutions Network. We got the data from this link: https://worldhappiness.report/ed/2022/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data):
According to the report, the data was collected by sending out a survey to a representative sample of each country’s population and asking them to rate their happiness from 0-10.
Write a brief description of the observations:
We have two datasets we’ll be focusing on, both of which are from the World Happiness report from 2022.
The first dataset has 2089 rows and 12 columns. Some of the important columns we’ll be focusing on are explained below. Note that all the variables below, except GDP and life expectancy, were based on responses from the Gallup World Poll.
GDP per capita: The gross domestic product per person, which relates to the overall economic health of a country.
Social support: This was the national average of respondents’ responses about the availability of their friends and relatives that they could count on when in trouble.
Healthy life expectancy: This is determined using data from the World Health Organization from select years (2000, 2010, 2015, and 2019), matching the report’s sample period (2005-2021) through interpolation and extrapolation.
Freedom to make life choices: This was the national average of respondents’ responses about their satisfaction with their individual freedom to choose what they wish to do with their lives.
Generosity: This was the national average of respondents’ responses about donating money to charity in the past month, regressed on log GDP per capita.
Perceptions of corruption: This was the national average of respondents’ responses about the prevalence of corruption in government and businesses.
The second dataset has 147 rows and 12 columns. Some variables it contains, that we would focus on are:
Happiness Score: On average how much the happiness score was for that country in that year.
It includes indications of how much the happiness index can be explained by each of the GDP per capita, social support, healthy life expectancy, freedom, generosity, and corruption variables..
It also contains a variable called dystopia which is defined by the researchers as “a hypothetical country called”Dystopia”—named because it has values equal to the world’s lowest national averages for 2019-2021 for each of the six key variables used. We use Dystopia as a benchmark against which to compare contributions from each of the six factors. “
The content for the description of the dataset variables have been taken from the researchers’ report, found here: https://worldhappiness.report/ed/2022/happiness-benevolence-and-trust-during-covid-19-and-beyond/#ranking-of-happiness-2019-2021
Ethical Concerns:
While analyzing the data, it will be crucial to think about the fact that the concept of happiness is very subjective and might be influenced by cultural, societal, and personal factors. We will need to interpret the results with caution and remember that the measures of happiness do not imply that one country’s culture or approach is superior to others. Furthermore, since a lot of the data is based on individual participants’ responses, we need to recognize that these can be highly biased.
Glimpse of data
# add code here
dataset1.1 <- read_excel("data/dataset1.1_happiness.xls")
dataset1.2 <- read_excel("data/dataset1.2_happiness.xls")
skimr::skim(dataset1.1)| Name | dataset1.1 |
| Number of rows | 2089 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Country name | 0 | 1 | 4 | 25 | 0 | 166 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 2013.73 | 4.46 | 2005.00 | 2010.00 | 2014.00 | 2017.00 | 2021.00 | ▅▆▆▆▇ |
| Life Ladder | 0 | 1.00 | 5.47 | 1.12 | 2.18 | 4.65 | 5.41 | 6.29 | 8.02 | ▁▅▇▆▃ |
| Log GDP per capita | 27 | 0.99 | 9.38 | 1.14 | 5.53 | 8.47 | 9.46 | 10.35 | 11.67 | ▁▃▆▇▅ |
| Social support | 13 | 0.99 | 0.81 | 0.12 | 0.29 | 0.75 | 0.83 | 0.90 | 0.99 | ▁▁▂▆▇ |
| Healthy life expectancy at birth | 58 | 0.97 | 63.18 | 6.95 | 6.72 | 58.97 | 64.98 | 68.36 | 74.35 | ▁▁▁▃▇ |
| Freedom to make life choices | 32 | 0.98 | 0.75 | 0.14 | 0.26 | 0.65 | 0.77 | 0.86 | 0.99 | ▁▂▅▇▆ |
| Generosity | 80 | 0.96 | 0.00 | 0.16 | -0.34 | -0.11 | -0.02 | 0.09 | 0.71 | ▃▇▃▁▁ |
| Perceptions of corruption | 113 | 0.95 | 0.75 | 0.19 | 0.04 | 0.69 | 0.80 | 0.87 | 0.98 | ▁▁▁▅▇ |
| Positive affect | 24 | 0.99 | 0.65 | 0.11 | 0.18 | 0.57 | 0.66 | 0.74 | 0.88 | ▁▁▆▇▅ |
| Negative affect | 16 | 0.99 | 0.27 | 0.09 | 0.08 | 0.21 | 0.26 | 0.32 | 0.70 | ▃▇▃▁▁ |
| Confidence in national government | 216 | 0.90 | 0.48 | 0.19 | 0.07 | 0.33 | 0.47 | 0.62 | 0.99 | ▃▇▇▅▂ |
skimr::skim(dataset1.2)| Name | dataset1.2 |
| Number of rows | 147 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Country | 0 | 1 | 2 | 25 | 0 | 147 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| RANK | 0 | 1.00 | 74.00 | 42.58 | 1.00 | 37.50 | 74.00 | 110.50 | 147.00 | ▇▇▇▇▇ |
| Happiness score | 1 | 0.99 | 5.55 | 1.09 | 2.40 | 4.89 | 5.57 | 6.31 | 7.82 | ▁▃▇▇▃ |
| Whisker-high | 1 | 0.99 | 5.67 | 1.07 | 2.47 | 5.01 | 5.68 | 6.45 | 7.89 | ▁▃▇▇▃ |
| Whisker-low | 1 | 0.99 | 5.43 | 1.11 | 2.34 | 4.75 | 5.45 | 6.19 | 7.76 | ▁▃▇▇▃ |
| Dystopia (1.83) + residual | 1 | 0.99 | 1.83 | 0.53 | 0.19 | 1.56 | 1.89 | 2.15 | 2.84 | ▁▂▅▇▃ |
| Explained by: GDP per capita | 1 | 0.99 | 1.41 | 0.42 | 0.00 | 1.10 | 1.45 | 1.79 | 2.21 | ▁▃▆▇▆ |
| Explained by: Social support | 1 | 0.99 | 0.91 | 0.28 | 0.00 | 0.73 | 0.96 | 1.11 | 1.32 | ▁▂▅▇▇ |
| Explained by: Healthy life expectancy | 1 | 0.99 | 0.59 | 0.18 | 0.00 | 0.46 | 0.62 | 0.72 | 0.94 | ▁▃▃▇▃ |
| Explained by: Freedom to make life choices | 1 | 0.99 | 0.52 | 0.15 | 0.00 | 0.44 | 0.54 | 0.63 | 0.74 | ▁▁▃▇▇ |
| Explained by: Generosity | 1 | 0.99 | 0.15 | 0.08 | 0.00 | 0.09 | 0.13 | 0.20 | 0.47 | ▆▇▅▁▁ |
| Explained by: Perceptions of corruption | 1 | 0.99 | 0.15 | 0.13 | 0.00 | 0.07 | 0.12 | 0.20 | 0.59 | ▇▅▂▁▁ |
Data 2
Problem or question
Questions:
The questions we are trying to answer are:
Are there significant differences between the representation of women vs men in comic books?
How has the representation of women evolved in comic books over the years?
Is there a difference between how DC and Marvel comics represent women?
Explain why you think this topic is important:
The underrepresentation of women in comic books and the lack of diversity in character gender within major comic book universes (DC and Marvel) is a pervasive issue. Our objective is to investigate and highlight the gender disparity the characters featured in comic books. This topic is important because comic books are a significant part of popular culture, influencing perceptions and shaping societal norms. Understanding the gender dynamics within the comic book industry and the characters portrayed can help issues of representation and equality.
Identify the types of data/variables you will use:
We will be using both quantitative and categorical variables. Categorical variables such as Align, Sex, Alive, First appearance, year, and GSM. Meanwhile, some of the quantitative variables like Appearance will be used.
State the major deliverable(s) you will create to solve this problem/answer this question:
The major deliverable will be a comic-strip themed website that will include analyses and visualizations showcasing trends and patterns about gender representation in comic books. We will:
Identify changes in gender representation over the years.
Investigate correlations between gender representation and the number of appearances of the character in comic books.
Introduction and data
Source of the data:
The data comes from Marvel Wikia and DC Wikia. Characters were scrapped on August 24, 2014.(README.md)
State when and how it was originally collected (by the original data curator, not necessarily how you found the data):
The data has been collected from the Marvel and DC databases and encyclopedias which houses the information of all the comic book characters. However, the appearance counts were scrapped on September 2, 2014. The month and year of the first issue each character appeared in was pulled on October 6, 2014.
Write a brief description of the observations:
The data is split into two files, for DC and Marvel, respectively:
The DC dataset has 13 columns and 6897 rows
The Marvel dataset has 13 columns, 16377 rows
Each file contains 13 variables: page_id, name, url slug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, APPEARANCES, FIRST APPEARANCE, YEAR
ALIGN: If the character is Good, Bad or Neutral
SEX: Sex of the character (e.g. Male, Female, etc.)
ALIVE: If the character is alive or deceased
FIRST APPEARANCE: The month and year of the character’s first appearance in a comic book, if available
YEAR: The year of the character’s first appearance in a comic book, if available
GSM: If character is of a gender or sexual minority category
APPEARANCE: The number of appearances of the character in comic books (as of Sep. 2, 2014.)
Ethical Concerns:
Since the data is based on fictional characters, there are no overarching ethical concerns. However, the issue of gender representation and gender identity can be a sensitive one, so we want to make sure we consider the sensitive nature of the subject when thinking about how to display our analyses.
Glimpse of data
# add code here
dataset2.1 <- read_csv("data/dataset2.1_comics_dc.csv")Rows: 6896 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, urlslug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, FIRST APPEAR...
dbl (3): page_id, APPEARANCES, YEAR
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset2.2 <- read_csv("data/dataset2.2_comics_marvel.csv")Rows: 16376 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): name, urlslug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, FIRST APPEAR...
dbl (3): page_id, APPEARANCES, Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(dataset2.1)| Name | dataset2.1 |
| Number of rows | 6896 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1.00 | 10 | 46 | 0 | 6896 | 0 |
| urlslug | 0 | 1.00 | 18 | 54 | 0 | 6896 | 0 |
| ID | 2013 | 0.71 | 15 | 16 | 0 | 3 | 0 |
| ALIGN | 601 | 0.91 | 14 | 18 | 0 | 4 | 0 |
| EYE | 3628 | 0.47 | 8 | 18 | 0 | 17 | 0 |
| HAIR | 2274 | 0.67 | 8 | 21 | 0 | 17 | 0 |
| SEX | 125 | 0.98 | 15 | 22 | 0 | 4 | 0 |
| GSM | 6832 | 0.01 | 19 | 21 | 0 | 2 | 0 |
| ALIVE | 3 | 1.00 | 17 | 19 | 0 | 2 | 0 |
| FIRST APPEARANCE | 69 | 0.99 | 4 | 15 | 0 | 774 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| page_id | 0 | 1.00 | 147441.21 | 108388.63 | 1380 | 44105.5 | 141267 | 213203 | 404010 | ▇▆▅▃▂ |
| APPEARANCES | 355 | 0.95 | 23.63 | 87.38 | 1 | 2.0 | 6 | 15 | 3093 | ▇▁▁▁▁ |
| YEAR | 69 | 0.99 | 1989.77 | 16.82 | 1935 | 1983.0 | 1992 | 2003 | 2013 | ▁▁▂▇▆ |
skimr::skim(dataset2.2)| Name | dataset2.2 |
| Number of rows | 16376 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1.00 | 4 | 69 | 0 | 16376 | 0 |
| urlslug | 0 | 1.00 | 6 | 71 | 0 | 16376 | 0 |
| ID | 3770 | 0.77 | 15 | 29 | 0 | 4 | 0 |
| ALIGN | 2812 | 0.83 | 14 | 18 | 0 | 3 | 0 |
| EYE | 9767 | 0.40 | 7 | 15 | 0 | 24 | 0 |
| HAIR | 4264 | 0.74 | 4 | 21 | 0 | 25 | 0 |
| SEX | 854 | 0.95 | 15 | 22 | 0 | 4 | 0 |
| GSM | 16286 | 0.01 | 13 | 22 | 0 | 6 | 0 |
| ALIVE | 3 | 1.00 | 17 | 19 | 0 | 2 | 0 |
| FIRST APPEARANCE | 815 | 0.95 | 6 | 6 | 0 | 832 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| page_id | 0 | 1.00 | 300232.08 | 253460.40 | 1025 | 28309.5 | 282578 | 509077 | 755278 | ▇▂▂▅▃ |
| APPEARANCES | 1096 | 0.93 | 17.03 | 96.37 | 1 | 1.0 | 3 | 8 | 4043 | ▇▁▁▁▁ |
| Year | 815 | 0.95 | 1984.95 | 19.66 | 1939 | 1974.0 | 1990 | 2000 | 2013 | ▃▂▅▇▇ |
Data 3
Problem or question
Identify the question you will answer
Determining which gender more frequently engages in dialogue in films and calculating the ratio of their speaking roles.
Exploring whether an actor’s age correlates with the number of words they speak in movies, investigating if there is an age-related pattern in dialogue distribution.
Analyzing whether the quantity of words spoken by actors has any influence on the gross revenue generated by films, potentially highlighting a relationship between dialogue and financial success.
Exploring the impact of supporting actors on a film’s success and determining if there are gender imbalances within supporting roles
Explain why you think this topic is important
This project seeks to analyze the frequency of gender imbalances in speaking roles and dialogues within the film industry, providing insights into which gender tends to exert greater influence in on-screen conversations. By doing so, we aim to gain a comprehensive understanding of the substantial gender gap present in films.
Identify the types of data/variables you will use
We will be using both quantitative and categorical variables. Categorical variables include gender, character name and titles. Meanwhile, quantitative variables include columns like age, words, and script IDs.
State the major deliverable(s) you will create to solve this problem/answer this question
Our major deliverable will be an engaging website featuring compelling visual representations. This platform aims to enlighten future producers about the discernible trends in gender inequality, with the goal of promoting more equitable opportunities within the industry.
Introduction and data
If you are using a dataset:
Identify the source of the data
The data source for this project comes from Polygraph’s movie-dialog gender ratio database.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data)
The original data collection by the Polygraph’s team began with the acquisition of approximately 8,000 scripts from various screenplay repositories available on the internet. Subsequently, each of these scripts was cross-referenced with an IMDB page to establish a connection. Following this, the content of each script was meticulously parsed, dissecting all the lines of dialogue attributed to individual characters. To complete the data compilation, a matching process was carried out to associate each character with their respective actor.
Write a brief description of the observations
The data has 3 files.
Character_list5.csv
Has 5 columns - script_id, imdb_character_name, words, gender, age
Has 23049 rows of data
Some of the columns we will be focusing on are:
IMDB Character Name: This column contains the names of the characters in the movie as they are officially recognized on the Internet Movie Database (IMDB).
Words: This column quantifies the number of words spoken by each character in the film. It serves as a measure of the character’s speaking role and contributes to the analysis of dialogue distribution in the movie.
Gender*: The “Gender” column indicates the gender of the actor who portrayed each character in the movie.
Age: The “Age” column provides details about the age of the actor who played each character in the film.
Character_mapping.csv
Has 5 columns - script_id, imdb_id, character_from_script, closest_character_name_from_imdb_match, closest_imdb_character_id
Has 99391 rows of data
Some of the columns we will be focusing on are:
Character from Script - This refers to a character as it appears in the screenplay
Closest Character Name - This indicates the most closely matching character name from the associated film’s IMDb database.
This pairing helps align the character in the script with its corresponding IMDb-recognized name from the movie, allowing for consistent referencing and analysis when comparing the written script to the actual film.
Meta_data.csv
Has 6 columns - script_id, imdb_id, year, gross, title, lines_data
Has 2001 rows of data
Some of the columns we will be focusing on are:
Gross - refers to the total revenue or earnings generated by a film, often at the box office or through various distribution channels.
Title - refers to the name or title of the film serving as its primary identifier.
Ethical Concerns:
The project openly addresses its data collection process and methodology, acknowledging potential limitations and concerns, such as the composition of the sample and the exclusion of minor characters. It emphasizes its primary role as a data-gathering initiative, distinct from a formal academic study, as it operates on the internet domain. While not conducting extensive statistical analysis, the project’s aim is to provide the data openly for others to interpret and use for their own analyses of gender representation in film dialogue. This approach offers a valuable contribution to the broader conversation surrounding this issue.
Reference: https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751
Glimpse of data
# add code here
dataset3.1 <- read_csv("data/dataset3.1_words_character-list.csv")Rows: 23048 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdb_character_name, gender, age
dbl (2): script_id, words
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset3.2 <- read_csv("data/dataset3.2_words_character-mapping.csv")Rows: 99390 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): imdb_id, character_from_script, closest_character_name_from_imdb_ma...
dbl (1): script_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset3.3 <- read_csv("data/dataset3.3_words_meta-data.csv")Rows: 2000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): imdb_id, title, lines_data
dbl (3): script_id, year, gross
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(dataset3.1)Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 50 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
Caused by warning in `grepl()`:
! unable to translate '<e9>tienne' to a wide string
ℹ Run `dplyr::last_dplyr_warnings()` to see the 49 remaining warnings.
| Name | dataset3.1 |
| Number of rows | 23048 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| imdb_character_name | 0 | 1 | 1 | 15 | 0 | 17609 | 0 |
| gender | 0 | 1 | 1 | 1 | 0 | 3 | 0 |
| age | 0 | 1 | 1 | 4 | 0 | 105 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| script_id | 0 | 1 | 4194.78 | 2472.99 | 280 | 2095 | 3694 | 6219.75 | 9254 | ▇▇▆▃▅ |
| words | 0 | 1 | 907.87 | 1399.59 | 101 | 193 | 396 | 980.00 | 28102 | ▇▁▁▁▁ |
skimr::skim(dataset3.2)Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 171 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
Caused by warning in `grepl()`:
! unable to translate 'maitre d<ed>' to a wide string
ℹ Run `dplyr::last_dplyr_warnings()` to see the 170 remaining warnings.
| Name | dataset3.2 |
| Number of rows | 99390 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| imdb_id | 0 | 1 | 9 | 9 | 0 | 2388 | 0 |
| character_from_script | 47 | 1 | 1 | 79 | 0 | 18540 | 0 |
| closest_character_name_from_imdb_match | 63 | 1 | 1 | 88 | 0 | 30651 | 0 |
| closest_imdb_character_id | 0 | 1 | 9 | 9 | 0 | 25728 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| script_id | 0 | 1 | 4132.47 | 2646.04 | 1 | 1970 | 3587 | 6242 | 9254 | ▆▇▆▃▅ |
skimr::skim(dataset3.3)Warning: There was 1 warning in `dplyr::summarize()`.
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
ℹ In group 0: .
Caused by warning:
! There were 12 warnings in `dplyr::summarize()`.
The first warning was:
ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
mangled_skimmers$funs)`.
Caused by warning in `grepl()`:
! unable to translate 'Alien<b3>' to a wide string
ℹ Run `dplyr::last_dplyr_warnings()` to see the 11 remaining warnings.
| Name | dataset3.3 |
| Number of rows | 2000 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| imdb_id | 0 | 1 | 9 | 9 | 0 | 2000 | 0 |
| title | 0 | 1 | 2 | 62 | 0 | 1994 | 0 |
| lines_data | 0 | 1 | 9 | 255 | 0 | 2000 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| script_id | 0 | 1.00 | 4396.84 | 2506.36 | 280 | 2201.75 | 4108 | 6607.75 | 9254 | ▇▇▇▅▆ |
| year | 0 | 1.00 | 1998.10 | 14.81 | 1929 | 1992.00 | 2001 | 2009.00 | 2015 | ▁▁▁▃▇ |
| gross | 338 | 0.83 | 103.94 | 140.56 | 0 | 20.00 | 54 | 132.75 | 1798 | ▇▁▁▁▁ |