library(tidyverse)
library(skimr)
library(haven)
library(rvest)
library(robotstxt)
Team Elegant Evee
Proposal
Data 1
Introduction and data
Identify the source of the data.
- Our data is the Current Population Survey (CPS) data by IPUMS linked here: https://cps.ipums.org/cps/. The source of our data is from IPUMS, which stands for Integrated Public Use of Microdata Series.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- IPUMS CPS is an integrated set of data from the Current Population Survey (CPS) from 1962 forward. The Current Population Survey (CPS) is administered monthly by the U.S. Bureau of the Census to over 65,000 households. These surveys gather information on education, labor force status, demographics, and other aspects of the U.S. population.
Write a brief description of the observations.
- The data contains monthly data starting from 1962 on demographic information (race, gender, etc.), education, income, tax, health insurance, welfare, child support, and more. The data is great for tracking trends among the general population and contains millions of rows and hundreds of columns. The data we extracted in the glimpse below contains survey year, month, age, sex, race, employment status, occupation, and industry.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Question 1: Are racial gaps in STEM education narrowing?
Question 2: What is the proportion of required child support that goes unpaid, and how has this proportion changed over time?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Question 1:
Description: We want to explore how racial gaps in STEM education have evolved over the last few decades. STEM has historically been a predominantly white set of fields, but diversity, equity, and inclusion initiatives at the university and industry levels have aimed to address racial inequality in these settings. As such, our question aims to address these shifts.
Hypothesis: The racial gaps in STEM education are indeed narrowing and have narrowed significantly in the last 3 decades as compared to the decades prior.
Question 2:
Description: We want to explore how the proportion of required child support that goes unpaid has evolved over time in the past couple of decades. Child support is a key indicator of economic progress and gender norms. Who pays child support—and how often payments are made—reflect the financial situations of American adults.
Our hypothesis: The proportion of required child support that goes unpaid over the decades has not changed too much.
Identify the types of variables in your research question. Categorical? Quantitative?
The variables in our dataset are of mixed types. Some of them are character/categorical and others are numerical/quantitative variables. Centrally, the analysis of our research question requires use of the race variable (categorical), industry (categorical), year (categorical), and child support amount (numerical).
Glimpse of data
<- read_dta("data/cps_00002.dta")
data1 skim(data1)
Name | data1 |
Number of rows | 9697681 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 1995.14 | 16.65 | 1962 | 1981 | 1997 | 2009 | 2022 | ▅▆▆▇▇ |
month | 0 | 1 | 3.00 | 0.00 | 3 | 3 | 3 | 3 | 3 | ▁▁▇▁▁ |
age | 0 | 1 | 34.46 | 22.11 | 0 | 15 | 33 | 51 | 99 | ▇▇▆▃▁ |
sex | 0 | 1 | 1.52 | 0.50 | 1 | 1 | 2 | 2 | 2 | ▇▁▁▁▇ |
race | 0 | 1 | 142.48 | 132.05 | 100 | 100 | 100 | 100 | 830 | ▇▁▁▁▁ |
empstat | 0 | 1 | 14.84 | 12.77 | 0 | 10 | 10 | 31 | 36 | ▅▇▁▁▅ |
occ | 0 | 1 | 1020.01 | 1988.95 | 0 | 0 | 99 | 902 | 9840 | ▇▁▁▁▁ |
ind | 0 | 1 | 1384.11 | 2704.73 | 0 | 0 | 11 | 802 | 9890 | ▇▁▁▁▁ |
Data 2
Introduction and data
Identify the source of the data.
- Our General Data Protection Regulation (GDPR) fine data comes from a Kaggle data set ( https://www.kaggle.com/datasets/andreibuliga1/gdpr-fines-20182020-updated-23012021?resource=download) sourced from https://www.privacyaffairs.com/gdpr-fines/, as compiled by the website Privacy Affairs, a website dedicated to tracking developments in cybersecurity and digital privacy. Our plan is to eventually scrape the website ourselves, yielding more up-to-date records, and we will investigate means of doing so in the coming weeks—the process is more complicated than just using SelectorGadget.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Privacy Affairs compiled a list of fines and relevant information from public records released by the data protection authorities of various European Union members in the years following the legislations’ passage in 2016. The Kaggle data set includes 510 observations from this list constituting fines ranging from January 2018 to December 2020, all sourced from reports by regulatory agencies issuing the fine in question. Importantly, however, none of the original Privacy Affairs data is provided in a structured, analyzable format—my team will work to scrape data from the linked page using SelectorGadget and rvest after conducting preliminary exploration of the data with the Kaggle data.
Write a brief description of the observations.
- Each observation/row constitutes a fine issued for violating a GDPR article. A company or individual who mishandled consumer data protected under this landmark privacy legislation and consequently incurred a fine would be identified in this data set, as well as the date of issuance for the fine, the relevant GDPR article violated, and the fine amount. As such, each observation includes both quantitive and qualitative values.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Question 1: How does the GDP of a country relate to the corresponding fine’s amount?
Question 2: How has the frequency and cost of fines changed over time?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
In devising our first question, we recognize that not all GDPR violations are equal in scale. Wealthier countries may intuitively feel comfortable with issuing larger fines—they worry less about disincentivizing corporations from doing business with their populations because of the lucrative nature of their markets—and they have more citizens whose data could be linked. With this motivating thought, we plan to use regression methods to assess the relationship between the wealth of a country, and the amount an offender was fined. Our hypothesis is that wealthier countries like Spain, France, and Germany will tend to issue larger fines and that other factors have less of an effect on fines.
For our second question, we also recognize that new legislation is often tested in the judiciary, and the novelty of regulations may lead to greater willingness or hesitancy to penalize under their provisions. For example, the frequency of U.S. antitrust lawsuits has changed drastically over time, even under the same legal framework, so my group intuitively believes that GDPR fine frequencies and amounts have also changed over time. Our hypothesis is that GDPR fines are becoming more frequent and costly as time goes on. To answer this question, we will use use a mix of categorical and numerical/quantitative variables, namely date and fine amount. Fine amounts are numerical, and dates are categorical (although the classification of dates is sometimes debated, the consensus is that dates are categorical because you would conduct mathematical operations on them).
- Identify the types of variables in your research question. Categorical? Quantitative?
To answer the first question, we will primarily use numerical/quantitative variables, namely the GDP of a country and the fine amounts issued by it. These are both numerical columns containing dollar amounts.
To answer the second question, we will use use a mix of categorical and numerical/quantitative variables, namely date and fine amount. Fine amounts are numerical, and dates are categorical (although the classification of dates is sometimes debated, the consensus is that dates are categorical because you would conduct mathematical operations on them).
Glimpse of data
<- read_csv("data/all_fines.csv") fines
Rows: 510 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): picture, country, authority, org_fined, articleViolated, type, sou...
dbl (2): id, price
date (1): date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(fines)
Name | fines |
Number of rows | 510 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 8 |
Date | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
picture | 0 | 1.00 | 67 | 80 | 0 | 29 | 0 |
country | 0 | 1.00 | 5 | 14 | 0 | 29 | 0 |
authority | 0 | 1.00 | 14 | 86 | 0 | 46 | 0 |
org_fined | 0 | 1.00 | 2 | 93 | 0 | 385 | 0 |
articleViolated | 0 | 1.00 | 7 | 156 | 0 | 170 | 0 |
type | 1 | 1.00 | 7 | 143 | 0 | 26 | 0 |
source | 238 | 0.53 | 29 | 209 | 0 | 237 | 0 |
summary | 0 | 1.00 | 21 | 1590 | 0 | 497 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 290 | 0.43 | 1970-01-01 | 2020-12-11 | 2020-03-02 | 98 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 255.5 | 147.37 | 1 | 128.25 | 255.5 | 382.75 | 5.1e+02 | ▇▇▇▇▇ |
price | 0 | 1 | 558966.2 | 3519231.84 | 0 | 3000.00 | 10000.0 | 50000.00 | 5.0e+07 | ▇▁▁▁▁ |
Data 3
Introduction and data
Identify the source of the data.
- The source of the data is linked here: https://www.kaggle.com/datasets/georgescutelnicu/top-100-popular-movies-from-2003-to-2022-imdb.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- The data is collected from 2003 to 2021. They scraped the top 100 most popular movies during this 19 year period through iMDB, which is a movie rating website.
Write a brief description of the observations.
- The dataset contains movie collected over a span of 2003 - 2021. The data contains variables such as the ratings, year, run time, filming location, and more. It contains 13 columns total and 1989 rows.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Question 1: Is there a relationship between a film’s location and its budget?
Question 2: Which release dates see the greatest profits?
A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic description for 1 is to investigate how the filming location would affect the budget of a movie. The hypothesis is that films that are filmed in countries that are not in the US would cost the most and thus would increase the budget. This is because overseas locations tend to increase the cost of bringing the whole set, equipment, and cast/crew there.
The research question description for 2 would be to investigate when the best time to release a movie would be. The hypothesis is that films that are released in the summer would see the greatest profits, as there are more people that are free in the summer (more younger people that are still in school) and thus would have more time to indulge in movies.
Identify the types of variables in your research question. Categorical? Quantitative?
The type of variables that would have to be evaluated in this set would be categorical for the film location, since it is sorted by country. The budget would be quantitative since it would be represented by a numerical dollar value (which is something that would be inputted as either an integer or can be converted to a numerical value).
The type of variables that would have to be evaluated would be categorical, since the months would help represent the season categories of fall, winter, spring, and summer. We would also use the quantitative variable, income, to see the profits.
Glimpse of data
<- read_csv("data/movies.csv") movies
Rows: 2000 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Title, Month, Certificate, Runtime, Directors, Stars, Genre, Filmi...
dbl (2): Rating, Year
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(movies)
Name | movies |
Number of rows | 2000 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
character | 11 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Title | 0 | 1.00 | 1 | 62 | 0 | 1989 | 0 |
Month | 0 | 1.00 | 3 | 9 | 0 | 14 | 0 |
Certificate | 34 | 0.98 | 1 | 9 | 0 | 12 | 0 |
Runtime | 0 | 1.00 | 2 | 7 | 0 | 113 | 0 |
Directors | 0 | 1.00 | 3 | 194 | 0 | 1082 | 0 |
Stars | 0 | 1.00 | 38 | 83 | 0 | 1990 | 0 |
Genre | 0 | 1.00 | 5 | 28 | 0 | 244 | 0 |
Filming_location | 0 | 1.00 | 2 | 29 | 0 | 97 | 0 |
Budget | 0 | 1.00 | 3 | 15 | 0 | 305 | 0 |
Income | 0 | 1.00 | 4 | 14 | 0 | 1856 | 0 |
Country_of_origin | 0 | 1.00 | 5 | 100 | 0 | 406 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Rating | 1 | 1 | 6.66 | 0.91 | 1.9 | 6.10 | 6.7 | 7.30 | 9 | ▁▁▃▇▂ |
Year | 0 | 1 | 2012.50 | 5.77 | 2003.0 | 2007.75 | 2012.5 | 2017.25 | 2022 | ▇▇▇▇▇ |