Project proposal

Author

Dank Extra

library(tidyverse)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

The Stack Overflow Annual Developer Survey 2024 dataset encompasses responses from over 65,000 developers worldwide, covering demographics, education, work experience, technology usage, and perspectives on artificial intelligence. The dataset includes both numerical variables (e.g., years of experience, compensation) and categorical variables (e.g., country, education level, programming languages used), totaling 28 columns and 65,437 rows of developer responses. This dataset is highly relevant for comparing developer compensation across different countries and analyzing varying opinions on AI, focusing exclusively on the single-response questions from the main survey sections. Each categorical response in the survey has been integer-coded, with corresponding labels available in the crosswalk file.

Our group includes several developers who frequently visit Stack Overflow for assistance, making us particularly curious about what the broader developer community looks like. We wanted to explore the dataset to better understand the demographics, trends, and challenges faced by developers worldwide. Additionally, given the increasing role of AI in software development, we were especially interested in analyzing developers’ perspectives on AI tools, automation, and how these technologies are shaping the industry. This dataset provides a unique opportunity to examine these insights while also reflecting on our own experiences as developers.

qname_levels_single_response_crosswalk <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-03/qname_levels_single_response_crosswalk.csv')

Rows: 122 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): qname, label
dbl (1): level

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

stackoverflow_survey_questions <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-03/stackoverflow_survey_questions.csv')

Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): qname, question

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

stackoverflow_survey_single_response <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2024/2024-09-03/stackoverflow_survey_single_response.csv')

Rows: 65437 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, currency
dbl (26): response_id, main_branch, age, remote_work, ed_level, years_code, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(stackoverflow_survey_single_response)

[1] 65437    28

head(stackoverflow_survey_single_response)

# A tibble: 6 × 28
  response_id main_branch   age remote_work ed_level years_code years_code_pro
        <dbl>       <dbl> <dbl>       <dbl>    <dbl>      <dbl>          <dbl>
1           1           1     8           3        4         NA             NA
2           2           1     3           3        2         20             17
3           3           1     4           3        3         37             27
4           4           2     1          NA        7          4             NA
5           5           1     1          NA        6          9             NA
6           6           4     8          NA        4         10             NA
# ℹ 21 more variables: dev_type <dbl>, org_size <dbl>,
#   purchase_influence <dbl>, buildvs_buy <dbl>, country <chr>, currency <chr>,
#   comp_total <dbl>, so_visit_freq <dbl>, so_account <dbl>,
#   so_part_freq <dbl>, so_comm <dbl>, ai_select <dbl>, ai_sent <dbl>,
#   ai_acc <dbl>, ai_complex <dbl>, ai_threat <dbl>, survey_length <dbl>,
#   survey_ease <dbl>, converted_comp_yearly <dbl>, r_used <dbl>,
#   r_want_to_use <dbl>

Questions

The two questions you want to answer.

How do education level, age, country, and programming language usage influence the average annual compensation of developers?

Note: We plan to compare compensation across different countries. We will explore two approaches: - Normalization: Convert all pay to a single currency for direct comparison. - Country-specific: Examine compensation within each country (e.g., via faceting by country in visualizations).

How do developer’s opinions about AI vary based on their work background and experience?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

Question 1: How do education level, age, country, and programming language usage influence the average annual compensation of developers?

Analysis Plan: No extra variables need to be created, and no extra data need to be loaded. The stackoverflow_survey_single_response.csv from the dataset will be loaded, from which all required variables will be extracted.

Plan: - Filter out anomalies such as zeros or extremely large values in comp_total. - Investigate different approaches to handle regional compensation differences: 1. Normalize the comp_total by converting to a single currency (if necessary/feasible). 2. Facet by country and compare compensation distributions separately for each country.

Visualizations:
- Boxplot comparing total compensation across different education levels, countries, and programming language usage.
- Scatter plot with regression to illustrate relationships between age, professional coding experience, and compensation.
- Random forest and/or linear regression to identify which factors are most predictive of compensation and to control for confounding variables.
  - For instance, a regression model can help distinguish between correlation and causation, clarifying “influence” by estimating how each factor affects compensation while controlling for others.
Address missing values (NAs) in a consistent way (e.g., removing or imputation if justifiable).
Required variables:
- comp_total (integer) – Annual compensation
- age_bracket (integer) – Developer’s age group
- country (integer) – Coded variable for developer’s country
- ed_level (integer) – Coded variable for highest education level
- lang_worked_with (integer) – Coded variable for primary programming languages used

Question 2: How do developer’s opinions about AI vary based on their work background and experience?

Plan: - Consider creating an aggregated AI-opinion score by combining or normalizing the values of ai_select, ai_sent, ai_acc, ai_complex, and ai_threat into a single scale (where logically justifiable). This can make it easier to visualize overall sentiment. - Focus on developers who use AI (ai_select > 1) to examine how strongly they feel about AI’s potential benefits vs. threats. - Possible Visualizations: - Bar chart / Boxplot comparing AI sentiment (or aggregated score) across different work backgrounds (e.g., dev_type, remote_work, org_size,``years_code_pro). - Heatmap illustrating how various work characteristics (e.g., type of work, organization size) and AI opinions correlate. - Stacked bar chart (or small multiple bar charts) to show the percentage distribution of different AI opinions (e.g., ai_sent, ai_acc, ai_complex, ai_threat)

Required variables:
- remote_work (integer) – Current work situation
- dev_type (integer) – Best current-job description
- org_size (integer) – People in the organization
- purchase_influence (integer) – Level of influence in purchasing new technology
- years_code_pro(integer) – Years the respondent has coded professionally
- elect (integer) – Use of AI in development process
- ai_sent (integer) – Stance on using AI tools as part of development
- ai_acc (integer) – Trust in AI accuracy
- ai_complex (integer) – Belief about how well AI handles complex tasks
- ai_threat (integer) – Belief that AI is a threat to current job