Project Togepi

Proposal

library(tidyverse)
library(skimr)

Data 1

Problem or question

The objective of this data analysis project is to teach data analysis skills - EDA and logistic regression by focusing on the behavior of telecom customers and predicting the likelihood of customer churn. The primary goal is to conduct an in-depth exploratory data analysis (EDA) to identify notable customer behaviors and subsequently apply predictive analytics techniques to determine the customers most likely to churn. This analysis aims to improve proficiency in data analysis while addressing the critical challenge of customer retention within the telecommunications domain.
This topic is important because analyzing customer churn in the telecommunications industry is crucial for retaining revenue, optimizing resources, improving customer satisfaction, gaining a competitive edge, reducing costs, and fostering a data-driven decision-making culture.
The dataset includes the following types of variables:
1. Categorical Variables (Nominal): customerID, gender, Partner, Dependents, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, Churn
2. Categorical Variables (Ordinal): Contract (may have ordinal aspects like “Month-to-month,” “One year,” “Two year”)
3. Numerical Variables: SeniorCitizen, tenure, MonthlyCharges, TotalCharges (represented as doubles and considered numerical)
The major deliverables for developing educational content on EDA and logistic regression in R using the tidyverse are:
1. Educational Slides/Presentation: Structured presentation covering key concepts and methods.
2. Tutorial Notebooks with R Code: Interactive notebooks demonstrating step-by-step procedures.
3. Exercise Sheets with Solutions: Practice exercises to reinforce learning.
4. Supplementary Resources: Additional readings and recommendations for further learning neatly curated in a pdf.
5. Feedback and Q&A Session: Opportunity to seek clarification and provide feedback.

Introduction and data

Source: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
Collection Time and Method:
- The data represents a snapshot of 7043 customers in California during Q3, indicating a specific time frame for the observations.
- The data collection method, whether through surveys, transaction records, or other means, is not specified.
Brief Description of Observations:
- The dataset provides insights into customer churn in the telecommunications industry.
- It contains various attributes related to telecom customers, such as customer ID, gender, tenure, types of services used, contract details, billing information, etc.
- Each observation likely represents a unique customer and their corresponding attributes and behavior.
- The ‘Churn’ variable indicates whether a customer has left the platform (‘Yes’) or not (‘No’), serving as a critical factor for analysis and predictive modeling.
- The dataset is valuable for analyzing customer behavior and predicting customer churn, aiding in customer retention strategies and business decision-making.

Glimpse of data

library(readr)

# URL of the CSV file
url <- "https://drive.google.com/uc?id=1E32DjzTI9nt8ho3x0fs2bNsBSZlJb1bX"

churndata <- read.csv(url)

# Display a glimpse of the data
glimpse(churndata)

Rows: 7,043
Columns: 21
$ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
$ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
$ SeniorCitizen    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
$ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
$ tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
$ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
$ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
$ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
$ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
$ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
$ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
$ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
$ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
$ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
$ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
$ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
$ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
$ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
$ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
$ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…

Data 2

Problem or question

The main problem to solve in this project is to identify and analyze correlations between various parameters related to movies, such as film length, IMDB rating, distributing studio, genre, and worldwide gross. The goal is to investigate whether there are significant and meaningful relationships between these parameters, and if so, how they impact a movie’s success.
We think this topic is important because understanding correlations between movie-related parameters is crucial for the film industry. It provides insights into what factors influence a movie’s box office performance and critical reception. Filmmakers, producers, and distributors can use this information to make informed decisions about movie production, marketing strategies, and investments. Additionally, understanding audience preferences and their correlation with film characteristics helps in tailoring content to meet the demands of the market.
The dataset includes the following types of variables:
- Film Information: Genre (Main Genre, Genre_2, Genre_3), Length of the film, IMDB rating, Distributing studio, Title of the movie
- Financial Information: Worldwide gross
The major deliverable(s):
- Correlation Analysis Tool: A Shiny web application where users can select two parameters (e.g., film length vs. IMDB rating) and visualize their correlation. This tool will allow users to interactively explore and understand the relationships between different movie-related parameters.
- Correlation Insights and Visualizations: Visualizations (e.g., scatter plots, correlation matrices, trend lines) within the Shiny app to present the correlations between selected variables. Insights will be derived from these visualizations to help users understand the relationships between movie parameters.
- Summary Report: A summary report outlining key findings, significant correlations, and insights from the analysis to help summarize the main conclusions of the project and providing a comprehensive view of the relationships between movie parameters.

Introduction and data

If you are using a dataset:

Source of Data: CrowdFlower - Data for Everyone
Data Inclusions:
- Movie Titles
- Poster URLs
- Genre Information
- Run Time
- MPAA Ratings
- IMDB Rating
- Rotten Tomato Audience/Critic Rating
- Box Office Receipts (Adjusted for Inflation)
Collection methodology: The dataset originated from a data categorization job conducted by CrowdFlower. The purpose of the task was to gather information about the ten most popular movies for each year over a span of 40 years (1975-2015).
The dataset, consisting of 410 data rows, provides a comprehensive view of popular movies over a significant time period, facilitating analyses related to movie trends, ratings, and financial performance in the specified years. The data is a valuable resource for anyone interested in the movie industry and its evolution over the mentioned decades.

Glimpse of data

# URL of the CSV file (Google Drive link)
url <- "https://drive.google.com/uc?export=download&id=1psklLnp-K0A7ggoPexxykmWY7vB0DAds"

# Specify the column types
column_types <- cols(
  Main_Genre = col_character(),
  Genre_2 = col_character(),
  Genre_3 = col_character(),
  rating = col_character(),
  studio = col_character(),
  title = col_character(),
  worldwide_gross = col_character(),
  imdb_rating = col_double(),
  length = col_double(),
  rank_in_year = col_double(),
  year = col_double()
)

# Read the CSV file with specified column types
blockbusters <- read_csv(url, col_types = column_types)

# Display a glimpse of the data
glimpse(blockbusters)

Rows: 437
Columns: 11
$ Main_Genre      <chr> "Action", "Action", "Animation", "Action", "Action", "…
$ Genre_2         <chr> "Adventure", "Adventure", "Action", "Adventure", "Come…
$ Genre_3         <chr> "Drama", "Sci-Fi", "Adventure", "Drama", NA, "Drama", …
$ imdb_rating     <dbl> 7.4, 8.5, 7.8, 6.2, 7.8, 7.9, 7.2, 7.0, 6.9, 8.1, 7.2,…
$ length          <dbl> 135, 156, 118, 129, 119, 147, 118, 135, 112, 135, 152,…
$ rank_in_year    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8,…
$ rating          <chr> "PG-13", "PG-13", "PG", "PG-13", "R", "PG-13", "PG-13"…
$ studio          <chr> "Walt Disney Pictures", "Walt Disney Pictures", "Pixar…
$ title           <chr> "Black Panther", "Avengers: Infinity War", "Incredible…
$ worldwide_gross <chr> "$700,059,566", "$678,815,482", "$608,581,744", "$416,…
$ year            <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …

Data 3

Problem or question

Creating a Lesson Plan: Analyzing Nutritional Content of McDonald’s Menu for Informed Dietary Choices through Accessible Data Visualizations using ggplot2 and Quarto
Explain why you think this topic is important.

The dataset includes the following types of variables:

Categorical Data:

* Category: Describes the menu item category (e.g., breakfast, burgers, fries),

* Item: Names of specific menu items,

* Serving Size: Describes the serving size of each menu item.
Numerical Data:

\* Calories, Total Fat, Saturated Fat, Trans Fat, Cholesterol, Sodium, Carbohydrates, Dietary Fiber, Sugars, Protein: Nutritional content of each menu item,

\* % Daily Value for Various Nutrients: The percentage of recommended daily intake for various nutrients.

Lesson plan:
Accessible Data Visualizations:
- Visually appealing and accessible data visualizations using ggplot2, highlighting the nutritional content of McDonald’s menu items. These will include charts like bar charts, scatter plots, and pie charts, designed with accessibility principles in mind.
Integrated Quarto Document:
- Developing a Quarto document integrating the accessible visualizations, providing meaningful descriptions, context, and explanations. The Quarto document will ensure the information is accessible to a diverse audience and can be easily reproduced.
Insightful Analysis and Recommendations:
- Steps to derive insights and trends from the visualizations regarding the nutritional composition of McDonald’s menu items and provide recommendations for making informed dietary choices based on the analysis, promoting a healthier lifestyle.
Accessibility Guidelines Adherence:
- Ensuring all visualizations adhere to accessibility guidelines, incorporating features like alternative text, appropriate color contrast, and clear labels, to make the visualizations usable by everyone, including those with disabilities.

Introduction and data

If you are using a dataset:

Source of the Data:
- The dataset was obtained from Kaggle. (https://www.kaggle.com/datasets/mcdonalds/nutrition-facts)
Original Collection:
- The original source and curator of the data scraped the nutrition facts and menu items from the McDonald’s website.
Observations Description:
- The dataset provides a comprehensive nutrition analysis of every menu item available on the US McDonald’s menu. It covers a wide range of food and beverage categories including breakfast, beef burgers, chicken and fish sandwiches, fries, salads, soda, coffee and tea, milkshakes, and desserts. The dataset includes information on calories, total fat, saturated fat, trans fat, cholesterol, sodium, carbohydrates, dietary fiber, sugars, protein, and % daily value for various nutrients. These observations are crucial for analyzing the nutritional content of McDonald’s menu items and deriving insights to make informed dietary choices.

Glimpse of data

url <- "https://drive.google.com/uc?export=download&id=1w36RA0O7HueW814H-qRzGoJfMwsbDjqY"


# Read the CSV file with specified column types
menu <- read_csv(url)

Rows: 260 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Category, Item, Serving Size
dbl (21): Calories, Calories from Fat, Total Fat, Total Fat (% Daily Value),...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display a glimpse of the data
glimpse(menu)

Rows: 260
Columns: 24
$ Category                        <chr> "Breakfast", "Breakfast", "Breakfast",…
$ Item                            <chr> "Egg McMuffin", "Egg White Delight", "…
$ `Serving Size`                  <chr> "4.8 oz (136 g)", "4.8 oz (135 g)", "3…
$ Calories                        <dbl> 300, 250, 370, 450, 400, 430, 460, 520…
$ `Calories from Fat`             <dbl> 120, 70, 200, 250, 210, 210, 230, 270,…
$ `Total Fat`                     <dbl> 13, 8, 23, 28, 23, 23, 26, 30, 20, 25,…
$ `Total Fat (% Daily Value)`     <dbl> 20, 12, 35, 43, 35, 36, 40, 47, 32, 38…
$ `Saturated Fat`                 <dbl> 5, 3, 8, 10, 8, 9, 13, 14, 11, 12, 12,…
$ `Saturated Fat (% Daily Value)` <dbl> 25, 15, 42, 52, 42, 46, 65, 68, 56, 59…
$ `Trans Fat`                     <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0…
$ Cholesterol                     <dbl> 260, 25, 45, 285, 50, 300, 250, 250, 3…
$ `Cholesterol (% Daily Value)`   <dbl> 87, 8, 15, 95, 16, 100, 83, 83, 11, 11…
$ Sodium                          <dbl> 750, 770, 780, 860, 880, 960, 1300, 14…
$ `Sodium (% Daily Value)`        <dbl> 31, 32, 33, 36, 37, 40, 54, 59, 54, 59…
$ Carbohydrates                   <dbl> 31, 30, 29, 30, 30, 31, 38, 43, 36, 42…
$ `Carbohydrates (% Daily Value)` <dbl> 10, 10, 10, 10, 10, 10, 13, 14, 12, 14…
$ `Dietary Fiber`                 <dbl> 4, 4, 4, 4, 4, 4, 2, 3, 2, 3, 2, 3, 2,…
$ `Dietary Fiber (% Daily Value)` <dbl> 17, 17, 17, 17, 17, 18, 7, 12, 7, 12, …
$ Sugars                          <dbl> 3, 3, 2, 2, 2, 3, 3, 4, 3, 4, 2, 3, 2,…
$ Protein                         <dbl> 17, 18, 14, 21, 21, 26, 19, 19, 20, 20…
$ `Vitamin A (% Daily Value)`     <dbl> 10, 6, 8, 15, 6, 15, 10, 15, 2, 6, 0, …
$ `Vitamin C (% Daily Value)`     <dbl> 0, 0, 0, 0, 0, 2, 8, 8, 8, 8, 0, 0, 0,…
$ `Calcium (% Daily Value)`       <dbl> 25, 25, 25, 30, 25, 30, 15, 20, 15, 15…
$ `Iron (% Daily Value)`          <dbl> 15, 8, 10, 15, 10, 20, 15, 20, 10, 15,…