Project proposal

Author

Proud Gecko

library(tidyverse)

Dataset

# Loading Pokemon Data
# renv::install('pokemon')
pokemon_df <- read_csv('data/pokemon_df.csv')

Rows: 949 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): pokemon, type_1, type_2, color_1, color_2, color_f, egg_group_1, e...
dbl (12): id, species_id, height, weight, base_experience, hp, attack, defen...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(pokemon_df)

# A tibble: 6 × 22
     id pokemon    species_id height weight base_experience type_1 type_2    hp
  <dbl> <chr>           <dbl>  <dbl>  <dbl>           <dbl> <chr>  <chr>  <dbl>
1     1 bulbasaur           1    0.7    6.9              64 grass  poison    45
2     2 ivysaur             2    1     13               142 grass  poison    60
3     3 venusaur            3    2    100               236 grass  poison    80
4     4 charmander          4    0.6    8.5              62 fire   <NA>      39
5     5 charmeleon          5    1.1   19               142 fire   <NA>      58
6     6 charizard           6    1.7   90.5             240 fire   flying    78
# ℹ 13 more variables: attack <dbl>, defense <dbl>, special_attack <dbl>,
#   special_defense <dbl>, speed <dbl>, color_1 <chr>, color_2 <chr>,
#   color_f <chr>, egg_group_1 <chr>, egg_group_2 <chr>, url_icon <chr>,
#   generation_id <dbl>, url_image <chr>

Dataset description: Pokémon

This project uses the Pokémon dataset distributed with the {pokemon} R package (CRAN/GitHub), which aggregates canonical Pokémon attributes into a single, analysis-ready table. The dataset was curated for this assignment by Frank Hull and provides standardized Pokémon information (the package also supports English and Brazilian Portuguese naming in its broader contents).

Provenance

Source: {pokemon} R package (CRAN and GitHub distribution)
Content origin: Compiled Pokémon metadata commonly drawn from public Pokédex-style resources (e.g., IDs, types, base stats) and paired with image/icon links.
Curation note: Provided/organized for the course dataset of the week by Frank Hull.

Dimensions and structure

Observations (rows): 949
Variables (columns): 22
Unit of analysis: One row per Pokémon entry (identified by id and pokemon name).

What’s included to support analysis

Identifiers

id (double): unique Pokédex-style ID for each Pokémon entry
species_id(double) :species identifier (useful if multiple forms exist across the same species concept)
generation_id (double): which generation the Pokémon belongs to

Physical attributes

height (double): Pokémon height (likely meters, based on values like 0.7, 2.0)
weight (double): Pokémon weight (likely kilograms, based on values like 6.9, 100.0)
base_experience (double): base XP yield value

Typing

type_1 (string): primary type (e.g., grass, fire)
type_2 (string): secondary type (can be NA)

Battle stats

hp (double): health points
attack (double): base attack points
defense(double): base defense points
special_attack(double): base special attack points
special_defense(double): base special defense points
speed(double): base speed points

Colors

color_1, color_2, color_f (string): hex color codes (primary, secondary, and a “final” blended/processed color); color_2 and color_f often NA when only one dominant color is assigned

Egg groups

egg_group_1, egg_group_2 (string) breeding egg groups (egg_group_2 can be NA)

Media links

url_icon (string): icon URL path (missing https: prefix)
url_image (string): full image URL (looks like a standardized sprite/art asset link).

Reason for choosing the dataset

We chose the Pokémon dataset because it gives us a good balance of fun context and usable variables for making strong visualizations. It has clear numeric measurements like height, weight, and battle stats, and it also has categories that make comparisons natural, like egg groups, types, and generation. That makes it easy to ask two different questions that don’t feel repetitive: one focused on body size (by creating a BMI-style measure from height and weight and comparing it across egg groups), and another focused on performance which allows us to see competitiveness and how different types change across generations. The dataset is also already organized and well-documented through TidyTuesday, so we can spend more time designing polished plots and less time dealing with data cleaning issues.

Questions

The two questions we want to answer:

How do the Body Mass Index (BMI) of Pokemon differ across egg groups?
What Pokemon types are becoming more competitive across generations?

Analysis plan

How do the Body Mass Index (BMI) of Pokemon differ across egg groups?
Numerical variables: height, weight (for computing BMI)
Categorical variable: egg_group_1
Create variable: poke_bmi = weight/height^2 (Note, this is a fictional index, and not a real-world BMI)
>> Box plot or violin plot of BMI by egg_group_1

We will use egg_group_1 as the primary egg group label for grouping. We are choosing this for consistency because egg_group_2 is missing for a substantial subset of Pokemon and treating both egg groups equally would duplicate Pokemon across groups, which would overweight dual-group Pokémon in distribution plots.

Using a BMI-style metric (weight / height^2) lets us compare “heaviness for size” rather than just raw weight, which is strongly driven by height. Two Pokemon can have similar weights but very different heights, and BMI helps separate bulky builds from tall-but-light builds. We treat this as a descriptive index (not a health metric), mainly to standardize body size for fair comparisons across egg groups and generations.
What Pokemon types are becoming more competitive across generations?
Definition of competitiveness: sum of six base stats
Numerical variables: hp, attack, defense, special_attack, special_defense, speed
Categorical variable: type1, generation
Create variable: base_stats = hp + attack + defense + special_attack + special_defense + speed
Summarize:
Group by type_1, generation_id to calculate :
1. mean_stats = mean(base_stats)
2. sd_stats = sqrt(var(base_stats))
3. n = no of pokemon in each group
4. 95% confidence interval around mean_stats, using sd_stats and n as some types have more pokemons than others
  Line chart of mean(base_stats) across generation_id, colored by type_1 with error bars showing confidence intervalsConsidering there is a variation in how many Pokemon observations are under each type, we will normalize the competitive metric to compare them meaningfully. As some Pokemon fans might have realized as well, legendary or pseudo-legendary Pokemon can be outliers for the data. We plan to address this by utilizing effective visualization like box plots to show the outliers if they exist and optionally designate interquartile range boundaries to filter them out. However, it is worth noting that most legendary Pokemons are condensed around similar types (Dragon as an example), so the change in competitiveness across generation is not expected to differ drastically because of legendary Pokemon.
To strengthen statistical reasoning, we will report variability with mean trends. Some types by generation groups may have relatively more or few pokemon,, or have wide variation in base stats so plotting only the mean could make small-sample noise look like a trend. Adding standard deviations and 95% CIs makes it clearer when changes across generations are more accurate or not.
We will also summarize each type-s change over time by computing absolute change in mean(base_stats) from early generations to later generations (first to last). We can then get a ranking of which types increased most and the confidence intervals will help understand whether those changes are stable or if they were due to variability.