Project proposal

Author

Proud Gecko

library(tidyverse)
install.packages("pokemon")
The following package(s) will be installed:
- pokemon [0.1.3]
These packages will be installed into "~/proj-01-proud-gecko/renv/library/linux-ubuntu-noble/R-4.5/x86_64-pc-linux-gnu".

# Installing packages --------------------------------------------------------
- Installing pokemon ...                        OK [linked from cache]
Successfully installed 1 package in 13 milliseconds.
library(pokemon)

Dataset

# Loading Pokemon Data
# renv::install('pokemon')
pokemon_df <- pokemon::pokemon

head(pokemon_df)
# A tibble: 6 × 22
     id pokemon    species_id height weight base_experience type_1 type_2    hp
  <dbl> <chr>           <dbl>  <dbl>  <dbl>           <dbl> <chr>  <chr>  <dbl>
1     1 bulbasaur           1    0.7    6.9              64 grass  poison    45
2     2 ivysaur             2    1     13               142 grass  poison    60
3     3 venusaur            3    2    100               236 grass  poison    80
4     4 charmander          4    0.6    8.5              62 fire   <NA>      39
5     5 charmeleon          5    1.1   19               142 fire   <NA>      58
6     6 charizard           6    1.7   90.5             240 fire   flying    78
# ℹ 13 more variables: attack <dbl>, defense <dbl>, special_attack <dbl>,
#   special_defense <dbl>, speed <dbl>, color_1 <chr>, color_2 <chr>,
#   color_f <chr>, egg_group_1 <chr>, egg_group_2 <chr>, url_icon <chr>,
#   generation_id <dbl>, url_image <chr>

Dataset description: Pokémon

This project uses the Pokémon dataset distributed with the {pokemon} R package (CRAN/GitHub), which aggregates canonical Pokémon attributes into a single, analysis-ready table. The dataset was curated for this assignment by Frank Hull and provides standardized Pokémon information (the package also supports English and Brazilian Portuguese naming in its broader contents).

Provenance

  • Source: {pokemon} R package (CRAN and GitHub distribution)

  • Content origin: Compiled Pokémon metadata commonly drawn from public Pokédex-style resources (e.g., IDs, types, base stats) and paired with image/icon links.

  • Curation note: Provided/organized for the course dataset of the week by Frank Hull.

Dimensions and structure

  • Observations (rows): 949

  • Variables (columns): 22

  • Unit of analysis: One row per Pokémon entry (identified by id and pokemon name).

What’s included to support analysis

Identifiers

  • id (double): unique Pokédex-style ID for each Pokémon entry

  • species_id(double) :species identifier (useful if multiple forms exist across the same species concept)

  • generation_id (double): which generation the Pokémon belongs to

Physical attributes

  • height (double): Pokémon height (likely meters, based on values like 0.7, 2.0)

  • weight (double): Pokémon weight (likely kilograms, based on values like 6.9, 100.0)

  • base_experience (double): base XP yield value

Typing

  • type_1 (string): primary type (e.g., grass, fire)

  • type_2 (string): secondary type (can be NA)

Battle stats

  • hp (double): health points

  • attack (double): base attack points

  • defense(double): base defense points

  • special_attack(double): base special attack points

  • special_defense(double): base special defense points

  • speed(double): base speed points

Colors

  • color_1, color_2, color_f (string): hex color codes (primary, secondary, and a “final” blended/processed color); color_2 and color_f often NA when only one dominant color is assigned

Egg groups

  • egg_group_1, egg_group_2 (string) breeding egg groups (egg_group_2 can be NA)

Media links

  • url_icon (string): icon URL path (missing https: prefix)

  • url_image (string): full image URL (looks like a standardized sprite/art asset link).

Reason for choosing the dataset

We chose the Pokémon dataset because it gives us a good balance of fun context and usable variables for making strong visualizations. It has clear numeric measurements like height, weight, and battle stats, and it also has categories that make comparisons natural, like egg groups, types, and generation. That makes it easy to ask two different questions that don’t feel repetitive: one focused on body size (by creating a BMI-style measure from height and weight and comparing it across egg groups), and another focused on performance which allows us to see competitiveness and how different types change across generations. The dataset is also already organized and well-documented through TidyTuesday, so we can spend more time designing polished plots and less time dealing with data cleaning issues.

Questions

The two questions we want to answer:

  1. How do the Body Mass Index (BMI) of Pokemon differ across egg groups?
  2. What Pokemon types are becoming more competitive across generations?

Analysis plan

  1. How do the Body Mass Index (BMI) of Pokemon differ across egg groups?
    Numerical variables: height, weight (for computing BMI)
    Categorical variable: egg_group_1
    Create variable: poke_bmi = weight/height^2 (Note, this is a fictional index, and not a real-world BMI)
    >> Box plot or violin plot of BMI by egg_group_1

    We will use egg_group_1 as the primary egg group label for grouping. We are choosing this for consistency because egg_group_2 is missing for a substantial subset of Pokemon and treating both egg groups equally would duplicate Pokemon across groups, which would overweight dual-group Pokémon in distribution plots.

    Using a BMI-style metric (weight / height^2) lets us compare “heaviness for size” rather than just raw weight, which is strongly driven by height. Two Pokemon can have similar weights but very different heights, and BMI helps separate bulky builds from tall-but-light builds. We treat this as a descriptive index (not a health metric), mainly to standardize body size for fair comparisons across egg groups and generations.

  2. What Pokemon types are becoming more competitive across generations?
    Definition of competitiveness: sum of six base stats
    Numerical variables: hp, attack, defense, special_attack, special_defense, speed
    Categorical variable: type1, generation
    Create variable: base_stats = hp + attack + defense + special_attack + special_defense + speed
    Summarize:
    Group by type_1, generation_id to calculate :

    1. mean_stats = mean(base_stats)

    2. sd_stats = sqrt(var(base_stats))

    3. n = no of pokemon in each group

    4. 95% confidence interval around mean_stats, using sd_stats and n as some types have more pokemons than others
      Line chart of mean(base_stats) across generation_id, colored by type_1 with error bars showing confidence intervalsConsidering there is a variation in how many Pokemon observations are under each type, we will normalize the competitive metric to compare them meaningfully. As some Pokemon fans might have realized as well, legendary or pseudo-legendary Pokemon can be outliers for the data. We plan to address this by utilizing effective visualization like box plots to show the outliers if they exist and optionally designate interquartile range boundaries to filter them out. However, it is worth noting that most legendary Pokemons are condensed around similar types (Dragon as an example), so the change in competitiveness across generation is not expected to differ drastically because of legendary Pokemon.

    To strengthen statistical reasoning, we will report variability with mean trends. Some types by generation groups may have relatively more or few pokemon,, or have wide variation in base stats so plotting only the mean could make small-sample noise look like a trend. Adding standard deviations and 95% CIs makes it clearer when changes across generations are more accurate or not.
    We will also summarize each type-s change over time by computing absolute change in mean(base_stats) from early generations to later generations (first to last). We can then get a ranking of which types increased most and the confidence intervals will help understand whether those changes are stable or if they were due to variability.