Visualizing the Network of NYC Public Computer Centers

Appendix to report

Data cleaning

Overview

Our project involved analyzing the geographic distribution of public computer centers across New York City’s five boroughs. The raw data for this analysis was sourced from NYCOpenData and was provided in a CSV file named “Public_Computer_Center.csv”. This appendix outlines the data cleaning steps we performed to transform the raw data into a clean, analysis-ready dataset.

Data Cleaning Steps

  1. Loading Raw Data:

    • The raw data was loaded into R using the read.csv function and stored in the variable PCC_raw.
  2. Standardizing Column Names:

    • Column names were standardized to snake_case using the clean_names() function from the janitor package, facilitating easier data handling and manipulation.
  3. Language Standardization:

    • In the languages_offered column, language descriptions were standardized (e.g., “English only” to “English”, “Chinese (Traditional)” to “Chinese”) for consistency and ease of filtering.
  4. Value Standardization Across the Dataset:

    • A custom function, replace_function, was defined to standardize certain values across all columns of the dataset (e.g., “Not operating” to “Not Operating”, “Temporarily closed” to “Temporarily Closed”).
  5. Renaming Columns:

    • Specific columns were renamed for clarity and consistency, allowing users to better identify and filter computer centers based on services offered.
  6. Filtering Out Inactive Locations:

    • Rows where the operating_status was “Not Operating” were filtered out to focus on active computer centers.
  7. Column Removal for Relevance:

    • Unnecessary columns (e.g., borough, technology_related_courses, state) were removed to focus the dataset on relevant attributes.
  8. Updating Specific Columns:

    • Certain columns were updated to replace values like “N/A”, ” “, and”Not sure” with “Information not provided”. This step was crucial for users to identify centers with incomplete information.
  9. Data Type Conversion:

    • latitude and longitude were converted to numeric data types for accurate geographical mapping.
  10. Splitting Language Data into Separate Rows:

    • The languages_offered column was split into multiple rows using a semicolon as a separator, allowing for more detailed language-specific filtering.

Saving the Cleaned Data

The cleaned dataset was saved in RDS format as “cleaned_data_file.rds” for use in further analysis and visualization, particularly in the Shiny application developed for this project.

Data Source

The original data was obtained from the NYCOpenData portal (URL: Citywide Public Computer Centers) and processed as described above.