Visualizing the Network of NYC Public Computer Centers
Appendix to report
Data cleaning
Overview
Our project involved analyzing the geographic distribution of public computer centers across New York City’s five boroughs. The raw data for this analysis was sourced from NYCOpenData and was provided in a CSV file named “Public_Computer_Center.csv”. This appendix outlines the data cleaning steps we performed to transform the raw data into a clean, analysis-ready dataset.
Data Cleaning Steps
Loading Raw Data:
- The raw data was loaded into R using the
read.csvfunction and stored in the variablePCC_raw.
- The raw data was loaded into R using the
Standardizing Column Names:
- Column names were standardized to snake_case using the
clean_names()function from thejanitorpackage, facilitating easier data handling and manipulation.
- Column names were standardized to snake_case using the
Language Standardization:
- In the
languages_offeredcolumn, language descriptions were standardized (e.g., “English only” to “English”, “Chinese (Traditional)” to “Chinese”) for consistency and ease of filtering.
- In the
Value Standardization Across the Dataset:
- A custom function,
replace_function, was defined to standardize certain values across all columns of the dataset (e.g., “Not operating” to “Not Operating”, “Temporarily closed” to “Temporarily Closed”).
- A custom function,
Renaming Columns:
- Specific columns were renamed for clarity and consistency, allowing users to better identify and filter computer centers based on services offered.
Filtering Out Inactive Locations:
- Rows where the
operating_statuswas “Not Operating” were filtered out to focus on active computer centers.
- Rows where the
Column Removal for Relevance:
- Unnecessary columns (e.g.,
borough,technology_related_courses,state) were removed to focus the dataset on relevant attributes.
- Unnecessary columns (e.g.,
Updating Specific Columns:
- Certain columns were updated to replace values like “N/A”, ” “, and”Not sure” with “Information not provided”. This step was crucial for users to identify centers with incomplete information.
Data Type Conversion:
latitudeandlongitudewere converted to numeric data types for accurate geographical mapping.
Splitting Language Data into Separate Rows:
- The
languages_offeredcolumn was split into multiple rows using a semicolon as a separator, allowing for more detailed language-specific filtering.
- The
Saving the Cleaned Data
The cleaned dataset was saved in RDS format as “cleaned_data_file.rds” for use in further analysis and visualization, particularly in the Shiny application developed for this project.
Data Source
The original data was obtained from the NYCOpenData portal (URL: Citywide Public Computer Centers) and processed as described above.