COVID-19 Impact on NYC School Enrollment

Report

Introduction

Project Motivation

Our project motivation was to examine the effect of Covid-19 on NYC school enrollment. We found data related to school enrollment demographic throughout the years and compared various aspects of school enrollment information (such as numbers of students by grade level, economic factors, etc) in years leading up to COVID-19 and the years during COVID-19. We want to examine how enrollment variables changed during the years 2013 to 2020, compare differences in variables during pre-COVID-19 years (2013-2019) and COVID-19 years (2020-2021), and thus examine whether COVID-19 had an affect on aspects of school enrollment in NYC.

Objective

Our object is to visualize and analyze the trend in school enrollment demographic information from the year 2013 to 2021. We examined changes in enrollment (number of students) by grade level, race, gender, etc from 2013 to 2021. We also examined changes in students’ poverty level and economic index during the aforementioned years. We aim to find trends in school enrollment variables from 2013 to 2021. Furthermore, through those trends, we aim to find out what variables are significantly different in the COVID-19 period (2020-2021) compared to the pre-COVID-19 period (2013-2019), which can lend insight into what enrollment variables were affected by COVID-19 and what variables were not.

Through our analysis, we hope to help those who work in the NYC Department of Education to gain a better understanding of how trends in school enrollment changed from the years 2013 to 2021. Through our visualizations, we found that some aspects of school enrollments changed the most dramatically during the COVID-19 years (2020-2021) while others changed before COVID-19. In recent news, reports of educational statistics tend to be tied to how COVID-19 affected them, however, through our analysis, we found out that while some changes in school enrollment were indeed affected by covid-19, other aspects were changing before covid-19 happened. We hope that knowing whether an aspect of school enrollment is affected by COVID19 or not will give educators and policymakers in the NYC Department of Education directions to explore the true causes and thus solutions to changes in those aspects.

Main conclusions

  • Clear decrease of total and mean enrollment in 2020 and 2021 in NYC schools

  • Unclear how COVID affected poverty levels of student in 2020 and 2021 as they seem to be within normal bounds.

  • COVID did not disproportionately affect school enrollment of student of color

  • The percentage of students with disabilities slightly increased during COVID in comparison pre-COVID.

  • The average number of English learners remained the same even during COVID.

Justifications of Approach and Design Process

There are four aspects to our deliverable. We conducted data visualizations via ShinyR, statistical testing, and machine learning. Our results from visualizations, testing, and machine learning, will be compiled into a public facing website.

Our intended audience are policy makers in the NYC Department of Education. The deliverable will meet their needs as it visualizes enrollment changes in a clear, easy to understand, and visually appealing way. The statistical tests and machine learning will offer policy makers quantifiable insights into the current changes and possible future trends. Lastly, our website is easily accessible by anyone with the link.

Visualization

As a part of our Exploratory Data Visualizations we made several visualizations using the ggplot package to show the trends in different variables of school enrollment throughout the years: - As a part of our exploratory data analysis, we first made visualizations on the following topics.

  • How did covid-19 affect school total enrollment (faceted by grade level and for all grade levels)?

  • How did covid-19 affect the poverty and economic needs levels of students?

  • Did Covid-19 disproportionately affect school enrollment of minority students/ students of color?

  • Did Covid-19 disproportionately affect school enrollment of students with disabilities?

  • Did Covid-19 disproportionately affect school enrollment of students who have challenges in learning with the English language?

These graphs can be found below in the “Exploratory Data Visualizations” section.

To turn the EDA into a part of our deliverable, we used the ShinyR package to turn them into interactive visualization applications. We wanted the visualizations to be interactive because of the large scope of our data. Also, the interactive nature of the graphs allows more graphs to be combined into one and thus more organized and user-friendly. For example, in our “enrollment change in NYC schools over the years faceted by grade level” visualization, the graphs are cramped and small, and it was very difficult to see the changes throughout the years. However, with Shiny R we could make a scrolling tab that users can use to choose which grade-level’s graph they want to look at, making the layout neater and the graphs more readable.

The Shiny application can be found in the ‘shiny 1’ folder and shiny_web_app.R. To run the application, simply open the shiny_web_app.R file and click on the “Run App button”.

The Shiny application can also be found using this link. The link can also be found in the “Team Butterfree” page of our website.

The biggest difficulty we encountered for the visualization and Shiny R section is to create a public link for our ShinyR application.

Statistical Testing

We conducted four different types of statistical analysis/testing.

  1. ARIMA (AutoRegressive Integrated Moving Average) are used to forecast future trends based on past data. ARIMA models are designed to handle time series data where the values at different points in time are correlated. As the schools.csv dataset represents a time series (yearly enrollment numbers), ARIMA is appropriate. Furthermore, as part of our goal is to make future predictions of school enrollment based on the schools.csv data, ARIMA is appropriate. It can be used to forecast future values by identifying and capturing the underlying patterns in the time series.

  2. We used ANOVA to understand the differential impact of COVID-19 on various demographic groups. ANOVA assesses whether there are statistically significant differences in the means of three or more groups, and can be used to compare numeric variables such as total enrollment across categorical variables, such as years. In our dataset, the independent variable, year, is categorical, which is an ideal situation for using ANOVA.

  3. We used correlation analysis to examine the relationships between different variables without assuming causality. It is also useful to understand the strength and direction of relationships between pairs of numeric variables. As correlation analysis examines the bivariate relationships between two continuous variables, it allows us to look at whether there is a relationship with enrollment variables such as total enrollment, female/male enrollment, poverty, etc.

  4. Regression Analysis: To explore relationships between variables (e.g., how economic status or race affects enrollment changes during COVID-19). Regression analysis is commonly used for predictive modeling, i.e. to estimate the value of a dependent variable (e.g., total enrollment) based on one or more independent variables (e.g. year). Regression also allows us to quantify the relationship between variables in terms of coefficients and thus assess the magnitude and direction of the impact that year has on the dependent variable.

More specific details on each method and results interpretations can be found in stats-analysis.qmd file. To view the analysis, simply open the file and run all chunks. There may be packages that need to be installed before the chunks can be run. We recommend installing via the tools -> install packages method as we have run into errors when trying to install via running installation code.

It can also be found in the Statistic Analysis section on our website.

The biggest difficulty we encountered in the statistics section is deciding what tests to choose and interpreting the results. We overcame this difficulty because one of our group mates studied mathematics during undergrad, and her theoretic knowlegde was very helpful when selecting and interpreting the results.

Machine Learning

  1. Decision tree model: A Decision Tree is a predictive modeling tool used in machine learning. It has a tree-like structure where each node represents a decision or a test on an attribute, each branch represents an outcome of that test, and each leaf node represents the predicted outcome or class label. We chose to use a decision tree model as our first model because they are relatively simple to understand and can handle both classification and regression tasks.

  2. XGBoost Model: XGBoost (eXtreme Gradient Boosting) implements gradient boosting and is an ensemble learning method. It builds a series of weak learners (typically decision trees) sequentially. It also allows users to define custom functions based on specific requirements. We chose this model because of its flexibility, efficiency, and high performance in a variety of tasks, including classification, regression, and ranking.

  3. Random Forest Model: The Random Forest model is also an ensemble learning method that can be used for both classification and regression tasks. It consists of a collection (ensemble) of decision trees each trained independently on a subset of the training data. We chose random forest because of its robustness, high accuracy, and ability to handle complex datasets.

The three models and their analysis can be found in machine-learning.qmd. Simply open the file and run all chunks. There may be packages that need to be installed before the chunks can be run.

It can also be found in the Machine Learning section on our website.

Our biggest difficulty when creating the machine learning models was in choosing and interpreting the models. We did not want to choose models that we already learned in class but we are unfamiliar with new models. We managed to solve this issue because one of our group members used the aforementioned models in the MPS project course on different data and we were able to utilize his skills learned from that course in this project.

Website

We created an website in order to show our visualizations, statistical tests, and machine learning models on a public facing platform. The website can be accessed via this link The link can also be found in the _publish.yml file.

We didn’t encounter any difficulties when creating the website. However, the process did enlighten us on issues with our file organization, which we solved promptly.

The most important considerations our team faced in designing and constructing the final product was to make sure that our deliverables are easily accessible and understandable by our target audience.

Load packages


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0     ✔ tibble  3.2.1
✔ purrr   1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Description

We chose two datasets from the NYC department of education. The first dataset “nyc_schools_13_to_18” contains enrollment information of NYC schools from 2013 to 2018. The second dataset “nyc_schools_17_to_22” enrollment information of NYC schools from 2017 to 2022. We combined and processed the two datasets into our final dataset schools.csv.

nyc_schools_13_to_18

nyc_schools_17_to_22

More details on the specific steps taken to get our final dataset can be found in appendices.qmd file or the Appendices section of our website

Rows: 16379 Columns: 45
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): dbn, school_name
dbl (43): year, total_enrollment, grade_pk_(half_day_&_full_day), grade_k, g...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 16,379
Columns: 45
$ dbn                                              <chr> "01M015", "01M015", "…
$ school_name                                      <chr> "P.S. 015 Roberto Cle…
$ year                                             <dbl> 2013, 2014, 2015, 201…
$ total_enrollment                                 <dbl> 190, 183, 176, 178, 1…
$ `grade_pk_(half_day_&_full_day)`                 <dbl> 26, 18, 14, 17, 17, 1…
$ grade_k                                          <dbl> 39, 27, 32, 28, 28, 2…
$ grade_1                                          <dbl> 39, 47, 33, 33, 32, 3…
$ grade_2                                          <dbl> 21, 31, 39, 27, 33, 3…
$ grade_3                                          <dbl> 16, 19, 23, 31, 23, 3…
$ grade_4                                          <dbl> 26, 17, 17, 24, 31, 2…
$ grade_5                                          <dbl> 23, 24, 18, 18, 26, 2…
$ grade_6                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_7                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_8                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_9                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_10                                         <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_11                                         <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_12                                         <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ female                                           <dbl> 93, 84, 83, 83, 99, 8…
$ percent_female                                   <dbl> 48.900, 45.900, 47.20…
$ male                                             <dbl> 97, 99, 93, 95, 91, 8…
$ percent_male                                     <dbl> 51.100, 54.100, 52.80…
$ asian                                            <dbl> 9, 8, 9, 14, 20, 24, …
$ percent_asian                                    <dbl> 4.700, 4.400, 5.100, …
$ black                                            <dbl> 72, 65, 57, 51, 52, 4…
$ percent_black                                    <dbl> 37.900, 35.500, 32.40…
$ hispanic                                         <dbl> 104, 107, 105, 105, 1…
$ percent_hispanic                                 <dbl> 54.700, 58.500, 59.70…
$ multiple_race_categories_not_represented         <dbl> 2, 1, 3, 4, NA, NA, N…
$ percent_multiple_race_categories_not_represented <dbl> 1.1, 0.5, 1.7, 2.2, N…
$ white                                            <dbl> 3, 2, 2, 4, 6, 6, 9, …
$ percent_white                                    <dbl> 1.600, 1.100, 1.100, …
$ `multi-racial`                                   <dbl> NA, NA, NA, NA, 1, 0,…
$ `percent_multi-racial`                           <dbl> NA, NA, NA, NA, 0.005…
$ native_american                                  <dbl> NA, NA, NA, NA, 1, 1,…
$ percent_native_american                          <dbl> NA, NA, NA, NA, 0.005…
$ `missing_race/ethnicity_data`                    <dbl> NA, NA, NA, NA, 0, 0,…
$ `percent_missing_race/ethnicity_data`            <dbl> NA, NA, NA, NA, 0.000…
$ students_with_disabilities                       <dbl> 65, 64, 60, 51, 49, 3…
$ percent_students_with_disabilities               <dbl> 34.200, 35.000, 34.10…
$ english_language_learners                        <dbl> 19, 17, 16, 12, 8, 8,…
$ percent_english_language_learners                <dbl> 10.000, 9.300, 9.100,…
$ poverty                                          <dbl> 171, 169, 149, 152, 1…
$ percent_poverty                                  <dbl> 90.0, 92.3, 84.7, 85.…
$ economic_need_index                              <dbl> NA, 93.5, 89.6, 89.2,…

What are the observations (rows) and the attributes (columns)?

The schools dataset contains 46 attributes (columns) and 16,379 observations (rows). The dataset is arranged by DBN of schools in NYC, with the same DBN and school name but increasing years in consecutive rows. In other words, the grain of the dataset is in one row/record per school per year.

Descriptive information for each school includes: school_name and dbn.

Year goes from 2013 to 2021 for each school.

Data Dictionary:

  • dbn: DBN (District Borough Number) is the combination of the District Number, the letter code for the borough, and the number of the school.

  • school_name: full name of schools.

  • year: year (containing years from 2013 to 2021).

  • total_enrollment: total number of students enrolled in a particular school in a year.

  • grade_pk_(half_day_&_full_day): number of students enrolled in pre-kindergarten in a particular school in a year.

  • grade_k: number of students enrolled in kindergarten in a particular school in a year.

  • grade_1: number of students enrolled in grade 1 in a particular school in a year.

  • grade_2: number of students enrolled in grade 2 in a particular school in a year.

  • The same trend continues for columns grade_3, grade_4 grade_5, grade_6, grade_7, grade_8, grade_9, grade_10, grade_11, and grade_12.

  • female: the number of female students enrolled in a particular school in a year.

  • percent_female: the percentage of female students enrolled in a particular school in a year.

  • male: the number of male students enrolled in a particular school in a year.

  • percent_male: the percentage of male students enrolled in a particular school in a year.

  • asian: the number of asian students enrolled in a particular school in a year.

  • percent_asian: the percentage of asian students enrolled in a particular school in a year.

  • black: the number of black students enrolled in a particular school in a year.

  • percent_black: the percentage of black students enrolled in a particular school in a year.

  • the same trend goes for columns, hispanic, percent_hispanic, multiple_race_categories_not_represented, percent_multiple_race_categories_not_represented, white, percent_white, multi-racial, percent_multi-racial, native_american, percent_native_american, missing_race/ethnicity_data, percent_missing_race/ethnicity_data

  • students_with_disabilities: the number of students with disabilities enrolled in a particular school in a year.

  • percent_students_with_disabilities: the percentage of students with disabilities enrolled in a particular school in a year.

  • english_language_learners: the number of students whose first language is not english enrolled in a particular school in a year.

  • percent_english_language_learners: the percentage of students whose first language is not english enrolled in a particular school in a year.

  • poverty: the number of students in enrolled in a particular school in a year.

  • percent_poverty: the percentage of students in enrolled in a particular school in a year.

  • economic_need_index: Economic need index by school. The economic need index is a number ranging from 0-1 (it can also be represented as a percentage).

The student’s Economic Need Value is 1.0 if: - The student is eligible for public assistance from the NYC Human Resources Administration (HRA); - The student lived in temporary housing in the past four years; or - The student is in high school, has a home language other than English, and entered the NYC DOE for the first time within the last four years.

If a school is more than 10 percentage points above the citywide average, it is skewed toward lower incomes; if a school is more than 10 percentage points below the citywide average, it is skewed toward higher incomes.

(Economic Need Index)[https://data.cccnewyork.org/data/bar/1371/student-economic-need-index#1371/a/1/1622/127]

Why was this dataset created?

This dataset is created by the city of New York City about a comprehensive overview of student demographics across NYC schools. This data can help stakeholders, including policymakers, educators, and researchers to understand the diverse student population within NYC and make informed decisions related to educational programs, resources allocation, and policy planning.

Who funded the creation of the dataset?

The dataset appears to be related to NYC schools, so it was likely funded and maintained by the New York City Department of Education or a related governmental body.

What processes might have influenced what data was observed and recorded and what was not?

The data within the data set needed to be measurable. It’s likely that this data set came from a larger data set with more comprehensive data related to all things related to education. This data set specifically focuses on student’s ethnicity, disability, and economic status.

One factor that we noticed is that these are aggregate data without further breakdown of individuals. This is likely due to privacy concerns to mask student identities.

What preprocessing was done, and how did the data come to be in the form that you are using?

As mentioned above, this data set is aggregated. The data is likely gathered from various schools and centralized into a single data set.

  • Aggregations were performed to calculate percentages and other metrics.

  • Missing data, especially the ones related to the Economic Need Index, might have been derived from aggregated data using averages or other methods.

  • To ensure privacy, certain values were replaced with generic labels like “Below 5%” and “Above 95%”.

More detailed data cleaning processes are in code chunk annotations.

If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

Given that this data is related to public schools and is likely that parents have signed something that allowed the school to collect their information for official and administrative purposes. Parents, guardians, and school staff were probably aware of the data collection. They might have expected the data to be used for administrative purposes, policy planning, resource allocation, and to assess the needs of the student population.

Especially in public schools, economic status information could be derived from things like subsidy or food stamps. The race, gender, English level and age are basic information that are collected upon school enrollment.

Exploratory Data Visualizations

How did covid-19 affect school total enrollment?

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

As we cannot see significant trends in changes in students enrollment throughout the years when separated by grade level, we decided to visualize changes in students enrollment throughout the years with all grade levels combined.

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

How did Covid-19 affect school mean enrollment?

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

The graphs above shows a steep drop from 2019 to 2020 and from 2020 to 2021, corresponding to the covid-19 years.The graphs showed that both total enrollment and mean enrollment of students were decreased during the covid-19 years.

How did covid-19 effect the poverty and economic needs levels of students?

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Did Covid-19 disproportionately affect school enrollment of minorities students/ student of color?

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `across(starts_with("percent"), mean, na.rm = TRUE)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.

  # Previously
  across(a:b, mean, na.rm = TRUE)

  # Now
  across(a:b, \(x) mean(x, na.rm = TRUE))
  percent_white percent_black percent_hispanic percent_asian
1     -39.42883     -11.29854         3.483335     -32.73358
  percent_native_american
1                4.151661

Did Covid-19 disproportionately affect school enrollment of students with disabilities?

  mean_percent_disabilities
1                  4.899926

Did Covid-19 disproportionately affect school enrollment of students who have challenges in learning with the English language?

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

Limitations

  • Data doesn’t contain a column for school types (elementary, middle, high school, etc.)

  • The Data only contains NYC public schools

  • Scope of Data: The data set only includes public schools, so the findings may not be representative of private, charter, or other types of schools in NYC.

  • Variable Definitions are not unified

  • Inconsistent Definitions: Definitions of key variables, like “economic need” or “English language learners,” may have evolved over time, affecting trend analyses.

  • Categorical Limitations: Categories such as race/ethnicity and economic status are often complex and may not be captured fully by the available data.

  • Covid-19 Impact: The pandemic’s impact started in early 2020, so data from 2017-18 and 2018-19 might not capture the pre-pandemic baseline accurately.

One hurdle we failed to overcome was to create a column of school type (elementary, middle, high, vocational school, etc). We tried to use specific prefixes and suffixes in school names as indication of a school’s school type, but there are too many schools without indicative prefix and suffixes and individually searching each school would require a large amount of time and give little reward.

If we had the opportunity to do this project again we would do the following to improve our deliverables:

  • we would make the Shiny R even more visually appealing.

  • we would try to find school enrollment data further back in years to make our analysis and machine learning aspect more robust.

  • we would combine school enrollment data and with COVID-19 health related data to gain a more well rounded understanding of how COVID-19 impacted school enrollment.

Acknowledgments

We extend our sincere gratitude to NYC Open Data for providing the invaluable datasets that formed the foundation of our data analysis project. We would also like to extend our gratitude to the MPS Project course where one of our teammate obtained the machine learning skills.

We would also like to express our appreciation to Dr. Benjamin Soltoff for his teaching, guidance, and mentorship throughout the course.

Lastly, we would like to express our gratitue to CRAN R where we obtained all of the R packages used in our project.