COVID-19 Impact on NYC School Enrollment

Exploration

library(readr)
library(tidyr)
library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(snakecase)

Objective(s)

State the question(s) you are answering or the problem(s) you are solving clearly.

The data contains demographic information for NYC schools between the years 2013 (2013-14) and 2021 (2021-22). Our objective for this project is to explore how Covid-19 affects the NYC schools. We will consider how COVID affected education through attributes such as race, disability, economic status, etc. Covid-19 started affecting schools in the US during March 2019.

We posed questions such as:

How did covid-19 affect school enrollment? Including total enrollment, enrollment by grade level (PK-12).
How did covid-19 effect the enrollment of schools with different levels of funding? Are underserved schools affect more than schools with sufficient fundings.
Did Covid-19 disproportionately affect school enrollment of minorities student/students of color?
Did Covid-19 disproportionately affect school enrollment of students with disabilities?
Did Covid-19 disproportionately affect school enrollment of students in poverty?
Did Covid-19 disproportionately affect school enrollment of students who have challenges in leaning with the Anglish language?

Note: DBN (District Borough Number) is the combination of the District Number, the letter code for the borough, and the number of the school

Data import and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

# upload data 

nyc_schools_17_to_22 <- read_csv("2017-18__-_2021-22_Demographic_Snapshot_20231020.csv")

Rows: 9251 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): DBN, School Name, Year, # Poverty, % Poverty, Economic Need Index
dbl (38): Total Enrollment, Grade 3K, Grade PK (Half Day & Full Day), Grade ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nyc_schools_13_to_18 <- read_csv("2013_-_2018_Demographic_Snapshot_School_20231024.csv")

Rows: 8972 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): DBN, School Name, Year, Economic Need Index
dbl (35): Total Enrollment, Grade PK (Half Day & Full Day), Grade K, Grade 1...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Sources of data

nyc_schools_17_to_22: source: https://data.cityofnewyork.us/Education/2017-18-2021-22-Demographic-Snapshot/c7ru-d68s data provided by New York City Department of Education

nyc_schools_13_to_18: source: https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6 data provided by New York City Department of Education

Initial views of the data

nyc_schools_17_to_22

# A tibble: 9,251 × 44
   DBN    `School Name`             Year    `Total Enrollment` `Grade 3K`
   <chr>  <chr>                     <chr>                <dbl>      <dbl>
 1 01M015 P.S. 015 Roberto Clemente 2017-18                190          0
 2 01M015 P.S. 015 Roberto Clemente 2018-19                174          0
 3 01M015 P.S. 015 Roberto Clemente 2019-20                190          0
 4 01M015 P.S. 015 Roberto Clemente 2020-21                193          0
 5 01M015 P.S. 015 Roberto Clemente 2021-22                179          0
 6 01M019 P.S. 019 Asher Levy       2017-18                257          0
 7 01M019 P.S. 019 Asher Levy       2018-19                249          0
 8 01M019 P.S. 019 Asher Levy       2019-20                236          0
 9 01M019 P.S. 019 Asher Levy       2020-21                212          0
10 01M019 P.S. 019 Asher Levy       2021-22                176          9
# ℹ 9,241 more rows
# ℹ 39 more variables: `Grade PK (Half Day & Full Day)` <dbl>, `Grade K` <dbl>,
#   `Grade 1` <dbl>, `Grade 2` <dbl>, `Grade 3` <dbl>, `Grade 4` <dbl>,
#   `Grade 5` <dbl>, `Grade 6` <dbl>, `Grade 7` <dbl>, `Grade 8` <dbl>,
#   `Grade 9` <dbl>, `Grade 10` <dbl>, `Grade 11` <dbl>, `Grade 12` <dbl>,
#   `# Female` <dbl>, `% Female` <dbl>, `# Male` <dbl>, `% Male` <dbl>,
#   `# Asian` <dbl>, `% Asian` <dbl>, `# Black` <dbl>, `% Black` <dbl>, …

#glimpse of data 

glimpse(nyc_schools_17_to_22)

Rows: 9,251
Columns: 44
$ DBN                              <chr> "01M015", "01M015", "01M015", "01M015…
$ `School Name`                    <chr> "P.S. 015 Roberto Clemente", "P.S. 01…
$ Year                             <chr> "2017-18", "2018-19", "2019-20", "202…
$ `Total Enrollment`               <dbl> 190, 174, 190, 193, 179, 257, 249, 23…
$ `Grade 3K`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0…
$ `Grade PK (Half Day & Full Day)` <dbl> 17, 13, 14, 17, 15, 13, 10, 16, 13, 7…
$ `Grade K`                        <dbl> 28, 20, 29, 29, 30, 34, 30, 25, 23, 2…
$ `Grade 1`                        <dbl> 32, 33, 28, 29, 26, 38, 39, 27, 25, 2…
$ `Grade 2`                        <dbl> 33, 30, 38, 27, 24, 42, 43, 39, 27, 2…
$ `Grade 3`                        <dbl> 23, 30, 33, 30, 22, 46, 41, 45, 38, 2…
$ `Grade 4`                        <dbl> 31, 20, 29, 32, 33, 42, 44, 42, 44, 3…
$ `Grade 5`                        <dbl> 26, 28, 19, 29, 29, 42, 42, 42, 42, 3…
$ `Grade 6`                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 7`                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 8`                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 9`                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 10`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 11`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 12`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `# Female`                       <dbl> 99, 85, 94, 101, 96, 114, 114, 110, 9…
$ `% Female`                       <dbl> 0.521, 0.489, 0.495, 0.523, 0.536, 0.…
$ `# Male`                         <dbl> 91, 89, 96, 92, 83, 143, 135, 126, 11…
$ `% Male`                         <dbl> 0.479, 0.511, 0.505, 0.477, 0.464, 0.…
$ `# Asian`                        <dbl> 20, 24, 27, 26, 21, 23, 14, 11, 13, 1…
$ `% Asian`                        <dbl> 0.105, 0.138, 0.142, 0.135, 0.117, 0.…
$ `# Black`                        <dbl> 52, 48, 56, 53, 50, 49, 52, 49, 41, 3…
$ `% Black`                        <dbl> 0.274, 0.276, 0.295, 0.275, 0.279, 0.…
$ `# Hispanic`                     <dbl> 110, 95, 96, 102, 93, 166, 156, 148, …
$ `% Hispanic`                     <dbl> 0.579, 0.546, 0.505, 0.528, 0.520, 0.…
$ `# Multi-Racial`                 <dbl> 1, 0, 0, 1, 3, 3, 8, 8, 7, 8, 14, 12,…
$ `% Multi-Racial`                 <dbl> 0.005, 0.000, 0.000, 0.005, 0.017, 0.…
$ `# Native American`              <dbl> 1, 1, 2, 0, 0, 0, 1, 1, 1, 1, 4, 3, 5…
$ `% Native American`              <dbl> 0.005, 0.006, 0.011, 0.000, 0.000, 0.…
$ `# White`                        <dbl> 6, 6, 9, 11, 12, 16, 18, 19, 17, 21, …
$ `% White`                        <dbl> 0.032, 0.034, 0.047, 0.057, 0.067, 0.…
$ `# Missing Race/Ethnicity Data`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 1, 1, 1…
$ `% Missing Race/Ethnicity Data`  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.…
$ `# Students with Disabilities`   <dbl> 49, 39, 46, 44, 38, 90, 102, 94, 87, …
$ `% Students with Disabilities`   <dbl> 0.258, 0.224, 0.242, 0.228, 0.212, 0.…
$ `# English Language Learners`    <dbl> 8, 8, 17, 21, 11, 8, 8, 8, 9, 6, 86, …
$ `% English Language Learners`    <dbl> 0.042, 0.046, 0.089, 0.109, 0.061, 0.…
$ `# Poverty`                      <chr> "161", "147", "155", "161", "150", "1…
$ `% Poverty`                      <chr> "84.7%", "84.5%", "81.6%", "83.4%", "…
$ `Economic Need Index`            <chr> "89.0%", "88.8%", "86.7%", "86.4%", "…

#View(nyc_schools_17_to_22)

nyc_schools_13_to_18

# A tibble: 8,972 × 39
   DBN   `School Name` Year  `Total Enrollment` Grade PK (Half Day &…¹ `Grade K`
   <chr> <chr>         <chr>              <dbl>                  <dbl>     <dbl>
 1 01M0… P.S. 015 Rob… 2013…                190                     26        39
 2 01M0… P.S. 015 Rob… 2014…                183                     18        27
 3 01M0… P.S. 015 Rob… 2015…                176                     14        32
 4 01M0… P.S. 015 Rob… 2016…                178                     17        28
 5 01M0… P.S. 015 Rob… 2017…                190                     17        28
 6 01M0… P.S. 019 Ash… 2013…                285                     36        39
 7 01M0… P.S. 019 Ash… 2014…                270                     30        44
 8 01M0… P.S. 019 Ash… 2015…                270                     21        47
 9 01M0… P.S. 019 Ash… 2016…                271                     24        37
10 01M0… P.S. 019 Ash… 2017…                257                     13        34
# ℹ 8,962 more rows
# ℹ abbreviated name: ¹`Grade PK (Half Day & Full Day)`
# ℹ 33 more variables: `Grade 1` <dbl>, `Grade 2` <dbl>, `Grade 3` <dbl>,
#   `Grade 4` <dbl>, `Grade 5` <dbl>, `Grade 6` <dbl>, `Grade 7` <dbl>,
#   `Grade 8` <dbl>, `Grade 9` <dbl>, `Grade 10` <dbl>, `Grade 11` <dbl>,
#   `Grade 12` <dbl>, `# Female` <dbl>, `% Female` <dbl>, `# Male` <dbl>,
#   `% Male` <dbl>, `# Asian` <dbl>, `% Asian` <dbl>, `# Black` <dbl>, …

glimpse(nyc_schools_13_to_18)

Rows: 8,972
Columns: 39
$ DBN                                          <chr> "01M015", "01M015", "01M0…
$ `School Name`                                <chr> "P.S. 015 Roberto Clement…
$ Year                                         <chr> "2013-14", "2014-15", "20…
$ `Total Enrollment`                           <dbl> 190, 183, 176, 178, 190, …
$ `Grade PK (Half Day & Full Day)`             <dbl> 26, 18, 14, 17, 17, 36, 3…
$ `Grade K`                                    <dbl> 39, 27, 32, 28, 28, 39, 4…
$ `Grade 1`                                    <dbl> 39, 47, 33, 33, 32, 38, 4…
$ `Grade 2`                                    <dbl> 21, 31, 39, 27, 33, 36, 3…
$ `Grade 3`                                    <dbl> 16, 19, 23, 31, 23, 45, 3…
$ `Grade 4`                                    <dbl> 26, 17, 17, 24, 31, 47, 4…
$ `Grade 5`                                    <dbl> 23, 24, 18, 18, 26, 44, 4…
$ `Grade 6`                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 7`                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 8`                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 9`                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 10`                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 11`                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `Grade 12`                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ `# Female`                                   <dbl> 93, 84, 83, 83, 99, 141, …
$ `% Female`                                   <dbl> 48.9, 45.9, 47.2, 46.6, 5…
$ `# Male`                                     <dbl> 97, 99, 93, 95, 91, 144, …
$ `% Male`                                     <dbl> 51.1, 54.1, 52.8, 53.4, 4…
$ `# Asian`                                    <dbl> 9, 8, 9, 14, 20, 41, 30, …
$ `% Asian`                                    <dbl> 4.7, 4.4, 5.1, 7.9, 10.5,…
$ `# Black`                                    <dbl> 72, 65, 57, 51, 52, 56, 4…
$ `% Black`                                    <dbl> 37.9, 35.5, 32.4, 28.7, 2…
$ `# Hispanic`                                 <dbl> 104, 107, 105, 105, 110, …
$ `% Hispanic`                                 <dbl> 54.7, 58.5, 59.7, 59.0, 5…
$ `# Multiple Race Categories Not Represented` <dbl> 2, 1, 3, 4, 2, 10, 8, 3, …
$ `% Multiple Race Categories Not Represented` <dbl> 1.1, 0.5, 1.7, 2.2, 1.1, …
$ `# White`                                    <dbl> 3, 2, 2, 4, 6, 30, 27, 16…
$ `% White`                                    <dbl> 1.6, 1.1, 1.1, 2.2, 3.2, …
$ `# Students with Disabilities`               <dbl> 65, 64, 60, 51, 45, 89, 8…
$ `% Students with Disabilities`               <dbl> 34.2, 35.0, 34.1, 28.7, 2…
$ `# English Language Learners`                <dbl> 19, 17, 16, 12, 8, 25, 18…
$ `% English Language Learners`                <dbl> 10.0, 9.3, 9.1, 6.7, 4.2,…
$ `# Poverty`                                  <dbl> 171, 169, 149, 152, 161, …
$ `% Poverty`                                  <dbl> 90.0, 92.3, 84.7, 85.4, 8…
$ `Economic Need Index`                        <chr> "No Data", "93.5%", "89.6…

# View(nyc_schools_13_to_18)

Data Cleaning

# deleting all rows with the years "2017-18" in nyc_schools_13_to_18
# information is already contained in nyc_schools_17_to_22

nyc_schools_13_to_18_fix <- nyc_schools_13_to_18 |> 
  filter(Year != "2017-18") 
  
nyc_schools_13_to_18_fix

# A tibble: 7,128 × 39
   DBN   `School Name` Year  `Total Enrollment` Grade PK (Half Day &…¹ `Grade K`
   <chr> <chr>         <chr>              <dbl>                  <dbl>     <dbl>
 1 01M0… P.S. 015 Rob… 2013…                190                     26        39
 2 01M0… P.S. 015 Rob… 2014…                183                     18        27
 3 01M0… P.S. 015 Rob… 2015…                176                     14        32
 4 01M0… P.S. 015 Rob… 2016…                178                     17        28
 5 01M0… P.S. 019 Ash… 2013…                285                     36        39
 6 01M0… P.S. 019 Ash… 2014…                270                     30        44
 7 01M0… P.S. 019 Ash… 2015…                270                     21        47
 8 01M0… P.S. 019 Ash… 2016…                271                     24        37
 9 01M0… P.S. 020 Ann… 2013…                631                     54       114
10 01M0… P.S. 020 Ann… 2014…                633                     51       102
# ℹ 7,118 more rows
# ℹ abbreviated name: ¹`Grade PK (Half Day & Full Day)`
# ℹ 33 more variables: `Grade 1` <dbl>, `Grade 2` <dbl>, `Grade 3` <dbl>,
#   `Grade 4` <dbl>, `Grade 5` <dbl>, `Grade 6` <dbl>, `Grade 7` <dbl>,
#   `Grade 8` <dbl>, `Grade 9` <dbl>, `Grade 10` <dbl>, `Grade 11` <dbl>,
#   `Grade 12` <dbl>, `# Female` <dbl>, `% Female` <dbl>, `# Male` <dbl>,
#   `% Male` <dbl>, `# Asian` <dbl>, `% Asian` <dbl>, `# Black` <dbl>, …

In nyc_schools_13_to_18, “# Poverty” and “% Poverty” are both in double type. In nyc_schools_17_to_22, both are in character strings because there are schools with above 95% poverty, which is written as “Above 95%” in both the “# Poverty” and “% Poverty” and columns.

The chunk below turns the “# Poverty” and “% Poverty” columns in into character strings in nyc_schools_13_to_18 so the two dataset can be merged

nyc_schools_13_to_18_fix <- nyc_schools_13_to_18_fix |> 
  mutate(`% Poverty` = as.character(`% Poverty`), 
         `# Poverty` = as.character(`# Poverty`))

Creating a joined data set

schools <- 
  # joining the two data sets.
  bind_rows(nyc_schools_13_to_18_fix, nyc_schools_17_to_22) |> 
  # turn values in Year column from YYYY-YY to YYYY. E.g. 2013-24 to 2013 
  mutate("Year" = substr(Year, 1, 4)) |> 
  arrange(DBN)

# turning all column names to snakecase
colnames(schools) <- tolower(gsub(" ", "_", colnames(schools)))

# deleting  all '#' in column names to avoid confusing R
colnames(schools) <- gsub("#_", "", colnames(schools))

# turning all "%" in column names to "percent" to avoid confusing R
colnames(schools) <- gsub("%", "percent", colnames(schools))

schools

# A tibble: 16,379 × 46
   dbn    school_name      year  total_enrollment grade_pk_(half_day_&…¹ grade_k
   <chr>  <chr>            <chr>            <dbl>                  <dbl>   <dbl>
 1 01M015 P.S. 015 Robert… 2013               190                     26      39
 2 01M015 P.S. 015 Robert… 2014               183                     18      27
 3 01M015 P.S. 015 Robert… 2015               176                     14      32
 4 01M015 P.S. 015 Robert… 2016               178                     17      28
 5 01M015 P.S. 015 Robert… 2017               190                     17      28
 6 01M015 P.S. 015 Robert… 2018               174                     13      20
 7 01M015 P.S. 015 Robert… 2019               190                     14      29
 8 01M015 P.S. 015 Robert… 2020               193                     17      29
 9 01M015 P.S. 015 Robert… 2021               179                     15      30
10 01M019 P.S. 019 Asher … 2013               285                     36      39
# ℹ 16,369 more rows
# ℹ abbreviated name: ¹`grade_pk_(half_day_&_full_day)`
# ℹ 40 more variables: grade_1 <dbl>, grade_2 <dbl>, grade_3 <dbl>,
#   grade_4 <dbl>, grade_5 <dbl>, grade_6 <dbl>, grade_7 <dbl>, grade_8 <dbl>,
#   grade_9 <dbl>, grade_10 <dbl>, grade_11 <dbl>, grade_12 <dbl>,
#   female <dbl>, percent_female <dbl>, male <dbl>, percent_male <dbl>,
#   asian <dbl>, percent_asian <dbl>, black <dbl>, percent_black <dbl>, …

# organizing columns so they make sense 
schools <- schools |> 
  # deleting grade_3k because most schools don't have information on it. 
  select(-grade_3k) |> 
  # putting all the race and ethnicity columns next to each other 
  select(1:`percent_white`, 
         `multi-racial`, 
         `percent_multi-racial`, 
         `native_american`, 
         `percent_native_american`, 
         `missing_race/ethnicity_data`, 
         `percent_missing_race/ethnicity_data`,           
         everything())

schools

# A tibble: 16,379 × 45
   dbn    school_name      year  total_enrollment grade_pk_(half_day_&…¹ grade_k
   <chr>  <chr>            <chr>            <dbl>                  <dbl>   <dbl>
 1 01M015 P.S. 015 Robert… 2013               190                     26      39
 2 01M015 P.S. 015 Robert… 2014               183                     18      27
 3 01M015 P.S. 015 Robert… 2015               176                     14      32
 4 01M015 P.S. 015 Robert… 2016               178                     17      28
 5 01M015 P.S. 015 Robert… 2017               190                     17      28
 6 01M015 P.S. 015 Robert… 2018               174                     13      20
 7 01M015 P.S. 015 Robert… 2019               190                     14      29
 8 01M015 P.S. 015 Robert… 2020               193                     17      29
 9 01M015 P.S. 015 Robert… 2021               179                     15      30
10 01M019 P.S. 019 Asher … 2013               285                     36      39
# ℹ 16,369 more rows
# ℹ abbreviated name: ¹`grade_pk_(half_day_&_full_day)`
# ℹ 39 more variables: grade_1 <dbl>, grade_2 <dbl>, grade_3 <dbl>,
#   grade_4 <dbl>, grade_5 <dbl>, grade_6 <dbl>, grade_7 <dbl>, grade_8 <dbl>,
#   grade_9 <dbl>, grade_10 <dbl>, grade_11 <dbl>, grade_12 <dbl>,
#   female <dbl>, percent_female <dbl>, male <dbl>, percent_male <dbl>,
#   asian <dbl>, percent_asian <dbl>, black <dbl>, percent_black <dbl>, …

# cleaning column content structure and type 

schools <- schools |> 
  # turning all "No Data" to N/A in economic_need_index column 
  mutate(economic_need_index = ifelse(economic_need_index == "No Data", NA, 
                                      economic_need_index)) |> 
  # turning percentage and character values to numeric 
  mutate(`percent_poverty` = as.numeric(gsub("%", "", `percent_poverty`))) |> 
  mutate(`economic_need_index` = 
           as.numeric(gsub("%", "", `economic_need_index`))) |> 
  mutate(`poverty` = as.numeric(`poverty`)) |> 
  mutate(`year` = as.numeric(`year`))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `percent_poverty = as.numeric(gsub("%", "", percent_poverty))`.
Caused by warning:
! NAs introduced by coercion

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `economic_need_index = as.numeric(gsub("%", "",
  economic_need_index))`.
Caused by warning:
! NAs introduced by coercion

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `poverty = as.numeric(poverty)`.
Caused by warning:
! NAs introduced by coercion

schools

# A tibble: 16,379 × 45
   dbn    school_name       year total_enrollment grade_pk_(half_day_&…¹ grade_k
   <chr>  <chr>            <dbl>            <dbl>                  <dbl>   <dbl>
 1 01M015 P.S. 015 Robert…  2013              190                     26      39
 2 01M015 P.S. 015 Robert…  2014              183                     18      27
 3 01M015 P.S. 015 Robert…  2015              176                     14      32
 4 01M015 P.S. 015 Robert…  2016              178                     17      28
 5 01M015 P.S. 015 Robert…  2017              190                     17      28
 6 01M015 P.S. 015 Robert…  2018              174                     13      20
 7 01M015 P.S. 015 Robert…  2019              190                     14      29
 8 01M015 P.S. 015 Robert…  2020              193                     17      29
 9 01M015 P.S. 015 Robert…  2021              179                     15      30
10 01M019 P.S. 019 Asher …  2013              285                     36      39
# ℹ 16,369 more rows
# ℹ abbreviated name: ¹`grade_pk_(half_day_&_full_day)`
# ℹ 39 more variables: grade_1 <dbl>, grade_2 <dbl>, grade_3 <dbl>,
#   grade_4 <dbl>, grade_5 <dbl>, grade_6 <dbl>, grade_7 <dbl>, grade_8 <dbl>,
#   grade_9 <dbl>, grade_10 <dbl>, grade_11 <dbl>, grade_12 <dbl>,
#   female <dbl>, percent_female <dbl>, male <dbl>, percent_male <dbl>,
#   asian <dbl>, percent_asian <dbl>, black <dbl>, percent_black <dbl>, …

Data description

Have an initial draft of your data description section. Your data description should be about your analysis-ready data.

# glimpse of the analysis-ready data set 
glimpse(schools)

Rows: 16,379
Columns: 45
$ dbn                                              <chr> "01M015", "01M015", "…
$ school_name                                      <chr> "P.S. 015 Roberto Cle…
$ year                                             <dbl> 2013, 2014, 2015, 201…
$ total_enrollment                                 <dbl> 190, 183, 176, 178, 1…
$ `grade_pk_(half_day_&_full_day)`                 <dbl> 26, 18, 14, 17, 17, 1…
$ grade_k                                          <dbl> 39, 27, 32, 28, 28, 2…
$ grade_1                                          <dbl> 39, 47, 33, 33, 32, 3…
$ grade_2                                          <dbl> 21, 31, 39, 27, 33, 3…
$ grade_3                                          <dbl> 16, 19, 23, 31, 23, 3…
$ grade_4                                          <dbl> 26, 17, 17, 24, 31, 2…
$ grade_5                                          <dbl> 23, 24, 18, 18, 26, 2…
$ grade_6                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_7                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_8                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_9                                          <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_10                                         <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_11                                         <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ grade_12                                         <dbl> 0, 0, 0, 0, 0, 0, 0, …
$ female                                           <dbl> 93, 84, 83, 83, 99, 8…
$ percent_female                                   <dbl> 48.900, 45.900, 47.20…
$ male                                             <dbl> 97, 99, 93, 95, 91, 8…
$ percent_male                                     <dbl> 51.100, 54.100, 52.80…
$ asian                                            <dbl> 9, 8, 9, 14, 20, 24, …
$ percent_asian                                    <dbl> 4.700, 4.400, 5.100, …
$ black                                            <dbl> 72, 65, 57, 51, 52, 4…
$ percent_black                                    <dbl> 37.900, 35.500, 32.40…
$ hispanic                                         <dbl> 104, 107, 105, 105, 1…
$ percent_hispanic                                 <dbl> 54.700, 58.500, 59.70…
$ multiple_race_categories_not_represented         <dbl> 2, 1, 3, 4, NA, NA, N…
$ percent_multiple_race_categories_not_represented <dbl> 1.1, 0.5, 1.7, 2.2, N…
$ white                                            <dbl> 3, 2, 2, 4, 6, 6, 9, …
$ percent_white                                    <dbl> 1.600, 1.100, 1.100, …
$ `multi-racial`                                   <dbl> NA, NA, NA, NA, 1, 0,…
$ `percent_multi-racial`                           <dbl> NA, NA, NA, NA, 0.005…
$ native_american                                  <dbl> NA, NA, NA, NA, 1, 1,…
$ percent_native_american                          <dbl> NA, NA, NA, NA, 0.005…
$ `missing_race/ethnicity_data`                    <dbl> NA, NA, NA, NA, 0, 0,…
$ `percent_missing_race/ethnicity_data`            <dbl> NA, NA, NA, NA, 0.000…
$ students_with_disabilities                       <dbl> 65, 64, 60, 51, 49, 3…
$ percent_students_with_disabilities               <dbl> 34.200, 35.000, 34.10…
$ english_language_learners                        <dbl> 19, 17, 16, 12, 8, 8,…
$ percent_english_language_learners                <dbl> 10.000, 9.300, 9.100,…
$ poverty                                          <dbl> 171, 169, 149, 152, 1…
$ percent_poverty                                  <dbl> 90.0, 92.3, 84.7, 85.…
$ economic_need_index                              <dbl> NA, 93.5, 89.6, 89.2,…

write.csv(schools, "schools.csv", row.names=FALSE)

If using real-world data, describe it. A good model for this is presented in Gebru et al, 2018. Answer any relevant questions from sections 3.1-3.5 of the Gebru et al article, especially the following questions:

First, some notes:

DBN (District Borough Number) is the combination of the District Number, the letter code for the borough, and the number of the school.
The Economic Need Index refers to the estimated percentage of students facing economic hardship.

What are the observations (rows) and the attributes (columns)?

The schools dataset contains 46 attributes (columns) and 16,379 observations (rows). The dataset is arranged by DBN of schools in NYC, with the same DBN and school name but increasing years in consecutive rows. In other words, the grain of the dataset is in one row/record per school per year.

Descriptive information for each school includes: school_name and dbn.

Year goes from 2013 to 2021 for each school.

Demographic attributes for each school includes: - total_enrollment

enrollment by grade level (from grade_pk to grade_12)
quantity of student sexes (female, male)
proportions of student sexes (percent_female, percent_male)
quantity of students from different race and ethnicity
proportions of students from different race and ethnicity
quantity and proportions of students experiencing poverty

Why was this dataset created?

This dataset is created by the city of New York City about a comprehensive overview of student demographics across NYC schools. This data can help stakeholders, including policymakers, educators, and researchers to understand the diverse student population within NYC and make informed decisions related to educational programs, resources allocation, and policy planning.

Who funded the creation of the dataset?

The dataset appears to be related to NYC schools, so it was likely funded and maintained by the New York City Department of Education or a related governmental body.

What processes might have influenced what data was observed and recorded and what was not?

The data within the data set needed to be measurable. It’s likely that this data set came from a larger data set with more comprehensive data related to all things related to education. This data set specifically focuses on student’s ethnicity, disability, and economic status.

One factor that we noticed is that these are aggregate data without further breakdown of individuals. This is likely due to privacy concerns to mask student identities.

What preprocessing was done, and how did the data come to be in the form that you are using?

As mentioned above, this data set is aggregated. The data is likely gathered from various schools and centralized into a single data set.

Aggregations were performed to calculate percentages and other metrics.
Missing data, especially the ones related to the Economic Need Index, might have been derived from aggregated data using averages or other methods.
To ensure privacy, certain values were replaced with generic labels like “Below 5%” and “Above 95%”.
Data might have been cleaned and standardized to ensure consistency across the data set.

If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?

Given that this data is related to public schools and is likely that parents have signed something that allowed the school to collect their information for official and administrative purposes. Parents, guardians, and school staff were probably aware of the data collection. They might have expected the data to be used for administrative purposes, policy planning, resource allocation, and to assess the needs of the student population.

Especially in public schools, economic status information could be derived from things like subsidy or food stamps. The race, gender, English level and age are basic information that are collected upon school enrollment.

Data limitations

Identify any potential problems with your dataset.

no school types
only public schools
Scope of Data: The data set only includes public schools, so the findings may not be representative of private, charter, or other types of schools in NYC.
Variable Definitions are not unified

Inconsistent Definitions: Definitions of key variables, like “economic need” or “English language learners,” may have evolved over time, affecting trend analyses.

Categorical Limitations: Categories such as race/ethnicity and economic status are often complex and may not be captured fully by the available data.
Covid-19 Impact: The pandemic’s impact started in early 2020, so data from 2017-18 and 2018-19 might not capture the pre-pandemic baseline accurately.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

The dataset breaks down enrollment by grade level and demographic attributes (such as race/ethnicity, gender, economic need, and English learner status) for each school. This is conducive to comprehensive analysis, for example, one can examine racial diversity within specific grades in different schools, or analyze enrollment trends of English learners over the years. The same data can be used to understand the percentage of minorities in schools, which can help in the allocation and balance of special education resources and services

We posed questions such as:

How did covid-19 affect school enrollment? Including total enrollment, enrollment by grade level (PK-12).
How did covid-19 effect the enrollment of schools with different levels of funding? Are underserved schools affect more than schools with sufficient fundings.
Did Covid-19 disproportionately affect school enrollment of minorities student/students of color?
Did Covid-19 disproportionately affect school enrollment of students with disabilities?
Did Covid-19 disproportionately affect school enrollment of students in poverty?
Did Covid-19 disproportionately affect school enrollment of students who have challenges in leaning with the Anglish language?

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Based on the initial exploratory data analysis, do the findings seem valid and are they presented clearly?

Are there any other angles or sub-questions that could add depth to the research?

Are there any critical aspects of the project that seem to be missing or need rethinking?