Project Title

Exploratory data analysis

Research question(s)

Research question(s). State your research question (s) clearly.

What is the impact of distance to supermarkets and grocery stores on the food accessibility and food security of low-income households in low population counties of New York?

Data collection and cleaning

Have an initial draft of your data cleaning appendix. Document every step that takes your raw data file(s) and turns it into the analysis-ready data set that you would submit with your final project. Include text narrative describing your data collection (downloading, scraping, surveys, etc) and any additional data curation/cleaning (merging data frames, filtering, transformations of variables, etc). Include code for data curation/cleaning, but not collection.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.2      ✔ purrr   1.0.0 
✔ tibble  3.2.1      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

raw_data <- read_csv(file = "data/food_access.csv")

Rows: 3142 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): County, State
dbl (23): Population, Housing Data.Residing in Group Quarters, Housing Data....

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# isolate New York
food_access_tidy <- raw_data |>
  filter(State == "New York") |>
  # select relevant variables
  select(County, 
         State, 
         Population, 
         `Low Access Numbers.Low Income People.1 Mile`, 
         `Low Access Numbers.Low Income People.1/2 Mile`, 
         `Low Access Numbers.Low Income People.10 Miles`, 
         `Low Access Numbers.Low Income People.20 Miles`) |>
  # normalize low access low income stats
  mutate(`Low Access Low Income % 1 Mile` = 
           100*`Low Access Numbers.Low Income People.1 Mile`/Population) |>
  mutate(`Low Access Low Income % 1/2 Mile` = 
           100*`Low Access Numbers.Low Income People.1/2 Mile`/Population) |>
  mutate(`Low Access Low Income % 10 Miles` = 
           100*`Low Access Numbers.Low Income People.10 Miles`/Population) |>
    mutate(`Low Access Low Income % 20 Miles` = 
             100*`Low Access Numbers.Low Income People.20 Miles`/Population) |>
  # take out raw population numbers
    select(County, 
         State, 
         Population, 
         `Low Access Low Income % 1 Mile`, 
         `Low Access Low Income % 1/2 Mile`, 
         `Low Access Low Income % 10 Miles`, 
         `Low Access Low Income % 20 Miles`)

Data description

Have an initial draft of your data description section. Your data description should be about your analysis-ready data. - The observations will be the different counties, with an identifying state attribute (New York) and normalized low access low income percent for populations that are at a distance of ½, 1, 10, and 20 miles. This dataset was created to look at the correlation between low income populations and those populations’ distance from a supermarket in all counties in New York. The dataset is funded by a food Access CSV File from the CORGIS Dataset Project by Ryan Whitcomb, Joung Min Choi, Bo Guan. One process that might have influenced the data was what the surveyor’s definition of low-income, low-access is. Our conclusion on food accessibility and security is based on variables that indicate food insecurity rather than prove food insecurity. No collection of data of individual’s meals, school or government provided food was recorded. We can assume that factors such as having low income or being a far distance from a grocery store will cause food insecurity, but we do not know if there are other factors in these people’s lives preventing food insecurity, such as free and reduced school lunch programs, soup kitchens, and farmland prevalence. The preprocessing that was done was a count of the population that was deemed low-income and an analysis of the distance of population from supermarkets. The data was collected from a 2010 census so the participants were aware of the data collection, but there was no indication that the people who had data collected from them knew what they were used for. Our conclusion on food accessibility and security is based on variables that indicate food insecurity rather than prove food insecurity. No collection of data of individual’s meals, school or government provided food was recorded. We can assume that factors such as having low income or being a far distance from a grocery store will cause food insecurity, but we do not know if there are other factors in these people’s lives preventing food insecurity, such as free and reduced school lunch programs, soup kitchens, and farmland prevalence.

Data limitations

Identify any potential problems with your dataset.

One limitation of our dataset is that it only reports the population of each county, when knowing the population density of each county would be more useful information for the purpose of this report. Knowing population density would allow us to differentiate between rural and urban communities.

Another limitation of the dataset is that it is from 2010, so the data we are investigating is over 10 years out of date with current food insecurity trends.

Exploratory data analysis

Perform an (initial) exploratory data analysis.

food_access_summary <- food_access_tidy |>
  summarize(mean_1mile = mean(`Low Access Low Income % 1 Mile`), 
            sd_1mile = sd(`Low Access Low Income % 1 Mile`), 
            mean_halfmile = mean(`Low Access Low Income % 1/2 Mile`), 
            sd_halfmile = sd(`Low Access Low Income % 1/2 Mile`), 
            mean_10miles = mean(`Low Access Low Income % 10 Miles`), 
            sd_10miles = sd(`Low Access Low Income % 10 Miles`),
            mean_20miles = mean(`Low Access Low Income % 20 Miles`),
            sd_20miles = sd(`Low Access Low Income % 20 Miles`))
food_access_summary

# A tibble: 1 × 8
  mean_1mile sd_1mile mean_halfmile sd_halfmile mean_10miles sd_10miles
       <dbl>    <dbl>         <dbl>       <dbl>        <dbl>      <dbl>
1       15.6     8.83          21.6        8.89        0.771       1.95
# ℹ 2 more variables: mean_20miles <dbl>, sd_20miles <dbl>

Questions for reviewers

List specific questions for your peer reviewers and project mentor to answer in giving you feedback on this phase.

Are we narrowing down our question too much to where the project will become too simple? If so, are there any tips to rephrase the question to be more broad/relevant to the project scope?

Is there a requirement for the number of observations and columns that we end up using to visual our graph/answer the research question?

Does our data description make sense for the research question, and is everything correct?