Covid Policies Data Across the United States
Appendix to report
Data cleaning
- Delete “county” column: Because our research questions deal with regions and states, removing the county column would help reduce the information that we need to process as we do not intend on using it.
- Delete “source” column: Our data analysis and visualization will not require us to reference the source of the policy.
- Delete “fips_codes” column: In similar vain to the previous choice, a fip code stands for the “Federal Information Processing Standards,” and it is a five-digit code used to uniquely identify counties and in the United States. Because we will not be working with counties in our project, this information is not needed for analysis.
- Delete “geocoded_state” column: a geocode is a code that represents a geographic entity; in this case, it represents the geographic location of the state; since we intend to identify the potential patterns that we speculate through the state itself and the general region, we do not need the code of the specific physical location.
- Delete “total_phases” column: though this data could have been useful, in filtering for all the NA values out, it would get rid of too many observations since a significant number of rows for this column contain NA values.
- Rename “state_id” to just “state”: in doing this, it will ease our process of referencing the variable. The state_id variable is just the two-letter representation of the state itself.
- Filter out the NA rows: this is pretty self-explanatory; we do not want additional NA values to inhibit our data representation and analysis.
- Separate the dataset into two datasets based off of the “start_stop” column – one for “start” and another for “stop”: Separating the datasets into the start and stop values in the start_stop column will allow us to efficiently compare when each state or region starts and stops their policy. With the current arrangement, it is difficult to compare as the two values for each state are situated in the same column.