Project proposal

Author

Red Koala

library(tidyverse)
library(scales)
library(skimr)
library(emojifont)
library(janitor)
library(gridExtra)

Dataset

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Make sure to load the data and use inline code for some of this information.

UFO Sightings Redux

Dataset-link: https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-06-20/readme.md

ufo_sightings <- read_csv('data/ufo_sightings.csv')
Rows: 96429 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): city, state, country_code, shape, reported_duration, summary, day_...
dbl  (1): duration_seconds
lgl  (1): has_images
dttm (2): reported_date_time, reported_date_time_utc
date (1): posted_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
places <- read_csv('data/places.csv')
Rows: 14417 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): city, alternate_city_names, state, country, country_code, timezone
dbl (4): latitude, longitude, population, elevation_m

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
day_parts_map <- read_csv('data/day_parts_map.csv')
Rows: 26409 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (2): rounded_lat, rounded_long
date (1): rounded_date
time (9): astronomical_twilight_begin, nautical_twilight_begin, civil_twilig...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(ufo_sightings)
Data summary
Name ufo_sightings
Number of rows 96429
Number of columns 12
_______________________
Column type frequency:
character 7
Date 1
logical 1
numeric 1
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1.00 3 26 0 10721 0
state 85 1.00 2 31 0 684 0
country_code 0 1.00 2 2 0 152 0
shape 2039 0.98 3 9 0 24 0
reported_duration 0 1.00 2 25 0 4956 0
summary 31 1.00 1 135 0 95898 0
day_part 2563 0.97 5 17 0 9 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
posted_date 0 1 1998-03-07 2023-05-19 2012-08-19 619

Variable type: logical

skim_variable n_missing complete_rate mean count
has_images 0 1 0 FAL: 96429

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
duration_seconds 0 1 31613.25 6399774 0 30 180 600 1987200000 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
reported_date_time 0 1 1925-12-29 2023-05-18 19:27:00 2012-02-05 03:00:00 86201
reported_date_time_utc 0 1 1925-12-29 2023-05-18 19:27:00 2012-02-05 03:00:00 86201
skim(places)
Data summary
Name places
Number of rows 14417
Number of columns 10
_______________________
Column type frequency:
character 6
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
city 0 1.0 3 26 0 10721 0
alternate_city_names 2953 0.8 3 2610 0 10792 0
state 32 1.0 2 31 0 684 0
country 0 1.0 3 31 0 197 0
country_code 0 1.0 2 2 0 152 0
timezone 0 1.0 9 30 0 222 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latitude 0 1.00 37.76 13.00 -53.15 34.99 40.09 42.96 70.64 ▁▁▁▇▁
longitude 0 1.00 -75.36 47.96 -170.48 -95.46 -84.21 -74.82 179.19 ▂▇▁▁▁
population 0 1.00 86375.01 628045.45 0.00 1926.00 6085.00 21993.00 22315474.00 ▇▁▁▁▁
elevation_m 2285 0.84 288.23 387.96 -57.00 65.00 194.00 304.00 3097.00 ▇▁▁▁▁
skim(day_parts_map)
Data summary
Name day_parts_map
Number of rows 26409
Number of columns 12
_______________________
Column type frequency:
Date 1
difftime 9
numeric 2
________________________
Group variables None

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
rounded_date 0 1 1925-12-27 2023-05-21 2007-03-25 3146

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
astronomical_twilight_begin 951 0.96 40 secs 86391 secs 10:38:29.0 17153
nautical_twilight_begin 122 1.00 18 secs 86323 secs 11:09:04.0 17235
civil_twilight_begin 2 1.00 0 secs 86394 secs 11:40:42.0 17263
sunrise 2 1.00 46 secs 86395 secs 12:07:20.0 17205
solar_noon 0 1.00 16 secs 86376 secs 18:06:16.0 10336
sunset 2 1.00 0 secs 86398 secs 02:51:40.0 17225
civil_twilight_end 2 1.00 0 secs 86399 secs 02:40:25.0 17294
nautical_twilight_end 122 1.00 0 secs 86399 secs 02:43:58.0 17359
astronomical_twilight_end 951 0.96 2 secs 86393 secs 02:54:24.5 17145

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
rounded_lat 0 1 36.23 15.82 -50 30 40 40 70 ▁▁▁▇▂
rounded_long 0 1 -80.01 56.38 -170 -110 -90 -80 180 ▇▇▂▁▁

The dataset is a culmination of 3 CSV files

  • ufo_sightings.csv
    • This csv file gives information about when, where, how, and important information pertinent to the sighting itself
  • places.csv
    • Gives additional information about the place itself
  • day_parts_map.csv
    • Gives information about the sunrise, sunset, and astronomy information about the day in particular

Questions

The two questions you want to answer.

1) Are UFO reports more likely to come from predominantly less educated states ?

2) During what part of the day and what part of year is UFO sighting more frequent and how does these trends change over the years?

Analysis plan

A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).

How we will answer Q1

we will use the columns (city, state, country_code, population, duration_seconds, posted_date)

Since we plan to relate UFO sightings with literacy rates of states, we will import a dataset joining states/cities with literacy rates and political alignment.

To explore this comparison of literacy rates and UFO sightings, we will have a scatter plot and line plot to showcase the number of UFO encounters and the literacy rate of the city it was discovered.

To explore the comparison of political affiliation and UFO sightings, we will overlay a US map and showcase political affiliation by state, then have scatter plots of UFO sightings. To coincide with this, we will have a barchart with UFO sightings given that it was from a conservative-leaning city and sightings given that it was from a liberal-leaning city.

3 graphs in total to answer the question (one tackling location, one tackling literacy rates and one tackling political affiliation)

How we will answer Q2

We will use the columns (day_part, reported_date_time_utc)

We can use variables day_part to create bar plots to visualize the frequency of UFO sighting during each part of the day. We can also use the year information in reported_date_time_utc and create line graphs to visualize the frequency of UFO sighting over the years. We can also correlate it to the seasons, creating a season col with “spring”, “summer”, “autumn”, and “winter” to determine if the sightings are related to seasons.