library(tidyverse)
library(scales)
library(skimr)
library(emojifont)
library(janitor)
library(gridExtra)
Project proposal
Dataset
A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
Make sure to load the data and use inline code for some of this information.
UFO Sightings Redux
Dataset-link: https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-06-20/readme.md
<- read_csv('data/ufo_sightings.csv') ufo_sightings
Rows: 96429 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): city, state, country_code, shape, reported_duration, summary, day_...
dbl (1): duration_seconds
lgl (1): has_images
dttm (2): reported_date_time, reported_date_time_utc
date (1): posted_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv('data/places.csv') places
Rows: 14417 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): city, alternate_city_names, state, country, country_code, timezone
dbl (4): latitude, longitude, population, elevation_m
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv('data/day_parts_map.csv') day_parts_map
Rows: 26409 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): rounded_lat, rounded_long
date (1): rounded_date
time (9): astronomical_twilight_begin, nautical_twilight_begin, civil_twilig...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(ufo_sightings)
Name | ufo_sightings |
Number of rows | 96429 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 7 |
Date | 1 |
logical | 1 |
numeric | 1 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
city | 0 | 1.00 | 3 | 26 | 0 | 10721 | 0 |
state | 85 | 1.00 | 2 | 31 | 0 | 684 | 0 |
country_code | 0 | 1.00 | 2 | 2 | 0 | 152 | 0 |
shape | 2039 | 0.98 | 3 | 9 | 0 | 24 | 0 |
reported_duration | 0 | 1.00 | 2 | 25 | 0 | 4956 | 0 |
summary | 31 | 1.00 | 1 | 135 | 0 | 95898 | 0 |
day_part | 2563 | 0.97 | 5 | 17 | 0 | 9 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
posted_date | 0 | 1 | 1998-03-07 | 2023-05-19 | 2012-08-19 | 619 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
has_images | 0 | 1 | 0 | FAL: 96429 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
duration_seconds | 0 | 1 | 31613.25 | 6399774 | 0 | 30 | 180 | 600 | 1987200000 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
reported_date_time | 0 | 1 | 1925-12-29 | 2023-05-18 19:27:00 | 2012-02-05 03:00:00 | 86201 |
reported_date_time_utc | 0 | 1 | 1925-12-29 | 2023-05-18 19:27:00 | 2012-02-05 03:00:00 | 86201 |
skim(places)
Name | places |
Number of rows | 14417 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
city | 0 | 1.0 | 3 | 26 | 0 | 10721 | 0 |
alternate_city_names | 2953 | 0.8 | 3 | 2610 | 0 | 10792 | 0 |
state | 32 | 1.0 | 2 | 31 | 0 | 684 | 0 |
country | 0 | 1.0 | 3 | 31 | 0 | 197 | 0 |
country_code | 0 | 1.0 | 2 | 2 | 0 | 152 | 0 |
timezone | 0 | 1.0 | 9 | 30 | 0 | 222 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
latitude | 0 | 1.00 | 37.76 | 13.00 | -53.15 | 34.99 | 40.09 | 42.96 | 70.64 | ▁▁▁▇▁ |
longitude | 0 | 1.00 | -75.36 | 47.96 | -170.48 | -95.46 | -84.21 | -74.82 | 179.19 | ▂▇▁▁▁ |
population | 0 | 1.00 | 86375.01 | 628045.45 | 0.00 | 1926.00 | 6085.00 | 21993.00 | 22315474.00 | ▇▁▁▁▁ |
elevation_m | 2285 | 0.84 | 288.23 | 387.96 | -57.00 | 65.00 | 194.00 | 304.00 | 3097.00 | ▇▁▁▁▁ |
skim(day_parts_map)
Name | day_parts_map |
Number of rows | 26409 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
Date | 1 |
difftime | 9 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
rounded_date | 0 | 1 | 1925-12-27 | 2023-05-21 | 2007-03-25 | 3146 |
Variable type: difftime
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
astronomical_twilight_begin | 951 | 0.96 | 40 secs | 86391 secs | 10:38:29.0 | 17153 |
nautical_twilight_begin | 122 | 1.00 | 18 secs | 86323 secs | 11:09:04.0 | 17235 |
civil_twilight_begin | 2 | 1.00 | 0 secs | 86394 secs | 11:40:42.0 | 17263 |
sunrise | 2 | 1.00 | 46 secs | 86395 secs | 12:07:20.0 | 17205 |
solar_noon | 0 | 1.00 | 16 secs | 86376 secs | 18:06:16.0 | 10336 |
sunset | 2 | 1.00 | 0 secs | 86398 secs | 02:51:40.0 | 17225 |
civil_twilight_end | 2 | 1.00 | 0 secs | 86399 secs | 02:40:25.0 | 17294 |
nautical_twilight_end | 122 | 1.00 | 0 secs | 86399 secs | 02:43:58.0 | 17359 |
astronomical_twilight_end | 951 | 0.96 | 2 secs | 86393 secs | 02:54:24.5 | 17145 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
rounded_lat | 0 | 1 | 36.23 | 15.82 | -50 | 30 | 40 | 40 | 70 | ▁▁▁▇▂ |
rounded_long | 0 | 1 | -80.01 | 56.38 | -170 | -110 | -90 | -80 | 180 | ▇▇▂▁▁ |
The dataset is a culmination of 3 CSV files
ufo_sightings.csv
- This csv file gives information about when, where, how, and important information pertinent to the sighting itself
places.csv
- Gives additional information about the place itself
day_parts_map.csv
- Gives information about the sunrise, sunset, and astronomy information about the day in particular
Questions
The two questions you want to answer.
1) Are UFO reports more likely to come from predominantly less educated states ?
2) During what part of the day and what part of year is UFO sighting more frequent and how does these trends change over the years?
Analysis plan
A plan for answering each of the questions including the variables involved, variables to be created (if any), external data to be merged in (if any).
How we will answer Q1
we will use the columns (city, state, country_code, population, duration_seconds, posted_date)
Since we plan to relate UFO sightings with literacy rates of states, we will import a dataset joining states/cities with literacy rates and political alignment.
To explore this comparison of literacy rates and UFO sightings, we will have a scatter plot and line plot to showcase the number of UFO encounters and the literacy rate of the city it was discovered.
To explore the comparison of political affiliation and UFO sightings, we will overlay a US map and showcase political affiliation by state, then have scatter plots of UFO sightings. To coincide with this, we will have a barchart with UFO sightings given that it was from a conservative-leaning city and sightings given that it was from a liberal-leaning city.
3 graphs in total to answer the question (one tackling location, one tackling literacy rates and one tackling political affiliation)
How we will answer Q2
We will use the columns (day_part
, reported_date_time_utc
)
We can use variables day_part
to create bar plots to visualize the frequency of UFO sighting during each part of the day. We can also use the year information in reported_date_time_utc
and create line graphs to visualize the frequency of UFO sighting over the years. We can also correlate it to the seasons, creating a season col with “spring”, “summer”, “autumn”, and “winter” to determine if the sightings are related to seasons.