library(tidyverse)
library(skimr)
library(haven)
Coffee Ratings
Proposal
Data 1
Introduction and data
- Identify the source of the data.
https://cps.ipums.org/cps-action/variables/group?
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
IPUMS includes data produced by a broad range of agencies, including the Census Bureau, the Bureau of Labor Statistics, the National Science Foundation, the National Center for Health Statistics, the Centers for Disease Control, and the National Aeronautics and Space Administration.
- Write a brief description of the observations.
The dataset shows individuals’ age, education level, weeks they have worked in the last year and their income in a specific year.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Is migration level related to an individual’s income level in a randomly selected year?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
Nowadays, as transportation became more efficient and convenient, moving is not a rare thing. There are tons of reasons for people to leave their hometowns and move to a brand new place. We then wonder if this moving decision is related to their economic level. In other words, we hope to learn how a person’s financial situation affects their choice (at all) on the moving location. Moreover, we wish to observe the social trend of moving so we also monitor different years.
Identify the types of variables in your research question. Categorical? Quantitative?
Migration level is categorical, income level is quantitative and year is quantitative.
Glimpse of data
# add code here
<-
micro read_dta("data/cps_00008.dta")
<-
filtered filter(micro,wkswork2==6,incwage>0&incwage<9999998) |>
slice(c(1:1000))
::skim(filtered) skimr
Name | filtered |
Number of rows | 1000 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2.020000e+03 | 0.000000e+00 | 2020.00 | 2.02000e+03 | 2.02000e+03 | 2.02000e+03 | 2.020000e+03 | ▁▁▇▁▁ |
serial | 0 | 1 | 9.930600e+02 | 4.937000e+02 | 1.00 | 5.85750e+02 | 1.06150e+03 | 1.44200e+03 | 1.768000e+03 | ▃▆▅▆▇ |
month | 0 | 1 | 3.000000e+00 | 0.000000e+00 | 3.00 | 3.00000e+00 | 3.00000e+00 | 3.00000e+00 | 3.000000e+00 | ▁▁▇▁▁ |
cpsid | 0 | 1 | 1.536704e+13 | 8.616170e+12 | 0.00 | 2.01812e+13 | 2.01903e+13 | 2.02001e+13 | 2.020030e+13 | ▂▁▁▁▇ |
asecflag | 0 | 1 | 1.000000e+00 | 0.000000e+00 | 1.00 | 1.00000e+00 | 1.00000e+00 | 1.00000e+00 | 1.000000e+00 | ▁▁▇▁▁ |
asecwth | 0 | 1 | 9.443200e+02 | 4.036900e+02 | 299.54 | 6.60760e+02 | 8.83460e+02 | 1.11061e+03 | 2.670760e+03 | ▆▇▂▁▁ |
pernum | 0 | 1 | 1.690000e+00 | 9.200000e-01 | 1.00 | 1.00000e+00 | 1.00000e+00 | 2.00000e+00 | 8.000000e+00 | ▇▁▁▁▁ |
cpsidp | 0 | 1 | 1.536704e+13 | 8.616170e+12 | 0.00 | 2.01812e+13 | 2.01903e+13 | 2.02001e+13 | 2.020030e+13 | ▂▁▁▁▇ |
asecwt | 0 | 1 | 9.760400e+02 | 4.280300e+02 | 299.54 | 6.69310e+02 | 9.12180e+02 | 1.16113e+03 | 3.699200e+03 | ▇▅▁▁▁ |
wkswork2 | 0 | 1 | 6.000000e+00 | 0.000000e+00 | 6.00 | 6.00000e+00 | 6.00000e+00 | 6.00000e+00 | 6.000000e+00 | ▁▁▇▁▁ |
incwage | 0 | 1 | 6.442857e+04 | 6.924175e+04 | 75.00 | 3.12000e+04 | 5.00000e+04 | 7.50000e+04 | 1.099999e+06 | ▇▁▁▁▁ |
migsta1 | 0 | 1 | 9.271000e+01 | 2.031000e+01 | 9.00 | 9.90000e+01 | 9.90000e+01 | 9.90000e+01 | 9.900000e+01 | ▁▁▁▁▇ |
whymove | 0 | 1 | 7.800000e-01 | 3.020000e+00 | 0.00 | 0.00000e+00 | 0.00000e+00 | 0.00000e+00 | 2.000000e+01 | ▇▁▁▁▁ |
migrate1 | 0 | 1 | 1.260000e+00 | 8.700000e-01 | 1.00 | 1.00000e+00 | 1.00000e+00 | 1.00000e+00 | 6.000000e+00 | ▇▁▁▁▁ |
Data 2
Introduction and data
- Identify the source of the data.
https://think.cs.vt.edu/corgis/csv/coffee/
https://github.com/rfordatascience/tidytuesday/tree/2e9bd5a67e09b14d01f616b00f7f7e0931515d24/data/2020/2020-07-07
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The original creator of this data set is Buzzfeed Data Scientist James LeDoux who collected the data from the Coffee Quality Institute’s review pages in January 2018.
- Write a brief description of the observations.
Each observation contains information about the coffee’s country of origin, owner (of the farm), farm name, altitude, region, producer, bag weight, processing method, company (name), and much more logistical information of its source. There are lots of measurements of its taste and quality, such as flavor, aftertaste, acidity, body, balance, uniformity, moisture, sweetness, and color.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Do certain regions around the world specialize in different tastes of coffee?
Which region produces the highest quality coffee?
What components of the coffee’s quality and taste is affected by the altitude of where it is grown?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
Coffee is one of the most widely consumed beverages around the world, and with the current day world being more interconnected as ever, we have access to different types of coffee from all across the globe. Researching into the geographical differences of coffee sources would inform consumers worldwide about different options that might best suit different tastes. We do expect that coffee that come from the same regions and altitudes might have similar ratings, and the more geographically separate two regions are, the less similar their coffee may taste.
- Identify the types of variables in your research question. Categorical? Quantitative?
Categorical variables: region, country, farm name, processing method, company, color
Quantitative variables: all taste/quality scores out of 10 (flavor, aftertaste, acidity, etc.), altitude, bag weight
Glimpse of data
# add code here
<- read_csv("data/coffee_ratings.csv") coffee_ratings
Rows: 1339 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(coffee_ratings) skimr
Name | coffee_ratings |
Number of rows | 1339 |
Number of columns | 43 |
_______________________ | |
Column type frequency: | |
character | 24 |
numeric | 19 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
species | 0 | 1.00 | 7 | 7 | 0 | 2 | 0 |
owner | 7 | 0.99 | 3 | 50 | 0 | 315 | 0 |
country_of_origin | 1 | 1.00 | 4 | 28 | 0 | 36 | 0 |
farm_name | 359 | 0.73 | 1 | 73 | 0 | 571 | 0 |
lot_number | 1063 | 0.21 | 1 | 71 | 0 | 227 | 0 |
mill | 315 | 0.76 | 1 | 77 | 0 | 460 | 0 |
ico_number | 151 | 0.89 | 1 | 40 | 0 | 847 | 0 |
company | 209 | 0.84 | 3 | 73 | 0 | 281 | 0 |
altitude | 226 | 0.83 | 1 | 41 | 0 | 396 | 0 |
region | 59 | 0.96 | 2 | 76 | 0 | 356 | 0 |
producer | 231 | 0.83 | 1 | 100 | 0 | 691 | 0 |
bag_weight | 0 | 1.00 | 1 | 8 | 0 | 56 | 0 |
in_country_partner | 0 | 1.00 | 7 | 85 | 0 | 27 | 0 |
harvest_year | 47 | 0.96 | 3 | 24 | 0 | 46 | 0 |
grading_date | 0 | 1.00 | 13 | 20 | 0 | 567 | 0 |
owner_1 | 7 | 0.99 | 3 | 50 | 0 | 319 | 0 |
variety | 226 | 0.83 | 4 | 21 | 0 | 29 | 0 |
processing_method | 170 | 0.87 | 5 | 25 | 0 | 5 | 0 |
color | 218 | 0.84 | 4 | 12 | 0 | 4 | 0 |
expiration | 0 | 1.00 | 13 | 20 | 0 | 566 | 0 |
certification_body | 0 | 1.00 | 7 | 85 | 0 | 26 | 0 |
certification_address | 0 | 1.00 | 40 | 40 | 0 | 32 | 0 |
certification_contact | 0 | 1.00 | 40 | 40 | 0 | 29 | 0 |
unit_of_measurement | 0 | 1.00 | 1 | 2 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
total_cup_points | 0 | 1.00 | 82.09 | 3.50 | 0 | 81.08 | 82.50 | 83.67 | 90.58 | ▁▁▁▁▇ |
number_of_bags | 0 | 1.00 | 154.18 | 129.99 | 0 | 14.00 | 175.00 | 275.00 | 1062.00 | ▇▇▁▁▁ |
aroma | 0 | 1.00 | 7.57 | 0.38 | 0 | 7.42 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
flavor | 0 | 1.00 | 7.52 | 0.40 | 0 | 7.33 | 7.58 | 7.75 | 8.83 | ▁▁▁▁▇ |
aftertaste | 0 | 1.00 | 7.40 | 0.40 | 0 | 7.25 | 7.42 | 7.58 | 8.67 | ▁▁▁▁▇ |
acidity | 0 | 1.00 | 7.54 | 0.38 | 0 | 7.33 | 7.58 | 7.75 | 8.75 | ▁▁▁▁▇ |
body | 0 | 1.00 | 7.52 | 0.37 | 0 | 7.33 | 7.50 | 7.67 | 8.58 | ▁▁▁▁▇ |
balance | 0 | 1.00 | 7.52 | 0.41 | 0 | 7.33 | 7.50 | 7.75 | 8.75 | ▁▁▁▁▇ |
uniformity | 0 | 1.00 | 9.83 | 0.55 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
clean_cup | 0 | 1.00 | 9.84 | 0.76 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
sweetness | 0 | 1.00 | 9.86 | 0.62 | 0 | 10.00 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
cupper_points | 0 | 1.00 | 7.50 | 0.47 | 0 | 7.25 | 7.50 | 7.75 | 10.00 | ▁▁▁▇▁ |
moisture | 0 | 1.00 | 0.09 | 0.05 | 0 | 0.09 | 0.11 | 0.12 | 0.28 | ▃▇▅▁▁ |
category_one_defects | 0 | 1.00 | 0.48 | 2.55 | 0 | 0.00 | 0.00 | 0.00 | 63.00 | ▇▁▁▁▁ |
quakers | 1 | 1.00 | 0.17 | 0.83 | 0 | 0.00 | 0.00 | 0.00 | 11.00 | ▇▁▁▁▁ |
category_two_defects | 0 | 1.00 | 3.56 | 5.31 | 0 | 0.00 | 2.00 | 4.00 | 55.00 | ▇▁▁▁▁ |
altitude_low_meters | 230 | 0.83 | 1750.71 | 8669.44 | 1 | 1100.00 | 1310.64 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
altitude_high_meters | 230 | 0.83 | 1799.35 | 8668.81 | 1 | 1100.00 | 1350.00 | 1650.00 | 190164.00 | ▇▁▁▁▁ |
altitude_mean_meters | 230 | 0.83 | 1775.03 | 8668.63 | 1 | 1100.00 | 1310.64 | 1600.00 | 190164.00 | ▇▁▁▁▁ |
Data 3
Introduction and data
- Identify the source of the data.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QFTAPM
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The original data curator collected the data and published it on March 8, 2023. It was collected through using the General Transit Feed Specification (GTFS) data that over 80 transit agencies in Canada published their transit data with.
- Write a brief description of the observations.
Each observation contains data about the bus’s segment id, route id, direction, number of traversals, distance, id of each stop, geometry of path, start point, and end point. The observations provide information about each bus in Toronto - Greater Ontario and their corresponding route and the placement of the stops. The multiple data about the stops such as stop 1 id, stop 2 id, start point, and end point in each response seem repetitive in the sense that they are all locations but are critical to determining the placement of all the stops for the corresponding bus observation and its route.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
Does the geometry of the bus route affect the distance and amount of stops along the bus route?
What correlations can be observed between the bus path stops and the number of traversals?
Do buses in opposite directions follow the same path and stops or how do they differ?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
As a major city with one of the largest span of land, Toronto and the Greater Ontario area requires a complete and efficient transit system in order to function. As one of the most relied upon methods of travel, buses in Toronto must have routes that cover most points on the map to allow for ease of travel in the city. Studying the bus routes of all the buses in Toronto will allow for us to observe whether the current locations of bus stops, distance between stops, and geometry of the bus routes are effectively covering the map in an efficient manner. We can expect that loop bus routes have more and closer stops than a straight line path as straight line paths would take too long from start to end point otherwise. We can also expect that bus paths with shorter or lesser stops would have more traversals due to needing to serve citizens on a shorter path duration most likely with higher population and frequency required. Most likely, bus paths in opposite direction can be hypothesized to be the same as the same required stops would make more sense in covering the map as different route buses can be used for different paths. These questions are important in helping to plan future or change existing bus paths such as identifying unnecessary repeated paths or seeing points that lack transit.
Identify the types of variables in your research question. Categorical? Quantitative?
categorical variables: segment_id, route_id, direction_id, stop_id1, stop_id2, geometry, start_point, end_point
quantitative variables: traversals, distance
Glimpse of data
# add code here
<- read_csv("data/spacings_with_geometry.csv") bus_spacing
Rows: 1857 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): segment_id, route_id, stop_id1, stop_id2, geometry, start_point, en...
dbl (3): direction_id, traversals, distance
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(bus_spacing) skimr
Name | bus_spacing |
Number of rows | 1857 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
segment_id | 0 | 1 | 10 | 13 | 0 | 1212 | 0 |
route_id | 0 | 1 | 11 | 11 | 0 | 39 | 0 |
stop_id1 | 0 | 1 | 2 | 5 | 0 | 978 | 0 |
stop_id2 | 0 | 1 | 2 | 5 | 0 | 980 | 0 |
geometry | 0 | 1 | 99 | 35887 | 0 | 1230 | 0 |
start_point | 0 | 1 | 24 | 28 | 0 | 974 | 0 |
end_point | 0 | 1 | 24 | 28 | 0 | 977 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
direction_id | 0 | 1 | 0.51 | 0.50 | 0.00 | 0.0 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
traversals | 0 | 1 | 14.96 | 10.92 | 1.00 | 6.0 | 14.00 | 22.00 | 75.00 | ▇▅▂▁▁ |
distance | 0 | 1 | 3716.10 | 7919.67 | 121.87 | 417.5 | 821.52 | 2729.64 | 93664.84 | ▇▁▁▁▁ |