Coffee Ratings

Proposal

library(tidyverse)
library(skimr)
library(haven)

Data 1

Introduction and data

  • Identify the source of the data.

https://cps.ipums.org/cps-action/variables/group?

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

IPUMS includes data produced by a broad range of agencies, including the Census Bureau, the Bureau of Labor Statistics, the National Science Foundation, the National Center for Health Statistics, the Centers for Disease Control, and the National Aeronautics and Space Administration.

  • Write a brief description of the observations.

The dataset shows individuals’ age, education level, weeks they have worked in the last year and their income in a specific year.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is migration level related to an individual’s income level in a randomly selected year?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

Nowadays, as transportation became more efficient and convenient, moving is not a rare thing. There are tons of reasons for people to leave their hometowns and move to a brand new place. We then wonder if this moving decision is related to their economic level. In other words, we hope to learn how a person’s financial situation affects their choice (at all) on the moving location. Moreover, we wish to observe the social trend of moving so we also monitor different years.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Migration level is categorical, income level is quantitative and year is quantitative.

Glimpse of data

# add code here
micro <-
  read_dta("data/cps_00008.dta")
filtered <-
  filter(micro,wkswork2==6,incwage>0&incwage<9999998) |>
  slice(c(1:1000))
skimr::skim(filtered)
Data summary
Name filtered
Number of rows 1000
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.020000e+03 0.000000e+00 2020.00 2.02000e+03 2.02000e+03 2.02000e+03 2.020000e+03 ▁▁▇▁▁
serial 0 1 9.930600e+02 4.937000e+02 1.00 5.85750e+02 1.06150e+03 1.44200e+03 1.768000e+03 ▃▆▅▆▇
month 0 1 3.000000e+00 0.000000e+00 3.00 3.00000e+00 3.00000e+00 3.00000e+00 3.000000e+00 ▁▁▇▁▁
cpsid 0 1 1.536704e+13 8.616170e+12 0.00 2.01812e+13 2.01903e+13 2.02001e+13 2.020030e+13 ▂▁▁▁▇
asecflag 0 1 1.000000e+00 0.000000e+00 1.00 1.00000e+00 1.00000e+00 1.00000e+00 1.000000e+00 ▁▁▇▁▁
asecwth 0 1 9.443200e+02 4.036900e+02 299.54 6.60760e+02 8.83460e+02 1.11061e+03 2.670760e+03 ▆▇▂▁▁
pernum 0 1 1.690000e+00 9.200000e-01 1.00 1.00000e+00 1.00000e+00 2.00000e+00 8.000000e+00 ▇▁▁▁▁
cpsidp 0 1 1.536704e+13 8.616170e+12 0.00 2.01812e+13 2.01903e+13 2.02001e+13 2.020030e+13 ▂▁▁▁▇
asecwt 0 1 9.760400e+02 4.280300e+02 299.54 6.69310e+02 9.12180e+02 1.16113e+03 3.699200e+03 ▇▅▁▁▁
wkswork2 0 1 6.000000e+00 0.000000e+00 6.00 6.00000e+00 6.00000e+00 6.00000e+00 6.000000e+00 ▁▁▇▁▁
incwage 0 1 6.442857e+04 6.924175e+04 75.00 3.12000e+04 5.00000e+04 7.50000e+04 1.099999e+06 ▇▁▁▁▁
migsta1 0 1 9.271000e+01 2.031000e+01 9.00 9.90000e+01 9.90000e+01 9.90000e+01 9.900000e+01 ▁▁▁▁▇
whymove 0 1 7.800000e-01 3.020000e+00 0.00 0.00000e+00 0.00000e+00 0.00000e+00 2.000000e+01 ▇▁▁▁▁
migrate1 0 1 1.260000e+00 8.700000e-01 1.00 1.00000e+00 1.00000e+00 1.00000e+00 6.000000e+00 ▇▁▁▁▁

Data 2

Introduction and data

  • Identify the source of the data.

https://think.cs.vt.edu/corgis/csv/coffee/

https://github.com/rfordatascience/tidytuesday/tree/2e9bd5a67e09b14d01f616b00f7f7e0931515d24/data/2020/2020-07-07

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The original creator of this data set is Buzzfeed Data Scientist James LeDoux who collected the data from the Coffee Quality Institute’s review pages in January 2018.

  • Write a brief description of the observations.

Each observation contains information about the coffee’s country of origin, owner (of the farm), farm name, altitude, region, producer, bag weight, processing method, company (name), and much more logistical information of its source. There are lots of measurements of its taste and quality, such as flavor, aftertaste, acidity, body, balance, uniformity, moisture, sweetness, and color.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Do certain regions around the world specialize in different tastes of coffee?

Which region produces the highest quality coffee?

What components of the coffee’s quality and taste is affected by the altitude of where it is grown?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

Coffee is one of the most widely consumed beverages around the world, and with the current day world being more interconnected as ever, we have access to different types of coffee from all across the globe. Researching into the geographical differences of coffee sources would inform consumers worldwide about different options that might best suit different tastes. We do expect that coffee that come from the same regions and altitudes might have similar ratings, and the more geographically separate two regions are, the less similar their coffee may taste.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Categorical variables: region, country, farm name, processing method, company, color

Quantitative variables: all taste/quality scores out of 10 (flavor, aftertaste, acidity, etc.), altitude, bag weight

Glimpse of data

# add code here
coffee_ratings <- read_csv("data/coffee_ratings.csv")
Rows: 1339 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(coffee_ratings)
Data summary
Name coffee_ratings
Number of rows 1339
Number of columns 43
_______________________
Column type frequency:
character 24
numeric 19
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
species 0 1.00 7 7 0 2 0
owner 7 0.99 3 50 0 315 0
country_of_origin 1 1.00 4 28 0 36 0
farm_name 359 0.73 1 73 0 571 0
lot_number 1063 0.21 1 71 0 227 0
mill 315 0.76 1 77 0 460 0
ico_number 151 0.89 1 40 0 847 0
company 209 0.84 3 73 0 281 0
altitude 226 0.83 1 41 0 396 0
region 59 0.96 2 76 0 356 0
producer 231 0.83 1 100 0 691 0
bag_weight 0 1.00 1 8 0 56 0
in_country_partner 0 1.00 7 85 0 27 0
harvest_year 47 0.96 3 24 0 46 0
grading_date 0 1.00 13 20 0 567 0
owner_1 7 0.99 3 50 0 319 0
variety 226 0.83 4 21 0 29 0
processing_method 170 0.87 5 25 0 5 0
color 218 0.84 4 12 0 4 0
expiration 0 1.00 13 20 0 566 0
certification_body 0 1.00 7 85 0 26 0
certification_address 0 1.00 40 40 0 32 0
certification_contact 0 1.00 40 40 0 29 0
unit_of_measurement 0 1.00 1 2 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total_cup_points 0 1.00 82.09 3.50 0 81.08 82.50 83.67 90.58 ▁▁▁▁▇
number_of_bags 0 1.00 154.18 129.99 0 14.00 175.00 275.00 1062.00 ▇▇▁▁▁
aroma 0 1.00 7.57 0.38 0 7.42 7.58 7.75 8.75 ▁▁▁▁▇
flavor 0 1.00 7.52 0.40 0 7.33 7.58 7.75 8.83 ▁▁▁▁▇
aftertaste 0 1.00 7.40 0.40 0 7.25 7.42 7.58 8.67 ▁▁▁▁▇
acidity 0 1.00 7.54 0.38 0 7.33 7.58 7.75 8.75 ▁▁▁▁▇
body 0 1.00 7.52 0.37 0 7.33 7.50 7.67 8.58 ▁▁▁▁▇
balance 0 1.00 7.52 0.41 0 7.33 7.50 7.75 8.75 ▁▁▁▁▇
uniformity 0 1.00 9.83 0.55 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
clean_cup 0 1.00 9.84 0.76 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
sweetness 0 1.00 9.86 0.62 0 10.00 10.00 10.00 10.00 ▁▁▁▁▇
cupper_points 0 1.00 7.50 0.47 0 7.25 7.50 7.75 10.00 ▁▁▁▇▁
moisture 0 1.00 0.09 0.05 0 0.09 0.11 0.12 0.28 ▃▇▅▁▁
category_one_defects 0 1.00 0.48 2.55 0 0.00 0.00 0.00 63.00 ▇▁▁▁▁
quakers 1 1.00 0.17 0.83 0 0.00 0.00 0.00 11.00 ▇▁▁▁▁
category_two_defects 0 1.00 3.56 5.31 0 0.00 2.00 4.00 55.00 ▇▁▁▁▁
altitude_low_meters 230 0.83 1750.71 8669.44 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁
altitude_high_meters 230 0.83 1799.35 8668.81 1 1100.00 1350.00 1650.00 190164.00 ▇▁▁▁▁
altitude_mean_meters 230 0.83 1775.03 8668.63 1 1100.00 1310.64 1600.00 190164.00 ▇▁▁▁▁

Data 3

Introduction and data

  • Identify the source of the data.

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QFTAPM

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The original data curator collected the data and published it on March 8, 2023. It was collected through using the General Transit Feed Specification (GTFS) data that over 80 transit agencies in Canada published their transit data with.

  • Write a brief description of the observations.

Each observation contains data about the bus’s segment id, route id, direction, number of traversals, distance, id of each stop, geometry of path, start point, and end point. The observations provide information about each bus in Toronto - Greater Ontario and their corresponding route and the placement of the stops. The multiple data about the stops such as stop 1 id, stop 2 id, start point, and end point in each response seem repetitive in the sense that they are all locations but are critical to determining the placement of all the stops for the corresponding bus observation and its route.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Does the geometry of the bus route affect the distance and amount of stops along the bus route?

What correlations can be observed between the bus path stops and the number of traversals?

Do buses in opposite directions follow the same path and stops or how do they differ?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

As a major city with one of the largest span of land, Toronto and the Greater Ontario area requires a complete and efficient transit system in order to function. As one of the most relied upon methods of travel, buses in Toronto must have routes that cover most points on the map to allow for ease of travel in the city. Studying the bus routes of all the buses in Toronto will allow for us to observe whether the current locations of bus stops, distance between stops, and geometry of the bus routes are effectively covering the map in an efficient manner. We can expect that loop bus routes have more and closer stops than a straight line path as straight line paths would take too long from start to end point otherwise. We can also expect that bus paths with shorter or lesser stops would have more traversals due to needing to serve citizens on a shorter path duration most likely with higher population and frequency required. Most likely, bus paths in opposite direction can be hypothesized to be the same as the same required stops would make more sense in covering the map as different route buses can be used for different paths. These questions are important in helping to plan future or change existing bus paths such as identifying unnecessary repeated paths or seeing points that lack transit.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    categorical variables: segment_id, route_id, direction_id, stop_id1, stop_id2, geometry, start_point, end_point

    quantitative variables: traversals, distance

Glimpse of data

# add code here
bus_spacing <- read_csv("data/spacings_with_geometry.csv")
Rows: 1857 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): segment_id, route_id, stop_id1, stop_id2, geometry, start_point, en...
dbl (3): direction_id, traversals, distance

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(bus_spacing)
Data summary
Name bus_spacing
Number of rows 1857
Number of columns 10
_______________________
Column type frequency:
character 7
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
segment_id 0 1 10 13 0 1212 0
route_id 0 1 11 11 0 39 0
stop_id1 0 1 2 5 0 978 0
stop_id2 0 1 2 5 0 980 0
geometry 0 1 99 35887 0 1230 0
start_point 0 1 24 28 0 974 0
end_point 0 1 24 28 0 977 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
direction_id 0 1 0.51 0.50 0.00 0.0 1.00 1.00 1.00 ▇▁▁▁▇
traversals 0 1 14.96 10.92 1.00 6.0 14.00 22.00 75.00 ▇▅▂▁▁
distance 0 1 3716.10 7919.67 121.87 417.5 821.52 2729.64 93664.84 ▇▁▁▁▁