Coffee Ratings

Proposal

library(tidyverse)
library(skimr)
library(haven)

Data 1

Introduction and data

Identify the source of the data.

https://cps.ipums.org/cps-action/variables/group?

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

IPUMS includes data produced by a broad range of agencies, including the Census Bureau, the Bureau of Labor Statistics, the National Science Foundation, the National Center for Health Statistics, the Centers for Disease Control, and the National Aeronautics and Space Administration.

Write a brief description of the observations.

The dataset shows individuals’ age, education level, weeks they have worked in the last year and their income in a specific year.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Is migration level related to an individual’s income level in a randomly selected year?

A description of the research topic along with a concise statement of your hypotheses on this topic.

Nowadays, as transportation became more efficient and convenient, moving is not a rare thing. There are tons of reasons for people to leave their hometowns and move to a brand new place. We then wonder if this moving decision is related to their economic level. In other words, we hope to learn how a person’s financial situation affects their choice (at all) on the moving location. Moreover, we wish to observe the social trend of moving so we also monitor different years.

Identify the types of variables in your research question. Categorical? Quantitative?

Migration level is categorical, income level is quantitative and year is quantitative.

Glimpse of data

# add code here
micro <-
  read_dta("data/cps_00008.dta")
filtered <-
  filter(micro,wkswork2==6,incwage>0&incwage<9999998) |>
  slice(c(1:1000))
skimr::skim(filtered)

Data summary
Name	filtered
Number of rows	1000
Number of columns	14
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2.020000e+03	0.000000e+00	2020.00	2.02000e+03	2.02000e+03	2.02000e+03	2.020000e+03	▁▁▇▁▁
serial	1	9.930600e+02	4.937000e+02	1.00	5.85750e+02	1.06150e+03	1.44200e+03	1.768000e+03	▃▆▅▆▇
month	1	3.000000e+00	0.000000e+00	3.00	3.00000e+00	3.00000e+00	3.00000e+00	3.000000e+00	▁▁▇▁▁
cpsid	1	1.536704e+13	8.616170e+12	0.00	2.01812e+13	2.01903e+13	2.02001e+13	2.020030e+13	▂▁▁▁▇
asecflag	1	1.000000e+00	0.000000e+00	1.00	1.00000e+00	1.00000e+00	1.00000e+00	1.000000e+00	▁▁▇▁▁
asecwth	1	9.443200e+02	4.036900e+02	299.54	6.60760e+02	8.83460e+02	1.11061e+03	2.670760e+03	▆▇▂▁▁
pernum	1	1.690000e+00	9.200000e-01	1.00	1.00000e+00	1.00000e+00	2.00000e+00	8.000000e+00	▇▁▁▁▁
cpsidp	1	1.536704e+13	8.616170e+12	0.00	2.01812e+13	2.01903e+13	2.02001e+13	2.020030e+13	▂▁▁▁▇
asecwt	1	9.760400e+02	4.280300e+02	299.54	6.69310e+02	9.12180e+02	1.16113e+03	3.699200e+03	▇▅▁▁▁
wkswork2	1	6.000000e+00	0.000000e+00	6.00	6.00000e+00	6.00000e+00	6.00000e+00	6.000000e+00	▁▁▇▁▁
incwage	1	6.442857e+04	6.924175e+04	75.00	3.12000e+04	5.00000e+04	7.50000e+04	1.099999e+06	▇▁▁▁▁
migsta1	1	9.271000e+01	2.031000e+01	9.00	9.90000e+01	9.90000e+01	9.90000e+01	9.900000e+01	▁▁▁▁▇
whymove	1	7.800000e-01	3.020000e+00	0.00	0.00000e+00	0.00000e+00	0.00000e+00	2.000000e+01	▇▁▁▁▁
migrate1	1	1.260000e+00	8.700000e-01	1.00	1.00000e+00	1.00000e+00	1.00000e+00	6.000000e+00	▇▁▁▁▁

Data 2

Introduction and data

Identify the source of the data.

https://think.cs.vt.edu/corgis/csv/coffee/

https://github.com/rfordatascience/tidytuesday/tree/2e9bd5a67e09b14d01f616b00f7f7e0931515d24/data/2020/2020-07-07

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The original creator of this data set is Buzzfeed Data Scientist James LeDoux who collected the data from the Coffee Quality Institute’s review pages in January 2018.

Write a brief description of the observations.

Each observation contains information about the coffee’s country of origin, owner (of the farm), farm name, altitude, region, producer, bag weight, processing method, company (name), and much more logistical information of its source. There are lots of measurements of its taste and quality, such as flavor, aftertaste, acidity, body, balance, uniformity, moisture, sweetness, and color.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Do certain regions around the world specialize in different tastes of coffee?

Which region produces the highest quality coffee?

What components of the coffee’s quality and taste is affected by the altitude of where it is grown?

A description of the research topic along with a concise statement of your hypotheses on this topic.

Coffee is one of the most widely consumed beverages around the world, and with the current day world being more interconnected as ever, we have access to different types of coffee from all across the globe. Researching into the geographical differences of coffee sources would inform consumers worldwide about different options that might best suit different tastes. We do expect that coffee that come from the same regions and altitudes might have similar ratings, and the more geographically separate two regions are, the less similar their coffee may taste.

Identify the types of variables in your research question. Categorical? Quantitative?

Categorical variables: region, country, farm name, processing method, company, color

Quantitative variables: all taste/quality scores out of 10 (flavor, aftertaste, acidity, etc.), altitude, bag weight

Glimpse of data

# add code here
coffee_ratings <- read_csv("data/coffee_ratings.csv")

Rows: 1339 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (24): species, owner, country_of_origin, farm_name, lot_number, mill, ic...
dbl (19): total_cup_points, number_of_bags, aroma, flavor, aftertaste, acidi...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(coffee_ratings)

Data summary
Name	coffee_ratings
Number of rows	1339
Number of columns	43
_______________________
Column type frequency:
character	24
numeric	19
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
species	0	1.00	7	7	2
owner	7	0.99	3	50	315
country_of_origin	1	1.00	4	28	36
farm_name	359	0.73	1	73	571
lot_number	1063	0.21	1	71	227
mill	315	0.76	1	77	460
ico_number	151	0.89	1	40	847
company	209	0.84	3	73	281
altitude	226	0.83	1	41	396
region	59	0.96	2	76	356
producer	231	0.83	1	100	691
bag_weight	0	1.00	1	8	56
in_country_partner	0	1.00	7	85	27
harvest_year	47	0.96	3	24	46
grading_date	0	1.00	13	20	567
owner_1	7	0.99	3	50	319
variety	226	0.83	4	21	29
processing_method	170	0.87	5	25	5
color	218	0.84	4	12	4
expiration	0	1.00	13	20	566
certification_body	0	1.00	7	85	26
certification_address	0	1.00	40	40	32
certification_contact	0	1.00	40	40	29
unit_of_measurement	0	1.00	1	2	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total_cup_points	0	1.00	82.09	3.50	0	81.08	82.50	83.67	90.58	▁▁▁▁▇
number_of_bags	0	1.00	154.18	129.99	0	14.00	175.00	275.00	1062.00	▇▇▁▁▁
aroma	0	1.00	7.57	0.38	0	7.42	7.58	7.75	8.75	▁▁▁▁▇
flavor	0	1.00	7.52	0.40	0	7.33	7.58	7.75	8.83	▁▁▁▁▇
aftertaste	0	1.00	7.40	0.40	0	7.25	7.42	7.58	8.67	▁▁▁▁▇
acidity	0	1.00	7.54	0.38	0	7.33	7.58	7.75	8.75	▁▁▁▁▇
body	0	1.00	7.52	0.37	0	7.33	7.50	7.67	8.58	▁▁▁▁▇
balance	0	1.00	7.52	0.41	0	7.33	7.50	7.75	8.75	▁▁▁▁▇
uniformity	0	1.00	9.83	0.55	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
clean_cup	0	1.00	9.84	0.76	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
sweetness	0	1.00	9.86	0.62	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
cupper_points	0	1.00	7.50	0.47	0	7.25	7.50	7.75	10.00	▁▁▁▇▁
moisture	0	1.00	0.09	0.05	0	0.09	0.11	0.12	0.28	▃▇▅▁▁
category_one_defects	0	1.00	0.48	2.55	0	0.00	0.00	0.00	63.00	▇▁▁▁▁
quakers	1	1.00	0.17	0.83	0	0.00	0.00	0.00	11.00	▇▁▁▁▁
category_two_defects	0	1.00	3.56	5.31	0	0.00	2.00	4.00	55.00	▇▁▁▁▁
altitude_low_meters	230	0.83	1750.71	8669.44	1	1100.00	1310.64	1600.00	190164.00	▇▁▁▁▁
altitude_high_meters	230	0.83	1799.35	8668.81	1	1100.00	1350.00	1650.00	190164.00	▇▁▁▁▁
altitude_mean_meters	230	0.83	1775.03	8668.63	1	1100.00	1310.64	1600.00	190164.00	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QFTAPM

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The original data curator collected the data and published it on March 8, 2023. It was collected through using the General Transit Feed Specification (GTFS) data that over 80 transit agencies in Canada published their transit data with.

Write a brief description of the observations.

Each observation contains data about the bus’s segment id, route id, direction, number of traversals, distance, id of each stop, geometry of path, start point, and end point. The observations provide information about each bus in Toronto - Greater Ontario and their corresponding route and the placement of the stops. The multiple data about the stops such as stop 1 id, stop 2 id, start point, and end point in each response seem repetitive in the sense that they are all locations but are critical to determining the placement of all the stops for the corresponding bus observation and its route.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Does the geometry of the bus route affect the distance and amount of stops along the bus route?

What correlations can be observed between the bus path stops and the number of traversals?

Do buses in opposite directions follow the same path and stops or how do they differ?

A description of the research topic along with a concise statement of your hypotheses on this topic.

As a major city with one of the largest span of land, Toronto and the Greater Ontario area requires a complete and efficient transit system in order to function. As one of the most relied upon methods of travel, buses in Toronto must have routes that cover most points on the map to allow for ease of travel in the city. Studying the bus routes of all the buses in Toronto will allow for us to observe whether the current locations of bus stops, distance between stops, and geometry of the bus routes are effectively covering the map in an efficient manner. We can expect that loop bus routes have more and closer stops than a straight line path as straight line paths would take too long from start to end point otherwise. We can also expect that bus paths with shorter or lesser stops would have more traversals due to needing to serve citizens on a shorter path duration most likely with higher population and frequency required. Most likely, bus paths in opposite direction can be hypothesized to be the same as the same required stops would make more sense in covering the map as different route buses can be used for different paths. These questions are important in helping to plan future or change existing bus paths such as identifying unnecessary repeated paths or seeing points that lack transit.

Identify the types of variables in your research question. Categorical? Quantitative?

categorical variables: segment_id, route_id, direction_id, stop_id1, stop_id2, geometry, start_point, end_point

quantitative variables: traversals, distance

Glimpse of data

# add code here
bus_spacing <- read_csv("data/spacings_with_geometry.csv")

Rows: 1857 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): segment_id, route_id, stop_id1, stop_id2, geometry, start_point, en...
dbl (3): direction_id, traversals, distance

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(bus_spacing)

Data summary
Name	bus_spacing
Number of rows	1857
Number of columns	10
_______________________
Column type frequency:
character	7
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
segment_id	1	10	13	1212
route_id	1	11	11	39
stop_id1	1	2	5	978
stop_id2	1	2	5	980
geometry	1	99	35887	1230
start_point	1	24	28	974
end_point	1	24	28	977

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
direction_id	1	0.51	0.50	0.00	0.0	1.00	1.00	1.00	▇▁▁▁▇
traversals	1	14.96	10.92	1.00	6.0	14.00	22.00	75.00	▇▅▂▁▁
distance	1	3716.10	7919.67	121.87	417.5	821.52	2729.64	93664.84	▇▁▁▁▁