Project title

Proposal

library(tidyverse)
library(skimr)
library(rvest)
library(lubridate)
library(robotstxt)
library(dplyr)

Data 1

Introduction and data

Identify the source of the data.

https://www.niche.com/colleges/search/best-colleges/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data is collected by web scraping by us. We consider this web scraping to be ethical because firstly, the information we scraped is all public on the website; moreover, we only scraped the information we need, and we are not scraping a huge amount; lastly, we are only using it for the class project and not any other use.

The ranking considers many different factors, and each has their resources. The factors and sources are listed as below:

academic grade (U.S. Dept of Education, Niche users, The Center for Measureing University Performances)

value grade (U.S. Dept of Education, U.S. Census, Niche users)

professors grade (U.S. Dept of Education, Niche users, The Center for Measureing University Performances)

campus grade (U.S. Dept of Education, Niche users, FBI Uniform Crime Report, CDC)

diversity grade (U.S. Dept of Education, Niche users)

student life grade (U.S. Dept of Education, Niche users, U.S. Census, Wikipedia, Bureau of Labor Statistics, Tax Foundation, FBI Uniform Crime Report, CDC, NOAA)

student surveys on overall experience (Niche users)

local area grade (U.S. Census, Bureau of Labor Statistics, Tax Foundation, FBI Uniform Crime Report, CDC, Open Street Map, U.S. National Park Service, EPA)

safety grade ((U.S. Census, Bureau of Labor Statistics, Tax Foundation, FBI Uniform Crime Report, CDC, Niche users)
Write a brief description of the observations.

There are 500 observations in the dataset. Each observation represents a college, and these are the top 500 universities in the United States according to Niche. There are eight variables: college names, ranking, location, number of reviews, acceptance rate, type of college, sat score range, and net price, each corresponding to a college.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What is the relationship between a college’s ranking and its acceptance rate in different regions of the United States?
Importance

This analysis is important because it can give students a sense how hard it generally is to get into some colleges at a certain rank. For example, what’s an expected acceptance rate for colleges ranking 30-50. People can also tell which regions colleges that rank higher tend to be in, and how the acceptance rate varies by region, too.
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is about the 500 best universities in the United States according to Niche. We will explore their ranking, acceptance rate, and regions they are in.

Hypothesis: Colleges that rank higher will have a lower acceptance rate. The ranks of colleges and acceptance rates will vary by regions. More colleges on higher ranks will be in north eastern and west coast regions.
Identify the types of variables in your research question. Categorical? Quantitative?

Ranking: quantitative

region: categorical (This will need to be converted from the current data. We need to separate the city and the state, and categorize colleges into different regions by state)

Acceptance rate: quantitative

Glimpse of data

# add code here

# check that we can scrape data
#paths_allowed("https://www.niche.com/colleges/search/best-colleges/")

#scrape_review <- function(url){ 
  # pause for a couple of seconds to prevent rapid HTTP requests
  #Sys.sleep(2)
  
  # read the first page
 # page <- read_html(url)
  
  # extract desired components
  #rank <- html_elements(x = page, css = ".search-result-badge") |>
   # html_text2()
  
 # school_name <- html_elements(x = page, css = ".nss-griot4 .nss-ihhgee") |>
  #  html_text2()
  
 # sat_range <- html_elements(x = page, css = ".nss-griot4 .search-result-fact#-list__item:nth-child(4) .nss-1l1273x") |>
   # html_text2()
#   
#   net_price <- html_elements(x = page, css = ".nss-griot4 .search-result-fact-list__item:nth-child(3) .nss-1l1273x") |>
#     html_text2()
#   
#   accept_rate <- html_elements(x = page, css = ".nss-griot4 .search-result-fact-list__item:nth-child(2) .nss-1l1273x") |>
#     html_text2()
#   
#   region <- html_elements(x = page, css = ".nss-griot4 .search-result-tagline__item:nth-child(2) .nss-1l1273x") |>
#     html_text2()
#   
#   college_type <- html_elements(x = page, css = ".nss-griot4 .search-result-tagline__item:nth-child(3) .nss-1l1273x") |>
#     html_text2()
#   
#   review_num <- html_elements(x = page, css = ".nss-griot4 .review__stars__number__reviews") |>
#     html_text2()
#   
#   # Create the tibble with the cleaned-up data
#   review_raw <- tibble(
#     rank = rank,
#     school_name = school_name,
#     sat_range = sat_range,
#     net_price = net_price,
#     accept_rate = accept_rate,
#     region = region,
#     college_type = college_type,
#     review_num = review_num
#   )
#     
# }
# 
# # create a vector of URLs
# page_nums <- 1:20
# cr_urls <- str_glue("docs/college_data_web/{page_nums}.html")
# 
# # map function over URLs
# cr_reviews <- map(.x = cr_urls, .f = scrape_review) |>
#   list_rbind() |>
#   mutate(region = substr(region, 1, nchar(region) - 1))
# 
# # write data
# write_csv(x = cr_reviews, file = "data/school-data.csv")
# 
# skimr::skim(cr_reviews)

Data 2

Introduction and data

Identify the source of the data.

https://think.cs.vt.edu/corgis/csv/coffee/
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

It was collected on a website provided on the course website as a part of “resource for dataset.” Based on the website, the data curator was Sam Donald, and the data was collected 10/28/2022.
Write a brief description of the observations.

Each observation represents a type of coffee, data including location, coffee type, year are included for each observation in the dataset.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What is the relationship between coffee quality and its country of origin?

How does coffee’s sweetness/acidity/moisture/flavor vary based on its country of origin?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic focuses on Arabica and Robusta beans, across many countries and professionally rated on a 0-100 scale on categories such as acidity, sweetness, fragrance, balance, etc.

Hypothesis: Most coffee-producing countries are developing countries, and the flavors of each coffee type are based on the climate of the region and the coffee species. The rating and opinions on each coffee type are highly subjective based on the preference of each individual.
Identify the types of variables in your research question. Categorical? Quantitative?

The variables on countries and coffee type are categorical, while the ratings (numerical values) are quantitative data.

Glimpse of data

coffee <- read_csv("data/coffee.csv")

Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(coffee)

Data summary
Name	coffee
Number of rows	989
Number of columns	23
_______________________
Column type frequency:
character	7
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Location.Country	1	4	28	32
Location.Region	1	3	76	278
Data.Owner	1	3	50	263
Data.Type.Species	1	7	7	2
Data.Type.Variety	1	3	21	28
Data.Type.Processing method	1	3	25	6
Data.Color	1	4	12	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Location.Altitude.Min	1	1640.08	9192.52	0	905.00	1300.00	1550.00	190164.00	▇▁▁▁▁
Location.Altitude.Max	1	1675.93	9191.96	0	950.00	1310.00	1600.00	190164.00	▇▁▁▁▁
Location.Altitude.Average	1	1658.00	9192.06	0	950.00	1300.00	1600.00	190164.00	▇▁▁▁▁
Year	1	2013.55	1.66	2010	2012.00	2013.00	2015.00	2018.00	▁▇▃▃▁
Data.Production.Number of bags	1	151.76	125.67	1	15.00	170.00	275.00	600.00	▇▁▇▁▁
Data.Production.Bag weight	1	210.49	1666.71	0	1.00	60.00	69.00	19200.00	▇▁▁▁▁
Data.Scores.Aroma	1	7.57	0.40	0	7.42	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Flavor	1	7.52	0.42	0	7.33	7.50	7.75	8.83	▁▁▁▁▇
Data.Scores.Aftertaste	1	7.39	0.43	0	7.25	7.42	7.58	8.67	▁▁▁▁▇
Data.Scores.Acidity	1	7.54	0.40	0	7.33	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Body	1	7.51	0.39	0	7.33	7.50	7.67	8.50	▁▁▁▁▇
Data.Scores.Balance	1	7.50	0.43	0	7.33	7.50	7.75	8.58	▁▁▁▁▇
Data.Scores.Uniformity	1	9.82	0.59	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Sweetness	1	9.83	0.69	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Moisture	1	0.09	0.04	0	0.10	0.11	0.12	0.28	▃▇▆▁▁
Data.Scores.Total	1	81.97	3.86	0	81.08	82.50	83.58	90.58	▁▁▁▁▇

Data 3

Introduction and data

Identify the source of the data.

FiveThirtyEight. The data used for the article, “The Worst Tweeter In Politics Isn’t Trump”
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was originally collected by Dhrumil Mehta, a database journalist. It was done by taking the data (perhaps through web scraping but it is unclear) of the tweets from all senators collected on October 19 and 20, 2017.
Write a brief description of the observations.

Each observation is one specific tweet by a senator that includes information on the date and time it was tweeted, the number of replies, retweets, favorites, the party and state affiliation of the senator who tweeted it, etc.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How does the time of year and time of date affect engagement with a tweet? What topic areas get the highest engagement? How does the party and state of the senator affect which topic areas get the most engagement?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic focuses on understanding how people perceive different topics of tweets by senators overtime. Our hypothesis is that engagment will increase in the fall in election years given that elections are when people pay attention to the news the most. In addition, the topic area that is most popular will vary by state and party. For example, social issues may be more popular with Democrats whereas gun rights may be more popular with Republicans.
Identify the types of variables in your research question. Categorical? Quantitative?

There are both categorical and quantitative variables. For instance, the number of replies, retweets, and favorites would be considered quantitative variables. Depending on what we hope to visualize, the date and time information can be considered a categorical or quantitative variable. We will need to create another variable to classify the tweets by topic area based on the content of the tweet which will be a categorical variable.

Glimpse of data

# add code here

tweets <- read.csv("data/senators.csv") 

skimr::skim(tweets)

Data summary
Name	tweets
Number of rows	288615
Number of columns	10
_______________________
Column type frequency:
character	7
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
created_at	1	11	14	228527
text	1	1	158	286959
url	1	46	61	288615
user	1	8	15	100
bioguide_id	1	7	7	100
party	1	1	1	3
state	1	2	2	50

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
replies	1	41.90	470.37	1	3	13	66872	▇▁▁▁▁
retweets	1	248.97	7793.34	3	8	32	3644423	▇▁▁▁▁
favorites	1	586.44	11155.10	3	13	72	2108865	▇▁▁▁▁