Project title

Proposal

library(tidyverse)
library(skimr)
library(rvest)
library(lubridate)
library(robotstxt)
library(readxl)

Data 1

Introduction and data

Identify the source of the data.

The source of this data is the College Scorecard of the U.S. Department of Education.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

Every year the U.S. Department of Education collects the aggregate data, which includes various qualitative and quanitiative variables, on every institute of higher education. This data was obtained by the U.S. Department of Education through federal reporting from the institutions, data on federal financial aid, and tax information. Much of the data was also obtained from data reported to the IPEDS, the Integrated Post-secondary Education Data System. The data set college_information contains the values of these variables for each institute of higher education in the U.S. for the 2020-21 school year, the most recent data collected.
Write a brief description of the observations.

Each row in the data set represents each institute of higher education in California. The data set also includes the city that the institutions are in, their zip code, the larger collection of schools that the university is apart of, the website of the university, and various identification codes for the university. The variables within the data set include those that store the values of the SAT and ACT test scores of students admitted into the university, the admission rate of the university, average cost of attendance of the university, and more.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

What is the relationship between the average SAT score of admitted Californian students and the average amount of student debt students have upon graduation for both private and public institutions of higher learning for the 2020-21 academic year?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic explores if there is a relationship between the average SAT score for California students admitted into the university and the cumulative student debt that these students have when they leave the university, either by graduating or withdrawing from it. It also compares this relationship to the structure of the governance of the school (ie. whether it is public, private nonprofit and private for-profit.) We hypothesize that intuitions that admit students with higher SAT scores will have a lower student.
Identify the types of variables in your research question. Categorical? Quantitative?

The variables in our research question are both quantitative and categorical. The qualitative variable in our research question is the type of governance structure of the institution- public, private nonprofit and private for-profit. The quantitative variables in our research question are the average SAT score of the students admitted into the university, as well as cumulative median student debt, meaning the median loan debt accumulated at the institution by all student borrowers of federal loans who separate (either graduate or withdraw from the university) in a given fiscal year.

Glimpse of data

# add code here
sat_scorecard <- read_csv("data/school_scores.xlsx")

Multiple files in zip: reading '[Content_Types].xml'
Rows: 1 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): <?xml version="1.0" encoding="UTF-8" standalone="yes"?>

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(sat_scorecard)

Data summary
Name	sat_scorecard
Number of rows	1
Number of columns	1
_______________________
Column type frequency:
character	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
	0	1	1111	1111	0	1	0

Data 2

Introduction and data

Identify the source of the data.

The source of the data is the Harvard Dataverse.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

These data sets were found in two ones, one being crowd-sourcing human participants to respond on what emotions they believed a displayed face to have. The second was through showing faces to different Emotion Analyses Services to see what their predictions on the emotion of the given face was.
Write a brief description of the observations.

Each observation represents one image being shown, and the corresponding response given by either a real person or an EAS. The variables include different emotions that are response options, and a demographic variable that indicates the race and gender of the displayed image. The variables in the EAS dataset represent the demographic of the picture with 7 different emotion options that the service will assign values of probability to. The variables in the real person data set again represent the demographic data of the displayed person, alongside the 1st and 2nd choice emotion values that the person will choose, and the opposite emotion they believe the face to hold.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?

Is there an inherent racial bias in image tagging emotion analyses services in their reading of emotions as compared to a real person reading emotion? Does this corroborate any previous research on the bias towards black individuals erroneously being perceived as angry or hostile through their expressions? What implications does this have on visual cognitive services if so?

The research topic is on algorithmic discrimination in image tagging emotion analyses services, and we hypothesize that there will be a significant difference in the computer tagging black individuals as more angry as opposed to real subjects. The variables included in the research are categorical, with some of the categorical variables having quantitative values.

Glimpse of data

# add code here
eas_data <- read_excel("data/MICROSOFT_DATASET.xlsx")
skimr::skim(eas_data)

Data summary
Name	eas_data
Number of rows	1207
Number of columns	10
_______________________
Column type frequency:
character	2
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Target	0	1	6	6	0	597	0
Emotion	0	1	1	2	0	5	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p50	p75	p100	hist
SADNESS	1	0.02	0.08	0.00	0.00	0.96	▇▁▁▁▁
NEUTRAL	1	0.56	0.46	0.84	0.99	1.00	▆▁▁▁▇
CONTEMPT	1	0.02	0.06	0.00	0.00	0.68	▇▁▁▁▁
DISGUST	1	0.01	0.04	0.00	0.00	0.56	▇▁▁▁▁
ANGER	1	0.04	0.15	0.00	0.00	1.00	▇▁▁▁▁
SURPRISE	1	0.07	0.22	0.00	0.00	1.00	▇▁▁▁▁
FEAR	1	0.02	0.11	0.00	0.00	1.00	▇▁▁▁▁
HAPPINESS	1	0.27	0.43	0.00	0.78	1.00	▇▁▁▁▃

person_data <- read_excel("data/PERSON_DATASET.xlsx")
skimr::skim(person_data)

Data summary
Name	person_data
Number of rows	1845
Number of columns	23
_______________________
Column type frequency:
character	11
logical	6
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
_channel	0	1.00	4	14	16
_country	0	1.00	3	3	1
_region	106	0.94	2	2	44
_city	111	0.94	3	17	230
_ip	0	1.00	10	15	310
1st_choice	0	1.00	4	12	8
2nd_choice	0	1.00	4	12	8
i_would_not_use	0	1.00	4	12	8
select_your_gender	0	1.00	4	9	5
select_your_race	0	1.00	4	6	8
image_url	0	1.00	38	39	610

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
_tainted	0	1	0	FAL: 1845
1st_choice_gold	1845	0	NaN	:
2nd_choice_gold	1845	0	NaN	:
i_would_not_use_gold	1845	0	NaN	:
select_your_gender_gold	1845	0	NaN	:
select_your_race_gold	1845	0	NaN	:

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
_unit_id	1	2.276232e+09	177.02	2276231480.0	2.276232e+09	2.276232e+09	2276231940.0	2276232092	▇▇▇▇▇
_id	1	4.896590e+09	36899584.69	4780811066.0	4.910562e+09	4.910811e+09	4911186464.0	4912209025	▁▁▁▁▇
_trust	1	7.500000e-01	0.12	0.4	6.700000e-01	7.300000e-01	0.8	1	▁▃▇▅▃
_worker_id	1	3.474014e+07	13541771.24	1853182.0	2.852137e+07	4.154501e+07	45205782.0	45331060	▁▁▁▂▇

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
_created_at	0	1	2019-05-07 12:44:00	2019-06-22 12:08:14	2019-06-21 14:37:59	1827
_started_at	0	1	2019-05-07 12:42:00	2019-06-22 12:04:48	2019-06-21 14:37:42	1825

Data 3

Introduction and data

Identify the source of the data.

The function reads HTML from Zillow URLs and extracts information on homes that have recently been sold in Ithaca, NY. The data was on March 15, 2023.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was scraped from the Zillow website on March 15, 2023.
Write a brief description of the observations.

The observations in the data include the date of sale, sale price, number of bedrooms, number of bathrooms, square footage, and realtor for each home sold. The data covers 10 pages of results from Zillow, with each page containing information on up to 40 homes.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Does the number of bedrooms or bathrooms have a stronger effect on the sale price of homes in Ithaca, NY?
A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is the relationship between the number of bedrooms and bathrooms and the sale price of homes in Ithaca, NY. This topic falls within the broader field of real estate research, and the aim of this analysis is to identify which factor - the number of bedrooms or bathrooms - has a stronger effect on home sale prices.

Hypothesis: The number of bedrooms has a stronger effect on the sale price of homes in Ithaca, NY than the number of bathrooms.

Identify the types of variables in your research question. Categorical? Quantitative?

The independent variables in this research question are the number of bedrooms and the number of bathrooms, both of which are quantitative variables. The dependent variable is the sale price of the homes, which is also a quantitative variable.

Glimpse of data

# add code here
real_estate <- read_csv("data/real_estate.csv")

Rows: 9129 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): data.date, data.owned or leased, data.status, data.type, location....
dbl  (5): data.parking spaces, location.congressional district, location.reg...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(real_estate)

Data summary
Name	real_estate
Number of rows	9129
Number of columns	15
_______________________
Column type frequency:
character	10
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
data.date	1	1	10	1380
data.owned or leased	1	5	6	2
data.status	1	6	14	3
data.type	1	4	9	3
location.id	1	6	6	9129
data.disabilities.ADA Accessible	1	2	12	3
location.address.city	1	3	25	1964
location.address.county	1	4	30	947
location.address.line 1	1	3	57	8076
location.address.state	1	2	2	56

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data.parking spaces	0	1	50.95	189.31	0	0	8	39	6600	▇▁▁▁▁
location.congressional district	22	1	14.83	23.47	0	2	6	16	98	▇▁▁▁▁
location.region id	0	1	6.22	2.90	1	4	7	9	11	▆▇▆▆▅
data.disabilities.ansi usable	0	1	33351.82	94770.75	0	2535	7629	21353	1989117	▇▁▁▁▁
location.address.zip	0	1	466892590.74	324349643.23	605	200242104	432151052	785216762	999019998	▇▇▃▆▆