Visualizing the Network of NYC Public Computer Centers

Proposal

library(tidyverse)
library(skimr)

Data 1

Problem or question

Identify the problem you will solve or the question you will answer
Explain why you think this topic is important.
Identify the types of data/variables you will use.
State the major deliverable(s) you will create to solve this problem/answer this question.

Question/Objective:

What are the trends in college tuition and how do they correlate with the diversity and salary potential of graduates?
How does the affordability of higher education, especially for different income groups, impact career prospects and college choices?

Importance:

This project is crucial for understanding the dynamics of higher education in the United States, as it directly affects students, educators, and policymakers. It can help guide decisions related to college choice, financial planning, and future career prospects.

Variables:

Categorical (e.g., college majors, school types) and Quantitative (e.g., tuition costs, salary potential).

Major Deliverable(s):

A comprehensive research report with data visualizations.

Introduction and data

If you are using a dataset:

Identify the source of the data.

Thomas Mock. (2019). College tuition, diversity, and pay. Kaggle.

https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

Obtained the dataset from Kaggle, but the original source of the data is the US Department of Education. The historical averages in the data were retrieved from the National Center for Education Statistics (NCES). Additional sources of the data includes Chronicle of Higher Education, Priceeconomics, TuitionTracke.,org, and payscale.com.

The data was obtained and cleaned by Thomas Mock during the March 10th, 2020.

Write a brief description of the observations.

This dataset focuses on the multifaceted dynamics of higher education in the United States. It encompasses various critical dimensions of the higher education experience, including college tuition, diversity, affordability, and career prospects. By examining historical tuition trends, current tuition and fees data, diversity statistics, average net costs by income bracket, and salary potential data, researchers can gain a comprehensive understanding of the state of American higher education. Moreover, this dataset allows for an exploration of how these factors interact and influence college choices, financial decisions, and future career opportunities. It provides a holistic view of the challenges and opportunities in the higher education landscape, offering valuable insights to stakeholders in academia and beyond.

Glimpse of data

# add code here
tuition <- read.csv("data/tuition_cost.csv")
skim(tuition)

Data summary
Name	tuition
Number of rows	2973
Number of columns	10
_______________________
Column type frequency:
character	5
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	8	67	2938
state	52	0.98	4	14	50
state_code	0	1.00	2	2	55
type	0	1.00	5	10	4
degree_length	0	1.00	5	6	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
room_and_board	1094	0.63	10095.28	3288.55	30	7935	10000	12424.5	21300	▁▅▇▃▁
in_state_tuition	0	1.00	16491.29	14773.84	480	4890	10099	27124.0	59985	▇▂▂▁▁
in_state_total	0	1.00	22871.73	18948.39	962	5802	17669	35960.0	75003	▇▅▂▂▁
out_of_state_tuition	0	1.00	20532.73	13255.65	480	9552	17486	29208.0	59985	▇▆▅▂▁
out_of_state_total	0	1.00	26913.16	17719.73	1376	11196	23214	39054.0	75003	▇▅▅▂▁

Data 2

Problem or question

Identify the problem you will solve or the question you will answer
Explain why you think this topic is important.
Identify the types of data/variables you will use.
State the major deliverable(s) you will create to solve this problem/answer this question.

Question/Objective:

What factors influence the success of movies in terms of ratings and box office earnings?
How have movie trends, genres, and audience preferences evolved over different decades?

Importance:

Understanding the factors that contribute to a movie’s success is crucial for the film industry. Analyzing trends can inform production and marketing strategies.

Variables:

Categorical (e.g., genres, directors) and Quantitative (e.g., ratings, gross income).

Major Deliverable(s):

An R package for laptop market analysis and visualization.

Introduction and data

If you are using a dataset:

Identify the source of the data.

Anonymous. (2023). 10,000 Movies Data(1915 - 2023). Kaggle. https://www.kaggle.com/datasets/dk123891/10000-movies-data

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data has been scraped from IMDb’s website for educational and research purposes.
Write a brief description of the observations.

This dataset is a comprehensive collection of IMDb’s Top 10,000 movies, offering a wealth of information on a wide array of films spanning various genres, release years, and user ratings. It includes essential details such as movie title, release year, IMDb user rating, Metascore rating, gross income, number of votes, runtime, genre, certification, plot summary, directors, and main cast.

Glimpse of data

# add code here
movie <- read_csv("data/movie.csv")

New names:
Rows: 10000 Columns: 13
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): Movie Name, Genre, Certification, Director, Stars, Description dbl (7):
...1, Year of Release, Run Time in minutes, Movie Rating, Votes, Me...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

skim(movie)

Data summary
Name	movie
Number of rows	10000
Number of columns	13
_______________________
Column type frequency:
character	6
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Movie Name	0	1.00	1	72	9632
Genre	0	1.00	9	39	425
Certification	369	0.96	1	9	24
Director	0	1.00	6	374	4162
Stars	0	1.00	2	102	9947
Description	0	1.00	51	648	9996

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
…1	0	1.00	4999.50	2886.90	0.0	2499.75	4999.5	7499.25	9999.0	▇▇▇▇▇
Year of Release	0	1.00	2001.41	18.60	1915.0	1994.00	2007.0	2015.00	2023.0	▁▁▁▃▇
Run Time in minutes	0	1.00	110.72	22.05	45.0	96.00	107.0	121.00	439.0	▇▂▁▁▁
Movie Rating	0	1.00	6.73	0.82	4.9	6.10	6.7	7.30	9.3	▃▇▇▃▁
Votes	0	1.00	92797.38	171650.90	10002.0	16851.75	34179.5	91546.00	2804443.0	▇▁▁▁▁
MetaScore	2026	0.80	59.17	17.27	7.0	47.00	60.0	72.00	100.0	▁▅▇▇▂
Gross	2915	0.71	40175003.53	67486580.89	0.0	2340000.00	16930000.0	48640000.00	936660000.0	▇▁▁▁▁

Data 3

Problem or question

Identify the problem you will solve or the question you will answer
Explain why you think this topic is important.
Identify the types of data/variables you will use.
State the major deliverable(s) you will create to solve this problem/answer this question.

Question/Objective:

What factors influence the pricing of laptops, and how do different brands compare in terms of price and specifications?
How has the laptop market evolved over time in terms of specifications, pricing, and brand dominance?

Importance:

Understanding the laptop market is vital for consumers, manufacturers, and retailers. It can inform purchasing decisions, product strategies, and pricing strategies.

Variables:

Categorical (e.g., laptop brands) and Quantitative (e.g., laptop prices, specifications).

Major Deliverable(s):

An R package for laptop market analysis and visualization.

Introduction and data

If you are using a dataset:

Identify the source of the data.

Ahmad. (2023). Laptop_Price_Analysis [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6591382

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

Check open data repositories like Kaggle, UCI Machine Learning Repository, or government data portals for publicly available datasets related to laptops and prices.

Use a python script to crawl information from various websites that display laptop prices and specifications.
Write a brief description of the observations.

The core of this dataset revolves around the diverse range of laptop models and their associated pricing, as made available by various manufacturers and brands. The dataset encompasses laptops from major brands such as Apple, Dell, HP, Lenovo, and more. Crucial details included in the dataset are the model name, specifications (like RAM size, storage type, processor type, etc.), release date, and of course, the price. This project seeks to conduct a thorough analysis of the laptop market, focusing on a diverse range of laptop models and their associated pricing from various manufacturers and brands.

Glimpse of data

# add code here
laptops_price <- read_csv("data/laptops_price.csv")

Rows: 977 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Manufacturer, Model Name, Category, Screen Size, Screen, CPU, RAM,...
dbl  (1): Price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(laptops_price)

Data summary
Name	laptops_price
Number of rows	977
Number of columns	13
_______________________
Column type frequency:
character	12
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Manufacturer	0	1.00	2	9	19
Model Name	0	1.00	6	45	488
Category	0	1.00	6	18	6
Screen Size	0	1.00	5	5	18
Screen	0	1.00	8	45	38
CPU	0	1.00	17	37	106
RAM	0	1.00	3	4	8
Storage	0	1.00	7	29	36
GPU	0	1.00	13	30	96
Operating System	0	1.00	5	9	7
Operating System Version	136	0.86	1	4	4
Weight	0	1.00	3	7	166

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Price	0	1	10018995	6306430	1706375	5326308	8527428	13115700	54232308	▇▃▁▁▁

Data 4

Problem or question

Identify the problem you will solve or the question you will answer
Explain why you think this topic is important.
Identify the types of data/variables you will use.
State the major deliverable(s) you will create to solve this problem/answer this question.

Question/Objective:

What is the distribution of public computer centers across New York City neighborhoods?
How accessible are these centers to the general public based on their operating hours?

Importance:

Having access to public computer centers can significantly impact digital literacy, job searching, educational purposes, and overall quality of life. It is vital to identify gaps in accessibility to ensure equitable opportunities for all New Yorkers, particularly in this digital age.

Variables:

Categorical: Borough, Neighborhood, Facility Type, Services Offered, Accessibility Features
Quantitative: Number of Public Computer Workstations, Operating Hours, Days of Operation

Major Deliverable(s):

A website showcasing the distribution of computer centers across NYC, with filters based on factors such as services, borough, and facility type.

Introduction and data

If you are using a dataset:

Identify the source of the data.

NYC Open Data. https://data.cityofnewyork.us/Social-Services/Citywide-Public-Computer-Centers/sejx-2gn3

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This data set was published on NYC OPEN DATA on September 26, 2023 by the New York City Office of Technology and Innovation (OTI).

Write a brief description of the observations.

The dataset provides detailed insights into public computer centers located across New York City. It offers specifics on the borough, address, facility type, services provided, number of computer workstations, operating days and hours, and various accessibility features. This dataset is instrumental in understanding digital accessibility across different city neighborhoods.

Glimpse of data

# add code here
PCC_raw <- read.csv("data/Public_Computer_Center.csv")
skim(PCC_raw)

Data summary
Name	PCC_raw
Number of rows	530
Number of columns	34
_______________________
Column type frequency:
character	24
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
Oversight.Agency	1	3	4	0	6
Location.Name	1	4	73	0	521
Operating.Status	1	4	18	0	5
Address.Number	1	1	9	0	448
Address.Prefix	1	0	5	435	5
Address.Street	1	3	24	0	361
Address.Suffix	1	0	13	24	18
City	1	5	16	0	45
State	1	2	2	0	1
Full.Location.Address	1	25	52	0	501
Borough	1	5	13	0	5
Wheelchair.Accessible	1	2	9	0	3
Assistive.Technology	1	2	8	0	3
Languages.Offered	1	3	57	0	30
Technology.Related.Courses	1	0	460	6	129
Affordability.Connectivity.Program	1	2	8	0	3
Productivity.Tools..ex…Using.Word..Excel..Powerpoint..Adobe.Acrobat..etc..	1	2	3	0	3
Job.Readiness..ex..resume.help..job.search..etc..	1	2	3	0	3
Education..ex..personal.growth..k.12.supports..reading.research..etc..	1	2	3	0	3
Creative.Expression…ex..making.art..videos..blogs..websites..etc..	1	2	3	0	3
Media.and.Entertainment..ex..consuming..producing..etc..	1	2	3	0	3
Certifications..ex..in.software..in.housing..in.professional.areas..etc..	1	2	3	0	3
Digital.Literacy	1	2	3	0	3
Neighborhood.Tabulation.Area..NTA…2020.	1	6	6	0	190

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Calendar.Year	0	1.00	2023.00	0.000000e+00	2023.00	2023.00	2023.00	2023.00	2023.00	▁▁▇▁▁
Object.Identification.Number	0	1.00	264.50	1.531400e+02	0.00	132.25	264.50	396.75	529.00	▇▇▇▇▇
Postcode	0	1.00	10827.41	5.555500e+02	10001.00	10307.25	11206.00	11234.00	11694.00	▆▃▁▇▅
Latitude	0	1.00	40.73	8.000000e-02	40.51	40.67	40.72	40.80	40.91	▁▅▇▆▃
Longitude	0	1.00	-73.92	8.000000e-02	-74.24	-73.97	-73.93	-73.87	-73.71	▁▁▇▆▂
Community.District	0	1.00	280.97	1.189200e+02	101.00	203.00	305.00	403.00	503.00	▆▅▇▆▁
Council.District	0	1.00	24.97	1.496000e+01	1.00	11.00	25.00	38.00	51.00	▇▆▅▆▆
BIN	12	0.98	2864074.56	1.263667e+06	1000000.00	2008096.25	3085294.00	4077471.25	5171789.00	▆▅▇▅▃
BBL	12	0.98	2756482478.46	1.206915e+09	1000167516.00	2026170051.00	3021760001.00	4014785010.00	5078990009.00	▆▅▇▆▁
Census.Tract..2020.	0	1.00	38895.13	3.448591e+04	102.00	13400.00	28601.00	51325.00	157902.00	▇▅▂▁▁