library(tidyverse)
library(skimr)Visualizing the Network of NYC Public Computer Centers
Proposal
Data 1
Problem or question
- Identify the problem you will solve or the question you will answer
- Explain why you think this topic is important.
- Identify the types of data/variables you will use.
- State the major deliverable(s) you will create to solve this problem/answer this question.
Question/Objective:
What are the trends in college tuition and how do they correlate with the diversity and salary potential of graduates?
How does the affordability of higher education, especially for different income groups, impact career prospects and college choices?
Importance:
This project is crucial for understanding the dynamics of higher education in the United States, as it directly affects students, educators, and policymakers. It can help guide decisions related to college choice, financial planning, and future career prospects.
Variables:
Categorical (e.g., college majors, school types) and Quantitative (e.g., tuition costs, salary potential).
Major Deliverable(s):
A comprehensive research report with data visualizations.
Introduction and data
If you are using a dataset:
- Identify the source of the data.
Thomas Mock. (2019). College tuition, diversity, and pay. Kaggle.
https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Obtained the dataset from Kaggle, but the original source of the data is the US Department of Education. The historical averages in the data were retrieved from the National Center for Education Statistics (NCES). Additional sources of the data includes Chronicle of Higher Education, Priceeconomics, TuitionTracke.,org, and payscale.com.
The data was obtained and cleaned by Thomas Mock during the March 10th, 2020.
Write a brief description of the observations.
This dataset focuses on the multifaceted dynamics of higher education in the United States. It encompasses various critical dimensions of the higher education experience, including college tuition, diversity, affordability, and career prospects. By examining historical tuition trends, current tuition and fees data, diversity statistics, average net costs by income bracket, and salary potential data, researchers can gain a comprehensive understanding of the state of American higher education. Moreover, this dataset allows for an exploration of how these factors interact and influence college choices, financial decisions, and future career opportunities. It provides a holistic view of the challenges and opportunities in the higher education landscape, offering valuable insights to stakeholders in academia and beyond.
Glimpse of data
# add code here
tuition <- read.csv("data/tuition_cost.csv")
skim(tuition)| Name | tuition |
| Number of rows | 2973 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1.00 | 8 | 67 | 0 | 2938 | 0 |
| state | 52 | 0.98 | 4 | 14 | 0 | 50 | 0 |
| state_code | 0 | 1.00 | 2 | 2 | 0 | 55 | 0 |
| type | 0 | 1.00 | 5 | 10 | 0 | 4 | 0 |
| degree_length | 0 | 1.00 | 5 | 6 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| room_and_board | 1094 | 0.63 | 10095.28 | 3288.55 | 30 | 7935 | 10000 | 12424.5 | 21300 | ▁▅▇▃▁ |
| in_state_tuition | 0 | 1.00 | 16491.29 | 14773.84 | 480 | 4890 | 10099 | 27124.0 | 59985 | ▇▂▂▁▁ |
| in_state_total | 0 | 1.00 | 22871.73 | 18948.39 | 962 | 5802 | 17669 | 35960.0 | 75003 | ▇▅▂▂▁ |
| out_of_state_tuition | 0 | 1.00 | 20532.73 | 13255.65 | 480 | 9552 | 17486 | 29208.0 | 59985 | ▇▆▅▂▁ |
| out_of_state_total | 0 | 1.00 | 26913.16 | 17719.73 | 1376 | 11196 | 23214 | 39054.0 | 75003 | ▇▅▅▂▁ |
Data 2
Problem or question
- Identify the problem you will solve or the question you will answer
- Explain why you think this topic is important.
- Identify the types of data/variables you will use.
- State the major deliverable(s) you will create to solve this problem/answer this question.
Question/Objective:
What factors influence the success of movies in terms of ratings and box office earnings?
How have movie trends, genres, and audience preferences evolved over different decades?
Importance:
Understanding the factors that contribute to a movie’s success is crucial for the film industry. Analyzing trends can inform production and marketing strategies.
Variables:
Categorical (e.g., genres, directors) and Quantitative (e.g., ratings, gross income).
Major Deliverable(s):
An R package for laptop market analysis and visualization.
Introduction and data
If you are using a dataset:
- Identify the source of the data.
Anonymous. (2023). 10,000 Movies Data(1915 - 2023). Kaggle. https://www.kaggle.com/datasets/dk123891/10000-movies-data
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data has been scraped from IMDb’s website for educational and research purposes.
Write a brief description of the observations.
This dataset is a comprehensive collection of IMDb’s Top 10,000 movies, offering a wealth of information on a wide array of films spanning various genres, release years, and user ratings. It includes essential details such as movie title, release year, IMDb user rating, Metascore rating, gross income, number of votes, runtime, genre, certification, plot summary, directors, and main cast.
Glimpse of data
# add code here
movie <- read_csv("data/movie.csv")New names:
Rows: 10000 Columns: 13
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(6): Movie Name, Genre, Certification, Director, Stars, Description dbl (7):
...1, Year of Release, Run Time in minutes, Movie Rating, Votes, Me...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
skim(movie)| Name | movie |
| Number of rows | 10000 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Movie Name | 0 | 1.00 | 1 | 72 | 0 | 9632 | 0 |
| Genre | 0 | 1.00 | 9 | 39 | 0 | 425 | 0 |
| Certification | 369 | 0.96 | 1 | 9 | 0 | 24 | 0 |
| Director | 0 | 1.00 | 6 | 374 | 0 | 4162 | 0 |
| Stars | 0 | 1.00 | 2 | 102 | 0 | 9947 | 0 |
| Description | 0 | 1.00 | 51 | 648 | 0 | 9996 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| …1 | 0 | 1.00 | 4999.50 | 2886.90 | 0.0 | 2499.75 | 4999.5 | 7499.25 | 9999.0 | ▇▇▇▇▇ |
| Year of Release | 0 | 1.00 | 2001.41 | 18.60 | 1915.0 | 1994.00 | 2007.0 | 2015.00 | 2023.0 | ▁▁▁▃▇ |
| Run Time in minutes | 0 | 1.00 | 110.72 | 22.05 | 45.0 | 96.00 | 107.0 | 121.00 | 439.0 | ▇▂▁▁▁ |
| Movie Rating | 0 | 1.00 | 6.73 | 0.82 | 4.9 | 6.10 | 6.7 | 7.30 | 9.3 | ▃▇▇▃▁ |
| Votes | 0 | 1.00 | 92797.38 | 171650.90 | 10002.0 | 16851.75 | 34179.5 | 91546.00 | 2804443.0 | ▇▁▁▁▁ |
| MetaScore | 2026 | 0.80 | 59.17 | 17.27 | 7.0 | 47.00 | 60.0 | 72.00 | 100.0 | ▁▅▇▇▂ |
| Gross | 2915 | 0.71 | 40175003.53 | 67486580.89 | 0.0 | 2340000.00 | 16930000.0 | 48640000.00 | 936660000.0 | ▇▁▁▁▁ |
Data 3
Problem or question
- Identify the problem you will solve or the question you will answer
- Explain why you think this topic is important.
- Identify the types of data/variables you will use.
- State the major deliverable(s) you will create to solve this problem/answer this question.
Question/Objective:
What factors influence the pricing of laptops, and how do different brands compare in terms of price and specifications?
How has the laptop market evolved over time in terms of specifications, pricing, and brand dominance?
Importance:
Understanding the laptop market is vital for consumers, manufacturers, and retailers. It can inform purchasing decisions, product strategies, and pricing strategies.
Variables:
Categorical (e.g., laptop brands) and Quantitative (e.g., laptop prices, specifications).
Major Deliverable(s):
An R package for laptop market analysis and visualization.
Introduction and data
If you are using a dataset:
- Identify the source of the data.
Ahmad. (2023). Laptop_Price_Analysis [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/6591382
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Check open data repositories like Kaggle, UCI Machine Learning Repository, or government data portals for publicly available datasets related to laptops and prices.
Use a python script to crawl information from various websites that display laptop prices and specifications.
Write a brief description of the observations.
The core of this dataset revolves around the diverse range of laptop models and their associated pricing, as made available by various manufacturers and brands. The dataset encompasses laptops from major brands such as Apple, Dell, HP, Lenovo, and more. Crucial details included in the dataset are the model name, specifications (like RAM size, storage type, processor type, etc.), release date, and of course, the price. This project seeks to conduct a thorough analysis of the laptop market, focusing on a diverse range of laptop models and their associated pricing from various manufacturers and brands.
Glimpse of data
# add code here
laptops_price <- read_csv("data/laptops_price.csv")Rows: 977 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Manufacturer, Model Name, Category, Screen Size, Screen, CPU, RAM,...
dbl (1): Price
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(laptops_price)| Name | laptops_price |
| Number of rows | 977 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Manufacturer | 0 | 1.00 | 2 | 9 | 0 | 19 | 0 |
| Model Name | 0 | 1.00 | 6 | 45 | 0 | 488 | 0 |
| Category | 0 | 1.00 | 6 | 18 | 0 | 6 | 0 |
| Screen Size | 0 | 1.00 | 5 | 5 | 0 | 18 | 0 |
| Screen | 0 | 1.00 | 8 | 45 | 0 | 38 | 0 |
| CPU | 0 | 1.00 | 17 | 37 | 0 | 106 | 0 |
| RAM | 0 | 1.00 | 3 | 4 | 0 | 8 | 0 |
| Storage | 0 | 1.00 | 7 | 29 | 0 | 36 | 0 |
| GPU | 0 | 1.00 | 13 | 30 | 0 | 96 | 0 |
| Operating System | 0 | 1.00 | 5 | 9 | 0 | 7 | 0 |
| Operating System Version | 136 | 0.86 | 1 | 4 | 0 | 4 | 0 |
| Weight | 0 | 1.00 | 3 | 7 | 0 | 166 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Price | 0 | 1 | 10018995 | 6306430 | 1706375 | 5326308 | 8527428 | 13115700 | 54232308 | ▇▃▁▁▁ |
Data 4
Problem or question
- Identify the problem you will solve or the question you will answer
- Explain why you think this topic is important.
- Identify the types of data/variables you will use.
- State the major deliverable(s) you will create to solve this problem/answer this question.
Question/Objective:
What is the distribution of public computer centers across New York City neighborhoods?
How accessible are these centers to the general public based on their operating hours?
Importance:
Having access to public computer centers can significantly impact digital literacy, job searching, educational purposes, and overall quality of life. It is vital to identify gaps in accessibility to ensure equitable opportunities for all New Yorkers, particularly in this digital age.
Variables:
Categorical: Borough, Neighborhood, Facility Type, Services Offered, Accessibility Features
Quantitative: Number of Public Computer Workstations, Operating Hours, Days of Operation
Major Deliverable(s):
A website showcasing the distribution of computer centers across NYC, with filters based on factors such as services, borough, and facility type.
Introduction and data
If you are using a dataset:
- Identify the source of the data.
NYC Open Data. https://data.cityofnewyork.us/Social-Services/Citywide-Public-Computer-Centers/sejx-2gn3
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
This data set was published on NYC OPEN DATA on September 26, 2023 by the New York City Office of Technology and Innovation (OTI).
Write a brief description of the observations.
The dataset provides detailed insights into public computer centers located across New York City. It offers specifics on the borough, address, facility type, services provided, number of computer workstations, operating days and hours, and various accessibility features. This dataset is instrumental in understanding digital accessibility across different city neighborhoods.
Glimpse of data
# add code here
PCC_raw <- read.csv("data/Public_Computer_Center.csv")
skim(PCC_raw)| Name | PCC_raw |
| Number of rows | 530 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Oversight.Agency | 0 | 1 | 3 | 4 | 0 | 6 | 0 |
| Location.Name | 0 | 1 | 4 | 73 | 0 | 521 | 0 |
| Operating.Status | 0 | 1 | 4 | 18 | 0 | 5 | 0 |
| Address.Number | 0 | 1 | 1 | 9 | 0 | 448 | 0 |
| Address.Prefix | 0 | 1 | 0 | 5 | 435 | 5 | 0 |
| Address.Street | 0 | 1 | 3 | 24 | 0 | 361 | 0 |
| Address.Suffix | 0 | 1 | 0 | 13 | 24 | 18 | 0 |
| City | 0 | 1 | 5 | 16 | 0 | 45 | 0 |
| State | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
| Full.Location.Address | 0 | 1 | 25 | 52 | 0 | 501 | 0 |
| Borough | 0 | 1 | 5 | 13 | 0 | 5 | 0 |
| Wheelchair.Accessible | 0 | 1 | 2 | 9 | 0 | 3 | 0 |
| Assistive.Technology | 0 | 1 | 2 | 8 | 0 | 3 | 0 |
| Languages.Offered | 0 | 1 | 3 | 57 | 0 | 30 | 0 |
| Technology.Related.Courses | 0 | 1 | 0 | 460 | 6 | 129 | 0 |
| Affordability.Connectivity.Program | 0 | 1 | 2 | 8 | 0 | 3 | 0 |
| Productivity.Tools..ex…Using.Word..Excel..Powerpoint..Adobe.Acrobat..etc.. | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Job.Readiness..ex..resume.help..job.search..etc.. | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Education..ex..personal.growth..k.12.supports..reading.research..etc.. | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Creative.Expression…ex..making.art..videos..blogs..websites..etc.. | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Media.and.Entertainment..ex..consuming..producing..etc.. | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Certifications..ex..in.software..in.housing..in.professional.areas..etc.. | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Digital.Literacy | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
| Neighborhood.Tabulation.Area..NTA…2020. | 0 | 1 | 6 | 6 | 0 | 190 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Calendar.Year | 0 | 1.00 | 2023.00 | 0.000000e+00 | 2023.00 | 2023.00 | 2023.00 | 2023.00 | 2023.00 | ▁▁▇▁▁ |
| Object.Identification.Number | 0 | 1.00 | 264.50 | 1.531400e+02 | 0.00 | 132.25 | 264.50 | 396.75 | 529.00 | ▇▇▇▇▇ |
| Postcode | 0 | 1.00 | 10827.41 | 5.555500e+02 | 10001.00 | 10307.25 | 11206.00 | 11234.00 | 11694.00 | ▆▃▁▇▅ |
| Latitude | 0 | 1.00 | 40.73 | 8.000000e-02 | 40.51 | 40.67 | 40.72 | 40.80 | 40.91 | ▁▅▇▆▃ |
| Longitude | 0 | 1.00 | -73.92 | 8.000000e-02 | -74.24 | -73.97 | -73.93 | -73.87 | -73.71 | ▁▁▇▆▂ |
| Community.District | 0 | 1.00 | 280.97 | 1.189200e+02 | 101.00 | 203.00 | 305.00 | 403.00 | 503.00 | ▆▅▇▆▁ |
| Council.District | 0 | 1.00 | 24.97 | 1.496000e+01 | 1.00 | 11.00 | 25.00 | 38.00 | 51.00 | ▇▆▅▆▆ |
| BIN | 12 | 0.98 | 2864074.56 | 1.263667e+06 | 1000000.00 | 2008096.25 | 3085294.00 | 4077471.25 | 5171789.00 | ▆▅▇▅▃ |
| BBL | 12 | 0.98 | 2756482478.46 | 1.206915e+09 | 1000167516.00 | 2026170051.00 | 3021760001.00 | 4014785010.00 | 5078990009.00 | ▆▅▇▆▁ |
| Census.Tract..2020. | 0 | 1.00 | 38895.13 | 3.448591e+04 | 102.00 | 13400.00 | 28601.00 | 51325.00 | 157902.00 | ▇▅▂▁▁ |