library(tidyverse)
library(skimr)
Project title
Proposal
Data 1
Introduction and data
Identify the source of the data.
The data is provided by the Los Angeles Police.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data is “Community Created Content” on the open LA data website. Community Users are individuals who like to contribute their data analyses to LA open data. Kyle Drives shared this dataset. The data is transcribed from original arrest reports (2010 to 2019) on the computer and paper.
Write a brief description of the observations.
The observations represent the booking of an arrestee for prostitution along N Western Ave in Los Angeles, California. If the location is missing, the data notes the address as (0.0000°, 0.0000°). Some addresses are missing specific information to ensure safety and maintain privacy.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
1) Is child prostitution a significant problem in Hollywood?
2) What are the most popular areas along N Western Ave in which illegal sexual activites are conducted?
3) How does the races and genders compare among the arrestees of prostitution acititvies?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
Over the past ten years, prostitution along the Western Avenue has been a major problem that causes the city major and Los Angeles Police Department to cooperate and put out. The dataframe records cases of sexual violation across the street from 2010. Understanding the dataset and being able to answer those questions are important on predicting and preventing illegal prostitution activities from happening in the area in the future. Our hypotheses are 1) child prostitution is much less of a concern, 2) The Western cross-section is the busiest place where this type of activities occur, and 3) Most of the arrestees are woman of color skin.
- Identify the types of variables in your research question. Categorical? Quantitative?
The main variables we analyze in the dataset are sex_code
, cross_street
, descent_code
(categorical) and age
(quantitative)
Glimpse of data
<- read_csv("data/Prostitution.csv") prostitute
Rows: 1749 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Arrest Date, Time, Area ID, Area Name, Reporting District, Sex Cod...
dbl (3): Report ID, Age, Charge Group Code
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add code here
::skim(prostitute) skimr
Name | prostitute |
Number of rows | 1749 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 13 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Arrest Date | 0 | 1 | 10 | 10 | 0 | 882 | 0 |
Time | 0 | 1 | 4 | 4 | 0 | 189 | 0 |
Area ID | 0 | 1 | 2 | 2 | 0 | 2 | 0 |
Area Name | 0 | 1 | 7 | 9 | 0 | 2 | 0 |
Reporting District | 0 | 1 | 4 | 4 | 0 | 57 | 0 |
Sex Code | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Descent Code | 0 | 1 | 1 | 1 | 0 | 6 | 0 |
Charge Group Description | 0 | 1 | 19 | 19 | 0 | 1 | 0 |
Arrest Type Code | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Charge | 0 | 1 | 6 | 11 | 0 | 5 | 0 |
Charge Description | 0 | 1 | 12 | 41 | 0 | 6 | 0 |
Address | 0 | 1 | 3 | 31 | 0 | 151 | 0 |
Cross Street | 0 | 1 | 7 | 34 | 0 | 8 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Report ID | 0 | 1 | 21602942.25 | 52341261.00 | 2118402 | 3222481 | 4007914 | 5132051 | 192022138 | ▇▁▁▁▁ |
Age | 0 | 1 | 26.67 | 8.82 | 13 | 20 | 24 | 30 | 67 | ▇▆▂▁▁ |
Charge Group Code | 0 | 1 | 13.00 | 0.00 | 13 | 13 | 13 | 13 | 13 | ▁▁▇▁▁ |
Data 2
Introduction and data
Identify the source of the data.
This dataset was created from Echo Nest and LabROSA labs
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
It is not clearly stated where the data comes from, but my guess is that it is being pulled from music streaming services, radio stations, and or music charts
Write a brief description of the observations.
The data contains standard information about the songs such as artist name, title, and year released. Additionally, the data contains more advanced information; for example, the length of the song, how many musical bars long the song is, and how long the fade in to the song was.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How are different music genres represented in different locations across the country?
A description of the research topic along with a concise statement of your hypotheses on this topic.
In urban centers and larger cities you fill find more diversity in genre of music listened.
Identify the types of variables in your research question. Categorical? Quantitative?
I think that it would be mostly Quantitative data, population numbers and percentages of listening by genre
Glimpse of data
# add code here
<- read_csv("data/music.csv") music
Rows: 10000 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): artist.id, artist.name, artist.terms, song.id
dbl (31): artist.familiarity, artist.hotttnesss, artist.latitude, artist.loc...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
::skim(music) skimr
Name | music |
Number of rows | 10000 |
Number of columns | 35 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 31 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
artist.id | 0 | 1 | 18 | 18 | 0 | 3888 | 0 |
artist.name | 0 | 1 | 1 | 255 | 0 | 4412 | 0 |
artist.terms | 5 | 1 | 2 | 40 | 0 | 458 | 0 |
song.id | 0 | 1 | 18 | 51 | 0 | 10000 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
artist.familiarity | 0 | 1 | 0.57 | 0.16 | 0.00 | 0.47 | 0.56 | 0.67 | 1.00 | ▁▂▇▅▂ |
artist.hotttnesss | 0 | 1 | 0.39 | 0.14 | 0.00 | 0.33 | 0.38 | 0.45 | 1.08 | ▁▇▃▁▁ |
artist.latitude | 0 | 1 | 13.90 | 20.36 | -41.28 | 0.00 | 0.00 | 34.42 | 69.65 | ▁▇▁▃▁ |
artist.location | 0 | 1 | 0.08 | 7.80 | 0.00 | 0.00 | 0.00 | 0.00 | 780.00 | ▇▁▁▁▁ |
artist.longitude | 0 | 1 | -23.92 | 43.72 | -162.44 | -73.95 | 0.00 | 0.00 | 174.77 | ▁▂▇▁▁ |
artist.similar | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
artist.terms_freq | 0 | 1 | 224.89 | 22392.16 | 0.00 | 0.95 | 1.00 | 1.00 | 2239217.00 | ▇▁▁▁▁ |
release.id | 0 | 1 | 371024.06 | 236777.83 | 0.00 | 172858.00 | 333103.00 | 573532.50 | 823599.00 | ▇▇▅▆▅ |
release.name | 0 | 1 | 23.10 | 1322.90 | 0.00 | 0.00 | 0.00 | 0.00 | 85555.00 | ▇▁▁▁▁ |
song.artist_mbtags | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | ▇▁▁▁▁ |
song.artist_mbtags_count | 0 | 1 | 0.52 | 0.88 | 0.00 | 0.00 | 0.00 | 1.00 | 9.00 | ▇▁▁▁▁ |
song.bars_confidence | 0 | 1 | 0.24 | 0.29 | 0.00 | 0.04 | 0.12 | 0.35 | 8.86 | ▇▁▁▁▁ |
song.bars_start | 0 | 1 | 1.07 | 1.72 | 0.00 | 0.44 | 0.79 | 1.22 | 59.74 | ▇▁▁▁▁ |
song.beats_confidence | 0 | 1 | 0.61 | 0.32 | 0.00 | 0.41 | 0.69 | 0.88 | 1.00 | ▃▂▃▆▇ |
song.beats_start | 0 | 1 | 0.43 | 0.81 | -60.00 | 0.19 | 0.33 | 0.50 | 12.25 | ▁▁▁▁▇ |
song.duration | 0 | 1 | 240.62 | 246.08 | 1.04 | 176.03 | 223.06 | 276.38 | 22050.00 | ▇▁▁▁▁ |
song.end_of_fade_in | 0 | 1 | 0.76 | 1.86 | 0.00 | 0.00 | 0.20 | 0.42 | 43.12 | ▇▁▁▁▁ |
song.hotttnesss | 0 | 1 | -0.24 | 0.69 | -1.00 | -1.00 | 0.00 | 0.41 | 1.00 | ▇▁▃▆▂ |
song.key | 0 | 1 | 5.37 | 9.67 | 0.00 | 2.00 | 5.00 | 8.00 | 904.80 | ▇▁▁▁▁ |
song.key_confidence | 0 | 1 | 0.45 | 0.33 | 0.00 | 0.22 | 0.47 | 0.66 | 19.08 | ▇▁▁▁▁ |
song.loudness | 0 | 1 | -10.48 | 5.40 | -51.64 | -13.16 | -9.38 | -6.53 | 0.57 | ▁▁▁▆▇ |
song.mode | 0 | 1 | 0.69 | 0.46 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
song.mode_confidence | 0 | 1 | 0.48 | 0.19 | 0.00 | 0.36 | 0.49 | 0.61 | 1.00 | ▂▅▇▅▁ |
song.start_of_fade_out | 0 | 1 | 229.88 | 112.02 | -21.39 | 168.86 | 213.86 | 266.27 | 1813.43 | ▇▁▁▁▁ |
song.tatums_confidence | 0 | 1 | 0.51 | 0.33 | 0.00 | 0.24 | 0.50 | 0.77 | 9.23 | ▇▁▁▁▁ |
song.tatums_start | 0 | 1 | 0.30 | 0.51 | 0.00 | 0.11 | 0.19 | 0.29 | 12.25 | ▇▁▁▁▁ |
song.tempo | 0 | 1 | 122.90 | 35.20 | 0.00 | 96.96 | 120.16 | 144.01 | 262.83 | ▁▆▇▂▁ |
song.time_signature | 0 | 1 | 3.56 | 1.27 | 0.00 | 3.00 | 4.00 | 4.00 | 7.00 | ▂▁▇▁▁ |
song.time_signature_confidence | 0 | 1 | 0.60 | 8.99 | 0.00 | 0.10 | 0.55 | 0.86 | 898.89 | ▇▁▁▁▁ |
song.title | 0 | 1 | 10.01 | 945.49 | 0.00 | 0.00 | 0.00 | 0.00 | 94496.00 | ▇▁▁▁▁ |
song.year | 0 | 1 | 934.70 | 996.65 | 0.00 | 0.00 | 0.00 | 2000.00 | 2010.00 | ▇▁▁▁▇ |
Data 3
Introduction and data
Identify the source of the data.
The data is from the CDC.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data on county-level diagnosed diabetes was collected from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) and the US Census Bureau’s Population Estimates Program in 2018. The CDC used data from the years 2019 to 2020 collected by County Health Rankings for the data on free and reduced lunch enrollment. The data on children in poverty was collected from the 2020 Small Area Income and Poverty Estimates (SAIPE) Program by the US Census Bureau.
Write a brief description of the observations.
The observations are all the counties or county equivalents in the 50 US states and the District of Columbia. The data is calculated by percentages for each county or county equivalent.
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How does having access to food and being in poverty impact how often children are diagnosed with diabetes in specific counties in the United States.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic is looking at the poverty rate of counties and if they having access free or reduced lunch and the percentage of children in those areas diagnosed with diabetes.
Our hypothesis is that children in counties with higher the poverty rate and not having access to food has a direct relationship to a higher percentage of kids diagnosed with diabetes.
- Identify the types of variables in your research question. Categorical? Quantitative?
The types of variables in our research question are categorical (states and counties) and quantitative (percentage of children in poverty and enrolled in free or reduced lunch).
Glimpse of data
<- read_csv("data/food_access_diabetes.csv", skip = 2) food_access_diabetes
Rows: 3142 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): County_FIPS, County, State, Free or Reduced Lunch Enrollment
dbl (2): Year, Diagnosed Diabetes Percentage
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("data/economic_diabetes.csv", skip = 2) economic_diabetes
Rows: 3142 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): County_FIPS, County, State, Children in Poverty
dbl (2): Year, Diagnosed Diabetes Percentage
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add code here
::skim(food_access_diabetes) skimr
Name | food_access_diabetes |
Number of rows | 3142 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
County_FIPS | 0 | 1 | 5 | 5 | 0 | 3142 | 0 |
County | 0 | 1 | 10 | 31 | 0 | 1877 | 0 |
State | 0 | 1 | 4 | 20 | 0 | 51 | 0 |
Free or Reduced Lunch Enrollment | 0 | 1 | 1 | 7 | 0 | 754 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 2018.00 | 0.00 | 2018.0 | 2018.0 | 2018.0 | 2018.0 | 2018.0 | ▁▁▇▁▁ |
Diagnosed Diabetes Percentage | 0 | 1 | 8.72 | 1.79 | 3.8 | 7.3 | 8.4 | 9.7 | 17.9 | ▁▇▃▁▁ |
::skim(economic_diabetes) skimr
Name | economic_diabetes |
Number of rows | 3142 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
County_FIPS | 0 | 1 | 5 | 5 | 0 | 3142 | 0 |
County | 0 | 1 | 10 | 31 | 0 | 1877 | 0 |
State | 0 | 1 | 4 | 20 | 0 | 51 | 0 |
Children in Poverty | 0 | 1 | 1 | 7 | 0 | 401 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Year | 0 | 1 | 2018.00 | 0.00 | 2018.0 | 2018.0 | 2018.0 | 2018.0 | 2018.0 | ▁▁▇▁▁ |
Diagnosed Diabetes Percentage | 0 | 1 | 8.72 | 1.79 | 3.8 | 7.3 | 8.4 | 9.7 | 17.9 | ▁▇▃▁▁ |