Project title

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

  • Identify the source of the data.

    The data is provided by the Los Angeles Police.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data is “Community Created Content” on the open LA data website. Community Users are individuals who like to contribute their data analyses to LA open data. Kyle Drives shared this dataset. The data is transcribed from original arrest reports (2010 to 2019) on the computer and paper.

  • Write a brief description of the observations.

    The observations represent the booking of an arrestee for prostitution along N Western Ave in Los Angeles, California. If the location is missing, the data notes the address as (0.0000°, 0.0000°). Some addresses are missing specific information to ensure safety and maintain privacy.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

1) Is child prostitution a significant problem in Hollywood?

2) What are the most popular areas along N Western Ave in which illegal sexual activites are conducted?

3) How does the races and genders compare among the arrestees of prostitution acititvies?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

Over the past ten years, prostitution along the Western Avenue has been a major problem that causes the city major and Los Angeles Police Department to cooperate and put out. The dataframe records cases of sexual violation across the street from 2010. Understanding the dataset and being able to answer those questions are important on predicting and preventing illegal prostitution activities from happening in the area in the future. Our hypotheses are 1) child prostitution is much less of a concern, 2) The Western cross-section is the busiest place where this type of activities occur, and 3) Most of the arrestees are woman of color skin.

  • Identify the types of variables in your research question. Categorical? Quantitative?

The main variables we analyze in the dataset are sex_code, cross_street, descent_code (categorical) and age (quantitative)

Glimpse of data

prostitute <- read_csv("data/Prostitution.csv")
Rows: 1749 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Arrest Date, Time, Area ID, Area Name, Reporting District, Sex Cod...
dbl  (3): Report ID, Age, Charge Group Code

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add code here
skimr::skim(prostitute)
Data summary
Name prostitute
Number of rows 1749
Number of columns 16
_______________________
Column type frequency:
character 13
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Arrest Date 0 1 10 10 0 882 0
Time 0 1 4 4 0 189 0
Area ID 0 1 2 2 0 2 0
Area Name 0 1 7 9 0 2 0
Reporting District 0 1 4 4 0 57 0
Sex Code 0 1 1 1 0 2 0
Descent Code 0 1 1 1 0 6 0
Charge Group Description 0 1 19 19 0 1 0
Arrest Type Code 0 1 1 1 0 2 0
Charge 0 1 6 11 0 5 0
Charge Description 0 1 12 41 0 6 0
Address 0 1 3 31 0 151 0
Cross Street 0 1 7 34 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Report ID 0 1 21602942.25 52341261.00 2118402 3222481 4007914 5132051 192022138 ▇▁▁▁▁
Age 0 1 26.67 8.82 13 20 24 30 67 ▇▆▂▁▁
Charge Group Code 0 1 13.00 0.00 13 13 13 13 13 ▁▁▇▁▁

Data 2

Introduction and data

  • Identify the source of the data.

    This dataset was created from Echo Nest and LabROSA labs

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    It is not clearly stated where the data comes from, but my guess is that it is being pulled from music streaming services, radio stations, and or music charts

  • Write a brief description of the observations.

    The data contains standard information about the songs such as artist name, title, and year released. Additionally, the data contains more advanced information; for example, the length of the song, how many musical bars long the song is, and how long the fade in to the song was.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    How are different music genres represented in different locations across the country?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    In urban centers and larger cities you fill find more diversity in genre of music listened.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    I think that it would be mostly Quantitative data, population numbers and percentages of listening by genre

Glimpse of data

# add code here
music <- read_csv("data/music.csv")
Rows: 10000 Columns: 35
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): artist.id, artist.name, artist.terms, song.id
dbl (31): artist.familiarity, artist.hotttnesss, artist.latitude, artist.loc...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skimr::skim(music)
Data summary
Name music
Number of rows 10000
Number of columns 35
_______________________
Column type frequency:
character 4
numeric 31
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
artist.id 0 1 18 18 0 3888 0
artist.name 0 1 1 255 0 4412 0
artist.terms 5 1 2 40 0 458 0
song.id 0 1 18 51 0 10000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
artist.familiarity 0 1 0.57 0.16 0.00 0.47 0.56 0.67 1.00 ▁▂▇▅▂
artist.hotttnesss 0 1 0.39 0.14 0.00 0.33 0.38 0.45 1.08 ▁▇▃▁▁
artist.latitude 0 1 13.90 20.36 -41.28 0.00 0.00 34.42 69.65 ▁▇▁▃▁
artist.location 0 1 0.08 7.80 0.00 0.00 0.00 0.00 780.00 ▇▁▁▁▁
artist.longitude 0 1 -23.92 43.72 -162.44 -73.95 0.00 0.00 174.77 ▁▂▇▁▁
artist.similar 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
artist.terms_freq 0 1 224.89 22392.16 0.00 0.95 1.00 1.00 2239217.00 ▇▁▁▁▁
release.id 0 1 371024.06 236777.83 0.00 172858.00 333103.00 573532.50 823599.00 ▇▇▅▆▅
release.name 0 1 23.10 1322.90 0.00 0.00 0.00 0.00 85555.00 ▇▁▁▁▁
song.artist_mbtags 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.33 ▇▁▁▁▁
song.artist_mbtags_count 0 1 0.52 0.88 0.00 0.00 0.00 1.00 9.00 ▇▁▁▁▁
song.bars_confidence 0 1 0.24 0.29 0.00 0.04 0.12 0.35 8.86 ▇▁▁▁▁
song.bars_start 0 1 1.07 1.72 0.00 0.44 0.79 1.22 59.74 ▇▁▁▁▁
song.beats_confidence 0 1 0.61 0.32 0.00 0.41 0.69 0.88 1.00 ▃▂▃▆▇
song.beats_start 0 1 0.43 0.81 -60.00 0.19 0.33 0.50 12.25 ▁▁▁▁▇
song.duration 0 1 240.62 246.08 1.04 176.03 223.06 276.38 22050.00 ▇▁▁▁▁
song.end_of_fade_in 0 1 0.76 1.86 0.00 0.00 0.20 0.42 43.12 ▇▁▁▁▁
song.hotttnesss 0 1 -0.24 0.69 -1.00 -1.00 0.00 0.41 1.00 ▇▁▃▆▂
song.key 0 1 5.37 9.67 0.00 2.00 5.00 8.00 904.80 ▇▁▁▁▁
song.key_confidence 0 1 0.45 0.33 0.00 0.22 0.47 0.66 19.08 ▇▁▁▁▁
song.loudness 0 1 -10.48 5.40 -51.64 -13.16 -9.38 -6.53 0.57 ▁▁▁▆▇
song.mode 0 1 0.69 0.46 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
song.mode_confidence 0 1 0.48 0.19 0.00 0.36 0.49 0.61 1.00 ▂▅▇▅▁
song.start_of_fade_out 0 1 229.88 112.02 -21.39 168.86 213.86 266.27 1813.43 ▇▁▁▁▁
song.tatums_confidence 0 1 0.51 0.33 0.00 0.24 0.50 0.77 9.23 ▇▁▁▁▁
song.tatums_start 0 1 0.30 0.51 0.00 0.11 0.19 0.29 12.25 ▇▁▁▁▁
song.tempo 0 1 122.90 35.20 0.00 96.96 120.16 144.01 262.83 ▁▆▇▂▁
song.time_signature 0 1 3.56 1.27 0.00 3.00 4.00 4.00 7.00 ▂▁▇▁▁
song.time_signature_confidence 0 1 0.60 8.99 0.00 0.10 0.55 0.86 898.89 ▇▁▁▁▁
song.title 0 1 10.01 945.49 0.00 0.00 0.00 0.00 94496.00 ▇▁▁▁▁
song.year 0 1 934.70 996.65 0.00 0.00 0.00 2000.00 2010.00 ▇▁▁▁▇

Data 3

Introduction and data

  • Identify the source of the data.

    The data is from the CDC.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data on county-level diagnosed diabetes was collected from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) and the US Census Bureau’s Population Estimates Program in 2018. The CDC used data from the years 2019 to 2020 collected by County Health Rankings for the data on free and reduced lunch enrollment. The data on children in poverty was collected from the 2020 Small Area Income and Poverty Estimates (SAIPE) Program by the US Census Bureau.

  • Write a brief description of the observations.

    The observations are all the counties or county equivalents in the 50 US states and the District of Columbia. The data is calculated by percentages for each county or county equivalent.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How does having access to food and being in poverty impact how often children are diagnosed with diabetes in specific counties in the United States.

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is looking at the poverty rate of counties and if they having access free or reduced lunch and the percentage of children in those areas diagnosed with diabetes.

Our hypothesis is that children in counties with higher the poverty rate and not having access to food has a direct relationship to a higher percentage of kids diagnosed with diabetes.

  • Identify the types of variables in your research question. Categorical? Quantitative?

The types of variables in our research question are categorical (states and counties) and quantitative (percentage of children in poverty and enrolled in free or reduced lunch).

Glimpse of data

food_access_diabetes <- read_csv("data/food_access_diabetes.csv", skip = 2)
Rows: 3142 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): County_FIPS, County, State, Free or Reduced Lunch Enrollment
dbl (2): Year, Diagnosed Diabetes Percentage

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
economic_diabetes <- read_csv("data/economic_diabetes.csv", skip = 2)
Rows: 3142 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): County_FIPS, County, State, Children in Poverty
dbl (2): Year, Diagnosed Diabetes Percentage

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# add code here

skimr::skim(food_access_diabetes)
Data summary
Name food_access_diabetes
Number of rows 3142
Number of columns 6
_______________________
Column type frequency:
character 4
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
County_FIPS 0 1 5 5 0 3142 0
County 0 1 10 31 0 1877 0
State 0 1 4 20 0 51 0
Free or Reduced Lunch Enrollment 0 1 1 7 0 754 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2018.00 0.00 2018.0 2018.0 2018.0 2018.0 2018.0 ▁▁▇▁▁
Diagnosed Diabetes Percentage 0 1 8.72 1.79 3.8 7.3 8.4 9.7 17.9 ▁▇▃▁▁
skimr::skim(economic_diabetes)
Data summary
Name economic_diabetes
Number of rows 3142
Number of columns 6
_______________________
Column type frequency:
character 4
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
County_FIPS 0 1 5 5 0 3142 0
County 0 1 10 31 0 1877 0
State 0 1 4 20 0 51 0
Children in Poverty 0 1 1 7 0 401 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Year 0 1 2018.00 0.00 2018.0 2018.0 2018.0 2018.0 2018.0 ▁▁▇▁▁
Diagnosed Diabetes Percentage 0 1 8.72 1.79 3.8 7.3 8.4 9.7 17.9 ▁▇▃▁▁