Awesome Evee

Proposal

library(tidyverse)
library(skimr)

Data 1

Introduction and data

Identify the source of the data.
- The source of the data is United States Census Beaureu Datasets from https://data.census.gov/table?tid=DECENNIALPL2020.P1.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- There are many datasets found, but the one I am specifically looking at Decennial Census P1|Race from 2020. This dataset was collected from a decennial census, in which The Census Bureau conducts a complete count of every person living in the United States every ten years, as mandated by the Constitution.
Write a brief description of the observations.
- In this dataset, it contains the total population of each US state as well as the amount of people belonging to a certain race or combination of races for each state.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Do certain parts of the country have higher concentrations of certain races/combinations of races
- Do states with liberal/conservative governors hold more of a percentage of a certain race? (Requires input of column of US governors and political standing but this can be done easily)
A description of the research topic along with a concise statement of your hypotheses on this topic.
- My target population is people who want to study the geographical locations of different races in the US and if there are perhaps concentrations of certain races in certain parts of the country. Additionally, this could be for people who want to study how the primary political ideology of the state (which could be properly determined by looking at the political idology of statewide leaders) correlate to the number of residents of a certain race in that state.
- Hypothesis 1: States closer to the north will have higher percentage of white population that states closer to the south.
- Hypothesis 2: States with liberal governors will have a higher African American population percent compared to states with conservative governors.
Identify the types of variables in your research question. Categorical? Quantitative?
- There are both categorical and quantitative variables in my research questions. The quantitative variables are the population numbers which will be converted into a percent to properly compare each state. The categorical variables are the different states and races as well as governors and political ideology.

Glimpse of data

# add code here
race_census_raw <- read.csv("data/DECENNIALPL2020.P1-2023-03-14T231744.csv")

skimr::skim(race_census_raw)

Data summary
Name	race_census_raw
Number of rows	71
Number of columns	53
_______________________
Column type frequency:
character	53
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Label..Grouping.	1	6	147	71
Alabama	1	1	9	64
Alaska	1	1	7	63
Arizona	1	1	9	68
Arkansas	1	1	9	61
California	1	2	10	70
Colorado	1	1	9	69
Connecticut	1	1	9	65
Delaware	1	1	7	57
District.of.Columbia	1	1	7	57
Florida	1	2	10	69
Georgia	1	1	10	67
Hawaii	1	1	9	69
Idaho	1	1	9	60
Illinois	1	1	10	66
Indiana	1	1	9	69
Iowa	1	1	9	62
Kansas	1	1	9	61
Kentucky	1	1	9	64
Louisiana	1	1	9	66
Maine	1	1	9	55
Maryland	1	1	9	68
Massachusetts	1	1	9	65
Michigan	1	1	10	65
Minnesota	1	1	9	67
Mississippi	1	1	9	62
Missouri	1	1	9	66
Montana	1	1	9	54
Nebraska	1	1	9	64
Nevada	1	1	9	68
New.Hampshire	1	1	9	58
New.Jersey	1	2	9	70
New.Mexico	1	1	9	64
New.York	1	2	10	69
North.Carolina	1	1	10	68
North.Dakota	1	1	7	56
Ohio	1	1	10	68
Oklahoma	1	1	9	63
Oregon	1	1	9	66
Pennsylvania	1	1	10	64
Rhode.Island	1	1	9	62
South.Carolina	1	1	9	65
South.Dakota	1	1	7	55
Tennessee	1	1	9	64
Texas	1	1	10	69
Utah	1	1	9	67
Vermont	1	1	7	50
Virginia	1	1	9	68
Washington	1	1	9	70
West.Virginia	1	1	9	59
Wisconsin	1	1	9	64
Wyoming	1	1	7	51
Puerto.Rico	1	1	9	62

Data 2

Introduction and data

Identify the source of the data.
- Billionaires CSV File - From the CORGIS Data-set Project
  By Ryan Whitcomb - Version 2.0.0, created 5-17-16
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996-2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire - including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.)
Write a brief description of the observations.
- A compilation of data of billionaires. Displays the source of their wealth in the US, Europe, and other countries: whether their wealth is self-made, inherited.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How do billionaires between the US and Europe differ in the sources of their wealth, industries they work in, and worth in billions?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- This research topic explores different sources of wealth among billionaires, whether inherited or self-made, and how wealth is distributed among different industries, from technology to finance to commodity products. We will also research the differences in the super wealthy between America and Europe, whether they focus in different industries, differ in methods of becoming billionaires, reveal what is responsible for the wealth in each region, and how billionaires differ in worth among regions.
- We hypothesize that more American billionaires are self made, working in greater diversity of industries, compared to European billionaires that become wealthy through inheritances and work mainly in technology or financial sectors.
Identify the types of variables in your research question. Categorical? Quantitative?
- company.sector - categorical
- wealth.how.industry - categorical
- location.region – categorical
- wealth.type - categorical
- wealth.how.inherited - categorical
- wealth.worth.in.billions - quantitative

Glimpse of data

# add code here
billionaires <- read_csv("data/billionaires.csv")

Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl  (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl  (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(billionaires)

Data summary
Name	billionaires
Number of rows	2614
Number of columns	22
_______________________
Column type frequency:
character	13
logical	3
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	5	45	2077
company.name	38	0.99	3	59	1576
company.relationship	46	0.98	3	46	73
company.sector	23	0.99	3	52	505
company.type	36	0.99	3	22	15
demographics.gender	34	0.99	4	14	3
location.citizenship	0	1.00	4	20	73
location.country code	0	1.00	3	6	74
location.region	0	1.00	1	24	8
wealth.type	22	0.99	9	24	5
wealth.how.category	1	1.00	1	18	9
wealth.how.industry	1	1.00	1	31	19
wealth.how.inherited	0	1.00	6	24	6

Variable type: logical

skim_variable	complete_rate	mean	count
wealth.how.from emerging	1	1	TRU: 2614
wealth.how.was founder	1	1	TRU: 2614
wealth.how.was political	1	1	TRU: 2614

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
rank	1	5.996700e+02	4.678900e+02	1	215.0	430	9.880e+02	1.565e+03	▇▅▃▂▃
year	1	2.008410e+03	7.480000e+00	1996	2001.0	2014	2.014e+03	2.014e+03	▂▂▁▁▇
company.founded	1	1.924710e+03	2.437800e+02	0	1936.0	1963	1.985e+03	2.012e+03	▁▁▁▁▇
demographics.age	1	5.334000e+01	2.533000e+01	-42	47.0	59	7.000e+01	9.800e+01	▁▂▁▇▃
location.gdp	1	1.769103e+12	3.547083e+12	0	0.0	0	7.250e+11	1.060e+13	▇▁▁▁▁
wealth.worth in billions	1	3.530000e+00	5.090000e+00	1	1.4	2	3.500e+00	7.600e+01	▇▁▁▁▁

Data 3

Introduction and data

Identify the source of the data.

Sam Donald through CORGIS

State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was published on 10/28/2022. The data was collected from coffee tasting professionals as they give their rankings on different coffees.

Write a brief description of the observations.

The data includes observations about the location of where the coffee was grown, the production year, the details on how much coffee was tested, and the rankings given for a variety of different categories.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- How does altitude of farms affect coffee aroma?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- This research topic would focuses on coffee taste and smell. Coffee taste/smell is effected my many different factors, we will be looking at one (altitude). We believe that coffee grown at a higher altitude will make the aroma stronger.
Identify the types of variables in your research question. Categorical? Quantitative?
- Quantitative: Aroma (1-10) & Quantitative: Altitude(ft/m)

Glimpse of data

# add code here
library(readr)
coffee <- read_csv("data/coffee.csv")

Rows: 989 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): Location.Country, Location.Region, Data.Owner, Data.Type.Species, ...
dbl (16): Location.Altitude.Min, Location.Altitude.Max, Location.Altitude.Av...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skimr::skim(coffee)

Data summary
Name	coffee
Number of rows	989
Number of columns	23
_______________________
Column type frequency:
character	7
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
Location.Country	1	4	28	32
Location.Region	1	3	76	278
Data.Owner	1	3	50	263
Data.Type.Species	1	7	7	2
Data.Type.Variety	1	3	21	28
Data.Type.Processing method	1	3	25	6
Data.Color	1	4	12	5

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Location.Altitude.Min	1	1640.08	9192.52	0	905.00	1300.00	1550.00	190164.00	▇▁▁▁▁
Location.Altitude.Max	1	1675.93	9191.96	0	950.00	1310.00	1600.00	190164.00	▇▁▁▁▁
Location.Altitude.Average	1	1658.00	9192.06	0	950.00	1300.00	1600.00	190164.00	▇▁▁▁▁
Year	1	2013.55	1.66	2010	2012.00	2013.00	2015.00	2018.00	▁▇▃▃▁
Data.Production.Number of bags	1	151.76	125.67	1	15.00	170.00	275.00	600.00	▇▁▇▁▁
Data.Production.Bag weight	1	210.49	1666.71	0	1.00	60.00	69.00	19200.00	▇▁▁▁▁
Data.Scores.Aroma	1	7.57	0.40	0	7.42	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Flavor	1	7.52	0.42	0	7.33	7.50	7.75	8.83	▁▁▁▁▇
Data.Scores.Aftertaste	1	7.39	0.43	0	7.25	7.42	7.58	8.67	▁▁▁▁▇
Data.Scores.Acidity	1	7.54	0.40	0	7.33	7.58	7.75	8.75	▁▁▁▁▇
Data.Scores.Body	1	7.51	0.39	0	7.33	7.50	7.67	8.50	▁▁▁▁▇
Data.Scores.Balance	1	7.50	0.43	0	7.33	7.50	7.75	8.58	▁▁▁▁▇
Data.Scores.Uniformity	1	9.82	0.59	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Sweetness	1	9.83	0.69	0	10.00	10.00	10.00	10.00	▁▁▁▁▇
Data.Scores.Moisture	1	0.09	0.04	0	0.10	0.11	0.12	0.28	▃▇▆▁▁
Data.Scores.Total	1	81.97	3.86	0	81.08	82.50	83.58	90.58	▁▁▁▁▇