Hotelies & Co. 2950 Project

Proposal

library(tidyverse)
library(skimr)
library(foreign)

Data 1

Introduction and data

Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.

This dataset was created by our team by scraping as much information as we could from the Twitter pages of 440 different Congressional members who’s pages we were able to find. The data wasn’t “collected” by traditional means; we used natual language processing to identify the sentiment of the tweet by analyzing key words and returning a rating (between -1 and 1 for negative to positive, respectively). There are 20,353 observations across the many tweets of all these congress members, with useful variables including the sentiment rating, total views of individual tweets, the political party affiliation of each Congress member, the time of day a tweet was posted, retweets, hashtags, url links, and replies to other tweets. There are negligible ethical concerns as these tweets are public (and shared by Congress members at that).

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- What effects do positive and negative sentiments in written tweets, as well as the use of buzzwords/hot button issues within those tweets have on the total engagement rate of online users (both in the aggregate and proportional to existing follower counts) for members of Congress? Are Republicans or Democrats more likely to tweet positively?
- The findings of analyzing this dataset would be useful for any political figures who would like to capitalize on content curation that appeals to the public. Increasing online presence is a great way to gain attention and political traction.
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Hypothesis: 1. Tweets with stronger negative sentiments that also contain negative buzzwords will likely receive the most engagement from Twitter users both in the aggregate and proportional to the follower count of an individual Congress member. 2. Republicans are more likely to resort to negative sentiment in their tweets.
Identify the types of variables in your research question. Categorical? Quantitative?
- Buzzwords: Categorical
- Follower Count: Quantitative
- Sentiment Rating: Quantitative
- View Count: Quantitative
- Retweet Count: Quantitative
- Party affiliation: Categorical
- Time posted: Qualitative

Glimpse of data

data

# add code here
twitter <- read_csv("data/reps_tweets_rd.csv")

Rows: 20352 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (28): created_at, full_text, hashtags_0, hashtags_1, hashtags_2, media_0...
dbl (16): conversation_id, favorite_count, id, reply_count, retweet_count, u...
lgl  (9): is_quote_tweet, user_default_profile_image, user_ext_has_nft_avata...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(twitter)

Rows: 20,352
Columns: 53
$ conversation_id                   <dbl> 1.63570e+18, 1.63386e+18, 1.63353e+1…
$ created_at                        <chr> "3/14/2023 17:46", "3/9/2023 16:11",…
$ favorite_count                    <dbl> 8, 7, 5, 4, 11, 8, 12, 30, 16, 31, 2…
$ full_text                         <chr> "As the far left continues to attack…
$ hashtags_0                        <chr> NA, "BeAllYouCanBe", NA, NA, NA, NA,…
$ hashtags_1                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ hashtags_2                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ id                                <dbl> 1.63570e+18, 1.63386e+18, 1.63353e+1…
$ is_quote_tweet                    <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FA…
$ media_0_media_url                 <chr> "https://pbs.twimg.com/media/FrMreHw…
$ media_0_type                      <chr> "photo", NA, "photo", "photo", "phot…
$ reply_count                       <dbl> 7, 11, 3, 9, 12, 15, 9, 31, 13, 22, …
$ replying_to_tweet                 <chr> NA, NA, "https://twitter.com/Robert_…
$ retweet_count                     <dbl> 2, 1, 1, 1, 4, 1, 4, 3, 5, 9, 7, 919…
$ start_url                         <chr> "https://twitter.com/Robert_Aderholt…
$ url                               <chr> "https://twitter.com/Robert_Aderholt…
$ urls_0_display_url                <chr> NA, NA, NA, NA, NA, NA, "newsnationn…
$ urls_0_expanded_url               <chr> NA, NA, NA, NA, NA, NA, "https://www…
$ urls_0_url                        <chr> NA, NA, NA, NA, NA, NA, "https://t.c…
$ user_created_at                   <chr> "9/22/2009 21:16", "9/22/2009 21:16"…
$ user_default_profile_image        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ user_description                  <chr> "Proudly serving Alabama's 4th Distr…
$ user_ext_has_nft_avatar           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ user_favourites_count             <dbl> 532, 532, 532, 532, 532, 532, 532, 5…
$ user_followers_count              <dbl> 32638, 32638, 32638, 32638, 32638, 3…
$ user_friends_count                <dbl> 506, 506, 506, 506, 506, 506, 506, 5…
$ user_geo_enabled                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ user_has_custom_timelines         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ user_id_str                       <dbl> 76452765, 76452765, 76452765, 764527…
$ user_is_translation_enabled       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ user_is_translator                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ user_listed_count                 <dbl> 1249, 1249, 1249, 1249, 1249, 1249, …
$ user_location                     <chr> "Haleyville, AL", "Haleyville, AL", …
$ user_media_count                  <dbl> 671, 671, 671, 671, 671, 671, 671, 6…
$ user_name                         <chr> "Robert Aderholt", "Robert Aderholt"…
$ user_normal_followers_count       <dbl> 32638, 32638, 32638, 32638, 32638, 3…
$ user_possibly_sensitive           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ user_profile_banner_url           <chr> "https://pbs.twimg.com/profile_banne…
$ user_profile_image_url            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ user_profile_image_url_https      <chr> "https://pbs.twimg.com/profile_image…
$ user_profile_sidebar_border_color <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ user_screen_name                  <chr> "Robert_Aderholt", "Robert_Aderholt"…
$ user_statuses_count               <dbl> 2905, 2905, 2905, 2905, 2905, 2905, …
$ user_translator_type              <chr> "none", "none", "none", "none", "non…
$ user_url                          <chr> "https://t.co/g8ir7PXxD4", "https://…
$ user_verified                     <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ user_verified_type                <chr> "Government", "Government", "Governm…
$ user_mentions_0_id_str            <dbl> NA, 8.775672e+06, NA, NA, NA, NA, NA…
$ user_mentions_0_name              <chr> NA, "U.S. Army", NA, NA, NA, NA, NA,…
$ user_mentions_0_screen_name       <chr> NA, "USArmy", NA, NA, NA, NA, NA, NA…
$ view_count                        <dbl> 641, 1468, 2395, 467, 833, 1160, 887…
$ affiliation                       <chr> "R", "R", "R", "R", "R", "R", "R", "…
$ sentiment                         <dbl> 0.14166667, 0.24567691, 0.12044326, …

Data 2

Introduction and data

Identify the source of the data.
- The data is from the study, “The Role of Colleges in Intergenerational Mobility”, conducted in 2017, by Chetty, Friedman, Saez, Turner, and Yagan.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- It was originally collected in 2017
- The data was collected from two administrative data sources: Federal tax records and Department of Education records spanning 1999-2013.
Write a brief description of the observations.
- The observations are colleges in the US whose students are eligible to participate in Federal Student Financial Assistance programs under Title IV regulations. Each of them is specified by an OPE-ID.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Is the ranking of a school related to the mobility of its students? In other words, are good colleges really more life-changing for their less fortunate students?
A description of the research topic along with a concise statement of your hypotheses on this topic.
- The hypothesis is that good colleges are more likely to provide mobility for their students.
Identify the types of variables in your research question. Categorical? Quantitative?
- The variables are quantitative because they are fractions.

Glimpse of data

# Load data set
mobility <- read_csv("data/College Income.csv")

Rows: 2202 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): name, czname, state
dbl (12): super_opeid, par_median, k_median, par_q1, par_top1pc, kq5_cond_pa...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(mobility)

Rows: 2,202
Columns: 15
$ super_opeid        <dbl> 2665, 7273, 2688, 7022, 1140, 2693, 2165, 2791, 283…
$ name               <chr> "Vaughn College Of Aeronautics And Technology", "CU…
$ czname             <chr> "New York", "New York", "New York", "New York", "Lo…
$ state              <chr> "NY", "NY", "NY", "NY", "CA", "NY", "MA", "NY", "NY…
$ par_median         <dbl> 30900, 42800, 35500, 32500, 36600, 41800, 83300, 68…
$ k_median           <dbl> 53000, 57600, 48500, 40700, 43000, 45200, 112700, 6…
$ par_q1             <dbl> 36.47788, 27.63224, 32.54647, 36.70749, 33.11693, 2…
$ par_top1pc         <dbl> 0.11981525, 0.55920202, 0.23351549, 0.00000000, 0.1…
$ kq5_cond_parq1     <dbl> 44.84354, 46.82423, 36.02156, 27.88297, 29.94980, 3…
$ ktop1pc_cond_parq1 <dbl> 1.766629900, 2.556827100, 1.408721400, 0.189634980,…
$ mr_kq5_pq1         <dbl> 16.357975, 12.938586, 11.723747, 10.235138, 9.91845…
$ mr_ktop1_pq1       <dbl> 0.644429150, 0.706508640, 0.458489180, 0.069610246,…
$ trend_parq1        <dbl> -7.998776, -9.186549, -9.801580, -5.733966, -13.313…
$ trend_bottom40     <dbl> -5.750611, -12.297223, -13.879366, -9.072347, -14.9…
$ count              <dbl> 207.6667, 1083.0000, 582.3333, 468.3333, 1179.6667,…

Data 3

Introduction and data

Identify the source of the data.
- Final edited data found online (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4866586/)
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
- Income data was collected from tax records. Mortality rates were measured from Social Security Administration death records. Percentiles were determined after comparing household earnings to all other individuals of the same sex and age in the United States.
Write a brief description of the observations.
- The dataset covers mortality rates of individuals by gender, age, year, and individual income percentile.

Research question

A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Does gender and income status have a significant impact on your life expectancy (mortality rate)? Has life expectancy rates risen over time?
- This question can be useful for the government to determine how well members of the US population can survive at certain income levels. It can help determine funding for social services and figure out how much money people need to live longer.
A description of the research topic along with a concise statement of your hypotheses on this topic.
- Individuals who have higher income will have a much higher life expectancy compared to lower incomes. Gender has only a slight impact with women living longer. Life expectancy at low income levels has not changed but has at higher income levels.
Identify the types of variables in your research question. Categorical? Quantitative?
- Gender is categorical while age, income, and mortality rate are quantitative.

Glimpse of data

# add code here
mortalityincome <- read_csv(file = "data/health_ineq_online_table_16.csv")

Rows: 85400 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): gnd
dbl (8): indv_pctile, age_at_d, yod, lag, mortrate, indv_inc, deaths, count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(mortalityincome)

Rows: 85,400
Columns: 9
$ gnd         <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F"…
$ indv_pctile <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ age_at_d    <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 41…
$ yod         <dbl> 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010…
$ lag         <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ mortrate    <dbl> 0.001911863, 0.002626552, 0.001847171, 0.002564103, 0.0025…
$ indv_inc    <dbl> 1.6370009, 1.4019952, 0.9445581, 0.7256386, 0.6136361, 0.5…
$ deaths      <dbl> 40, 55, 38, 51, 45, 34, 44, 46, 37, 40, 47, 37, 43, 44, 56…
$ count       <dbl> 20922, 20940, 20572, 19890, 17482, 17831, 17211, 19226, 17…