At the onset of this project, our group was very driven to select an area of study that we would be really interested in performing analysis with. We simultaneously wanted to challenge ourselves and create something unique. In friendly/humorous discussion, we talked about the subject of how attention grabbing the twitter posts of some current political figures are. From there, we figured we could have fun with this topic by web-scraping the twitter posts of members of the United States’ Congress and performing analysis on them. The goal of our data science project is to glean information about engagement with twitter posts based on the language and content included.
Key questions for our analysis that we set out to answer are included below:
What effects do positive and negative sentiments in written tweets, as well as the use of buzzwords/hot button issues within those tweets have on the total engagement rate of online users (both in the aggregate and proportional to existing follower counts) for members of Congress?
Hypothesis: Tweets with stronger negative sentiments that also contain negative buzzwords will likely receive the most engagement from Twitter users both in the aggregate and proportional to the follower count of an individual Congress member.
Are Republicans or Democrats more likely to tweet positively?
Hypothesis: Republicans are more likely to resort to negative sentiment in their tweets.
We initially anticipated that the findings of our analysis would be useful for any political figures who would like to capitalize on curating content that grabs the attention of the public, as increasing online presence is a great way to gain attention and political traction.
Analyses planned for completion prior to crunching the numbers include the following:
We are concerned with how combining a tweet’s sentiment (positive or negative on a scale from 1 to -1) and a congress member’s political affiliation play into the engagement (in views, likes, retweets, comments, etc.) with a tweet, both in totality and in proportion to the congress member’s total follower count.
Using a database of buzzwords and their respective “bias” scores (which help indicate political leaning), we plan to analyze the relationship between these buzzwords (and their consequent bias score assignments) and the actual party affiliation of the congress members we’ve sampled tweets from. Using the congress members as a sample, this would give us the ability to predict whether a twitter user has a political leaning towards progressive or conservative ideologies.
After conducting our analyses, here’s a glimpse into what happened:
Democrat members of Congress have a higher total view rate on their posts as well as a higher total follower count compared to their Republican counterparts. However, after conducting our analyses we noted that proportionally, Republicans get higher engagement than Democrats when the sentiment of their tweets is more negative. Likewise, they also get more relative engagement than Democrats do with their posts for the same amount of buzzwords that they include in their post (buzzwords here being those included in our externally sourced buzzwords dataframe that we used to conduct this analysis). The initial methods we used for our predictive models were largely ineffective, however upon using a random forest model in application to our data, we created a predictive model with good, if not quite strong accuracy and ROC AUC values. More details follow in the report.
Data description
The data set used was created by our team by scraping specific information the Twitter pages of 440 total Congress members who’s pages we were able to find. The data wasn’t collected by “traditional” means; we used natural language processing to identify the sentiment of the tweet by analyzing key words and returning a rating (between -1 and 1 for negative to positive, respectively). There are 20,353 observations across the many tweets of all these congress members, with useful variables including the sentiment rating, total views of individual tweets, the political party affiliation of each Congress member, the time of day a tweet was posted, retweets, hashtags, url links, and replies to other tweets. There are negligible ethical concerns as these tweets are public (and shared by Congress members at that, leaving them further open to scrutiny from a legal perspective).
Attaching package: 'vip'
The following object is masked from 'package:utils':
vi
First, let’s explore the variables in the dataset:
reps <-read_csv("data/reps_tweets_buzzwords.csv")
Rows: 20061 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): full_text, hashtags_0, hashtags_1, hashtags_2, url, user_descript...
dbl (10): favorite_count, reply_count, retweet_count, user_favourites_count...
dttm (1): created_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(reps)
Data summary
Name
reps
Number of rows
20061
Number of columns
20
_______________________
Column type frequency:
character
9
numeric
10
POSIXct
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
full_text
0
1.00
1
326
0
19627
0
hashtags_0
16115
0.20
2
37
0
1265
0
hashtags_1
19285
0.04
2
31
0
458
0
hashtags_2
19926
0.01
2
22
0
119
0
url
0
1.00
52
62
0
20061
0
user_description
100
1.00
30
162
0
420
0
user_location
4858
0.76
2
34
0
274
0
user_name
0
1.00
8
40
0
422
0
affiliation
0
1.00
1
1
0
2
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
favorite_count
0
1.00
303.23
2323.38
0.00
2.00
13.00
39.00
76984.00
▇▁▁▁▁
reply_count
0
1.00
77.08
748.99
0.00
0.00
3.00
11.00
34764.00
▇▁▁▁▁
retweet_count
0
1.00
148.95
1388.12
0.00
2.00
5.00
17.00
87130.00
▇▁▁▁▁
user_favourites_count
0
1.00
1961.89
5997.41
0.00
183.00
645.00
1673.00
98739.00
▇▁▁▁▁
user_followers_count
0
1.00
122592.35
537781.94
260.00
12723.00
27276.00
48229.00
8136584.00
▇▁▁▁▁
user_friends_count
0
1.00
1578.20
3692.63
5.00
317.00
705.00
1363.00
48084.00
▇▁▁▁▁
view_count
6173
0.69
22920.32
198446.93
1.00
718.75
1492.00
3724.25
8643806.00
▇▁▁▁▁
sentiment
0
1.00
0.10
0.22
-1.08
-0.02
0.09
0.23
1.47
▁▂▇▁▁
num_buzzwords
0
1.00
13.07
5.86
0.00
8.00
13.00
18.00
31.00
▃▇▇▅▁
avg_buzzword_bias_score
89
1.00
-0.10
0.08
-0.91
-0.15
-0.10
-0.06
0.59
▁▁▇▁▁
Variable type: POSIXct
skim_variable
n_missing
complete_rate
min
max
median
n_unique
created_at
0
1
2017-11-06 01:23:51
2023-03-15 02:50:49
2023-03-04 17:38:59
19038
1)
To begin, let’s plot the relationship between the number of followers and political affiliation to see whether Democrats or Republicans have more “clout” on Twitter.
# A tibble: 2 × 2
affiliation median
<chr> <dbl>
1 D 199045.
2 R 122441.
Based on the graph and our calculations, Democrats, on average, have a higher following on Twitter than Republicans.
2
The number of followers by itself does not mean much, because a significant portion of them might be inactive or bots. Instead, let’s explore the relationship between political affiliation and user engagement. Let’s define user engagement rate as the sum of likes, replies, and retweets divided by the number of views.
4) Sentiment to engagement by political affiliation
To answer one of our research questions, we want to analyze if the sentiment of the tweet has any effect on engagement, and if this varies based on political affiliation. First, let’s plot the relationship between the sentiment of the tweet and its engagement rate for Republicans and Democrats using a scatter plot.
By plotting these lines, we can see that there seems to be an inverted relationship between the sentiment and the engagement rate of the Tweet, and it’s more dramatic for Republican Representatives. It looks as though when the sentiment goes down, the engagement rate increases.
Let’s fit a linear model to analyze the relationship between the variables:
reps_lm <-linear_reg() |>fit(engagement_rate ~ sentiment *factor(affiliation), data = reps_buzz_ratio)glance(reps_lm)
As you can see, the points cluster around certain areas. In particular, Republicans seem to cluster in the top left corner, i.e. land higher on the avg bias score and lower on sentiment. The opposite goes for Democrats; they seem to have more positive sentiment and a lower avg buzzword bias score. This model does seem well fit for linear models; however, because of this clustering, it might work well with the KNN classification model.
Warning: The `...` are not used in this function but one or more objects were
passed: ''
→ A | error: Assigned data `orig_rows` must be compatible with existing data.
✖ Existing data has 200 rows.
✖ Assigned data has 201 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 201 to size 200.
There were issues with some computations A: x1
→ B | error: Assigned data `orig_rows` must be compatible with existing data.
✖ Existing data has 199 rows.
✖ Assigned data has 201 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 201 to size 199.
There were issues with some computations A: x1
→ C | error: Assigned data `orig_rows` must be compatible with existing data.
✖ Existing data has 199 rows.
✖ Assigned data has 200 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 200 to size 199.
There were issues with some computations A: x1
→ D | error: Assigned data `orig_rows` must be compatible with existing data.
✖ Existing data has 198 rows.
✖ Assigned data has 200 rows.
ℹ Only vectors of size 1 are recycled.
Caused by error in `vectbl_recycle_rhs_rows()`:
! Can't recycle input of size 200 to size 198.
There were issues with some computations A: x1
There were issues with some computations A: x4 B: x1 C: x1 D: x1
There were issues with some computations A: x4 B: x1 C: x1 D: x1
Based on the performance of out model, we’ve found that we can accurately predict whether a congress member is a democrat or republican with a 58% success rate. Further refinement of this model needs to occur before actionable steps can be taken to make use of these findings. For example, if our success rate was higher (say 75% or above), it would become increasingly worthwhile to analyze large volumes of public tweets to determine whether non-public figure, regular twitter users lean more left or right in political orientation.
6) Predicting Affiliation based on the text in the Tweet
To try and reach our objective, we would like to try and predict political affiliation based on the content of the tweets themselves.
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
always returns an ungrouped data frame and adjust accordingly.
ℹ The deprecated feature was likely used in the yardstick package.
Please report the issue at <]8;;https://github.com/tidymodels/yardstick/issueshttps://github.com/tidymodels/yardstick/issues]8;;>.
Although this model took over 40 minutes to fit, the result is amazing. Now, with 86% accuracy, this model can be used to predict whether a tweet came from a Democrat or a Republican, allowing for real-world use cases.
Now, let’s take a look at what words are important predictors for the political affiliation of a representative.
glmnet_best <-readRDS("trained_glmnet_model.rds")glmnet_tune <-readRDS("glmnet_tune.rds")# the error occurs below \/glmnet_imp <- glmnet_best |>extract_fit_parsnip() |>vi(method ="model", lambda =select_best(x = glmnet_tune, metric ="roc_auc")$penalty)#the error occurs above /\glmnet_imp |>mutate(Variable =str_replace(Variable, "tfidf_full_text_", ""),Variable =str_to_title(str_replace_all(Variable, "_", " ")),Sign =case_when( Sign =="NEG"~"More likely from Democrat", Sign =="POS"~"More likely from Republican" ),Importance =abs(Importance))|>group_by(Sign) |>slice_max(order_by = Importance, n =20) |>ggplot(mapping =aes(x = Importance,y =fct_reorder(Variable, Importance),fill = Sign )) +geom_col(show.legend =FALSE) +scale_x_continuous(expand =c(0, 0)) +scale_fill_brewer(type ="qual") +facet_wrap(facets =vars(Sign), scales ="free_y") +labs(y =NULL,title ="Most relevant features for predicting whether\na Tweet is by a Republican or a Democrat",subtitle ="Penalized regression model" )
Warning: The statistic is based on a difference or ratio; by default, for
difference-based statistics, the explanatory variable is subtracted in the
order "D" - "R", or divided in the order "D" / "R" for ratio-based statistics.
To specify this order yourself, supply `order = c("D", "R")` to the calculate()
function.
ggplot(boot_df, aes(x = stat)) +geom_dotplot() +geom_vline(xintercept =0, color ="red")
Bin width defaults to 1/30 of the range of the data. Pick better value with
`binwidth`.
Engagement Rates by Party Model 2 utilizes a model to predict engagement rate based on the political party affiliation. To reiterate, engagement rate is the total activity on the tweet (likes, comments, shares) divided by the amount of views on the individual tweet. Comparing the engagement rates of twitter posts across the Democrat and Republican parties, we found that republicans tend to have a slightly higher engagement rate overall. We visualized this comparison using side by side boxplots of the engagement rates separated by party. Notably, the median value of engagement for republicans was higher, along with the third quartile and the interquartile range. Both plots have high-engagement outliers, with the republican party having the highest outlier between the two at a rate of over 8 percent. Calculating a confidence interval using the bootstrap method, we found that the engagement rate of twitter posts was higher amongst republican congress members as compared with their democrat counterparts (with substantial statistical significance). Potential reasons for why this might be are that republican party representatives create content that has more shock, entertainment, or relevance value. Another possibility is that right-leaning twitter users simply have more loyalty to the representatives that they follow on Twitter.
Buzzword Ratio vs Engagement After calculating engagement rate, we also developed a new column called buzz_ratio, computed by dividing the number of buzzwords in a tweet by the total number of words in a tweet. This gives us the proportion of buzzwords within a tweet, which we can use to compare against engagement rate. This controls for relative content size across tweets to compare against engagement rates of viewers. We visualized the relationship with a scatterplot to understand the correlation between the two variables and fit a model taking interaction effects into account as well. Upon visual analysis of the scatterplot, we noticed that republicans tend to have slightly higher engagement rates than democrats for an equivalent buzzword ratio, particularly at the tail ends of the buzzword ratio (based on our application of a loess method line of best fit. There is not a strong correlation between these two variables, though we have noticed that there appears to be a stronger engagement rate on the high-end for buzzword ratio values between 20 and 70, approximately. This range also seems to be where the posts with the highest engagement lie, indicating that this could be a goldilocks zone for the ratio of buzzwords that congress members should use to achieve higher engagement rates from their viewers.
Bias vs. Political Affiliation KNN Prediction Model Using a k nearest neighbors classification/prediction model in application to tweet average bias scores and tweet sentiment, we’ve found that we can accurately predict whether a congress member is a democrat or republican with a 58% success rate. Further refinement of this model needs to occur before actionable steps can be taken to make use of these findings. For example, if our success rate was higher (say 75% or above), it would become increasingly worthwhile to analyze large volumes of public tweets to determine whether non-public figure, regular twitter users lean more left or right in political orientation. Since our model’s success rate is not much higher than fifty percent, it may as well be left to a coin toss to determine the party affiliation of an individual based on the content of their tweet.
An immediately obvious area of refinement to increase the model’s success rate would be advanced language processing and increased sophistication behind calculating the two variables used in this model. We would need to be able to calculate bias scores based on larger phrases or chains of words as opposed to singular buzzwords. The exact same reasoning applies to the calculation of tweet sentiment. Detection of rhetorical devices that go unnoticed by the simple method of calculating these variables is crucial to truly understanding the sentiment of a tweet to assign it a useful value for statistical analysis. All in all, this is a good start at predicting political leaning, but further refinement and sophistication are needed for further improvement on our work thus far.
Sentiment vs Engagement
Comparing the overall sentiment of a tweet to total engagement rates with posts, there appears to be a weak, negative correlation between the two variables. To clarify, this means that engagement rate generally trends downward as the sentiment of a post goes from negative to positive. Applying a line of best fit for each party, we notice that the strength of this correlation is stronger for republican congress members than it is for democrats. This seems to suggest that republicans who post more negative content are more likely to see higher attention/traffic on their posts. This could indicate that it’s worthwhile for republican congress members to incorporate more criticism or shock value into their posts in efforts to maximize the volume of engagement with their content. That said, it could be potentially harmful to a personal brand to do so, as being associated with only bad/mean/shocking content online could damage one’s credibility and popularity.
Interpretation and conclusions
Limitations
Acknowledgments
After determining the list of twitter handles for all Congress members, we input these handles into https://blog.apify.com/ where it was able to pull the content and metadata for the past 50 tweets of each of those Congress members. This was the primary tool used to generate our raw data set prior to refinement.
We used an external dataset created by Jack Bandy (a PhD candidate at Northwestern University) that took words (or 2-word phrases) used by members of the United States Congress and assigned them bias scores on a scale from -1 to 1 (i.e. the words are exclusively used by Democrats and Republicans, respectively). The methodology for calculating the scores as well as the actual dataset can be found on https://towardsdatascience.com/detecting-politically-biased-phrases-from-u-s-senators-with-natural-language-processing-tutorial-d6273211d331.