Analyzing Airbnb Data

What variables best predict the price of Airbnb listings in New York City?

Author

Wondrous Raichu
Boyang Li, Dave Tirtariyadi, Ethan Carlson,
Ribo Li, Tasnimul Taher

Published

May 5, 2023

Introduce the topic and motivation

Airbnb has quickly revolutionized the travel and hospitality industry since its launch in 2008.

With over six million listings worldwide, the company has disrupted traditional travel methods. In fact, research has shown that a 1% increase in Airbnb listings results in a 0.03% decrease in hotel room revenues in the top ten US cities (Dogru et al., 2019).

As Airbnb continues to grow, its important to understand its implications on the market and its users, and one step towards that is to understand its pricing mechanism.

Introduce the data

The main dataset we utilized comes from Inside Airbnb, an open platform that provides data on Airbnb listings in different locations around the world.

It contains information on over 42,000 listings in New York City as of March 6, 2023.

Some variables in the dataset that we chose to look into: price, neighbourhood_group, amenities, review_scores_rating

Highlights from EDA

  • Manhattan and Brooklyn are top two boroughs for number of listings.

  • Prices are heavily right-skewed.

  • Top 20 amenities included “dedicated workspace.”

Location

Predicting by NYC boroughs:

\[ \widehat{log(Airbnb~price)} = 4.495 + 0.221\times Brooklyn + 0.631\times Manhattan \\ + 0.029 \times Queens + 0.141 \times Staten~Island \]

  • R-squared = 11.53%, adjusted-R-squared = 11.52%

Predicting by NYC neighborhoods:

  • 220 different neighborhoods recorded in NYC, many coefficients

  • R-squared = 25.35%, adjusted-R-squared = 24.96%

Conclusion: - Linear models of neighborhoods accounts for more variabiliy in log(price) than boroughs. However, boroughs might be better when when we consider other variables in our future linear models, as using neighborhood levels might cause over-fitting.

Listing characteristics

Room type

  • Entire home/apt and shared room options are most common.
  • [Highest] Hotel room, Entire home/apt, Private room, Shared room [Lowest]
  • R-squared value is 16.3%.

Extra amenities

  • Majority of listings do not have all the extras and distribution is similar between boroughs.
  • Produced confidence interval: 95% confident that listings that include extra amenities will have median prices that are between $50 to $59 higher.

Conclusions + future work

  • In total, we have 22 variables.
  • Moving forward, we plan on fitting a multivariate regression model and utilizing backward elimination to remove insignificant predictors.
Top Five Predictor Variables Based on R-squared Values in Simple Linear Regression Model
Variable R-squared
Number of bathrooms 0.290
Room type 0.163
Number of bedrooms 0.157
Host acceptance rate 0.060
Washer 0.048