Analyzing Airbnb data for Boston and Seattle
- Analyzing Airbnb data for Boston and Seattle
-
- 1. Business understanding
- 2. Data understanding
-
- 2.1 Load the data
- 2.2 Preview the data
- 3. Data preperation
-
- 3.1 Merge listings and calender
- 3.2 Put two dataframes in one dataframe(concat)
- 3.3 Working with missing data
-
- 3.3.1 Checking columns that contains more than 80% miss data
- 3.3.2 Checking each columns in df_80
- 3.3.3 Get into each features
- 3.4 Result DataFrame
- 4. Modeling
-
- 4.1 Find the factors that affect price
- 4.2 Heatmap
- 4.3 Prediction
-
- 4.3.1 linear_model
- 4.3.2 RandomForest
- 4.3.3 Modeling Process
- 5. Summary
Analyzing Airbnb data for Boston and Seattle
In this article, I will talk about some thing about Airbnb data for Boston and Seattle in 2018。And I will answer 3 questions by analyzing these data.
1. Business understanding
I collect one month data in 2018 from Airbnb in Boston and Seattle, and I will put them together; then I will choose some features to build a model; at last, I will answer 3 questions:
- What is top positive and negative factors that can affect the price of a house;
- Try to build a model to predict the score;
- What should a host do to improve the score(review_scores_rating) of his house from this model;
2. Data understanding
# import the basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
2.1 Load the data
# I chooose data from Boston and Seattle AirBNB
boston_cal = pd.read_csv('./data/Boston/calendar.csv', low_memory=False)
boston_list = pd.read_csv('./data/Boston/listings.csv', low_memory=False)
boston_rev = pd.read_csv('./data/Boston/reviews.csv', low_memory=False)
seattle_cal = pd.read_csv('./data/Seattle/calendar.csv', low_memory=False)
seattle_list = pd.read_csv('./data/Seattle/listings.csv', low_memory=False)
seattle_rev = pd.read_csv('./data/Seattle/reviews.csv', low_memory=False)
2.2 Preview the data
print('boston calender size: ', boston_cal.shape)
print('boston listings size: ', boston_list.shape)
print('boston reviews size: ', boston_rev.shape)
boston calender size: (2286360, 7)
boston listings size: (6264, 106)
boston reviews size: (225880, 6)

- We can see that ‘calender’ data can join ‘listings’ data.
- Calender data show the listing id, data, available, price, adjusted_price, minimum_nights and maximum_nights.
- Listings data show some detailed information. However, obviously, listings data contains too many unnecessary messages, it need to be Extract-Transform-Load later.
- We don’t need reviews data, because the most important information in it is comments from guests. And we can hardly use them.
3. Data preperation
3.1 Merge listings and calender
# Obviouly, 'listing_id' in calender data refers to 'id' in listings data
seattle_cal = seattle_cal.rename(columns={'listing_id': 'id'})
boston_cal = boston_cal.rename(columns={'listing_id': 'id'})
seattle_df = pd.merge(seattle_list,seattle_cal,on='id',how='outer')
boston_df = pd.merge(boston_list,boston_cal,on='id',how='outer')
boston_df.head()
3.2 Put two dataframes in one dataframe(concat)
boston_df['city'] = 'boston'
seattle_df['city'] = 'seattle'
df = pd.concat([boston_df,seattle_df],ignore_index=True)
df.shape
(5595815, 112)
3.3 Working with missing data
3.3.1 Checking columns that contains more than 80% miss data
df_80 = df.columns[df.count()/df.shape[0]<=0.80]
df_80
Index([‘space’, ‘neighborhood_overview’, ‘notes’, ‘transit’, ‘access’,
‘interaction’, ‘house_rules’, ‘thumbnail_url’, ‘medium_url’,
‘xl_picture_url’, ‘host_about’, ‘host_acceptance_rate’,
‘neighbourhood_group_cleansed’, ‘square_feet’, ‘weekly_price’,
‘monthly_price’, ‘security_deposit’, ‘license’],
dtype=‘object’)
3.3.2 Checking each columns in df_80
At Zeus, we focus exclusively on 30 day+ stays to meet the longer-term needs of our valued guests. When you stay in a Zeus property you will be met with top-notch amenities including premium mattresses and bedding, organic bath products, bath linens, a fully equipped kitchen, and super-fast wifi. Our staff is on call should anything come up that needs our attention or that can make your stay more enjoyable. 56210
The apartment is in a high-end building, conveniently located near South Lake Union’s technology center. With 24/7 guest support and concierge services, we combine a fully-furnished premium apartment with the consistent, high-quality experience and amenities that you expect from leading hotels. Expertly designed with customized furniture, the apartment’s furnishings are stylish, contemporary and specifically tailored for this apartment. We equip our apartments with everything you’ll need for your home away from home. Need more towels or anything additional during your stay? Our 24/7 guest support and concierge services are here for you. Kitchen: - Fully-stocked with cookware, glasses, mugs, plates, and everything you’ll need to host a dinner at your home away from home. - Glass top, Energy Efficient Range - Full-sized, Energy Efficient Refridgerator and Freezer - 1100 Watt Microwave - Paper Products and Trash Bags - Premium Dishwasher - Dish Soap, Dish Pods, Hand Soap - Coffee Maker w 25915
…
This column talks about the description of the house which given by the house owner. Obviously, ‘space’ column contains a lot of text information and we can’t group them easily. We can just drop it.
Located in central South Lake Union these apartments are a short walk to some well known technology businesses nearby (Amazon, Impinj, Northeastern University, Path). You can also enjoy the large park across the street. 22265
These downtown apartments are within a mile of Pike Place Market, the Space Needle, the Seattle Aquarium and a multitude of upscale restaurants and bars. You’re also just steps away from an assortment of venues where you can enjoy theater, comedy and every conceivable kind of music. Here, returning home for the afternoon certainly doesn’t mean an end to an exciting day. 17155
Fenway highlights - Walkability score - 95 - 2 minute walk to Sweet Cheeks Q (favorite family-style BBQ) - 10 minute walk to The Hawthorne (upscale bar serving craft cocktails) 17155
…
This column talks about the surrouding of the house. Similarly, we can hardly find some useful imformation from it. We can drop it.
We check all the columns in df_80, and put all columns that need to be deleted in a new list:
Now, we need to drop ‘space’, ‘neighborhood_overview’, ‘notes’, ‘transit’, ‘access’, ‘interaction’, ‘house_rules’, ‘thumbnail_url’, ‘medium_url’, ‘xl_picture_url’, ‘host_about’, ‘host_acceptance_rate’, ‘neighbourhood_group_cleansed’, ‘square_feet’, ‘license’.
At the same time we also need to fill up ‘weekly_price’,‘monthly_price’, ‘security_deposit’.
list_drop = ['space', 'neighborhood_overview', 'notes', 'transit', 'access',
'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
'xl_picture_url', 'host_about', 'host_acceptance_rate',
'neighbourhood_group_cleansed', 'square_feet', 'license']
def drop_columns(df, cols):
'''
Usage: Delete colums
Input:
df - input dataframe
cols - the feature to be process
Output:
df - dataframe which have been process
'''
df.drop(columns=list_drop, inplace=True)
return df
Now we need to fill up ‘weekly_price’, ‘monthly_price’, ‘security_deposit’.
def fill_colums(df, cols):
'''
Usage: Fill up colums
Input:
df - input dataframe
cols - the feature to be process
Output:
df - dataframe which have been process
'''
for col in cols:
df[col].fillna(value=0, inplace=True)
return df
Before we process these columns, we need to complete our list_drop and list_fill_zero.
3.3.3 Get into each features
df.select_dtypes(include=['float', 'int']).columns
Index([‘id’, ‘scrape_id’, ‘host_id’, ‘host_listings_count’,
‘host_total_listings_count’, ‘latitude’, ‘longitude’, ‘accommodates’,
‘bathrooms’, ‘bedrooms’, ‘beds’, ‘guests_included’, ‘minimum_nights_x’,
‘maximum_nights_x’, ‘minimum_minimum_nights’, ‘maximum_minimum_nights’,
‘minimum_maximum_nights’, ‘maximum_maximum_nights’,
‘minimum_nights_avg_ntm’, ‘maximum_nights_avg_ntm’, ‘availability_30’,
‘availability_60’, ‘availability_90’, ‘availability_365’,
‘number_of_reviews’, ‘number_of_reviews_ltm’, ‘review_scores_rating’,
‘review_scores_accuracy’, ‘review_scores_cleanliness’,
‘review_scores_checkin’, ‘review_scores_communication’,
‘review_scores_location’, ‘review_scores_value’,
‘calculated_host_listings_count’,
‘calculated_host_listings_count_entire_homes’,
‘calculated_host_listings_count_private_rooms’,
‘calculated_host_listings_count_shared_rooms’, ‘reviews_per_month’,
‘minimum_nights_y’, ‘maximum_nights_y’],
dtype=‘object’)
df.select_dtypes(include=['object']).columns
Index([‘listing_url’, ‘last_scraped’, ‘name’, ‘summary’, ‘description’,
‘experiences_offered’, ‘picture_url’, ‘host_url’, ‘host_name’,
‘host_since’, ‘host_location’, ‘host_response_time’,
‘host_response_rate’, ‘host_is_superhost’, ‘host_thumbnail_url’,
‘host_picture_url’, ‘host_neighbourhood’, ‘host_verifications’,
‘host_has_profile_pic’, ‘host_identity_verified’, ‘street’,
‘neighbourhood’, ‘neighbourhood_cleansed’, ‘city’, ‘state’, ‘zipcode’,
‘market’, ‘smart_location’, ‘country_code’, ‘country’,
‘is_location_exact’, ‘property_type’, ‘room_type’, ‘bed_type’,
‘amenities’, ‘price_x’, ‘weekly_price’, ‘monthly_price’,
‘security_deposit’, ‘cleaning_fee’, ‘extra_people’, ‘calendar_updated’,
‘has_availability’, ‘calendar_last_scraped’, ‘first_review’,
‘last_review’, ‘requires_license’, ‘jurisdiction_names’,
‘instant_bookable’, ‘is_business_travel_ready’, ‘cancellation_policy’,
‘require_guest_profile_picture’, ‘require_guest_phone_verification’,
‘date’, ‘available’, ‘price_y’, ‘adjusted_price’],
dtype=‘object’)
Obviouly, columns named ‘description’, ‘country’, ‘_url’, ‘state’, ‘street’… are useless.
list_drop = list_drop + ['country', 'country_code', 'description', 'host_name', 'host_picture_url',
'host_thumbnail_url', 'host_url', 'last_scraped', 'smart_location',
'host_verifications', 'listing_url', 'name', 'picture_url', 'last_scraped', 'market',
'state', 'street', 'summary', 'zipcode', 'requires_license',
'host_neighbourhood', 'experiences_offered', 'calendar_last_scraped',
'jurisdiction_names', 'is_business_travel_ready', 'price_y']
We can see that some object columns need to convert into int or float
def convert_to_flout(df, cols):
'''
Usage: Convert colums to float
Input:
df - input dataframe
cols - the feature to be process
Output:
df - dataframe which have been process
'''
for col in cols:
df[col]=df[col].replace('[$,]','',regex=True).astype(float)
return df
today 1939610
2 weeks ago 325945
yesterday 310980
a week ago 248565
3 weeks ago 217905
2 months ago 216445
3 days ago 185785
3 months ago 172280
4 weeks ago 143080
4 days ago 137605
5 weeks ago 129940
5 days ago 118625
4 months ago 112420
2 days ago 108770
…
def convert_calendar_int(x):
if('a week' in x):
return 1
elif('week' in x):
return str.split(x)[0]
elif ('month' in x):
return int(str.split(x)[0])*4
elif ('never' in x):
return int(0)
else:
return int(1)
df['calendar_updated']=df['calendar_updated'].apply(lambda x:convert_calendar_int(x))
3.4 Result DataFrame
list_numerical_drop = ['id', 'scrape_id', 'host_id', 'minimum_nights_x', 'maximum_nights_x',
'minimum_nights_y', 'maximum_nights_y']
def drop_row(df):
'''
Usage: Drop row that contains missing values
Input:
df - input dataframe
Output:
df - dataframe which have been process
'''
df_miss = df.columns[(df.count()/len(df)) < 1]
df.dropna(subset=df_miss, inplace=True, axis=0)
return df
Check the result Dataframe
df.select_dtypes(include=['object']).columns
Index([‘host_since’, ‘host_is_superhost’,
‘host_has_profile_pic’, ‘host_identity_verified’,
‘neighbourhood_cleansed’, ‘city’, ‘is_location_exact’, ‘property_type’,
‘room_type’, ‘bed_type’, ‘has_availability’,
‘instant_bookable’, ‘cancellation_policy’,
‘require_guest_profile_picture’, ‘require_guest_phone_verification’,
‘available’],
dtype=‘object’)
def get_dump(df):
list_cat = df.select_dtypes(include=['object']).columns
df =pd.get_dummies(df, columns=list_cat, drop_first=True)
return df
4. Modeling
4.1 Find the factors that affect price
corr_matrix = df.corr()
corr_matrix['price'].sort_values(ascending=False)
Q1 The top factors that affect price of house:
price 1.000000
host_total_listings_count 0.497184
accommodates 0.452723
beds 0.395042
bedrooms 0.392926
cleaning_fee 0.367075
guests_included 0.318110
bathrooms 0.306648
…
host_is_superhost_True -0.125351
calculated_host_listings_count_private_rooms -0.138105
number_of_reviews -0.149646
number_of_reviews_ltm -0.160282
cancellation_policy_moderate -0.164931
reviews_per_month -0.199273
room_type_Private room -0.326181
Name: price, Length: 185, dtype: float64
4.2 Heatmap
But we get 178 features, I don’t think this picture would help us.
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
4.3 Prediction
score 1.000000
host_is_superhost_True 0.368763
amenities 0.183053
number_of_reviews_ltm 0.162046
reviews_per_month 0.151192
cancellation_policy_moderate 0.146941
number_of_reviews 0.145018
city_seattle 0.144506
property_type_Guest suite 0.103642
host_identity_verified_True 0.094759
…
price -0.080093
neighbourhood_cleansed_Mission Hill -0.081475
neighbourhood_cleansed_Back Bay -0.087674
availability_90 -0.097441
calculated_host_listings_count_shared_rooms -0.103292
require_guest_profile_picture_True -0.112825
instant_bookable_True -0.117803
require_guest_phone_verification_True -0.119340
neighbourhood_cleansed_Downtown -0.121227
cancellation_policy_strict_14_with_grace_period -0.138403
availability_60 -0.144002
cancellation_policy_strict -0.156851
availability_30 -0.181728
calculated_host_listings_count_private_rooms -0.195279
host_total_listings_count -0.225087
calculated_host_listings_count_entire_homes -0.230170
calculated_host_listings_count -0.255190
Name: score, Length: 185, dtype: float64
4.3.1 linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
X = df.drop(columns=['score','city_seattle'])
Y = df['score']
X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.2, random_state=42)
linear_model = LinearRegression(normalize=True)
linear_model.fit(X_train, y_train)
y_preds = linear_model.predict(X_test)
r2_scores = r2_score(y_test,y_preds)
r2_scores
0.24278702995958334
r2_score is really near zero, so LinearRegression is not best choice for us to predict the ‘score’.
This time I will use RamdonForest model, and chose 10 features I have selected above.
4.3.2 RandomForest
list_features = ['host_is_superhost_True',
'amenities',
'number_of_reviews_ltm',
'reviews_per_month',
'cancellation_policy_moderate',
'host_identity_verified_True',
'is_location_exact_True',
'host_response_rate',
'extra_people',
'security_deposit',
'calculated_host_listings_count',
'cancellation_policy_strict',
'neighbourhood_cleansed_Downtown',
'require_guest_phone_verification_True',
'require_guest_profile_picture_True',
'instant_bookable_True',
'require_guest_profile_picture_True',
'price',
'available_True',
'neighbourhood_cleansed_Chinatown']
X_rfg = df[list_features]
Y_rfg = df['score']
X_train, X_test, y_train, y_test = train_test_split(X_rfg, Y_rfg, test_size=0.3, random_state=42)
rfg_model = RandomForestRegressor(max_depth=8, random_state=42, n_estimators=100,
min_samples_leaf = 2, min_samples_split = 10)
rfg_model.fit(X_train, y_train)
y_rfg_preds = rfg_model.predict(X_test)
r2_rfg_scores = r2_score(y_test, y_rfg_preds)
r2_rfg_scores
0.38408818776902665
It’s better than lineargression, but still not good.
4.3.3 Modeling Process
from sklearn.model_selection import GridSearchCV
def find_best_rgf(df, cols):
'''
Usage: find best RandomRorestRegression model
Input:
df - input dataframe
col - features
Output:
best_clf - best model
'''
X_rfg = df[list_features2]
Y_rfg = df['score']
X_train, X_test, y_train, y_test = train_test_split(X_rfg, Y_rfg, test_size=0.3, random_state=42)
param_grid = {'max_depth':[8, 16, 20],
'min_samples_split':[8, 16, 20],
'min_samples_leaf':[2, 4]}
grid_search = GridSearchCV(rfg_model2, param_grid, cv=3)
grid_search.fit(X_train,y_train)
best_clf = grid_search.best_estimator_
return best_clf
5. Summary
Summary: RandomForest is a better model for predicting the score(scored by guests), both traing sets and testing set get an acceptable or a good re_score and MSE.
Q2: Find a model to predict the score.
def show_importances(model, df):
importances = model.feature_importances_
feat_names = df.columns
tree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})
tree_result.sort_values(by='importance',ascending=True)[-10:].plot(x='feature', y='importance', kind='barh')
show_importances(best_clf,X_train)
Q2: The best model to predict the score: RandomForestRegressor( bootstrap=True, criterion=‘mse’, max_depth=20, max_features=‘auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=16, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False)
Q3: Acording to this image, we should conclude that all guests are happy to live a great house with a low price.
1, The host should decrease the price of his house properly. If his house get a higher score, more guests would like to rent his house.
2, Price, amenities, wether the host is a superhost, the number of host listings, response rate, security deposit would be top 6 factors that would help the host and his house to get a higher score. And the other 4 factors can only help the hosts a little.