天天看點

Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle

Analyzing Airbnb data for Boston and Seattle

  • Analyzing Airbnb data for Boston and Seattle
    • 1. Business understanding
    • 2. Data understanding
      • 2.1 Load the data
      • 2.2 Preview the data
    • 3. Data preperation
      • 3.1 Merge listings and calender
      • 3.2 Put two dataframes in one dataframe(concat)
      • 3.3 Working with missing data
        • 3.3.1 Checking columns that contains more than 80% miss data
        • 3.3.2 Checking each columns in df_80
        • 3.3.3 Get into each features
      • 3.4 Result DataFrame
    • 4. Modeling
      • 4.1 Find the factors that affect price
      • 4.2 Heatmap
      • 4.3 Prediction
        • 4.3.1 linear_model
        • 4.3.2 RandomForest
        • 4.3.3 Modeling Process
    • 5. Summary

Analyzing Airbnb data for Boston and Seattle

In this article, I will talk about some thing about Airbnb data for Boston and Seattle in 2018。And I will answer 3 questions by analyzing these data.

1. Business understanding

I collect one month data in 2018 from Airbnb in Boston and Seattle, and I will put them together; then I will choose some features to build a model; at last, I will answer 3 questions:

  1. What is top positive and negative factors that can affect the price of a house;
  2. Try to build a model to predict the score;
  3. What should a host do to improve the score(review_scores_rating) of his house from this model;

2. Data understanding

# import the basic libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
           

2.1 Load the data

# I chooose data from Boston and Seattle AirBNB

boston_cal = pd.read_csv('./data/Boston/calendar.csv', low_memory=False)
boston_list = pd.read_csv('./data/Boston/listings.csv', low_memory=False)
boston_rev = pd.read_csv('./data/Boston/reviews.csv', low_memory=False)

seattle_cal = pd.read_csv('./data/Seattle/calendar.csv', low_memory=False)
seattle_list = pd.read_csv('./data/Seattle/listings.csv', low_memory=False)
seattle_rev = pd.read_csv('./data/Seattle/reviews.csv', low_memory=False)
           

2.2 Preview the data

print('boston calender size: ', boston_cal.shape)
print('boston listings size: ', boston_list.shape)
print('boston reviews size: ', boston_rev.shape)
           

boston calender size: (2286360, 7)

boston listings size: (6264, 106)

boston reviews size: (225880, 6)

Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle
Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle
Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle
  1. We can see that ‘calender’ data can join ‘listings’ data.
  2. Calender data show the listing id, data, available, price, adjusted_price, minimum_nights and maximum_nights.
  3. Listings data show some detailed information. However, obviously, listings data contains too many unnecessary messages, it need to be Extract-Transform-Load later.
  4. We don’t need reviews data, because the most important information in it is comments from guests. And we can hardly use them.

3. Data preperation

3.1 Merge listings and calender

# Obviouly, 'listing_id' in calender data refers to 'id' in listings data
seattle_cal = seattle_cal.rename(columns={'listing_id': 'id'})
boston_cal = boston_cal.rename(columns={'listing_id': 'id'})
           
seattle_df = pd.merge(seattle_list,seattle_cal,on='id',how='outer')
boston_df = pd.merge(boston_list,boston_cal,on='id',how='outer')
boston_df.head()
           
Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle

3.2 Put two dataframes in one dataframe(concat)

boston_df['city'] = 'boston'
seattle_df['city'] = 'seattle'
df = pd.concat([boston_df,seattle_df],ignore_index=True)
           
df.shape
           
(5595815, 112)

3.3 Working with missing data

3.3.1 Checking columns that contains more than 80% miss data

df_80 = df.columns[df.count()/df.shape[0]<=0.80]
df_80
           

Index([‘space’, ‘neighborhood_overview’, ‘notes’, ‘transit’, ‘access’,

‘interaction’, ‘house_rules’, ‘thumbnail_url’, ‘medium_url’,

‘xl_picture_url’, ‘host_about’, ‘host_acceptance_rate’,

‘neighbourhood_group_cleansed’, ‘square_feet’, ‘weekly_price’,

‘monthly_price’, ‘security_deposit’, ‘license’],

dtype=‘object’)

3.3.2 Checking each columns in df_80

At Zeus, we focus exclusively on 30 day+ stays to meet the longer-term needs of our valued guests. When you stay in a Zeus property you will be met with top-notch amenities including premium mattresses and bedding, organic bath products, bath linens, a fully equipped kitchen, and super-fast wifi. Our staff is on call should anything come up that needs our attention or that can make your stay more enjoyable. 56210

The apartment is in a high-end building, conveniently located near South Lake Union’s technology center. With 24/7 guest support and concierge services, we combine a fully-furnished premium apartment with the consistent, high-quality experience and amenities that you expect from leading hotels. Expertly designed with customized furniture, the apartment’s furnishings are stylish, contemporary and specifically tailored for this apartment. We equip our apartments with everything you’ll need for your home away from home. Need more towels or anything additional during your stay? Our 24/7 guest support and concierge services are here for you. Kitchen: - Fully-stocked with cookware, glasses, mugs, plates, and everything you’ll need to host a dinner at your home away from home. - Glass top, Energy Efficient Range - Full-sized, Energy Efficient Refridgerator and Freezer - 1100 Watt Microwave - Paper Products and Trash Bags - Premium Dishwasher - Dish Soap, Dish Pods, Hand Soap - Coffee Maker w 25915

This column talks about the description of the house which given by the house owner. Obviously, ‘space’ column contains a lot of text information and we can’t group them easily. We can just drop it.

Located in central South Lake Union these apartments are a short walk to some well known technology businesses nearby (Amazon, Impinj, Northeastern University, Path). You can also enjoy the large park across the street. 22265

These downtown apartments are within a mile of Pike Place Market, the Space Needle, the Seattle Aquarium and a multitude of upscale restaurants and bars. You’re also just steps away from an assortment of venues where you can enjoy theater, comedy and every conceivable kind of music. Here, returning home for the afternoon certainly doesn’t mean an end to an exciting day. 17155

Fenway highlights - Walkability score - 95 - 2 minute walk to Sweet Cheeks Q (favorite family-style BBQ) - 10 minute walk to The Hawthorne (upscale bar serving craft cocktails) 17155

This column talks about the surrouding of the house. Similarly, we can hardly find some useful imformation from it. We can drop it.

We check all the columns in df_80, and put all columns that need to be deleted in a new list:

Now, we need to drop ‘space’, ‘neighborhood_overview’, ‘notes’, ‘transit’, ‘access’, ‘interaction’, ‘house_rules’, ‘thumbnail_url’, ‘medium_url’, ‘xl_picture_url’, ‘host_about’, ‘host_acceptance_rate’, ‘neighbourhood_group_cleansed’, ‘square_feet’, ‘license’.

At the same time we also need to fill up ‘weekly_price’,‘monthly_price’, ‘security_deposit’.

list_drop = ['space', 'neighborhood_overview', 'notes', 'transit', 'access',
       'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
       'xl_picture_url', 'host_about', 'host_acceptance_rate',
       'neighbourhood_group_cleansed', 'square_feet', 'license']
           
def drop_columns(df, cols):
	'''
	Usage: Delete colums
	Input: 
	df - input dataframe
    cols - the feature to be process
    Output:
    df - dataframe which have been process
	'''
	df.drop(columns=list_drop, inplace=True)
	return df
           
Now we need to fill up ‘weekly_price’, ‘monthly_price’, ‘security_deposit’.
def fill_colums(df, cols):
	'''
	Usage: Fill up colums
	Input: 
	df - input dataframe
    cols - the feature to be process
    Output:
    df - dataframe which have been process
	'''
	for col in cols:
		df[col].fillna(value=0, inplace=True)
	return df
           
Before we process these columns, we need to complete our list_drop and list_fill_zero.

3.3.3 Get into each features

df.select_dtypes(include=['float', 'int']).columns
           

Index([‘id’, ‘scrape_id’, ‘host_id’, ‘host_listings_count’,

‘host_total_listings_count’, ‘latitude’, ‘longitude’, ‘accommodates’,

‘bathrooms’, ‘bedrooms’, ‘beds’, ‘guests_included’, ‘minimum_nights_x’,

‘maximum_nights_x’, ‘minimum_minimum_nights’, ‘maximum_minimum_nights’,

‘minimum_maximum_nights’, ‘maximum_maximum_nights’,

‘minimum_nights_avg_ntm’, ‘maximum_nights_avg_ntm’, ‘availability_30’,

‘availability_60’, ‘availability_90’, ‘availability_365’,

‘number_of_reviews’, ‘number_of_reviews_ltm’, ‘review_scores_rating’,

‘review_scores_accuracy’, ‘review_scores_cleanliness’,

‘review_scores_checkin’, ‘review_scores_communication’,

‘review_scores_location’, ‘review_scores_value’,

‘calculated_host_listings_count’,

‘calculated_host_listings_count_entire_homes’,

‘calculated_host_listings_count_private_rooms’,

‘calculated_host_listings_count_shared_rooms’, ‘reviews_per_month’,

‘minimum_nights_y’, ‘maximum_nights_y’],

dtype=‘object’)

df.select_dtypes(include=['object']).columns
           

Index([‘listing_url’, ‘last_scraped’, ‘name’, ‘summary’, ‘description’,

‘experiences_offered’, ‘picture_url’, ‘host_url’, ‘host_name’,

‘host_since’, ‘host_location’, ‘host_response_time’,

‘host_response_rate’, ‘host_is_superhost’, ‘host_thumbnail_url’,

‘host_picture_url’, ‘host_neighbourhood’, ‘host_verifications’,

‘host_has_profile_pic’, ‘host_identity_verified’, ‘street’,

‘neighbourhood’, ‘neighbourhood_cleansed’, ‘city’, ‘state’, ‘zipcode’,

‘market’, ‘smart_location’, ‘country_code’, ‘country’,

‘is_location_exact’, ‘property_type’, ‘room_type’, ‘bed_type’,

‘amenities’, ‘price_x’, ‘weekly_price’, ‘monthly_price’,

‘security_deposit’, ‘cleaning_fee’, ‘extra_people’, ‘calendar_updated’,

‘has_availability’, ‘calendar_last_scraped’, ‘first_review’,

‘last_review’, ‘requires_license’, ‘jurisdiction_names’,

‘instant_bookable’, ‘is_business_travel_ready’, ‘cancellation_policy’,

‘require_guest_profile_picture’, ‘require_guest_phone_verification’,

‘date’, ‘available’, ‘price_y’, ‘adjusted_price’],

dtype=‘object’)

Obviouly, columns named ‘description’, ‘country’, ‘_url’, ‘state’, ‘street’… are useless.
list_drop = list_drop + ['country', 'country_code', 'description', 'host_name', 'host_picture_url', 
           'host_thumbnail_url', 'host_url', 'last_scraped', 'smart_location', 
            'host_verifications', 'listing_url', 'name', 'picture_url', 'last_scraped', 'market',
            'state', 'street', 'summary', 'zipcode', 'requires_license', 
            'host_neighbourhood', 'experiences_offered', 'calendar_last_scraped',
            'jurisdiction_names', 'is_business_travel_ready', 'price_y']
           
We can see that some object columns need to convert into int or float
def convert_to_flout(df, cols):
	'''
	Usage: Convert colums to float
	Input: 
	df - input dataframe
    cols - the feature to be process
    Output:
    df - dataframe which have been process
	'''
	for col in cols:
	df[col]=df[col].replace('[$,]','',regex=True).astype(float)
	return df
           

today 1939610

2 weeks ago 325945

yesterday 310980

a week ago 248565

3 weeks ago 217905

2 months ago 216445

3 days ago 185785

3 months ago 172280

4 weeks ago 143080

4 days ago 137605

5 weeks ago 129940

5 days ago 118625

4 months ago 112420

2 days ago 108770

def convert_calendar_int(x):
    if('a week' in x):
        return 1
    elif('week' in x):
        return str.split(x)[0]
    elif ('month' in x):
        return int(str.split(x)[0])*4
    elif ('never' in x):
        return int(0)
    else:
        return int(1)
df['calendar_updated']=df['calendar_updated'].apply(lambda x:convert_calendar_int(x))
           

3.4 Result DataFrame

list_numerical_drop = ['id', 'scrape_id', 'host_id', 'minimum_nights_x', 'maximum_nights_x', 
                       'minimum_nights_y', 'maximum_nights_y']
           
def drop_row(df):
	'''
	Usage: Drop row that contains missing values
	Input: 
	df - input dataframe
    Output:
    df - dataframe which have been process
	'''
	df_miss = df.columns[(df.count()/len(df)) < 1]
	df.dropna(subset=df_miss, inplace=True, axis=0)
	return df
           
Check the result Dataframe
df.select_dtypes(include=['object']).columns
           

Index([‘host_since’, ‘host_is_superhost’,

‘host_has_profile_pic’, ‘host_identity_verified’,

‘neighbourhood_cleansed’, ‘city’, ‘is_location_exact’, ‘property_type’,

‘room_type’, ‘bed_type’, ‘has_availability’,

‘instant_bookable’, ‘cancellation_policy’,

‘require_guest_profile_picture’, ‘require_guest_phone_verification’,

‘available’],

dtype=‘object’)

def get_dump(df):
	list_cat = df.select_dtypes(include=['object']).columns
	df =pd.get_dummies(df, columns=list_cat, drop_first=True)
	return df
           

4. Modeling

4.1 Find the factors that affect price

corr_matrix = df.corr()
corr_matrix['price'].sort_values(ascending=False)
           

Q1 The top factors that affect price of house:

price 1.000000

host_total_listings_count 0.497184

accommodates 0.452723

beds 0.395042

bedrooms 0.392926

cleaning_fee 0.367075

guests_included 0.318110

bathrooms 0.306648

host_is_superhost_True -0.125351

calculated_host_listings_count_private_rooms -0.138105

number_of_reviews -0.149646

number_of_reviews_ltm -0.160282

cancellation_policy_moderate -0.164931

reviews_per_month -0.199273

room_type_Private room -0.326181

Name: price, Length: 185, dtype: float64

4.2 Heatmap

But we get 178 features, I don’t think this picture would help us.
# Generate a mask for the upper triangle
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
           
Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle

4.3 Prediction

score 1.000000

host_is_superhost_True 0.368763

amenities 0.183053

number_of_reviews_ltm 0.162046

reviews_per_month 0.151192

cancellation_policy_moderate 0.146941

number_of_reviews 0.145018

city_seattle 0.144506

property_type_Guest suite 0.103642

host_identity_verified_True 0.094759

price -0.080093

neighbourhood_cleansed_Mission Hill -0.081475

neighbourhood_cleansed_Back Bay -0.087674

availability_90 -0.097441

calculated_host_listings_count_shared_rooms -0.103292

require_guest_profile_picture_True -0.112825

instant_bookable_True -0.117803

require_guest_phone_verification_True -0.119340

neighbourhood_cleansed_Downtown -0.121227

cancellation_policy_strict_14_with_grace_period -0.138403

availability_60 -0.144002

cancellation_policy_strict -0.156851

availability_30 -0.181728

calculated_host_listings_count_private_rooms -0.195279

host_total_listings_count -0.225087

calculated_host_listings_count_entire_homes -0.230170

calculated_host_listings_count -0.255190

Name: score, Length: 185, dtype: float64

4.3.1 linear_model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
           
X = df.drop(columns=['score','city_seattle'])
Y = df['score']
X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.2, random_state=42)
linear_model = LinearRegression(normalize=True)
linear_model.fit(X_train, y_train)
y_preds = linear_model.predict(X_test)
r2_scores = r2_score(y_test,y_preds)
r2_scores
           

0.24278702995958334

r2_score is really near zero, so LinearRegression is not best choice for us to predict the ‘score’.

This time I will use RamdonForest model, and chose 10 features I have selected above.

4.3.2 RandomForest

list_features = ['host_is_superhost_True',
                 'amenities',
                 'number_of_reviews_ltm',
                 'reviews_per_month',
                 'cancellation_policy_moderate',
                 'host_identity_verified_True',
                 'is_location_exact_True',
                 'host_response_rate',
                 'extra_people',
                 'security_deposit',
                 'calculated_host_listings_count',
                 'cancellation_policy_strict',
                 'neighbourhood_cleansed_Downtown',
                 'require_guest_phone_verification_True',
                 'require_guest_profile_picture_True',
                 'instant_bookable_True',
                 'require_guest_profile_picture_True',
                 'price',
                 'available_True',
                 'neighbourhood_cleansed_Chinatown']
           
X_rfg = df[list_features]
Y_rfg = df['score']
X_train, X_test, y_train, y_test = train_test_split(X_rfg, Y_rfg, test_size=0.3, random_state=42)
rfg_model = RandomForestRegressor(max_depth=8, random_state=42, n_estimators=100,
                                 min_samples_leaf = 2, min_samples_split = 10)
rfg_model.fit(X_train, y_train)
y_rfg_preds = rfg_model.predict(X_test)
r2_rfg_scores = r2_score(y_test, y_rfg_preds)
r2_rfg_scores
           

0.38408818776902665

It’s better than lineargression, but still not good.

4.3.3 Modeling Process

from sklearn.model_selection import GridSearchCV
           
def find_best_rgf(df, cols):
	'''
	Usage: find best RandomRorestRegression model
	Input: 
	df - input dataframe
	col - features
    Output:
    best_clf - best model
    
	'''
	X_rfg = df[list_features2]
	Y_rfg = df['score']
	X_train, X_test, y_train, y_test = train_test_split(X_rfg, Y_rfg, test_size=0.3, random_state=42)
	param_grid = {'max_depth':[8, 16, 20],
              'min_samples_split':[8, 16, 20],
              'min_samples_leaf':[2, 4]}
	grid_search = GridSearchCV(rfg_model2, param_grid, cv=3)
	grid_search.fit(X_train,y_train)
	best_clf = grid_search.best_estimator_
	return best_clf
	
           

5. Summary

Summary: RandomForest is a better model for predicting the score(scored by guests), both traing sets and testing set get an acceptable or a good re_score and MSE.
Q2: Find a model to predict the score.
def show_importances(model, df):

    importances = model.feature_importances_ 
    feat_names = df.columns
    tree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})
    tree_result.sort_values(by='importance',ascending=True)[-10:].plot(x='feature', y='importance', kind='barh')

show_importances(best_clf,X_train)
           
Analyzing Airbnb data for Boston and SeattleAnalyzing Airbnb data for Boston and Seattle
Q2: The best model to predict the score: RandomForestRegressor( bootstrap=True, criterion=‘mse’, max_depth=20, max_features=‘auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=16, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False)

Q3: Acording to this image, we should conclude that all guests are happy to live a great house with a low price.

1, The host should decrease the price of his house properly. If his house get a higher score, more guests would like to rent his house.

2, Price, amenities, wether the host is a superhost, the number of host listings, response rate, security deposit would be top 6 factors that would help the host and his house to get a higher score. And the other 4 factors can only help the hosts a little.

繼續閱讀