Raw Data
AirBnB: Getting the Perfect Score Analyzed and Created by: Caroline Deng, Jantima Somboonsong, Lauren Tai, Lauren Yee, Gordon Wong
Table of Contents 1. Company Overview 2. Business Understanding 3. Data Understanding 4. Data Preparation 5. Modeling 6. Evaluation 7. Deployment 8. Risks & Implications
1. Company Overview
Company Overview “Airbnb is a trusted community marketplace for people to list, discover, and book unique accommodations around the world�
Countries
191+ Cities
34,000+ Listings Worldwide
2,000,000+ Total Guests
60 mil+
Company Overview
Company Overview
Company Overview
Company Overview
Company Overview
Company Overview
Company Overview
Company Overview
Company Overview “Airbnb is a trusted community marketplace for people to list, discover, and book unique accommodations around the world�
Countries
191+ Cities
34,000+ Listings Worldwide
2,000,000+ Total Guests
60 mil+
Company Overview
35,957 listings in New York City as of February 2, 2016
2. Business Understanding
Business Understanding
Goals
We wanted to dive into AirBnb ratings and reviews to see if we could predict whether or certain listings would get a perfect rating and what went into it Perfect rating = Score of 100
Business Understanding Rating System How does the rating system works? What factors go into it? What impact does it have on the lister and user?
3. Data Understanding
Data Understanding The Data A Snapshot: Taken from InsideAirbnb.com (not oďŹƒcial AirBnB data) Last Updated February 2, 2016 CSV format: 124 MB 35,957 records 92 Columns Mixture of numerical, categorical, and text Data limitations Messy
Word Cloud
4. Data Preparation
Data Preparation Tools OpenRefine, Python, Pandas, R, Excel 3 Steps Sample & Eliminate Clean Create New Variables
Data Preparation: Sample & Eliminate
Take random sample 10% of original data Remove Columns Discard irrelevant and repetitive columns Based on low correlation to review score 19 columns remain
0 id 1 boroughs 2 zipcode 3 property_type 4 room_type 5 bathrooms 6 bedrooms 7 beds 8 bed_type 9 amenities 10 price 11 security_deposit.2 12 cleaning_fee.2 13 guests_included 14 extra_people 15 availability_365 16 number_of_reviews 17 review_scores_rating 18 description 19 name
Data Preparation: Cleaning Remove all entries in which Review Score Rating were not numbers Remove entries with empty ‘Summary’ Transform Boolean True/False values into 0’s and 1’s Assign categorical variables to numeric
Data Preparation: Data Dictionary Boroughs: 1 - Bronx 2 - Brooklyn 3 - Manhattan 4 - Queens 5 - Staten Island
Property Type: 1 - Apartment 2 - Bed & Breakfast 3 - Condominium 4 - Dorm 5 - House 6 - Hut 7 - Loft 8 - Other 9 - Townhouse 0 - (blank)
Listing Type: 1 - Entire home/apt 2 - Private room 3 - Shared room
Bed Type: 1 - Airbed 2 - Couch 3 - Futon 4 - Pull-out Sofa 5 - Real Bed
Data Preparation: Create New Variables Create a new variable for target Binary variable of 1 and 0 1 - Score of 100 0 - Score of less than 100 Create a variable for count of amenities Using R, count the number of amenities oered by each AirBnb listing
5. Modeling
Logistic Regression Boroughs:
0.0090
Zip Code:
-0.00002
Property Type:
-0.0003
Room Type:
0.0272
Bathrooms:
0.0404
Beds:
-0.0254
Bed Type:
0.0225
Amenities Count:
0.0078
-0.01
Price:
0.00010
-0.02
Security Deposit:
-0.00002
-0.03
Cleaning Fee:
-0.000075
Number of Guests:
0.0283
Extra People Cost:
-0.0009
Availability 365:
-0.0003
Number of Reviews:
-0.0064
Coefficients 0.04 0.03 0.02 0.01 0
AUC: .674
Classification Tree Number of Reviews
Number of Reviews Amenities Count
Guests Included
Perfect Review: - Less than 6.5 reviews - Less than 1.5 reviews (aka 1) - Include more than 3.5 guests
Base Rate = 0.280 AUC = 0.746
What to Avoid: - More than 6.5 reviews - More than 19.5 reviews - Include less than 16.5 amenities
Classification Tree Number of Reviews Number of Reviews Guests Included Extra People
Price
Amenities Count Security Deposit
Base Rate = 0.280 AUC = 0.754
365 Availability
Classification Tree
- - - -
Perfect Review: Less than 6.5 reviews Less than 1.5 reviews (aka 1) Include more than 3.5 guests Include less than 21.5 amenities
- - - -
What to Avoid: More than 6.5 reviews Less than 19.5 reviews Include less than 6.5 amenities Make them pay less than $425 for security deposit
Nearest Neighbor
Nearest Neighbor AUC: 0.6620499
Text Mining Description • Count Vectorizer: AUC 0.534 • TFIDF: AUC 0.551 Space • Count Vectorizer: AUC 0.578 • TFIDF: AUC 0.603 Name • Count Vectorizer: AUC 0.514 • TFIDF: AUC 0.522
6. Evaluation
Method: K-Fold Cross Validation
A max depth of 4 achieves the highest area under the ROC curve outcome
Varying the complexity does not really have an impact for logistic regression
Metric: Area Under the ROC Curve
7. Risks & Limitations
Limitations Data may be skewed towards 100 Users may face pressure to give higher ratings Fake reviews
Deployment Hypotheticals Mobile app or online host recommendation system ● Suggestions to hosts: ○ Have less reviews ○ Increase the number of guests that you can accommodate ○ Have less than 16 amenities
Deployment
http://thenextweb.com/apps/2015/02/21/airbnb-launches-new-dashboard-updated-app-hosts/#gref
Questions?
Appendix 1: Original data set features id
medium_url
host_thumbnail_url
state
bedrooms
listing_url
picture_url
host_picture_url
zipcode
beds
scrape_id
xl_picture_url
host_neighbourhood
market
bed_type
last_scraped
host_id
host_listings_count
smart_location
amenities
name
host_url
host_total_listings_count
country_code
square_feet
summary
host_name
host_verifications
country
price
space
host_since
host_has_profile_pic
latitude
weekly_price
description
host_location
host_identity_verified
longitude
monthly_price
experiences_oered
host_about
street
is_location_exact
security_deposit
neighborhood_overview
host_response_time
neighbourhood
property_type
cleaning_fee
notes
host_response_rate
neighbourhood_cleansed
room_type
guests_included
transit
host_acceptance_rate
neighbourhood_group_cleansed
accommodates
extra_people
thumbnail_url
host_is_superhost
city
bathrooms
Appendix 1: Original data set features maximum_nights
review_scores_accuracy
require_guest_phone_verification
calendar_updated
review_scores_cleanliness
calculated_host_listings_count
has_availability
review_scores_checkin
reviews_per_month
availability_30
review_scores_communication
availability_60
review_scores_location
availability_90
review_scores_value
availability_365
requires_license
calendar_last_scraped
license
number_of_reviews
jurisdiction_names
first_review
instant_bookable
last_review
cancellation_policy
review_scores_rating
require_guest_profile_picture
Count: 92 features