Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning & Zillow Group

Similar presentations


Presentation on theme: "Machine Learning & Zillow Group"— Presentation transcript:

1 Machine Learning & Data @ Zillow Group
Jasjeet Thind Vice President, Data Science & Engineering @JasjeetThind

2 Agenda Zillow Group Machine Learning Use Cases Architectural Patterns
Models Machine Learning Pipeline Data Quality Free Zillow Data @JasjeetThind

3 Zillow Group Build the world's largest, most trusted and vibrant home-related marketplace. @JasjeetThind

4 Machine Learning Use Cases
Personalization Ad Targeting Zestimate (AVMs) Premier Agent (B2B) Mortgages Deep Learning Social & Content Demographics & Community Business Analytics Forecasting home price trends @JasjeetThind

5 Real-Time Scoring APIs Data Collection Systems Collaborative Filtering
Architecture Real-Time Scoring APIs (Python, Flask) Data Collection Systems (Java/Python/SQL) Zillow Group Data Lake (AWS - S3 / Kinesis) Ranking (Spark) Featurization (Spark) User Profiles (Spark / HBase) Aggregate Features (Spark) Wedge Counting Collaborative Filtering (Spark) @JasjeetThind

6 Architectural Patterns
Transport [Collect] Data Lake (AWS) [Store] Application (Backstage) [Process] Serving System / Analytics [Answer] Put object Analytics Put object Get object ZG Data Lake (S3) Application (batch) Kinesis Firehose Stream Database (Serving) Get object Get records Application (near real-time) Real-Time Scoring Put NRT records @JasjeetThind

7 Machine Learning Models
K-means clustering K-nearest neighbors Wedge Counting Random Forest Gradient Boosted Machines CNN (Deep Learning) NLP / TF-IDF / Word2vec / Bag of Words Linear Regression @JasjeetThind

8 Like vs. Dislike Predict homes per user using behavior of similar users Like = user actively engaged with property Dislike = user viewed property but weak engagement Feature Description uid unique id of user pid Property id first_visit timestamp or 0 num_views sigmoid(#views) time_spent time on page num_contacts # leads sent num_saves # saves on zpid num_shares # shares on zpid num_photos # photos viewed $19M + ? Spencer Stan - + $22M - + $664K @JasjeetThind

9 Wedge Count For all user & property pairs to form a prediction, perform wedge count Does Stan like $19M? Wedge # 3 (wedge03_cnt) 5 (wedge05_cnt) $22M + - $19M ? Spencer Stan $664k - + $19M ? Spencer Stan @JasjeetThind

10 Gradient Boosting Classifier
Normalize wedge counts for popular users / properties Prediction for all user / property pairs features wedge00_cnt wedge01_cnt wedge02_cnt wedge03_cnt wedge04_cnt wedge05_cnt wedge06_cnt wedge07_cnt wedge00_norm_cnt wedge01_norm_cnt wedge02_norm_cnt wedge03_norm_cnt wedge04_norm_cnt wedge05_norm_cnt wedge06_norm_cnt wedge07_norm_cnt Does Stan like the $19M home? features (uid: Stan, pid: $19M) (see right side) @JasjeetThind

11 User Profile Signals - website, mobile app, and search queries
Binary classification labels (like/dislike) same as wedge count model User profile model determines preference scores Features (categorical variables) Bath 0_bath, 0.5_bath, 1_Bath, 1.5_bath, 2_bath, 2.5_bath, 3_bath Bed 0_bed, 1_bed, 2_bed, 3_bed, 4_bed, 5_bed Price 100_125_price, 125_150_price, 150_175_price Use Code condo, single_family, farm_land Zipcode zip_98109 pid uid features label 0 or 1 (see right side) 0 or 1 0_bed: _bed: _bed: _bed: 0.6 @JasjeetThind

12 Ranking Property matrix feature space same as user profile
Dot product of property matrix with user profile vector Linear regression with additional features (e.g. age decay) 0_bed 1_bed 2_bed 3_bed uid_0 pid_0 1 0.01 0.8 0.6 0.8 0.6 pid_1 (uid, pid) score {"uId":" ", "pId":" "} 0.3364 = pid_2 pid_3 @JasjeetThind

13 Machine Learning Pipeline
Collect user behavior and real-estate data, train the various models, generate the candidate set, and and make predictions. Recommendations Hashmap (Redis) Spark job creates Hive table with user events (uid, pid) partitioned by date pid -> uid reverse index Wedge Counting / User Profile Models User Store Past and current user events Models (Python) User Behavior (Kinesis /S3) Event API (Java) Filter (Spark) Score (Spark) Wedge features or property features (user profile) Public Record (Kinesis / S3) Training Set (S3) Producer (Python) Training Data (Spark) Train Models (Spark) Property Data Active Listings (Kinesis / S3) Scoring Set (S3) Producer (Python) Scoring Data (Spark) Listing Data @JasjeetThind

14 Data Quality Analytical pipelines that measure Data integrity
Attributes / outlier detection Missing data Expected # of records Latency Models - expected data Build reports / alerts that drive action @JasjeetThind

15 Free Zillow Data @JasjeetThind
Zillow.com/data Zillow Home Value Index (ZHVI) Top / Middle / Bottom Thirds Single Family / Condo / Co-op Median Home Value Per Sq Ft Zillow Rent Index (ZRI) Multi-family / SFR / Condo / Co-op Median ZRI Per Sq ft Median Rent List Price Other Metrics Median List Price Price-to-Rent ratio Homes Foreclosed For-sale Inventory / Age Inventory Negative Equity And many more… Time Series: national, state, metro, county, city and ZIP code levels ZTRAX: Zillow Transaction and Assessment Dataset Previously inaccessible or prohibitively expensive housing data for academic and institutional researchers FOR FREE. More than 100 gigabytes 374 million detailed public records across more than 2,750 U.S. counties 20+ years of deed transfers, mortgages, foreclosures, auctions, property tax delinquencies and more for residential and commercial properties. Assessor data including property characteristics, geographic information, and prior valuations on approximately 200 million parcels in more than 3,100 counties. for more information @JasjeetThind

16 Thank you! Related Blogs Zillow.com/data-science Trulia.com/blog/tech/
Hiring Machine Learning Engineer Data Scientist Product Manager Data Engineer @JasjeetThind


Download ppt "Machine Learning & Zillow Group"

Similar presentations


Ads by Google