SEARCH Final Project (ILS-Z534) Yelp Data Challenge

SEARCH Final Project (ILS-Z534) Yelp Data Challenge
Under the Supervision of professor Xiaozhong Liu Presented by, Milind Gokhale Namrata Jagasia Deepak Bharanikana Sameedha Bairagi Siddharth Jayasankar

TASK – 1 Predicting Categories for each business using Information Retrieval Approach Input : - Business ID OutPut :- List of Categories for each Business ID

Dataset Division 1.6M reviews and 500K tips for 61K businesses
Test Set Training Set 1.6M reviews and 500K tips for 61K businesses Data divided into training set and test set. 66% Training Set: ~ 38K businesses Used for category feature extraction. 33% Test Set: ~ 20K businesses Used for prediction Evaluation

Toolset POS Tagging Indexing + Search String Utilities Database
JSON Handling Indexing + Search String Utilities POS Tagging Database Java (Eclipse)

Task 1 – ALGORITHM Start End Indexing [Business ID, Reviews, Tips]
Create Category Feature Map Perform Business search on categories Rank categories found Evaluate precision and recall Comparison with Ground Truth End

Task 1 – Method Index Creation using Lucene
Category Feature Extraction from training set Business ID Category Reviews and Tips Text 10001 Restaurant , Indian, Spicy The chicken curry is great. Loved the food. …… 10002 Restaurant, American, Donut The donuts are delicious The ambiance is good Category Search Query Indian Curry ,mutter, spicy….. Italian Pizza, Pasta, Alfredo….. Features are words with highest TFIDF score among all the words in reviews and tips text for the category

Task 1 – Method Category Scores for Businesses Predicted Results 10001
Business ID Result 10001 1 Indian 2 Restaurant – 0.678 3 Asian – 0.567 . 783 Mexican – 0.0 10002 Donut -0.67 Cheese – 0.56 Restaurant – 0.43 Bar – 0.0 Business ID Predicted categories 10001 Indian, Restaurant, Asian, Authentic, traditional 10002 Donut , Cheese Restaurant , American , Icecream

Task 1 – Evaluation Comparison of Ground Truth Value (provided by Yelp) with calculated predictions.

Task 1 – Evaluation

Predict MOST DISCUSSED Attributes
Task 2 Predict MOST DISCUSSED Attributes In each city Input : City Name Output : List of Attributes That are most Talked about in the city

Task 2 - Algorithm Start End
Split the data into Test and Train and Index the reviews and Tips for each City separately Using word net Create a Attribute Map for each Attribute with Attribute Name as key and search text (related words) as values For the given input city , perform a search for each Attribute and retrieve scores and rank for each Attribute using BM25 ranking function. Perform this step on both test and train data Assign top 10 ranked Attributes to each City for both test and train data Calculate Precision and Recall for this model Compare the test results with the train results. End

Final Collection Used to Index
Task 2 - Method Splitting and Indexing of data (City-wise) Business File Review File Tip File TRAIN INDEXES TEST INDEXES Reviews & Tips Review 1 Review 2 . Tip 1 Reviews & Tips Review 101 Review 102 . Tip 101 Tempe Tempe Reviews & Tips Review 1 Review 2 . Tip 1 Reviews & Tips Review 101 Review 102 . Tip 101 Pheonix Pheonix MongDB Collections Reviews & Tips Review 1 Review 2 . Tip 1 Reviews & Tips Review 101 Review 102 . Tip 101 Las Vegas Las Vegas Final Collection Used to Index {BusinessID : “1001” , City : “Las Vegas”, Rev&Tips :[“Rev1”,”Rev2”,…..,”Tip1”,”Tip2”,…]}

Task 2 - Method We used word net to create a Attribute map.
For the given city we ran a search for each Attribute on both test and train data and we retrieved the top 10 Attributes for both test and train data Attribute Map Good for Kids Healthy, colorful, son, daughter,……. Music Jazz, Rock, Pop, melody….. Liquor Alcohol , sprits ,vodka, Rum…. Smoking Cigar , Cigarette, lighter, ….. Attribute Good for Kids Music Liquor Smoking WORD NET Top 10 Attributes Liquor Good for Kids – 3.5 Music – 2.1 Smoking – 1.6 Results from Train Data IR Model Train Data (60%) Top 10 Attributes Liquor Music – 4.0 Good for Kids – 2.5 Smoking – 0.5 Results from Test Data Test Data (40%)

Task 2 - Evaluation We compared the predicted results of test data with the predicted results of the train data (considered as ground truth)and calculated the precision and recall Charlotte Phoenix Las Vegas

Challenges Task 1 Task 2 Data cleaning and pre-processing time.
Even after stop word removal, many unwanted features with high TFIDF scores. Java heap space out of memory exception while feature extraction from categories. Task 2 Data cleaning and pre-processing. Manual removal of some features from WordNet for improving output. Evaluation metric

Questions ?

Thank You…!!

SEARCH Final Project (ILS-Z534) Yelp Data Challenge

Similar presentations

Presentation on theme: "SEARCH Final Project (ILS-Z534) Yelp Data Challenge"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SEARCH Final Project (ILS-Z534) Yelp Data Challenge

Similar presentations

Presentation on theme: "SEARCH Final Project (ILS-Z534) Yelp Data Challenge"— Presentation transcript:

Similar presentations

About project

Feedback