Building a Real-time, Solr-powered Recommendation Engine

Slides:



Advertisements
Similar presentations
{Customer} Divisions Plan {Date} – {Version} Salesforce.com.
Advertisements

ELibrary Topic Search Basics eLibrary topic search allows users to locate articles and multimedia resources –Relevant to K-12 curricula and user.
ELibrary Curriculum Edition (CE) The ultimate K-12 curriculum and reference solution 2008.
Broadband Session Michael Byrne. Broadband Map Technical Details Data Integration Map Presentation Since Launch.
eClassifier: Tool for Taxonomies
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Google Search Appliance November 2, 2010 Susan Fagan.
Recommender Systems & Collaborative Filtering
COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University.
Quick Start Guide | Skill Assessments Using your Online Product Interfaces and Features Getting Started - My Account Creating/Logging into your Self Test.
SharePoint Forms All you ever wanted to know about forms but were afraid to ask.
SPS Nashville 2014 Dynamic Content using SharePoint Search SHAREPOINT SATURDAY NASHVILLE– APRIL 5, 2014 MIKE ORYSZAK BLOG: TWITTER:
Fox Scientific, Inc. ONLINE ORDERING 101. Welcome to our website On our main page you can find current promotions, the vendors we offer, technical references.
Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Book Recommender System Guided By: Prof. Ellis Horowitz Kaijian Xu Group 3 Ameet Nanda Bhaskar Upadhyay Bhavana Parekh.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Spatial Hypermedia and Augmented Reality
Xyleme A Dynamic Warehouse for XML Data of the Web.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Tracking Services for ANY websites and web applications Zhu Xiong CSE 403 LCO.
Mobile Application Abstract Future Work The potential applications and integration of this project are vast – many large department and grocery stores.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
10 Reasons to Use Google Analytics By: Errett Cord
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Jason Houle Vice President, Travel Operations Lixto Travel Price Intelligence 2.0.
SUMMON ® 2.0 DISCOVERY REINVENTED. What is Summon 2.0? A new, streamlined, modern interface New and enhanced features providing layers of contextual guidance.
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
Module 10 Administering and Configuring SharePoint Search.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
CONVERSION ARCHITECTURE CONVERSION ARCHITECTURE Testing data Keyword expansion Historical data Conversion analysis Geographic data Keyword analysis Visual.
ITGS Databases.
RemoteNet Enhancements 2014 Concentrating on Speed, Automated Sales Tools, Alerts and Access to more Detailed Content.
LANDESK SOFTWARE CONFIDENTIAL Tips and Tricks with Filters Jenny Lardh.
Advanced Semantics and Search Beyond Tag Clouds and Taxonomies Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services.
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture.
INFO 330 Forward Engineering Project From User To Info.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Sitecore. Compelling Web Experiences Page 1www.sitecore.net Patrick Schweizer Director of Sales Enablement 2013.
#SummitNow Super Size Your Search 14 th November 2013 Fran Alvarez (Zaizi)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Summon® 2.0 Discovery Reinvented
Big Data is a Big Deal!.
Recommender Systems & Collaborative Filtering
AERObics Presented by Dana Senn Mariya Krasny Joe Hiber.
Information Organization: Overview
Search Engines and Search techniques
Searching and Indexing
Search Techniques and Advanced tools for Researchers
Data Mining Chapter 6 Search Engines
Recommender Systems: Collaborative & Content-based Filtering Features
Product Overview.
Information Organization: Overview
Setting up your LinkedIn
Introduction to Search Engines
Presentation transcript:

Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Development @ Lucene Revolution 2012 - Boston

Overview Overview of Search & Matching Concepts Recommendation Approaches in Solr: Attribute-based Hierarchical Classification Concept-based More-like-this Collaborative Filtering Hybrid Approaches Important Considerations & Advanced Capabilities @ CareerBuilder

My Background Trey Grainger Relevant Background Fun Side Projects Manager, Search Technology Development @ CareerBuilder.com Relevant Background Search & Recommendations High-volume, N-tier Architectures NLP, Relevancy Tuning, user group testing, & machine learning Fun Side Projects Founder and Chief Engineer @ .com Currently co-authoring Solr in Action book… keep your eyes out for the early access release from Manning Publications

About Search @CareerBuilder Over 1 million new jobs each month Over 45 million actively searchable resumes ~250 globally distributed search servers (in the U.S., Europe, & Asia) Thousands of unique, dynamically generated indexes Hundreds of millions of search documents Over 1 million searches an hour

Search Products @

Redefining “Search Engine” “Lucene is a high-performance, full-featured text search engine library…” Yes, but really…  Lucene is a high-performance, fully-featured token matching and scoring library… which can perform full-text searching.

Redefining “Search Engine” or, in machine learning speak: A Lucene index is a multi-dimensional sparse matrix… with very fast and powerful lookup capabilities. Think of each field as a matrix containing each term mapped to each document

The Lucene Inverted Index (traditional text example) How the content is INDEXED into Lucene/Solr (conceptually): What you SEND to Lucene/Solr: Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x]

Match Text Queries to Text Fields /solr/select/?q=jobcontent: (software engineer) engineer Job Content Field Documents … engineer doc1, doc3, doc4, doc5 mechanical doc2, doc4, doc6 software doc1, doc3, doc4, doc7, doc8 doc5 software engineer doc1 doc3 doc4 software doc7 doc8

Beyond Text Searching Lucene/Solr is a text search matching engine When Lucene/Solr search text, they are matching tokens in the query with tokens in index Anything that can be searched upon can form the basis of matching and scoring: text, attributes, locations, results of functions, user behavior, classifications, etc.

Business Case for Recommendations For companies like CareerBuilder, recommendations can provide as much or even greater business value (i.e. views, sales, job applications) than user-driven search capabilities. Recommendations create stickiness to pull users back to your company’s website, app, etc. What are recommendations? … searches of relevant content for a user

Approaches to Recommendations Content-based Attribute based i.e. income level, hobbies, location, experience Hierarchical i.e. “medical//nursing//oncology”, “animal//dog//terrier” Textual Similarity i.e. Solr’s MoreLikeThis Request Handler & Search Handler Concept Based i.e. Solr => “software engineer”, “java”, “search”, “open source” Behavioral Based Collaborative Filtering: “Users who liked that also liked this…” Hybrid Approaches

Content-based Recommendation Approaches

Attribute-based Recommendations Example: Match User Attributes to Item Attribute Fields Janes_Profile:{ Industry:”healthcare”, Locations:”Boston, MA”, JobTitle:”Nurse Educator”, Salary:{ min:40000, max:60000 }, } /solr/select/?q=(jobtitle:”nurse educator”^25 OR jobtitle:(nurse educator)^10) AND ((city:”Boston” AND state:”MA”)^15 OR state:”MA”) AND _val_:”map(salary,40000,60000,10,0)” //by mapping the importance of each attribute to weights based upon your business domain, you can easily find results which match your customer’s profile without the user having to initiate a search.

Hierarchical Recommendations Example: Match User Attributes to Item Attribute Fields Janes_Profile:{ MostLikelyCategory:”healthcare//nursing//oncology”, 2ndMostLikelyCategory:”healthcare//nursing//transplant”, 3rdMostLikelyCategory:”educator//postsecondary//nursing”, … } /solr/select/?q=(category:( (”healthcare.nursing.oncology”^40 OR ”healthcare.nursing”^20 OR “healthcare”^10)) OR (”healthcare.nursing.transplant”^20 OR ”healthcare.nursing”^10 OR “healthcare”^5)) OR (”educator.postsecondary.nursing”^10 OR ”educator.postsecondary”^5 OR “educator”) ))

Textual Similarity-based Recommendations Solr’s More Like This Request Handler / Search Handler are a good example of this. Essentially, “important keywords” are extracted from one or more documents and turned into a search. This results in secondary search results which demonstrate textual similarity to the original document(s) See http://wiki.apache.org/solr/MoreLikeThis for example usage Currently no distributed search support (but a patch is available)

Concept Based Recommendations Approaches: 1) Create a Taxonomy/Dictionary to define your concepts and then either: a) manually tag documents as they come in or b) create a classification system which automatically tags content as it comes in (supervised machine learning) 2) Use an unsupervised machine learning algorithm to cluster documents and dynamically discover concepts (no dictionary required). //Very hard to scale… see Amazon Mechanical Turk if you must do this //See Apache Mahout //This is already built into Solr using Carrot2!

How Clustering Works

Setting Up Clustering in SolrConfig.xml <searchComponent name="clustering" enable=“true“ class="solr.clustering.ClusteringComponent"> <lst name="engine"> <str name="name">default</str> <str name="carrot.algorithm"> org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str> <str name="MultilingualClustering.defaultLanguage">ENGLISH</str> </lst> </searchComponent>   <requestHandler name="/clustering" enable=“true" class="solr.SearchHandler"> <lst name="defaults"> <str name="clustering.engine">default</str> <bool name="clustering.results">true</bool> <str name="fl">*,score</str> </lst> <arr name="last-components"> <str>clustering</str> </arr> </requestHandler>

Clustering Search in Solr /solr/clustering/?q=content:nursing &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25 &group=false //clustering & grouping don’t currently play nicely Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results

Search: Nursing

Search: .Net

Example Concept-based Recommendation Stage 1: Identify Concepts Original Query: q=(solr or lucene) // can be a user’s search, their job title, a list of skills, // or any other keyword rich data source Clusters Identifier: Developer (22)  Java Developer (13)  Software (10)  Senior Java Developer (9)  Architect (6)  Software Engineer (6)  Web Developer (5)  Search (3)  Software Developer (3)  Systems (3)  Administrator (2)  Hadoop Engineer (2)  Java J2EE (2)  Search Development (2)  Software Architect (2)  Solutions Architect (2)  Facets Identified (occupation): Computer Software Engineers Web Developers ...

Example Concept-based Recommendation Stage 2: Run Recommendations Search q=content:(“Developer”^22 or “Java Developer”^13 or “Software ”^10 or “Senior Java Developer”^9  or “Architect ”^6 or “Software Engineer”^6 or “Web Developer ”^5 or “Search”^3 or “Software Developer”^3 or “Systems”^3 or “Administrator”^2 or “Hadoop Engineer”^2 or “Java J2EE”^2 or “Search Development”^2 or “Software Architect”^2 or “Solutions Architect”^2) and occupation: (“Computer Software Engineers” or “Web Developers”) // Your can also add the user’s location or the original keywords to the // recommendations search if it helps results quality for your use-case.

Example Concept-based Recommendation Stage 3: Returning the Recommendations …

Important Side-bar: Geography

Geography and Recommendations Filtering or boosting results based upon geographical area or distance can help greatly for certain use cases: Jobs/Resumes, Tickets/Concerts, Restaurants For other use cases, location sensitivity is nearly worthless: Books, Songs, Movies /solr/select/?q=(Standard Recommendation Query) AND _val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))” // there are dozens of well-documented ways to search/filter/sort/boost // on geography in Solr.. This is just one example.

Behavior-based Recommendation Approaches (Collaborative Filtering)

The Lucene Inverted Index (user behavior example) How the content is INDEXED into Lucene/Solr (conceptually): What you SEND to Lucene/Solr: Document “Users who bought this product” Field doc1 user1, user4, user5 doc2 user2, user3 doc3 user4 doc4 user4, user5 doc5 user4, user1 … Term Documents user1 doc1, doc5 user2 doc2 user3 user4 doc1, doc3, doc4, doc5 user5 doc1, doc4 …

Collaborative Filtering Step 1: Find similar users who like the same documents q=documentid: (“doc1” OR “doc4”) Document “Users who bought this product “Field doc1 user1, user4, user5 doc2 user2, user3 doc3 user4 doc4 user4, user5 doc5 user4, user1 … doc1 user1 user4 user5 user4 user5 doc4 Top Scoring Results (Most Similar Users): user5 (2 shared likes) user4 (2 shared likes) user 1 (1 shared like)

Collaborative Filtering Step 2: Search for docs “liked” by those similar users /solr/select/?q=userlikes: (“user5”^2 OR “user4”^2 OR “user1”^1) Most Similar Users: user5 (2 shared likes) user4 (2 shared likes) user 1 (1 shared like) Term Documents user1 doc1, doc5 user2 doc2 user3 user4 doc1, doc3, doc4, doc5 user5 doc1, doc4 … Top Recommended Documents: 1) doc1 (matches user4, user5, user1) 2) doc4 (matches user4, user5) 3) doc5 (matches user4, user1) 4) doc3 (matches user4) //Doc 2 does not match //above example ignores idf calculations

Lot’s of Variations Users –> Item(s) User –> Item(s) –> Users Item –> Users –> Item(s) etc. Note: Just because this example tags with “users” doesn’t mean you have to. You can map any entity to any other related entity and achieve a similar result. User 1 User 2 User 3 User 4 … Item 1 X Item 2 Item 3 Item 4

Comparison with Mahout Recommendations are much easier for us to perform in Solr: Data is already present and up-to-date Doesn’t require writing significant code to make changes (just changing queries) Recommendations are real-time as opposed to asynchronously processed off-line. Allows easy utilization of any content and available functions to boost results Our initial tests show our collaborative filtering approach in Solr significantly outperforms our Mahout tests in terms of results quality Note: We believe that some portion of the quality issues we have with the Mahout implementation have to do with staleness of data due to the frequency with which our data is updated. Our general take away: We believe that Mahout might be able to return better matches than Solr with a lot of custom work, but it does not perform better for us out of the box. Because we already scale… Since we already have all of data indexed in Solr (tens to hundreds of millions of documents), there’s no need for us to rebuild a sparse matrix in Hadoop (your needs may be different).

Hybrid Recommendation Approaches

Hybrid Approaches Not much to say here, I think you get the point. /solr/select/?q=category:(”healthcare.nursing.oncology”^10 ”healthcare.nursing”^5 OR “healthcare”) OR title:”Nurse Educator”^15 AND _val_:”map(salary,40000,60000,10,0)”^5 AND _val_:”(recip(geodist(location, 40.7142, 74.0064),1,1,0))”) Combining multiple approaches generally yields better overall results if done intelligently. Experimentation is key here.

Important Considerations & Advanced Capabilities @ CareerBuilder

Important Considerations @ CareerBuilder Payload Scoring Measuring Results Quality Understanding our Users

Custom Scoring with Payloads In addition to boosting search terms and fields, content within the same field can also be boosted differently using Payloads (requires a custom scoring implementation): Content Field: design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], … Payload Bucket Mappings: jobtitle: bucket=[1] boost=10; company: bucket=[2] boost=4; jobdescription: bucket=[] weight=1; experience: bucket=[3] weight=1.5 We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:4;3:1.5;default:1;   This allows us to map many relevancy buckets to search terms at index time and adjust the weighting at query time without having to search across hundreds of fields. By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model

Measuring Results Quality A/B Testing is key to understanding our search results quality. Users are randomly divided between equal groups Each group experiences a different algorithm for the duration of the test We can measure “performance” of the algorithm based upon changes in user behavior: For us, more job applications = more relevant results For other companies, that might translate into products purchased, additional friends requested, or non-search pages viewed We use this to test both keyword search results and also recommendations quality

Understanding our Users (given limited information)

Understanding Our Users Machine learning algorithms can help us understand what matters most to different groups of users. Example: Willingness to relocate for a job (miles per percentile)

Key Takeaways Recommendations can be as valuable or more than keyword search. If your data fits in Solr then you have everything you need to build an industry-leading recommendation system Even a single keyword can be enough to begin making meaningful recommendations. Build up intelligently from there.

Contact Info Trey Grainger trey.grainger@careerbuilder.com http://www.careerbuilder.com @treygrainger And yes, we are hiring – come chat with me if you are interested.