Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Lucene Revolution 2012 - Boston.

Similar presentations


Presentation on theme: "Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Lucene Revolution 2012 - Boston."— Presentation transcript:

1 Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Lucene Revolution Boston

2 Overview Overview of Search & Matching Concepts Recommendation Approaches in Solr: Attribute-based Hierarchical Classification Concept-based More-like-this Collaborative Filtering Hybrid Approaches Important Considerations & Advanced CareerBuilder

3 My Background Trey Grainger Manager, Search Technology CareerBuilder.com Relevant Background Search & Recommendations High-volume, N-tier Architectures NLP, Relevancy Tuning, user group testing, & machine learning Fun Side Projects Founder and Chief Currently co-authoring Solr in Action book… keep your eyes out for the early access release from Manning Publications

4 About Over 1 million new jobs each month Over 45 million actively searchable resumes ~250 globally distributed search servers (in the U.S., Europe, & Asia) Thousands of unique, dynamically generated indexes Hundreds of millions of search documents Over 1 million searches an hour

5 Search

6 Redefining Search Engine Lucene is a high-performance, full-featured text search engine library… Yes, but really… Lucene is a high-performance, fully-featured token matching and scoring library… which can perform full-text searching.

7 Redefining Search Engine or, in machine learning speak: A Lucene index is a multi-dimensional sparse matrix… with very fast and powerful lookup capabilities. Think of each field as a matrix containing each term mapped to each document

8 The Lucene Inverted Index (traditional text example) TermDocuments adoc1 [2x] browndoc3 [1x], doc5 [1x] catdoc4 [1x] cowdoc2 [1x], doc5 [1x] …... oncedoc1 [1x], doc5 [1x] overdoc2 [1x], doc3 [1x] thedoc2 [2x], doc3 [2x], doc4 [2x], doc5 [1x] …… DocumentContent Field doc1once upon a time, in a land far, far away doc2the cow jumped over the moon. doc3the quick brown fox jumped over the lazy dog. doc4the cat in the hat doc5The brown cow said moo once. …… What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually):

9 Match Text Queries to Text Fields /solr/select/?q=jobcontent: (software engineer) Job Content FieldDocuments …… engineerdoc1, doc3, doc4, doc5 … mechanicaldoc2, doc4, doc6 …… softwaredoc1, doc3, doc4, doc7, doc8 …… doc5 doc7 doc8 doc1 doc3 doc4 engineer software software engineer

10 Lucene/Solr is a text search matching engine When Lucene/Solr search text, they are matching tokens in the query with tokens in index Anything that can be searched upon can form the basis of matching and scoring: – text, attributes, locations, results of functions, user behavior, classifications, etc. Beyond Text Searching

11 Business Case for Recommendations For companies like CareerBuilder, recommendations can provide as much or even greater business value (i.e. views, sales, job applications) than user-driven search capabilities. Recommendations create stickiness to pull users back to your companys website, app, etc. What are recommendations? … searches of relevant content for a user

12 Approaches to Recommendations Content-based – Attribute based i.e. income level, hobbies, location, experience – Hierarchical i.e. medical//nursing//oncology, animal//dog//terrier – Textual Similarity i.e. Solrs MoreLikeThis Request Handler & Search Handler – Concept Based i.e. Solr => software engineer, java, search, open source Behavioral Based Collaborative Filtering: Users who liked that also liked this… Hybrid Approaches

13 Content-based Recommendation Approaches

14 Attribute-based Recommendations Example: Match User Attributes to Item Attribute Fields Janes_Profile:{ Industry:healthcare, Locations:Boston, MA, JobTitle:Nurse Educator, Salary:{ min:40000, max:60000 }, } /solr/select/?q=(jobtitle:nurse educator^25 OR jobtitle:(nurse educator)^10) AND ((city:Boston AND state:MA)^15 OR state:MA) AND _val_:map(salary,40000,60000,10,0) //by mapping the importance of each attribute to weights based upon your business domain, you can easily find results which match your customers profile without the user having to initiate a search.

15 Hierarchical Recommendations Example: Match User Attributes to Item Attribute Fields Janes_Profile:{ MostLikelyCategory:healthcare//nursing//oncology, 2ndMostLikelyCategory:healthcare//nursing//transplant, 3rdMostLikelyCategory:educator//postsecondary//nursing, … } /solr/select/?q=(category:( (healthcare.nursing.oncology^40 OR healthcare.nursing^20 OR healthcare^10)) OR (healthcare.nursing.transplant^20 OR healthcare.nursing^10 OR healthcare^5)) OR (educator.postsecondary.nursing^10 OR educator.postsecondary^5 OR educator) ))

16 Textual Similarity-based Recommendations Solrs More Like This Request Handler / Search Handler are a good example of this. Essentially, important keywords are extracted from one or more documents and turned into a search. This results in secondary search results which demonstrate textual similarity to the original document(s) See for example usagehttp://wiki.apache.org/solr/MoreLikeThis Currently no distributed search support (but a patch is available)

17 Concept Based Recommendations Approaches: 1) Create a Taxonomy/Dictionary to define your concepts and then either: a) manually tag documents as they come in or b) create a classification system which automatically tags content as it comes in (supervised machine learning) 2) Use an unsupervised machine learning algorithm to cluster documents and dynamically discover concepts (no dictionary required). //Very hard to scale… see Amazon Mechanical Turk if you must do this //See Apache Mahout //This is already built into Solr using Carrot2!

18 How Clustering Works

19 default org.carrot2.clustering.lingo.LingoClusteringAlgorithm ENGLISH default true *,score clustering Setting Up Clustering in SolrConfig.xml

20 Clustering Search in Solr /solr/clustering/?q=content:nursing &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield & LingoClusteringAlgorithm.desiredClusterCountBase =25 &group=false //clustering & grouping dont currently play nicely Allows you to dynamically identify concepts and their prevalence within a users top search results

21 Search: Nursing

22 Search:.Net

23 Example Concept-based Recommendation Clusters Identifier: Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2) Original Query: q=(solr or lucene) // can be a users search, their job title, a list of skills, // or any other keyword rich data source Stage 1: Identify Concepts Facets Identified (occupation): Computer Software Engineers Web Developers...

24 Example Concept-based Recommendation q=content:(Developer^22 or Java Developer^13 or Software ^10 or Senior Java Developer^9 or Architect ^6 or Software Engineer^6 or Web Developer ^5 or Search^3 or Software Developer^3 or Systems^3 or Administrator^2 or Hadoop Engineer^2 or Java J2EE^2 or Search Development^2 orSoftware Architect^2 or Solutions Architect^2) and occupation: (Computer Software Engineers or Web Developers) // Your can also add the users location or the original keywords to the // recommendations search if it helps results quality for your use-case. Stage 2: Run Recommendations Search

25 Example Concept-based Recommendation Stage 3: Returning the Recommendations …

26 Important Side-bar: Geography

27 Geography and Recommendations Filtering or boosting results based upon geographical area or distance can help greatly for certain use cases: – Jobs/Resumes, Tickets/Concerts, Restaurants For other use cases, location sensitivity is nearly worthless: – Books, Songs, Movies /solr/select/?q=(Standard Recommendation Query) AND _val_:(recip(geodist(location, , ),1,1,0)) // there are dozens of well-documented ways to search/filter/sort/boost // on geography in Solr.. This is just one example.

28 Behavior-based Recommendation Approaches (Collaborative Filtering)

29 The Lucene Inverted Index (user behavior example) TermDocuments user1doc1, doc5 user2doc2 user3doc2 user4doc1, doc3, doc4, doc5 user5doc1, doc4 …… Document Users who bought this product Field doc1user1, user4, user5 doc2user2, user3 doc3user4 doc4user4, user5 doc5user4, user1 …… What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually):

30 Collaborative Filtering Step 1: Find similar users who like the same documents q=documentid: (doc1 OR doc4) Document Users who bought this product Field doc1user1, user4, user5 doc2user2, user3 doc3user4 doc4user4, user5 doc5user4, user1 …… Top Scoring Results (Most Similar Users): 1) user5 (2 shared likes) 2) user4 (2 shared likes) 3) user 1 (1 shared like) doc1 user1 user4 user5 user4 user5 doc4

31 Step 2: Search for docs liked by those similar users /solr/select/?q=userlikes: (user5^2 OR user4^2 OR user1^1) TermDocuments user1doc1, doc5 user2doc2 user3doc2 user4doc1, doc3, doc4, doc5 user5doc1, doc4 …… Collaborative Filtering Top Recommended Documents: 1) doc1 (matches user4, user5, user1) 2) doc4 (matches user4, user5) 3) doc5 (matches user4, user1) 4) doc3 (matches user4) //Doc 2 does not match //above example ignores idf calculations Most Similar Users: 1) user5 (2 shared likes) 2) user4 (2 shared likes) 3) user 1 (1 shared like)

32 Lots of Variations Users –> Item(s) User –> Item(s) –> Users Item –> Users –> Item(s) etc. Note: Just because this example tags with users doesnt mean you have to. You can map any entity to any other related entity and achieve a similar result. User 1User 2User 3User 4… Item 1XXX… Item 2XX… Item 3XX… Item 4X… ………………

33 Comparison with Mahout Recommendations are much easier for us to perform in Solr: – Data is already present and up-to-date – Doesnt require writing significant code to make changes (just changing queries) – Recommendations are real-time as opposed to asynchronously processed off-line. – Allows easy utilization of any content and available functions to boost results Our initial tests show our collaborative filtering approach in Solr significantly outperforms our Mahout tests in terms of results quality – Note: We believe that some portion of the quality issues we have with the Mahout implementation have to do with staleness of data due to the frequency with which our data is updated. Our general take away: – We believe that Mahout might be able to return better matches than Solr with a lot of custom work, but it does not perform better for us out of the box. Because we already scale… – Since we already have all of data indexed in Solr (tens to hundreds of millions of documents), theres no need for us to rebuild a sparse matrix in Hadoop (your needs may be different).

34 Hybrid Recommendation Approaches

35 Hybrid Approaches Not much to say here, I think you get the point. /solr/select/?q=category:(healthcare.nursing.oncology^10healthcare.nursing^5 OR healthcare) OR title:Nurse Educator^15 AND _val_:map(salary,40000,60000,10,0)^5 AND _val_:(recip(geodist(location, , ),1,1,0))) Combining multiple approaches generally yields better overall results if done intelligently. Experimentation is key here.

36 Important Considerations & Advanced CareerBuilder

37 Important CareerBuilder Payload Scoring Measuring Results Quality Understanding our Users

38 Custom Scoring with Payloads In addition to boosting search terms and fields, content within the same field can also be boosted differently using Payloads (requires a custom scoring implementation): Content Field: design [1] / engineer [1] / really [ ] / great [ ] / job [ ] / ten[3] / years[3] / experience[3] / careerbuilder [2] / design [2], … Payload Bucket Mappings: jobtitle: bucket=[1] boost=10; company: bucket=[2] boost=4; jobdescription: bucket=[] weight=1; experience: bucket=[3] weight=1.5 We can pass in a parameter to solr at query time specifying the boost to apply to each bucket i.e. …&bucketWeights=1:10;2:4;3:1.5;default:1; This allows us to map many relevancy buckets to search terms at index time and adjust the weighting at query time without having to search across hundreds of fields. By making all scoring parameters overridable at query time, we are able to do A / B testing to consistently improve our relevancy model

39 Measuring Results Quality A/B Testing is key to understanding our search results quality. Users are randomly divided between equal groups Each group experiences a different algorithm for the duration of the test We can measure performance of the algorithm based upon changes in user behavior: – For us, more job applications = more relevant results – For other companies, that might translate into products purchased, additional friends requested, or non-search pages viewed We use this to test both keyword search results and also recommendations quality

40 Understanding our Users (given limited information)

41 Understanding Our Users Machine learning algorithms can help us understand what matters most to different groups of users. Example: Willingness to relocate for a job (miles per percentile)

42 Key Takeaways Recommendations can be as valuable or more than keyword search. If your data fits in Solr then you have everything you need to build an industry-leading recommendation system Even a single keyword can be enough to begin making meaningful recommendations. Build up intelligently from there.

43 Contact Info And yes, we are hiring – come chat with me if you are interested. Trey Grainger


Download ppt "Building a Real-time, Solr-powered Recommendation Engine Trey Grainger Manager, Search Technology Lucene Revolution 2012 - Boston."

Similar presentations


Ads by Google