Presentation is loading. Please wait.

Presentation is loading. Please wait.

1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.

Similar presentations


Presentation on theme: "1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout."— Presentation transcript:

1 1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

2 2CONFIDENTIAL | Copyright Lucid Imagination Evolution Documents Models Feature Selection User Interaction Clicks Ratings/Reviews Learning to Rank Social Graph Queries Phrases NLP Content Relationships Page Rank, etc. Organization

3 3CONFIDENTIAL | Copyright Lucid Imagination Minding the Intersection Search Discovery Analytics

4 4CONFIDENTIAL | Copyright Lucid Imagination Background –Apache Mahout –Apache Solr and Lucene Recommendations with Mahout –Collaborative Filtering Discovery with Solr and Mahout Discussion Topics

5 5CONFIDENTIAL | Copyright Lucid Imagination Apache Lucene in a Nutshell http://lucene.apache.org/java Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: –Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet

6 6CONFIDENTIAL | Copyright Lucid Imagination Apache Solr in a Nutshell http://lucene.apache.org/solr Lucene-based Search Server + other features and functionality Access Lucene over HTTP: –Java, XML, Ruby, Python,.NET, JSON, PHP, etc. Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices

7 7CONFIDENTIAL | Copyright Lucid Imagination Apache Mahout in a Nutshell An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License –http://mahout.apache.orghttp://mahout.apache.org The Three C’s: –Collaborative Filtering (recommenders) –Clustering –Classification Others: –Frequent Item Mining –Primitive collections –Math stuff http://dictionary.reference.com/browse/mahout

8 8CONFIDENTIAL | Thinking Lucene Think Lucid Recommendations with Mahout

9 9CONFIDENTIAL | Copyright Lucid Imagination Collaborative Filtering (CF) –Provide recommendations solely based on preferences expressed between users and items –“People who watched this also watched that” Content-based Recommendations (CBR) –Provide recommendations based on the attributes of the items and user profile –‘Modern Family’ is a sitcom, Bob likes sitcoms => Suggest Modern Family to Bob Mahout geared towards CF, can be extended to do CBR –Classification can also be used for CBR Aside: search engines can also solve these problems Recommenders

10 10CONFIDENTIAL | Copyright Lucid Imagination DraculaJane Eyre FrankensteinJava Programming Bob14???- Mary514- In many instances, user’s don’t provide actual ratings –Clicks, views, etc. Non-Boolean ratings can also often introduce unnecessary noise –Even a low rating often has a positive correlation with highly rated items in the real world Example: Should we recommend Frankenstein to Bob? To Rate or Not? DraculaJane EyreFrankenstein Bob14??? Mary514

11 11CONFIDENTIAL | Copyright Lucid Imagination Collaborative Filtering with Mahout Extensive framework for collaborative filtering Recommenders –User based –Item based –Slope One Online and Offline support –Offline can utilize Hadoop Item 1 Item 2 …Item m User 1-0.50.9 User 20.10.3- … User n0.80.70.1 Recommendations for User X

12 12CONFIDENTIAL | Copyright Lucid Imagination User Similarity Item 1 Item 2 Item 3 Item 4 User 1 User 2 User 3 User 4 What should we recommend for User 1?

13 13CONFIDENTIAL | Copyright Lucid Imagination Item Similarity Item 1 Item 2 Item 3 Item 4 User 1 User 2 User 3 User 4 What should we recommend for User 1?

14 14CONFIDENTIAL | Copyright Lucid Imagination Intuition: There is a linear relationship between rated items –Y = mX + b where m = 1 Solve for b upfront based on existing ratings: b = (Y-X) –Find the average difference in preference value for every pair of items Online can be very fast, but requires up front computation and memory Slope One UserItem 1Item 2 A3.52 B?3 User A: 3.5 – 2 = 1.5 Item 1 (User B) = 3 + 1.5 = 4.5

15 15CONFIDENTIAL | Copyright Lucid Imagination Online –Predates Hadoop –Designed to run on a single node Matrix size of ~ 100M interactions –API for integrating with your application Offline –Hadoop based –Designed to run on large cluster –Several approaches: RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob Online and Offline Recommendations

16 16CONFIDENTIAL | Copyright Lucid Imagination Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf-email-examples.sh RecommenderJob 101102103104105 10172013 10228352 10303364 10415647 10532479 User A 3.0 0 4.0 3.0 2.0 X= Recs 30 37 38 53 64

17 17CONFIDENTIAL | Thinking Lucene Think Lucid Discovery with Solr

18 18CONFIDENTIAL | Copyright Lucid Imagination Goals: –Guide users to results without having to guess at keywords –Encourage serendipity –Never show empty results Out of the Box: –Faceting –Spell Checking –More Like This –Clustering (Carrot 2 ) Extend –Clustering (with Mahout) –Frequent Item Mining (with Mahout) Discovery with Solr

19 19CONFIDENTIAL | Copyright Lucid Imagination Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content Solr has search result clustering –Pluggable –Default implementation uses Carrot 2 Mahout has Hadoop based large scale clustering –K-Means, Minhash, Dirichlet, Canopy, Spectral, etc. Clustering

20 20CONFIDENTIAL | Copyright Lucid Imagination Discovery In Action Pre-reqs: –Apache Ant 1.7.x, Subversion (SVN) Command Line 1: –svn co https://svn.apache.org/repos/asf/lucene/dev/trunk solr-trunkhttps://svn.apache.org/repos/asf/lucene/dev/trunk –cd solr-trunk/solr/ –ant example –cd example –java –Dsolr.clustering.enabled=true –jar start.jar Command Line 2 –cd exampledocs; java –jar post.jar *.xml http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrows e=true http://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrows e=true

21 21CONFIDENTIAL | Thinking Lucene Think Lucid Solr + Mahout

22 22CONFIDENTIAL | Copyright Lucid Imagination Most Mahout tasks are offline Solr provides many touch points for integration: –ClusteringEngine Clustering results –SearchComponent Suggestions – Related searches, clusters, MLT, spellchecking –UpdateProcessor Classification of documents –FunctionQuery Basics

23 23CONFIDENTIAL | Copyright Lucid Imagination Discover frequently co-occurring items Use Case: Related Searches from Solr Logs Hadoop and sequential versions –Parallel FP Growth Input: – TAB SPACE SPACE –Comma, pipe also allowed as delimiters Example: Frequent Itemset Mining

24 24CONFIDENTIAL | Copyright Lucid Imagination Goal: –Extract user queries from Solr logs –Feed into FIM to generate Related Keyword Searches Context: –Solr Query logs –bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url -- formatterClass fpg –bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 -- method mapreduce –bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r- 00000 FIM on Solr Query Logs

25 25CONFIDENTIAL | Copyright Lucid Imagination Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12) Output

26 26CONFIDENTIAL | Copyright Lucid Imagination http://lucene.apache.org http://mahout.apache.org http://manning.com/owen http://manning.com/ingersoll http://www.lucidimagination.com grant@lucidimagination.com @gsingers Resources

27 27CONFIDENTIAL | Thinking Lucene Think Lucid Appendix

28 28CONFIDENTIAL | Copyright Lucid Imagination Mahout Overview Math Vectors/Matrices/ SVD Math Vectors/Matrices/ SVD Recommenders Clustering Classification Freq. Pattern Mining Freq. Pattern Mining Genetic Utilities/Integration Lucene/Vectorizer Utilities/Integration Lucene/Vectorizer Collections (primitives) Apache Hadoop Applications Examples See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms


Download ppt "1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout."

Similar presentations


Ads by Google