© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass.

Slides:



Advertisements
Similar presentations
Impact of Cloud Computing on Enterprise Architecture Perspectives, Best Practices, & Pitfalls David March 2009.
Advertisements

Predicting User Interests from Contextual Information
PhishZoo: Detecting Phishing Websites By Looking at Them
Google News Personalization: Scalable Online Collaborative Filtering
© 2009 IBM Corporation1 Feedback Directed Dynamic Recompilation for Statically Compiled Languages Dorit Nuzman, Sergei Dyshel, Revital Eres IBM Research,
Vincent W. Zheng, Yu Zheng, Xing Xie, Qiang Yang Hong Kong University of Science and Technology Microsoft Research Asia This work was done when Vincent.
Text Categorization.
Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.
Sony Smart Cards and International Evaluation 2 nd Common Criteria Conference London, UK July 2001 i-Card System Solutions Division Broadband Network.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Improved TF-IDF Ranker
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval in Practice
Video retrieval using inference network A.Graves, M. Lalmas In Sig IR 02.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Scalable Text Mining with Sparse Generative Models
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Search Engines and Information Retrieval Chapter 1.
 Copyright 2006 Digital Enterprise Research Institute. All rights reserved. Collaborative Building of Controlled Vocabularies Crosswalks Mateusz.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Chapter 6: Information Retrieval and Web Search
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Facilitating Document Annotation using Content and Querying Value.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Vector Space Models.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
Post-Ranking query suggestion by diversifying search Chao Wang.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
1 FollowMyLink Individual APT Presentation First Talk February 2006.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Facilitating Document Annotation Using Content and Querying Value.
An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Using ODP Metadata to Personalize Search University of Seoul Computer Science Database Lab. Min Mi-young.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Information Retrieval in Practice
A Simple Approach for Author Profiling in MapReduce
Search Engine Architecture
Information Retrieval and Web Search
Information Retrieval
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass IBM Haifa Research Lab David Konopnicki

© 2009 IBM Corporation2 Motivating Example san-francisco peer michael jackson alive analysis User Browsing Large scale content analysis for mass amount of users. Update users profiles Keywords Modeling: for each user, report the most meaningful keywords to describe her profile. Profiles database Track statistics about readers interests Dashboard Advertisement System

© 2009 IBM Corporation3 Contributions User Profiling Framework: – User profile model – KL approach to weight user profile Large scale implementation: – MapReduce flow Experiments: – Quality analysis – Scalability analysis

© 2009 IBM Corporation4 User Profiling Framework- Setting logging targeting

© 2009 IBM Corporation5 User Profiling - Definitions Bag of words model (BOW) Profile maintenance User snapshot Community snapshot

© 2009 IBM Corporation6 User Profiling - Intuition Find terms that are highly frequent in the user snapshot and separate the most between the user and the community snapshots { Travel, Tennis,Sport }

© 2009 IBM Corporation7 User Profiling – Naïve approach Term frequency: number of times a term t appears in document d- tf(t,d) Document frequency: the number of documents containing the term t – df(t,D ) average tf over the user snapshot inverse document frequency (df) of a term in the community snapshot probability to find a term in the user snapshot frequent separate

© 2009 IBM Corporation8 Kullback-Leibler (KL) Divergence Measures the difference between two probability distributions P 1 and P 2 : KL measures the distance between the Community distribution and the User distribution Each term is scored according to its contribution to the KL distance between the community and the user distributions. The top scored terms are then selected as the user important terms. User Community

© 2009 IBM Corporation9 User Profiling – KL method Community marginal term distribution: User marginal term distribution average tf over the community snapshot Probability to find a term t in community snapshot probability normalization factor =0.001 Smoothing with the community snapshot Relative initial weight of term t

© 2009 IBM Corporation10 MapReduce Flow HDFS TF UDF DF ¯ TF NjNj |D j (u)| HDFS Mapper: input: (u,d) output (u,1) Reducer: output (u,|D j (u)|) // Sum Mapper: input: (d,text) output ({t,d},1) Reducer: output ({t,d}, tf(t,d)) // Sum Mapper: input: ({t,d},tf(t,D j )) output (t,1}) Reducer: output (t, {df(t,D j ),idf(t,D j ),cdf(t,D j }) HDFS Mapper: input: (t,tf(t,d),|D j |) output (t,{tf(t,d),|Dj|,1}) Reducer: output (t, tf(t, D j )) //Avg Mapper: input: ({t},{tf(t,D j ),cdf(t,D j )}) output (t,N j }) Reducer: identity Mapper: input: ({t},{tf(t,D j ),|Dj|,cdf(t,Dj),Nj}) output (t,P(t|D j )}) Reducer: identity P(t|D j ) HDFS Mapper: input: ({u,t,d},{tf(t,D j (u)),|D j (u)|}) output ({u,t,|D j (u)},{1}) Reducer: output ({u,t},{udf(t,D j (u))})

© 2009 IBM Corporation11 MapReduce Flow- cont. w HDFS P(t|D j (u)) HDFS w

© 2009 IBM Corporation12 Experimental Data- quality analysis Open Directory Project (ODP): –Categories are associated with manual labels –Considered as ground-truth in this work – Examples: ODP: Science/Technology/Electronics: Manual label: Electronics ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: Buddhism Data Collection : –100 different categories randomly selected from ODP –100 documents randomly selected per category –A total collection size of about 10,000 Web pages Evaluation: –A match is considered if the suggested label is identical, an inflection, or a Wordnets synonym to the manual label

© 2009 IBM Corporation13 Results In how many cases, we got at least one correct term from the top- K terms. KL outperforms all other approaches for features selection ODP Category LabelTop-5 KL important terms Bowlingbowl, bowler, lane, bowl center, league BuddhismBuddhist, Buddhism, Buddha, Zen, dharma Ice Hockeyhockey, nhl, hockey league, coach, head coach Electronicsvoltage, high voltage, circuit, laser, power supply

© 2009 IBM Corporation14 Experimental Data- scalability analysis Blogger.com Data Collection : –We crawled 973,518 blog posts from March 2007 until January 2009 –Total collection size of 5.45GB, with ~120,000 users Cluster setting: –4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores) –Hadoop Blog entry

© 2009 IBM Corporation15 Number of User Profiles Time ratio Document ratio User profile ratio Runtime ratio is correlated with the number of user profiles ratio

© 2009 IBM Corporation16 Data Size Runtime linearly increases with the increasing of data size #user: chose 18,000 users between March-Apr 2007

© 2009 IBM Corporation17 Related Work Content-based user profiling: –Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy is taken from the ODP. Short-term activities update the hierarchy. –Adaptive user profile: Use words that appear in the Web pages and combine them using tfidf, looking on some window and giving different weights according to the recency of the browsing KL approach to user tasks: – Filter new documents that are not related to the user based on his profile. –Annotate a url with the most descriptive query term for a given user, based on his profile. User targeting in large-scale systems: –Behavioral targeting system over Hadoop MapReduce. –Large scale CF technique for movies recommendations for users. –Incremental algorithm to construct user profile based on monitoring and user feedback which trades-off between complexity and quality of the profile.

© 2009 IBM Corporation18 Conclusions & Future Work We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce We showed quality and scalability results We plan to extend the user model into semantic model Extend the user profile to include structured data

© 2009 IBM Corporation Thank You !