Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass.

Similar presentations


Presentation on theme: "© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass."— Presentation transcript:

1 © 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass IBM Haifa Research Lab David Konopnicki

2 © 2009 IBM Corporation2 Motivating Example san-francisco peer michael jackson alive analysis User Browsing Large scale content analysis for mass amount of users. Update users profiles Keywords Modeling: for each user, report the most meaningful keywords to describe her profile. Profiles database Track statistics about readers interests Dashboard Advertisement System

3 © 2009 IBM Corporation3 Contributions User Profiling Framework: – User profile model – KL approach to weight user profile Large scale implementation: – MapReduce flow Experiments: – Quality analysis – Scalability analysis

4 © 2009 IBM Corporation4 User Profiling Framework- Setting logging targeting

5 © 2009 IBM Corporation5 User Profiling - Definitions Bag of words model (BOW) Profile maintenance User snapshot Community snapshot

6 © 2009 IBM Corporation6 User Profiling - Intuition Find terms that are highly frequent in the user snapshot and separate the most between the user and the community snapshots { Travel, Tennis,Sport }

7 © 2009 IBM Corporation7 User Profiling – Naïve approach Term frequency: number of times a term t appears in document d- tf(t,d) Document frequency: the number of documents containing the term t – df(t,D ) average tf over the user snapshot inverse document frequency (df) of a term in the community snapshot probability to find a term in the user snapshot frequent separate

8 © 2009 IBM Corporation8 Kullback-Leibler (KL) Divergence Measures the difference between two probability distributions P 1 and P 2 : KL measures the distance between the Community distribution and the User distribution Each term is scored according to its contribution to the KL distance between the community and the user distributions. The top scored terms are then selected as the user important terms. User Community

9 © 2009 IBM Corporation9 User Profiling – KL method Community marginal term distribution: User marginal term distribution average tf over the community snapshot Probability to find a term t in community snapshot probability normalization factor =0.001 Smoothing with the community snapshot Relative initial weight of term t

10 © 2009 IBM Corporation10 MapReduce Flow HDFS TF UDF DF ¯ TF NjNj |D j (u)| HDFS Mapper: input: (u,d) output (u,1) Reducer: output (u,|D j (u)|) // Sum Mapper: input: (d,text) output ({t,d},1) Reducer: output ({t,d}, tf(t,d)) // Sum Mapper: input: ({t,d},tf(t,D j )) output (t,1}) Reducer: output (t, {df(t,D j ),idf(t,D j ),cdf(t,D j }) HDFS Mapper: input: (t,tf(t,d),|D j |) output (t,{tf(t,d),|Dj|,1}) Reducer: output (t, tf(t, D j )) //Avg Mapper: input: ({t},{tf(t,D j ),cdf(t,D j )}) output (t,N j }) Reducer: identity Mapper: input: ({t},{tf(t,D j ),|Dj|,cdf(t,Dj),Nj}) output (t,P(t|D j )}) Reducer: identity P(t|D j ) HDFS Mapper: input: ({u,t,d},{tf(t,D j (u)),|D j (u)|}) output ({u,t,|D j (u)},{1}) Reducer: output ({u,t},{udf(t,D j (u))})

11 © 2009 IBM Corporation11 MapReduce Flow- cont. w HDFS P(t|D j (u)) HDFS w

12 © 2009 IBM Corporation12 Experimental Data- quality analysis Open Directory Project (ODP): –Categories are associated with manual labels –Considered as ground-truth in this work – Examples: ODP: Science/Technology/Electronics: Manual label: Electronics ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: Buddhism Data Collection : –100 different categories randomly selected from ODP –100 documents randomly selected per category –A total collection size of about 10,000 Web pages Evaluation: –A match is considered if the suggested label is identical, an inflection, or a Wordnets synonym to the manual label

13 © 2009 IBM Corporation13 Results In how many cases, we got at least one correct term from the top- K terms. KL outperforms all other approaches for features selection ODP Category LabelTop-5 KL important terms Bowlingbowl, bowler, lane, bowl center, league BuddhismBuddhist, Buddhism, Buddha, Zen, dharma Ice Hockeyhockey, nhl, hockey league, coach, head coach Electronicsvoltage, high voltage, circuit, laser, power supply

14 © 2009 IBM Corporation14 Experimental Data- scalability analysis Blogger.com Data Collection : –We crawled 973,518 blog posts from March 2007 until January 2009 –Total collection size of 5.45GB, with ~120,000 users Cluster setting: –4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores) –Hadoop Blog entry

15 © 2009 IBM Corporation15 Number of User Profiles Time ratio Document ratio User profile ratio Runtime ratio is correlated with the number of user profiles ratio

16 © 2009 IBM Corporation16 Data Size Runtime linearly increases with the increasing of data size #user: chose 18,000 users between March-Apr 2007

17 © 2009 IBM Corporation17 Related Work Content-based user profiling: –Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy is taken from the ODP. Short-term activities update the hierarchy. –Adaptive user profile: Use words that appear in the Web pages and combine them using tfidf, looking on some window and giving different weights according to the recency of the browsing KL approach to user tasks: – Filter new documents that are not related to the user based on his profile. –Annotate a url with the most descriptive query term for a given user, based on his profile. User targeting in large-scale systems: –Behavioral targeting system over Hadoop MapReduce. –Large scale CF technique for movies recommendations for users. –Incremental algorithm to construct user profile based on monitoring and user feedback which trades-off between complexity and quality of the profile.

18 © 2009 IBM Corporation18 Conclusions & Future Work We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce We showed quality and scalability results We plan to extend the user model into semantic model Extend the user profile to include structured data

19 © 2009 IBM Corporation Thank You !


Download ppt "© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass."

Similar presentations


Ads by Google