1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ Sept 8 2008.

1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ kcsia@cs.ucla.edu Sept 8 2008

2 Explosion of user-generated content Doubling every 5 months – by Technorati

3 Characteristics of content About 97%-98% daily content are new 50 words shingles 62% weekly content are new on the web (“Whats new on the Web on the Web? The Evolution of the Web from a Search Engine Perspective”, by Ntoulas et.al., WWW 20004)

4 Characteristics of content Mostly consist of current event chatter  Politics  Technology  Entertainment  Sports

5 The Yahoo! Buzz service

6 Agenda Introduction: growth and characteristics of user- generated content Three aspects  Monitoring: How to deliver fresh content to users  Aggregation: How to efficiently deliver personalized results to users  Analysis of tagging data: Making tagging data useful for advertisers

7 Framework Pull model: A central server monitors data source changes and provides digested content to users Push model: data sources notify server for updates

8 Overview New challenges  Content update more frequently with recurring pattern  More time-sensitive requirements Modeling of post update Definition of delay Strategies for allocation and scheduling

9 How updates are changed? Homogeneous Poisson model λ(t) = λ at any t Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…

10 Definition of metrics Delay of a data source sum of elapsed time for every post Delay experienced by the aggregator

11 Approach Resource allocation  How often to contact data sources?  O 1 is more active than O 2, how much more often should we contact O 1 than O 2 ? Retrieval scheduling  When to contact a data source?  2 retrievals are allocated for O 1, when should these 2 retrievals be located?

12 Single retrieval per period example λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2  τ = 0.5, expected delay = 0.75  τ = 1, expected delay = 0.5  τ = 2, expected delay = 1.5

13 Multiple retrievals per period m retrievals per period are allocated, when scheduled at time τ 1, …, τ m, the expected delay is given by:

14 Example 6 retrievals for λ(t)=2+2sin(2πt)‏

15 Experiment Data – 10k RSS feeds from syndic8.com collected during Oct – Dec 2004 Typical power law distribution – good for resource allocation

16 Performance CGM03 (“Effective page refresh policy for Web crawlers”, by Cho and Garcia-Molina in ACM TODS 2003)  Homogenous Poisson model  Optimize for “age” metrics Ours – both resource allocation and retrieval scheduling

17 Size of estimation window Resource constraint: 4 retrievals per day per feeds on average 2 weeks seems an appropriate choice

18 Consistency of posting rate 90% of the RSS feeds post consistently

19 Summary Resource allocation is aggressive Retrieval scheduling optimizes within individual data source  Significantly improved freshness of content Also considered user browsing pattern “Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho, Hyun- Kyu Cho, in IEEE TKDE 2007 “Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007

21 Aggregate query over blogs User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events Aggregation of individual user opition to show current popular trends

22 Motivation Global aggregation (examples from blogpulse.com)‏  Recent news got picked up quickly “Dark Knight” in the week of July 18 “Olympics” related phrases in the week of August 8  Potential drawbacks What if a user not interested in entertainment at all? Groups of bloggers collaborated to promote advertisement videos Personal aggregation  Users selectively aggregate from different sources  Efficient strategy to handle large number of users and sources

23 From global to personal aggregation Dark KnightOlympics Michael PhelpsWALL-E Las Vegas bloggers items (phrases)‏ Dark Knight is great, more entertaining than watching Olympics and shows in Las Vegas! Um.. it will be good if there is a free show of Dark Knight and WALL-E Michael Phelps performance in Olympics is awesome... Finished watching Michael Phelps in Olympics, let me try the WALL-E DVD...

24 Matrix forumulation Endorsement matrix (E)‏ - e.g. the number of times a blogger mentions an object (keywords / links) in his posts. Trust matrix (T)‏ - e.g. how often a user reads from a blog Personalized score (TE) – weighted endorsement score by a user’s trust vector‏ 321b4b4 475Total 101b3b3 030b2b2 023b1b1 O3O3 o2o2 o1o1 E 0.5 00u3u3 0.6 0.2 u2u2 000.8 u1u1 b4b4 b3b3 b2b2 b1b1 T 21.0 u3u3 2.42.21.8u2u2 0.04.02.4u1u1 o3o3 o2o2 o1o1 TE

25 Baseline implementations Endorsement (blog_id, iterm, score)‏, Trust (user_id, blog_id, score)‏ Personal Aggregate Query SELECT t.item, sum(t.score*e.score) As p_score FROM Endorsement e, Trust t WHERE e.blog_id = t.blog_id AND t.user_id = GROUP BY t.items ORDER BY p_score DESC LIMIT 20 On-the-fly (OTF) View

26 Optimizing the query Identify “template” users  Typical users interested in sports / politics / technology /...  Results of template users are pre-computed  Results of individual users are combined from partially computed results

27 Using NMF to discover user groups Factorize trust matrix Decompose T into two sub-matrices W and H  Non-negative matrix factorization  W: relationship  H: relationship User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2 NMF as an approximation of original trust matrix

28 Reconstruction of results PersonalizedEndorsement score of template users are pre-computed, results of individual users are computed on request (HE) is maintained as sorted-lists for all template users W * (HE) is the personal aggregation result  Computed using Threshold Algorithm (by Fagin et.al. PODS 2001)‏ Top-K list (HE) are sorted lists W * (HE) is weighted linear combination

29 Partition of trust matrix Decomposition is useful when matrix is dense Real life data is often skewed (by Akshay et.al. ICWSM 2007) Hybrid method: uses decomposition only when it is effective Users with more subscription Blogs with more subscribers Users with >30 subscriptions Feeds with >30 subscribers 10k feeds, 24k users ~1M subscription pairs 2.7M subscription pairs 1. OTF 2. VIEW 3. NMF

30 Experiments Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile  91K users  487K RSS feeds Endorsement matrix E: blog – keywords occurrence  Feed content collected between Nov 2006-Jul 2007  Keywords filtered by nouns with high tf-idf values Platform  Python implementation of proposed scheme  MySQL server on linux with data stored on RAID

31 How different is personalization? Week 2007 Jan 7 – 2007 Jan 13 major event: iphone released Personal aggregation results differ from global aggregation irangooglequarterphone saddamcathartikpricesbusiness troopsvideocompaniessoftware deptkibbutzappledevelopment avenueargentinabushmanagement viewsvegasiraq presidentsearchchicagomanager bushreutersiphoneapple iraqiguazubeefiphone yorkerbrazilcattlesales User 91017User 90550User 90439Global 2007-01-07 to 2007-01-13

32 How different is personalization? Overlap comparison of global aggregation and personal aggregation  L G – global top 20 items  L i – individual top 20 items of user i Personal aggregation results also differ among users Overlap degree with global aggregation result Pair-wise among users

33 Approximation accuracy Dense region of subscription matrix  >30 subscribers: 10152 feeds  >30 subscriptions: 24340 users L2 norm comparison Sparsity of W (23%), H (13%)‏ NMF approximation is close to SVD with sparseness adv. 833.0823.2120 837.9829.0110 844.6835.1100 850.1841.690 856.9848.580 NMFSVDRank

34 Approximation accuracy How many items are approximated by NMF in top 20 list?  T i – top 20 items of user i computed by OTF  A i – top 20 items of user i computed by NMF 70% approximation and more accurate for higher rank items Correlation with rank

35 Efficiency of proposed method Update cost (for 1 week data)  OTF (222K) < NMF (3.2M) < VIEW (23.6M)‏ Query response time  Average over 1000 users with highest number of subscription  OTF: execute SQL query on MySQL server  NMF: phython implementation of Threshold Alogrithm that interface MySQL server Average query response time reduced by 75%, eliminated outliers of significant delay 0.007s2.84s0.53s0.46sNMF 0.037s84.42s3.60s2.05sOTF minmaxstdavgMethod

36 Summary Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries  Optimization by NMF & Threshold Algorithm Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy “Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008 “Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007

38 More than just tag-cloud

39 LDA of tagging data d – bookmark, w – tag (merged all users), z – topics Sample of topics IDTopicsTags (top 10 p(w|z)) 6Environmental Protection environment energy green ibm item sustainability oil power solar alternative 27Politicspolitics government war iraq usa activism bush military political terrorism 38Literaturebooks book literature ebooks free reading poetry ebook publishing writing 49Copyrightslaw copyright legal drm sony rights remote vnc ethics creativcommons 55Healthhealth fitness medicine regex medical drugs exercise running diet training 63Linuxlinux unix ubuntu debian os sysadmin shell kernel bash livecd 69Photographyphotography photo photos flickr camera gallery images pictures digital photoblog

40 Change of entropy Tags with increasing popularity in a period Correspond to developing topics where users are willing to explore new pages Correspond to well-established topics where users have common consensus

41 Change of topic association “Programmers” at Oct 2005 –programming, development, code, patterns, dev, coding, algorithms, scheme, software,... “Programmers” at Jan 2006 –programming, development, code, lisp, dev, coding, algorithms, scheme, software, cs,... –work, jobs, career, job, shell, sleep, uml, regex, scripting, bash,...

42 Specificity of word semantics Entropy vs idf metrics

43 Summary Features can be combined to build a classifier for words  Tag entropy change rate  KL-divergence of topic distribution  Entropy of semantic Assist advertisers to select better keywords for advertisement “Exploring Social Annotations for Word Usage Evolution” – work in progress

44 Thank you!

45 Definition of metrics τ j – retrieval time λ(t) – posting rate Expected delay  Homogeneous Poisson model  Inhomogeneous Poisson model

46 Resource allocation Consider n data source O 1, …, O n  λ i – posting rate of O i  w i – weight of O i  N – total number of retrievals per day  m i – number of retrievals per day allocated to O i Optimal allocation

47 Single retrieval per period For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:

48 Blogs becoming inactive Detection of abandoned blog to save resource [2] D.R. Cox “Regression models and life-tables (with discussion)” Journal of the Royal Statistical Society, B(34), 1972 [3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality” Technical report, Microsoft Research

49 More examples

50 Major posting patterns K – means clustering

51 Threshold algorithm Proposed by Fagin et.al. [2001] Efficient computation of top-K items from multiple lists with a monotone aggregate function users blogs user groups

52 Illustration of matrix partition Feeds with More subscribers User with more subscriptions 2 subscriptions8 subscriptions 2 subscribers 9 subscribers

1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ Sept 8 2008.

Similar presentations

Presentation on theme: "1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ Sept 8 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ Sept 8 2008.

Similar presentations

Presentation on theme: "1 Efficient Monitoring, Mining and Analysis of User-generated Content Ka Cheung “Richard” Sia (UCLA)‏ Sept 8 2008."— Presentation transcript:

Similar presentations

About project

Feedback