Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.

Similar presentations


Presentation on theme: "Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus."— Presentation transcript:

1 Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus

2 With/without organized access

3 Inaccessible? By AskJeeves

4 Introduction Organized access to blogs  Full coverage  Reflect changes quickly  Filtered and organized presentation Intended Contributions  Efficient techniques to harvest blogs  Algorithms to monitor frequently changing data sources  Algorithms to reconstruct implicit networks and compose topic summaries

5 Modules Monitoring Collection (future work) Topic detection and tracking (future work) Conclusion

6 Monitoring Preliminary results

7 Framework A central server monitors data source changes and provides succinct summaries to users

8 Overview New challenges  Content change more rapidly with recurring pattern  More time-sensitive requirements Modeling of posting update Definition of delay Strategies for allocation and scheduling

9 Characteristics Homogeneous Poisson model λ(t) = λ at any t Periodic inhomogeneous Poisson model λ(t) = λ(t-nT), n=1,2,…

10 Definition of metrics Delay of a data source sum of elapsed time for every post Delay experienced by the aggregator

11 Definition of metrics τ j – retrieval time λ(t) – posting rate Expected delay  Homogeneous Poisson model  Inhomogeneous Poisson model

12 Problem formulation Minimization of expected delay experienced by the aggregator under constraint of limited resources. Schedule τ j ’s such that is minimized.

13 Approach Resource allocation  How often to contact data sources?  O 1 is more active than O 2, how much more often should we contact O 1 than O 2 ? Retrieval scheduling  When to contact a data source?  3 retrievals are allocated for O 1, when should these 3 retrievals be located?

14 Resource allocation Consider n data source O 1, …, O n  λ i – posting rate of O i  w i – weight of O i  N – total number of retrievals per day  m i – number of retrievals per day allocated to O i Optimal allocation

15 Retrieval scheduling m retrieval(s) per day are allocated to a data source O, how should we schedule these m retrievals? m=1 m>1

16 Single retrieval per period λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2  τ = 0.5, expected delay = 0.75  τ = 1, expected delay = 0.5  τ = 2, expected delay = 1.5

17 Single retrieval per period For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:

18 Multiple retrievals per period m retrievals per period are allocated, when scheduled at time τ 1, …, τ m, the expected delay is given by:

19 Example 6 retrievals for λ(t)=2+2sin(2πt)

20 Experiment Data – 10k RSS feeds over Oct – Dec 2004

21 Performance CGM03 – optimize for “age” Ours – both resource allocation and retrieval scheduling

22 Size of estimation window Resource constraint: 4 retrievals per day per feeds on average 2 weeks is an appropriate choice

23 Predictability of posting rate 90% of the RSS feeds post consistently

24 Summaries and extensions Resource allocation is more aggressive Retrieval scheduling optimizes within individual data source Include user access pattern Variable retrieval cost

25 Collection Future work

26 Collection Blog hosting website Central repository ~5.3M URLs from weblogs.com limited and contaminated Crawling Retrieve maximum number of blog while reducing number of irrelevant pages downloaded DomainCountCategory spaces.msn.com839,663Blog blogspot.com362,957Blog wretch.cc116,161Blog search-net101.com89,750Spam/ads abalty.com86,329Spam/ads search-now854.com80,109Spam/ads bigebiz.org79,059Spam/ads

27 Collection Blogs are inter-connected (blogrolls) Selectively following links, discovering hubs for blogs blog [1] Chakrabarti et.al. “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, The International WWW conference 1999

28 Relinquishment of blogs Detection of abandoned blog to save resource [2] D.R. Cox “Regression models and life-tables (with discussion)” Journal of the Royal Statistical Society, B(34), 1972 [3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality” Technical report, Microsoft Research

29 Topic detection and tracking Future work

30 Overview Characteristics  Document stream  Traces of information propagation among blogs Challenges  Modeling growth and death of a topic  Ranking of blog articles  Malicious content

31 Influence network in blogs Information are “diffused” among blogs Indicator of popularity Social relationship among bloggers

32 Influence network in blogs Four major patterns of propagation Reconstruction of implicit network  Ranking (source authority)  Advertising campaign

33 Data characteristics ~ 97 - 98 % daily content are new

34 Data characteristics Same content last for ~8 days

35 Topics Topics with different lifespan  Bursty  Mid-range  Sustaining Evolving of topic [4] J. Kleinberg, “Bursty and Hierarchical Structure in Streams” in SIGKDD 2002 [5] J. Kleinberg, “Temploral Dynamics of On-Line Information Streams” Data Stream Management: Processing High-Speed Data Stream, Springer 2005

36 Document similarity Sparse and diverse ~400 articles clustered into 21 clusters out of 10,000 daily articles (by DBSCAN)

37 Framework Document stream approach  Filtering  Aggregation

38 Problems Selecting a representative subset of documents from a topic cluster  Coverage  Distinctiveness among subset Ranking of documents  Time  Source authority

39 Conclusion 1. Efficient collection of blogs and modeling the relinquishment 2. Monitoring and retrieval scheduling of rapidly changing data sources 3. Composing topic summary 1. Reconstruction of an implicit influence network 2. Representative document selection problem

40 End Questions?

41 More examples

42 Major posting patterns K – means clustering


Download ppt "Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus."

Similar presentations


Ads by Google