Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto.

Similar presentations


Presentation on theme: "Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto."— Presentation transcript:

1 Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

2 Nilesh Bansal and Nick Koudas WebDB 2007 BLOGOSPHERE

3 Nilesh Bansal and Nick Koudas WebDB 2007

4 Nilesh Bansal and Nick Koudas WebDB M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS

5 Nilesh Bansal and Nick Koudas WebDB 2007 WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT

6 Nilesh Bansal and Nick Koudas WebDB 2007 WHY SHOULD WE CARE?

7 Nilesh Bansal and Nick Koudas WebDB 2007 HUGE DATA REPOSITORY WILL CONTINUE TO GROW EXTRACT PUBLIC OPINION VALUABLE INSIGHTS

8 Nilesh Bansal and Nick Koudas WebDB 2007 KEY INSIGHTS MARKET RESEARCH PUBLIC RELATION STRATEGIES CUSTOMER OPINION TRACKING

9 Nilesh Bansal and Nick Koudas WebDB 2007 CHALLENGES AND OPPORTUNITIES

10 Nilesh Bansal and Nick Koudas WebDB 2007 HUGE AMOUNTS OF UNSTRUCTURED TEXT

11 Nilesh Bansal and Nick Koudas WebDB 2007

12 Nilesh Bansal and Nick Koudas WebDB 2007 MACHINE CREATED WEBLOGS MORE THAN HALF OF BLOGSPOT IS SPAM 33% OF WEBSPAM HOSTED AT BLOGSPOT

13 Nilesh Bansal and Nick Koudas WebDB 2007 TEMPORAL DIMENSION

14 Nilesh Bansal and Nick Koudas WebDB 2007 GEOGRAPHICAL ASSOCIATION

15 Nilesh Bansal and Nick Koudas WebDB 2007 CONVERSATION

16 Nilesh Bansal and Nick Koudas WebDB 2007 Gruhl et al., The Predictive Power of Online Chatter, KKD 2005 Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003 Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006 Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006 Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

17 Nilesh Bansal and Nick Koudas WebDB 2007 BLOGSCOPE

18 Nilesh Bansal and Nick Koudas WebDB 2007

19 Nilesh Bansal and Nick Koudas WebDB 2007 CRAWLER RUNNING 24x7 TRACKING 9M BLOGS INDEXING 70M ARTICLES AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS

20 Nilesh Bansal and Nick Koudas WebDB 2007 ANY STREAMING TEXT SOURCE NEWS MAILING LISTS FORUMS SOCIAL MEDIA

21 Nilesh Bansal and Nick Koudas WebDB Hot Keywords Hot Keywords

22 Nilesh Bansal and Nick Koudas WebDB 2007 Related Terms Related Terms Popularity Curve Popularity Curve Search Results Search Results Geo Search Geo Search

23 Nilesh Bansal and Nick Koudas WebDB 2007 Hawaii Earthquake Taiwan Undersea Earthquake Sumatra Earthquake

24 Nilesh Bansal and Nick Koudas WebDB 2007 December March

25 Nilesh Bansal and Nick Koudas WebDB 2007 IPHONE ON JAN

26 Nilesh Bansal and Nick Koudas WebDB 2007 Curves are usually correlated, except at one point

27 Nilesh Bansal and Nick Koudas WebDB 2007 TECHNIQUES

28 Nilesh Bansal and Nick Koudas WebDB 2007 CRAWLS RSS FEEDS 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM

29 Nilesh Bansal and Nick Koudas WebDB 2007 [Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007 [Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004 [Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006 LINK BASED ANALYSIS IS NOT EFFECTIVE SPAMMERS ARE INTELLIGENT WE USE HEURISTICS ON GOING BATTLE

30 Nilesh Bansal and Nick Koudas WebDB 2007 INTERACTIVE APPLICATION TWO SECOND RESPONSE TIME HUGE AMOUNTS OF DATA SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY SCALABILITY

31 Nilesh Bansal and Nick Koudas WebDB 2007

32 Nilesh Bansal and Nick Koudas WebDB 2007 BURST DETECTION [Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007 [Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

33 Nilesh Bansal and Nick Koudas WebDB 2007 POPULARITY = BASE + ZERO MEAN GAUSSIAN BURST = STATISTICAL OUTLIER

34 Nilesh Bansal and Nick Koudas WebDB 2007 IDENTIFYING RELATED TERMS

35 Nilesh Bansal and Nick Koudas WebDB 2007 COLLOCATIONS POINTWISE MUTUAL INFORMATION EXPENSIVE [Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis [Manning and Schutze] Foundation of Natural Statistical Language Processing [Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

36 Nilesh Bansal and Nick Koudas WebDB 2007 FAST COMPUTATION OF RELATED TERMS RANDOM SAMPLE MUTUAL INFORMATION IN EXPECTATION USE TF WITH PRECOMPUTED IDF

37 Nilesh Bansal and Nick Koudas WebDB 2007 COMPUTING HOT KEYWORDS

38 Nilesh Bansal and Nick Koudas WebDB 2007 POPULAR DOES NOT MEAN HOT INTERESTING = SURPRISING MIXTURE OF DIFFERENT SCORING FUNCTIONS DEVIATION FROM EXPECTED

39 Nilesh Bansal and Nick Koudas WebDB 2007 INTELLIGENT ALERT SERVICE BURST SYNOPSIS AUTHORATIVE RANKING

40 Nilesh Bansal and Nick Koudas WebDB 2007 Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal). JUST THE BEGINNING

41 Nilesh Bansal and Nick Koudas WebDB 2007 Source: xkcd.com THANK YOU. QUESTIONS?


Download ppt "Nilesh Bansal and Nick Koudas WebDB 2007 SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto."

Similar presentations


Ads by Google