Presentation is loading. Please wait.

Presentation is loading. Please wait.

BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Similar presentations


Presentation on theme: "BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay."— Presentation transcript:

1 BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL)

2 Motivation: Cleaning the Harvest BlogVox – A Blog analytics engine developed for the TREC 2006 Blog Track. Presence of spam blogs or splogs and extraneous content waters down the quality of the index. Narrowing down on the content of the post is essential in lack of clearly demarcated opinion sentences (like in eopinions, IMDB, Amazon etc) Noisy and unstructured text on the Blogosphere can skew blog analytics/ business intelligence tools (as observed in TREC 2006).

3 BlogVox Opinion Extraction System TREC 06: Finding opinionated posts, either positive or negative, about a query 2006 TREC Blog corpus: 80K blogs 300K posts 50 test queries BlogVox opinion extraction system Document and sentence level scorers Combined scores using an SVM meta-learner Data cleaning: splogs and post identification BlogVox BlogVox challenges Data cleaning and splog removal Slangs Semantic orientation of words Contradictions, sarcasms, ungrammatical text

4 Separating Blog Wheat from Blog Chaff Data cleaning for Splog removal Post content identification

5 Spam in the Blogosphere Types: comment spam, ping spam, splogs Akismet: 87% of all comments are spam 75% of update pings are spam (ebiquity 2005) 56% of blogs are spam (ebiquity 2005) 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads Spings, or ping spam, are pings that are sent from spam blogs

6 Motivation: host ads

7 Motivation: index affiliates, promote pageRank

8 Data Cleaning: Splogs Splog detection using SVM 700 blogs, 700 splogs used for training Model based on blog homepage and local blog features Host AdsIndex affiliates, Promote pageRank Plagiarized content Splog Detection Performance

9 Nature of Splogs in TREC 2006 1 The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks 81K blogs could be processed We use splog detection models developed on blog home-pages; 87% accuracy We identified 13,542 splogs Blacklisted 543K permalinks from these splogs ~16% of the entire collection ~17% splog posts injected into TREC dataset 1 1 The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis

10 Impact of Splogs in TREC Queries American Idol Cholesterol Hybrid Cars

11 Higher in Spam Prone Contexts Spam query terms based on analysis by McDonald et al 2006.. Card Interest Mortgage

12 Separating Blog Wheat from Blog Chaff Data cleaning for Splog removal Post content identification

13 Data Cleaning: Content Identification Navigation Post content Ads Recent Posts

14 Data cleaning: Baseline heuristic Eliminate link a if there exist a link b Within θ distance No Title tags between the links Avg length of text bearing nodes less than a threshold b is the nearest link to a An example DOM tree Navigational Links Ads Post Content Sidebar

15 Data cleaning: SVM cleaner Random collection of 150 blog posts Human evaluation of 400 links tagged as content or extraneous links We trained SVM using linear kernel in this analysis DOM Features Evaluation Tag Features Position Features Word Features

16 Data Cleaning: Effect of sidebar content

17 Related Work Web Spam Detection Coverage: Blog Analytics Engines dont look beyond Blogosphere Speed of detection is important, 150K posts/hour RSS feeds presents new opportunities, and challenges Email spam Detection Nature of spamming: links, RSS feeds, web graph, metadata Users targeted indirectly through search engines, e.g. N1ST not relevant for NIST query Template Detection Repeated structural components detected via sampling Customization, use of javascripts and AJAX is increasing Simple heuristics using DOM traversal work well in general cases Sentiment Analysis Open domain opinion extraction is complex Opinions are part of a narrative Subject for which the opinion is being expressed is not easy to detect

18 Conclusions Noisy content on the Blogosphere present a major challenge to the quality of blog analytics tools. Combination of heuristics and ML can be used to effectively clean the data. Ongoing Work DOM subtree elimination Identifying the subject of the opinion Slangs More training examples!

19 http://ebiquity.umbc.edu/ Thank you!

20 Backup Slides

21 Opinions in Social Media I went to school early so I would have time to grab some lunch. Which ended up consisting of a crappy sandwich from starbucks and a chai latte. Lacey came into Starbucks while I was there so we chatted for a little bit and she thought that I might be in her class. After I finished eating I headed to school and checked the board…….. 1 [1] http://annamay13x.livejournal.com/7061.html Expressed Opinions Narrative Readers Perspective Starbucks Sandwiches are bad! Opinions can influence buying decisions of customers

22 Keyword Stuffed Blog coupon codes, casino

23 Post Stitching Excerpts scraped from other sources

24 Post Weaving Spam Links contextually placed in post

25 Link-roll spam With fully plagiarized text

26 Difficulty We have been experimenting with multiple approaches starting mid 2005 Data: http://ebiquity.umbc.edu/resource/html/id/212

27 Difficulty Evolving spamming techniques and splog creation genres Most basic technique spam techniques Generate content by stuffing key dictionary words Generate link to affiliates, through link dumps on blogrolls, linkrolls or after post content Evolving spam techniques Scrape contextually similar content to generate posts RSS hijacking Aggregation software, e.g. Planet X Intersperse links randomly Make link placement meaningful Add spam comments and then ping. Repeat.

28 TREC Submissions (Topic Relevance)

29 TREC Submissions (Opinion Extraction)


Download ppt "BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay."

Similar presentations


Ads by Google