Presentation is loading. Please wait.

Presentation is loading. Please wait.

Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.

Similar presentations


Presentation on theme: "Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin."— Presentation transcript:

1 Blog Track Open Task: Spam Blog Detection Tim Finin http://ebiquity.umbc.edu/paper/html/id/318/ Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau University of Maryland, Baltimore County NIST Blog Pre-Track, 14 Nov 2006 James Mayfield Johns Hopkins University Applied Physics Laboratory

2 Blogosphere Reputation at Stake!

3 Spam in the Blogosphere Types: comment spam, ping spam, spam blogs Akismet: “87% of all comments are spam” 75% of update pings are spam (ebiquity 2005) 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) “Spam blogs, sometimes referred to by the neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” “Spings, or ping spam, are pings that are sent from spam blogs” 1 Wikipedia

4 Auto-generated and/or Plagiarized Content Advertisements in Profitable Contexts Link Farms to promote affiliates

5 Why a problem? Splog content provides no additional value Splog content is often plagiarized Splogs demote value of authentic content Splogs steal advertising (referral) revenue from authentic content producers Splogs stress the blogosphere infrastructure Splogs can skew Blog Analytics, as was observed in TREC Blog Track 2006

6 Nature of Splogs in TREC 2006 Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks 81K blogs could be processed We use splog detection models developed on blog home-pages – 87% accurate We identified 13,542 splogs Blacklisted 543K permalinks from these splogs This accounts to 16% of the entire collection Results tally in % to TREC dataset ( Macdonald et al 2006)

7 Impact of Splogs in TREC Queries American Idol Cholesterol Hybrid Cars

8 Higher in Spam Prone Contexts Spam query terms based on analysis by McDonald et al 2006.. Card Interest Mortgage

9 Splog Detection Task Proposal Motivation –Detecting and eliminating spam forms a key competency requirement of any blog analysis –Splog detection has characteristics that sets it different from e-mail and web spam detection Constraint –Simulate how a blog search system operates Task Statement –Is an input permalink (post) spam?

10 Relation to E-mail Spam Detection TREC has an E-mail Spam Classification Task Similar in –Fast online spam detection Different in –Nature of spamming – links, RSS feeds –Users targeted indirectly through search engines – “n1st” not relevant for “nist” query

11 Relation to Web Spam Detection TREC does not have a web spam track Similar in –Spamming web link structure Different in –Coverage of Blog Analytics Engines, not beyond blogosphere –Speed of detection, crucial –Presence of structured text through RSS feeds presents new opportunities, and challenges

12 Difficulty We have been experimented with multiple approaches starting mid 2005 Dataset available at: –http://ebiquity.umbc.edu/blogger/

13 Difficulty Evolving spamming techniques, and splog creation genres Most basic technique –Generate content by stuffing random dictionary words –Generate link to affiliates, through link dumps on blogrolls, linkrolls or after post content Evolving techniques –Scrape contextually similar content to generate posts –Intersperse links randomly –Make link placement meaningful

14

15 Task Details - Dataset Creation Similar to TREC Blog 2006, a collection of feeds, blog home-pages and permalinks View dataset D as two sets – D base, D test D base to span (n-x) days, and D test to span the rest of x days. x=1 or lesser D could collected as a combination of – D as collected in 2006 –Sample a subset of pings from a ping server over the period that D is collected

16 Task Details - Assessment Assessors classify spam post by the kind of spam this post, or the blog hosting it features –Non-blog –Keyword-stuffed –Post-stitching –Post-plagiarism –Post-weaving –Blog/link-roll Each assessment typically takes 1-2 mins Detailed assessment will enable participants to find which classes they do well, and where they can improve

17 Evaluation D base distributed first, D test subsequently with 50 independent sets of permalinks D base, D test division will mimic how blog search engines operate –Build models to detect splogs – using individual posts, feeds or blog homepages of what is seen –Detect spam in an incoming stream of new blog postings Teams will be judged by how well they detect “spamminess” for new posts

18 Input/Output............ {set Q0 docno rank prob runtag} Individual set of test input. 50 such sets can be used, similar to how topics used in the opinion Identification task Each permalink to be judged by participants Output format

19 Summary Spam Blogs present a major challenge to the quality of blog mining/analytics Splog Detection is different from spam in other communication platforms Development of TREC Task will help furthering state of the art Task requirements can be easily aligned with existing task of opinion identification


Download ppt "Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin."

Similar presentations


Ads by Google