Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.

Slides:



Advertisements
Similar presentations
Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu, Vinay Goel and Brian D. Davison Lehigh University, USA.
Advertisements

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
TrustRank Algorithm Srđan Luković 2010/3482
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
 How many pages does it search?  How does it access all those pages?  How does it give us an answer so quickly?  How does it give us such accurate.
Detecting Web Spam with CombinedRank Abhita Chugh Ravi Tiruvury.
Assignment: Improving search rank – search engine optimization Read the following post carefully.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
IS 360 Web Promotion. Slide 2 Overview How to attract visitors.
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Internet Research Search Engines & Subject Directories.
Search Engine Optimization. What is SEO? Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
WageIndicator SEO, December 10, 2008 Irene van Beveren Today: 0.Why SEO is important 1.Keyword Strategies 2.Title Tags 3.Internal Links 4.Duplicate Content.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Adversarial Information Retrieval The Manipulation of Web Content.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Using Hyperlink structure information for web search.
The Search Engine Landscape: 2010 How Users Interact with Engines & How the Search Engines Crawl, Index & Rank Pages Rand Fishkin CEO & Co-Founder: SEOmoz.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSCI-235 Micro-Computer in Science Internet Search.
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search. Search and Economics Search is ubiquitous –Money as a search efficiency Eliminates double coincidence of wants in search for barter exchange –Job.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching Tutorial By: Lola L. Introduction:  When you are using a topic, you might want to use “keyword topics.” Using this might help you find better.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
9 Algorithms: PageRank. Ranking After matching, have to rank:
Search Engines SCI199 Oct. 5, 2009 Phillipa Gill
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
Search Engine Marketing Science Writers Conference 2009.
How do Web Applications Work?
CSCE 590 Web Scraping – Information Extraction II
Information Organization: Overview
WEB SPAM.
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Text & Web Mining 9/22/2018.
Search Engines & Subject Directories
A Comparative Study of Link Analysis Algorithms
HITS Hypertext Induced Topic Selection
9 Algorithms: PageRank.
CS 440 Database Management Systems
HITS Hypertext Induced Topic Selection
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Search Engines & Subject Directories
Search Engines & Subject Directories
Junghoo “John” Cho UCLA
Information Organization: Overview
Discussion Class 9 Google.
Presentation transcript:

Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011

Agenda What is Adversarial Information Retrieval? What is the goal of Adversarial Information Retrieval? Issues with First generation engines Resolution Future Improvements

What is Adversarial IR? Gathering, Indexing, Retrieving and Ranking Information. Subset of the information has been manipulated maliciously. Financial Gain.

What is the Goal of Adversarial IR? Detect bad/unauthentic sites. Improve precision on search engines by eliminating such unauthentic sites.

Issues with First Generation Engines What is Term Frequency? First generation engines relied heavily on "term frequency" to determine the page rank. Increase the page rank by repeating the same word over and over again.

Search Engine Spamming Link Spam Link-bombing Spam Blogs Comment Spam Keyword Spam Malicious Tagging

Google Bombing

Trust Rank Observation Good Pages tend to link good pages. Algorithm -- Select a small subset of pages and let a human classify them -- Propagate the goodness of pages.

Propagation Trust function T -- T(p) returns the probability that page p is good page. Initial values -- T(p) = 1, if p was found to be a good page. -- T(p) = 0, if p was found to be a spam page. Iterations: -- Propagate trust following out-links. -- only a fixed number of iterations M.

Propagation Issues and Resolution Problems with propagation – Pages reachable from good seeds might not be good. – the further away we are from good seed pages, the less certain we are that a page is good. Solution -- Reduce trust as we move further away from the good seed pages (trust attenuation).

Hubs and Authorities A hub is good if it belongs to good authority. An authority is good if good hubs point to it. Weights given to pages must keep track of the authenticity.

Further Improvements Seed Weighting Instead of assigning equal weights to each seed assign a weight proportional to its quality/importance. Seed Filtering Filtering out low quality pages that may exist in topic directories. Finer Topics Lower Layer of Topic Directories