Web search basics.

Slides:



Advertisements
Similar presentations
CS276 Information Retrieval and Web Search
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
Web Search Spidering.
Information Retrieval in Practice
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Search Engines and Information Retrieval
The process of increasing the amount of visitors to a website by ranking high in the search results of a search engine.
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Search Engine Optimization By Andy Smith | Art Institute of Dallas.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Search Engine Optimization.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Adversarial Information Retrieval The Manipulation of Web Content.
Search Engines and Information Retrieval Chapter 1.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
CSCI 5417 Information Retrieval Systems Jim Martin
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Crawling.
Crawling Slides adapted from
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
Sigir’99 Inside Internet Search Engines: Spidering and Indexing Jan Pedersen and William Chang.
Week 1 Introduction to Search Engine Optimization.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec
Information Retrieval in Practice
Modified by Dongwon Lee from slides by
CCT356: Online Advertising and Marketing
Lecture 16: Web search/Crawling/Link Analysis
Data Mining Chapter 6 Search Engines
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Anwar Alhenshiri.
Presentation transcript:

Web search basics

Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc. Volatile: فرّار Heterogeneous: ناهمگن VRML: Virtual Reality Modeling Language to design 3-D models

The Web document collection Sec. 19.2 The Web document collection No design/co-ordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions … Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… Scale much larger than previous text collections Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated The Web

Web Search Using IR Web Spider Document corpus IR System Query Ranked Documents 1. Page1 2. Page2 3. Page3 .

Brief history Early keyword-based engines ca. 1995-1997 Altavista, Excite, Infoseek, Inktomi, Lycos Paid search ranking: Your search ranking depended on how much you paid. 1998+: Link-based ranking pioneered by Google Google added paid search “ads” to the side, independent of search results Ca: circa: around

Paid Search Ads Algorithmic results.

Web search basics The Web User Indexer Indexes Ad indexes Web spider Sec. 19.4.1 Web search basics The Web User Web spider Indexer Search Indexes Ad indexes

User Needs Access a service Downloads Shop Find a good hub Sec. 19.4.1 User Needs Informational – want to learn about something Navigational – want to go to that page Transactional – want to do something (web-mediated) Access a service Downloads Shop Gray areas Find a good hub Exploratory search “see what’s there” Low hemoglobin United Airlines Seattle weather Mars surface images Canon S410 Exploratory: اکتشافی Car rental Brasil

How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

Users’ empirical evaluation of results Quality of pages Relevance Other desirable qualities (non IR) Content: Trustworthy, diverse, non-duplicated, well maintained Web readability: display correctly & fast No annoyances: pop-ups, etc. Precision vs. recall On the web, recall seldom matters Recall matters when the number of matches is very small Comprehensiveness – must be able to deal with obscure queries User perceptions may be unscientific, but are significant Obscure: مبهم

Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Pre/Post process tools provided Mitigate user errors (auto spell check, search assist,…) Explicit: Search within results, more like this, refine ... Anticipative: related searches Deal with idiosyncrasies Web specific vocabulary Impact on stemming, spell-check, etc. Web addresses typed in the search box Anticipative: پیشگویی idiosyncrasies: شیوه نویسنده

Sec. 19.6 Spidering

Spiders (Robots/Bots/Crawlers) Web crawling is the process by which we gather pages from the Web. Start with a comprehensive set of root URL’s from which to start the search. Follow all links on these pages recursively to find additional pages. Must obey page-owner restrictions: robot exclusion. Exclusion: محرومیت

Spidering Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.

Queueing Strategy How new links added to the queue determines search strategy. FIFO (append to end of Q) gives breadth-first search. LIFO (add to front of Q) gives depth-first search. Heuristically ordering the Q gives a “focused crawler” that directs its search towards “interesting” pages.

Search Strategies Breadth-first Search

Search Strategies (cont) Depth-first Search

Avoiding Page Re-spidering Must detect when revisiting a page that has already been spidered (web is a graph not a tree). Must efficiently index visited pages to allow rapid recognition test. Index page using URL as a key. Must canonicalize URL’s (e.g. delete ending “/”) Not detect duplicated or mirrored pages. Index page using textual content as a key. Requires first downloading page.

Robot Exclusion Web sites and pages can specify that robots should not crawl/index certain areas. Two components: Robots Exclusion Protocol: Site wide specification of excluded directories. Robots META Tag: Individual document tag to exclude indexing or following links.

Robots Exclusion Protocol Site administrator puts a “robots.txt” file at the root of the host’s web directory. http://www.ebay.com/robots.txt http://www.cnn.com/robots.txt File is a list of excluded directories for a given robot. Exclude all robots from the entire site: User-agent: * Disallow: /

Robot Exclusion Protocol Examples Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/ Exclude a specific robot: User-agent: GoogleBot Disallow: / Allow a specific robot: Disallow:

Keeping Spidered Pages Up to Date Web is very dynamic: many new pages, updated pages, deleted pages, etc. Periodically check spidered pages for updates and deletions: Just look at header info (e.g. META tags on last update) to determine if page has changed, only reload entire page if needed. Track how often each page is updated and preferentially return to pages which are historically more dynamic. Preferentially update pages that are accessed more often to optimize freshness of more popular pages.

SPAM (SEARCH ENGINE OPTIMIZATION)

The trouble with paid search ads Sec. 19.2.2 The trouble with paid search ads It costs money. What’s the alternative? Search Engine Optimization: “Tuning” your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients Some perfectly legitimate, some very shady Consultants: مشاور Shady: مشکوک

Simplest forms First generation engines relied heavily on tf/idf Sec. 19.2.2 Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query Qom University were the ones containing the most Qom’s and University’s SEOs responded with dense repetitions of chosen terms e.g., Qom University Qom University Qom University Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal

Cloaking Serve fake content to search engine spider Sec. 19.2.2

More spam techniques Robots Doorway pages Link spamming Sec. 19.2.2 More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Fake links Robots Fake query stream – rank checking programs Doorway: راهرو

The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Policing of URL submissions Anti robot test Limits on meta-keywords Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection

More on spam Web search engines have policies on SEO practices they tolerate/block http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research http://airweb.cse.lehigh.edu/

Sec. 19.6 DUPLICATE DETECTION

Duplicate documents The web is full of duplicated content Sec. 19.6 Duplicate documents The web is full of duplicated content Strict duplicate detection = exact match Not as common But many, many cases of near duplicates E.g., last-modified date the only difference between two copies of a page

Duplicate/Near-Duplicate Detection Sec. 19.6 Duplication: Exact match can be detected with fingerprints Near-Duplication: Approximate match Compute syntactic similarity Use similarity threshold to detect near-duplicates E.g., Similarity > 80% => Documents are “near duplicates”

Computing Similarity Features: Segments of a document Sec. 19.6 Computing Similarity Features: Segments of a document Shingles (Word N-Grams) a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is Similarity Measure between two docs (= sets of shingles) Jaccard coefficient