Standard Web Search Engine Architecture

Standard Web Search Engine Architecture
Check for duplicates, store the documents DocIds crawl the web user query create an inverted index Inverted index Search engine servers Show results To user

More detailed architecture, from Brin & Page 98
More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

Indexes for Web Search Engines
Inverted indexes are still used, even though the web is so huge Most current web search systems partition the indexes across different machines Each machine handles different parts of the data (Google uses thousands of PC-class processors and keeps most things in main memory) Other systems duplicate the data across many machines Queries are distributed among the machines Most do a combination of these

Search Engine Querying
In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row. From description of the FAST search engine, by Knut Risvik

Querying: Cascading Allocation of CPUs
A variation on this that produces a cost-savings: Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machines

Google Google maintains (probably) the worlds largest Linux cluster (over 15,000 servers) These are partitioned between index servers and page servers Index servers resolve the queries (massively parallel processing) Page servers deliver the results of the queries Over 8 Billion web pages are indexed and served by Google

Search Engine Indexes Starting Points for Users include
Manually compiled lists Directories Page “popularity” Frequently visited pages (in general) Frequently visited pages as a result of a query Link “co-citation” Which sites are linked to by other sites?

Starting Points: What is Really Being Used?
Todays search engines combine these methods in various ways Integration of Directories Today most web search engines integrate categories into the results listings Lycos, MSN, Google Link analysis Google uses it; others are also using it Words on the links seems to be especially useful Page popularity Many use DirectHit’s popularity rankings

Web Page Ranking Varies by search engine Combining subsets of:
Pretty messy in many cases Details usually proprietary and fluctuating Combining subsets of: Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information

Ranking: Hearst ‘96 Proximity search can help get high-precision results if >1 term Combine Boolean and passage-level proximity Proves significant improvements when retrieving top 5, 10, 20, 30 documents Results reproduced by Mitra et al. 98 Google uses something similar

Ranking: Link Analysis
Assumptions: If the pages pointing to this page are good, then this is also a good page The words on the links pointing to this page are useful indicators of what this page is about References: Page et al. 98, Kleinberg 98

Ranking: Link Analysis
Why does this work? The official Toyota site will be linked to by lots of other official (or high-quality) sites The best Toyota fan-club site probably also has many links pointing to it Less high-quality sites do not have as many high-quality sites linking to them

Ranking: PageRank Google uses the PageRank
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

Note: these are not real PageRanks, since they include values >= 1
Pr=1 X2 X1 T1 Pr=.725 T4 Pr=1 A Pr= T2 Pr=1 T5 Pr=1 T8 Pr= T7 Pr=1 T6 Pr=1

PageRank Similar to calculations used in scientific citation analysis (e.g., Garfield et al.) and social network analysis (e.g., Waserman et al.) Similar to other work on ranking (e.g., the hubs and authorities of Kleinberg et al.) How is Amazon similar to Google in terms of the basic insights and techniques of PageRank? How could PageRank be applied to other problems and domains?

Today Review Web Search Processing Web Crawling and Search Issues
Web Search Engines and Algorithms Web Search Processing Parallel Architectures (Inktomi – Eric Brewer) Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

Digital Library Grid Initiatives: Cheshire3 and the Grid
Presentation from DLF Forum April 2005 Digital Library Grid Initiatives: Cheshire3 and the Grid Ray R. Larson University of California, Berkeley School of Information Management and Systems Rob Sanderson University of Liverpool Dept. of Computer Science Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation

Overview The Grid, Text Mining and Digital Libraries
Grid Architecture Grid IR Issues Cheshire3: Bringing Search to Grid-Based Digital Libraries Overview Grid Experiments Cheshire3 Architecture Distributed Workflows

Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)
.…. High energy physics Engineering Chemical Climate Astrophysics Cosmology Combustion Applications Application Toolkits Grid Services Fabric ..… Computing Remote Visualization Remote Data Grid Collaboratories Portals Remote sensors Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library Workshop)
High energy physics Bio-Medical Libraries Digital Engineering Chemical Humanities computing Astrophysics Climate Cosmology Combustion … Applications Application Toolkits Grid Services Fabric … Computing Remote Visualization Remote Search & Retrieval management Metadata Text Mining Collaboratories Grid middleware Data Grid Portals Remote sensors Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

Grid-Based Digital Libraries
Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval issues and algorithms

Grid IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is a challenge for sub-second retrieval Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

Cheshire3 Overview XML Information Retrieval Engine
3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. Uses Python for flexibility and extensibility, but imports C/C++ based libraries for processing speed Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. Grid capable. Uses distributed configuration files, workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes. Free and Open Source Software. (GPL Licence) (under development!)

Cheshire3 Server Overview
API I N D E X G T R R X E A S C N L O S T R F D O R M S A C H P H R A O N T D O L C E O R L DB API REMOTE SYSTEMS (any protocol) XML CONFIG & Metadata INFO INDEXES LOCAL DB STAFF UI W O K RESULT SETS USER F & ACCESS U Native calls Z39.50 SOAP OAI JDBC Fetch ID Put ID OpenURL P SERVER CONTROL UDDI WSRP SRW Normalization Client User/ Clients OGIS Cheshire3 SERVER

Cheshire3 Grid Tests Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about records per second On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

SRB and SDSC Experiments
We are working with SDSC to include SRB support We are planning to continue working with SDSC and to run further evaluations using the TeraGrid server(s) through a “small” grant for CPU hours SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network. Planned large-scale test collections include NSDL, the NARA repository, CiteSeer and the “million books” collections of the Internet Archive

Cheshire3 Object Model Protocol Handler Ingest Process Document
Group Ingest Process ConfigStore Object Server Documents Transformer Records Document UserStore User Database Query Document PreParser Query ResultSet Index Extracter RecordStore Parser DocumentStore Normaliser IndexStore Terms Record

Cheshire3 Data Objects DocumentGroup: Document: Record: Query:
A collection of Document objects (e.g. from a file, directory, or external search) Document: A single item, in any format (e.g. PDF file, raw XML string, relational table) Record: A single item, represented as parsed XML Query: A search query, in the form of CQL (an abstract query language for Information Retrieval) ResultSet: An ordered list of pointers to records Index: An ordered list of terms extracted from Records

Cheshire3 Process Objects
PreParser: Given a Document, transform it into another Document (e.g. PDF to Text, Text to XML) Parser: Given a Document as a raw XML string, return a parsed Record for the item. Transformer: Given a Record, transform it into a Document (e.g. via XSLT, from XML to PDF, or XML to relational table) Extracter: Extract terms of a given type from an XML sub-tree (e.g. extract Dates, Keywords, Exact string value) Normaliser: Given the results of an extracter, transform the terms, maintaining the data structure (e.g. CaseNormaliser)

Cheshire3 Abstract Objects
Server: A logical collection of databases Database: A logical collection of Documents, their Record representations and Indexes of extracted terms. Workflow: A 'meta-process' object that takes a workflow definition in XML and converts it into executable code.

Workflow Objects Workflows are first class objects in Cheshire3 (though not represented in the model diagram) All Process and Abstract objects have individual XML configurations with a common base schema with extensions We can treat configurations as Records and store in regular RecordStores, allowing access via regular IR protocols.

Workflow References Workflows contain a series of instructions to perform, with reference to other Cheshire3 objects Reference is via pseudo-unique identifiers … Pseudo because they are unique within the current context (Server vs Database) Workflows are objects, so this enables server level workflows to call database specific workflows with the same identifier

Distributed Processing
Each node in the cluster instantiates the configured architecture, potentially through a single ConfigStore. Master nodes then run a high level workflow to distribute the processing amongst Slave nodes by reference to a subsidiary workflow As object interaction is well defined in the model, the result of a workflow is equally well defined. This allows for the easy chaining of workflows, either locally or spread throughout the cluster.

Workflow Example1 <subConfig id=“buildWorkflow”>
<objectType>workflow.SimpleWorkflow</objectType> <workflow> <log>Starting Load</log> <object type=“recordStore” function=“begin_storing”/> <object type=“database” function=“begin_indexing”/> <for-each> <object type=“workflow” ref=“buildSingleWorkflow”> </for-each> <object type=“recordStore” function=“commit_storing”/> <object type=“database” function=“commit_indexing”/> <object type=“database” function=“commit_metadata”/> </workflow> </subConfig>

Workflow Example2 <subConfig id=“buildSingleWorkflow”>
<objectType>workflow.SimpleWorkflow</objectType> <workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>Loaded Record</log> </workflow> </subConfig>

Workflow Standards Cheshire3 workflows do not conform to any standard schema Intentional: Workflows are specific to and dependent on the Cheshire3 architecture Replaces the distribution of lines of code for distributed processing Replaces many lines of code in general Needs to be easy to understand and create GUI workflow builder coming (web and standalone)

External Integration Looking at integration with existing cross-service workflow systems, in particular Kepler/Ptolemy Possible integration at two levels: Cheshire3 as a service (black box) ... Identify a workflow to call. Cheshire3 object as a service (duplicate existing workflow function) … But recall the access speed issue.

Conclusions Scalable Grid-Based digital library services can be created and provide support for very large collections with improved efficiency The Cheshire3 IR and DL architecture can provide Grid (or single processor) services for next-generation DLs Available as open source via: or

Plan for today Wrap up spam Crawling Connectivity servers

Link-based ranking Most search engines use hyperlink information for ranking Basic idea: Peer endorsement Web page authors endorse their peers by linking to them Prototypical link-based ranking algorithm: PageRank Page is important if linked to (endorsed) by many other pages More so if other pages are themselves important More later …

Link spam Link spam: Inflating the rank of a page by creating nepotistic links to it From own sites: Link farms From partner sites: Link exchanges From unaffiliated sites (e.g. blogs, web forums, etc.) The more links, the better Generate links automatically Use scripts to post to blogs Synthesize entire web sites (often infinite number of pages) Synthesize many web sites (DNS spam; e.g. *.thrillingpage.info) The more important the linking page, the better Buy expired highly-ranked domains Post to high-quality blogs Example of DNS spam: *.thrillingpage.info

Link farms and link exchanges

More spam techniques Cloaking
Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Is this a Search Engine spider? Y N SPAM Real Doc Cloaking

Tutorial on Cloaking & Stealth Technology

More spam techniques Doorway pages Robots
Pages optimized for a single keyword that re-direct to the real target page Robots Fake query stream – rank checking programs “Curve-fit” ranking programs of search engines Millions of submissions via Add-Url

Acid test Which SEO’s rank highly on the query seo?
Web search engines have policies on SEO practices they tolerate/block See pointers in Resources Adversarial IR: the unending (technical) battle between SEO’s and web search engines See for instance

Crawling

Crawling Issues How to crawl? How much to crawl? How much to index?
Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have? How often to crawl? Freshness: How much has changed? How much has really changed? (why is this a different question?)

Basic crawler operation
Begin with known “seed” pages Fetch and parse them Extract URLs they point to Place the extracted URLs on a queue Fetch each URL on the queue and repeat

Simple picture – complications
Web crawling isn’t feasible with one machine All of the above steps distributed Even non-malicious pages pose challenges Latency/bandwidth to remote servers vary Robots.txt stipulations How “deep” should you crawl a site’s URL hierarchy? Site mirrors and duplicate pages Malicious pages Spam pages (Lecture 1, plus others to be discussed) Spider traps – incl dynamically generated Politeness – don’t hit a server too often

Robots.txt Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 Website announces its request on what can(not) be crawled For a URL, create a file URL/robots.txt This file specifies access restrictions

Robots.txt example No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine": User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow:

Crawling and Corpus Construction
Crawl order Distributed crawling Filtering duplicates Mirror detection

Where do we spider next? Web URLs in queue URLs crawled and parsed

Crawl Order Want best pages first Potential quality measures:
Final In-degree Final Pagerank What’s this?

Crawl Order Want best pages first Potential quality measures:
Final In-degree Final Pagerank Crawl heuristic: Breadth First Search (BFS) Partial Indegree Partial Pagerank Random walk Measure of page quality we’ll define later in the course.

BFS & Spam (Worst case scenario)
Start Page Start Page BFS depth = 2 Normal avg outdegree = 10 100 URLs on the queue including a spam page. Assume the spammer is able to generate dynamic pages with 1000 outlinks BFS depth = 3 2000 URLs on the queue 50% belong to the spammer BFS depth = 4 1.01 million URLs on the queue 99% belong to the spammer

Where do we spider next? Web URLs in queue URLs crawled and parsed

Where do we spider next? Keep all spiders busy
Keep spiders from treading on each others’ toes Avoid fetching duplicates repeatedly Respect politeness/robots.txt Avoid getting stuck in traps Detect/minimize spam Get the “best” pages What’s best? Best for answering search queries

Where do we spider next? Complex scheduling optimization problem, subject to all the constraints listed Plus operational constraints (e.g., keeping all machines load-balanced) Scientific study – limited to specific aspects Which ones? What do we measure? What are the compromises in distributed crawling?

Parallel Crawlers We follow the treatment of Cho and Garcia-Molina:
Raises a number of questions in a clean setting, for further study Setting: we have a number of c-proc’s c-proc = crawling process Goal: we wish to spider the best pages with minimum overhead What do these mean?

Distributed model Crawlers may be running in diverse geographies – Europe, Asia, etc. Periodically update a master index Incremental update so this is “cheap” Compression, differential update etc. Focus on communication overhead during the crawl Also results in dispersed WAN load

c-proc’s crawling the web
Which c-proc gets this URL? URLs crawled URLs in queues Communication: by URLs passed between c-procs.

Measurements Overlap = (N-I)/I where Coverage = I/U where
N = number of pages fetched I = number of distinct pages fetched Coverage = I/U where U = Total number of web pages Quality = sum over downloaded pages of their importance Importance of a page = its in-degree Communication overhead = Number of URLs c-proc’s exchange x

Crawler variations c-procs are independent Static assignment
Fetch pages oblivious to each other. Static assignment Web pages partitioned statically a priori, e.g., by URL hash … more to follow Dynamic assignment Central co-ordinator splits URLs among c-procs

Static assignment Firewall mode: each c-proc only fetches URL within its partition – typically a domain inter-partition links not followed Crossover mode: c-proc may following inter-partition links into another partition possibility of duplicate fetching Exchange mode: c-procs periodically exchange URLs they discover in another partition

Experiments 40M URL graph – Stanford Webbase
Open Directory (dmoz.org) URLs as seeds Should be considered a small Web

Summary of findings Cho/Garcia-Molina detail many findings
We will review some here, both qualitatively and quantitatively You are expected to understand the reason behind each qualitative finding in the paper You are not expected to remember quantities in their plots/studies

Firewall mode coverage
The price of crawling in firewall mode

Crossover mode overlap
Demanding coverage drives up overlap

Exchange mode communication
Communication overhead sublinear Per downloaded URL

Connectivity servers

Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01]
Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to? Stores mappings in memory from URL to outlinks, URL to inlinks Applications Crawl control Web graph analysis Connectivity, crawl optimization Link analysis More on this later

Most recent published work
Boldi and Vigna Webgraph – set of algorithms and a java implementation Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency lists is the critical component

Adjacency lists The set of neighbors of a node
Assume each URL represented by an integer Properties exploited in compression: Similarity (between lists) Locality (many links from a page go to “nearby” pages) Use gap encodings in sorted lists Distribution of gap values

Storage Boldi/Vigna get down to an average of ~3 bits/link How?
(URL to URL edge) For a 118M node web graph How? Why is this remarkable?

Main ideas of Boldi/Vigna
Consider lexicographically ordered list of all URLs, e.g.,

Encode as (-2), remove 9, add 8
Boldi/Vigna Each of these URLs has an adjacency list Main thesis: because of templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering Express adjacency list in terms of one of these E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64 Why 7? Encode as (-2), remove 9, add 8

Resources www.robotstxt.org/wc/norobots.html
www2002.org/CDROM/refereed/108/index.html www2004.org/proceedings/docs/1p595.pdf

Standard Web Search Engine Architecture

Similar presentations

Presentation on theme: "Standard Web Search Engine Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Standard Web Search Engine Architecture

Similar presentations

Presentation on theme: "Standard Web Search Engine Architecture"— Presentation transcript:

Similar presentations

About project

Feedback