2007.04.24 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

2007.04.24 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring 2007 http://courses.ischool.berkeley.edu/i240/s07 Principles of Information Retrieval Lecture 23: Web Searching

2007.04.24 - SLIDE 2IS 240 – Spring 2007 Mini-TREC Proposed Schedule –February 15 – Database and previous Queries –February 27 – report on system acquisition and setup –March 8, New Queries for testing… –April 19, Results due (Next Thursday) –April 24 or 26, Results and system rankings –May 8 Group reports and discussion

2007.04.24 - SLIDE 3IS 240 – Spring 2007 All Minitrec Runs

2007.04.24 - SLIDE 4IS 240 – Spring 2007 All Groups – Best Runs

2007.04.24 - SLIDE 5IS 240 – Spring 2007 All Groups – Best Runs + RRL

2007.04.24 - SLIDE 6IS 240 – Spring 2007 Results Data trec_eval runs for each submitted file have been put into a new directory called RESULTS in your group directories The trec_eval parameters used for these runs are “-o” for the “.res” files and “-o –q” for the “.resq” files. The “.dat” files contain the recall level and precision values used for the preceding plots The qrels for the Mini-TREC queries are available now in the /projects/i240 directory as “MINI_TREC_QRELS”

2007.04.24 - SLIDE 7IS 240 – Spring 2007 Mini-TREC Reports In-Class Presentations May 8 th Written report due May 8 th (Last day of Class) – 4-5 pages Content –System description –What approach/modifications were taken? –results of official submissions (see RESULTS) –results of “post-runs” – new runs with results using MINI_TREC_QRELS and trec_eval

2007.04.24 - SLIDE 8IS 240 – Spring 2007 Term Paper Should be about 8-15 pages on: – some area of IR research (or practice) that you are interested in and want to study further –Experimental tests of systems or IR algorithms –Build an IR system, test it, and describe the system and its performance Due May 8 th (Last day of class)

2007.04.24 - SLIDE 9IS 240 – Spring 2007 Today Review –Web Crawling and Search Issues –Web Search Engines and Algorithms Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

2007.04.24 - SLIDE 10IS 240 – Spring 2007 Web Crawlers How do the web search engines get all of the items they index? More precisely: –Put a set of known sites on a queue –Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed: –Record the information found on this page »Positions of words, links going out, etc –Add each link on the current page to the queue –Record that this page has been processed In what order should the links be followed?

2007.04.24 - SLIDE 11IS 240 – Spring 2007 Page Visit Order Animated examples of breadth-first vs depth-first search on trees: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/ExhaustiveSearch.html

2007.04.24 - SLIDE 12IS 240 – Spring 2007 Sites Are Complex Graphs, Not Just Trees Page 1 Page 3 Page 2 Page 1 Page 2 Page 1 Page 5 Page 6 Page 4 Page 1 Page 2 Page 1 Page 3 Site 6 Site 5 Site 3 Site 1 Site 2

2007.04.24 - SLIDE 13IS 240 – Spring 2007 Web Crawling Issues Keep out signs –A file called robots.txt tells the crawler which directories are off limits Freshness –Figure out which pages change often –Recrawl these often Duplicates, virtual hosts, etc –Convert page contents with a hash function –Compare new pages to the hash table Lots of problems –Server unavailable –Incorrect html –Missing links –Infinite loops Web crawling is difficult to do robustly!

2007.04.24 - SLIDE 14IS 240 – Spring 2007 Search Engines Crawling Indexing Querying

2007.04.24 - SLIDE 15IS 240 – Spring 2007 Web Search Engine Layers From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

2007.04.24 - SLIDE 16IS 240 – Spring 2007 Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds

2007.04.24 - SLIDE 17IS 240 – Spring 2007 More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

2007.04.24 - SLIDE 18IS 240 – Spring 2007 Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge Most current web search systems partition the indexes across different machines –Each machine handles different parts of the data (Google uses thousands of PC-class processors and keeps most things in main memory) Other systems duplicate the data across many machines –Queries are distributed among the machines Most do a combination of these

2007.04.24 - SLIDE 19IS 240 – Spring 2007 Search Engine Querying In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

2007.04.24 - SLIDE 20IS 240 – Spring 2007 Querying: Cascading Allocation of CPUs A variation on this that produces a cost- savings: –Put high-quality/common pages on many machines –Put lower quality/less common pages on fewer machines –Query goes to high quality machines first –If no hits found there, go to other machines

2007.04.24 - SLIDE 21IS 240 – Spring 2007 Google Google maintains (probably) the worlds largest Linux cluster (over 15,000 servers) These are partitioned between index servers and page servers –Index servers resolve the queries (massively parallel processing) –Page servers deliver the results of the queries Over 8 Billion web pages are indexed and served by Google

2007.04.24 - SLIDE 22IS 240 – Spring 2007 Search Engine Indexes Starting Points for Users include Manually compiled lists –Directories Page “popularity” –Frequently visited pages (in general) –Frequently visited pages as a result of a query Link “co-citation” –Which sites are linked to by other sites?

2007.04.24 - SLIDE 23IS 240 – Spring 2007 Starting Points: What is Really Being Used? Todays search engines combine these methods in various ways –Integration of Directories Today most web search engines integrate categories into the results listings Lycos, MSN, Google –Link analysis Google uses it; others are also using it Words on the links seems to be especially useful –Page popularity Many use DirectHit’s popularity rankings

2007.04.24 - SLIDE 24IS 240 – Spring 2007 Web Page Ranking Varies by search engine –Pretty messy in many cases –Details usually proprietary and fluctuating Combining subsets of: –Term frequencies –Term proximities –Term position (title, top of page, etc) –Term characteristics (boldface, capitalized, etc) –Link analysis information –Category information –Popularity information

2007.04.24 - SLIDE 25IS 240 – Spring 2007 Ranking: Hearst ‘96 Proximity search can help get high- precision results if >1 term –Combine Boolean and passage-level proximity –Proves significant improvements when retrieving top 5, 10, 20, 30 documents –Results reproduced by Mitra et al. 98 –Google uses something similar

2007.04.24 - SLIDE 26IS 240 – Spring 2007 Ranking: Link Analysis Assumptions: –If the pages pointing to this page are good, then this is also a good page –The words on the links pointing to this page are useful indicators of what this page is about –References: Page et al. 98, Kleinberg 98

2007.04.24 - SLIDE 27IS 240 – Spring 2007 Ranking: Link Analysis Why does this work? –The official Toyota site will be linked to by lots of other official (or high-quality) sites –The best Toyota fan-club site probably also has many links pointing to it –Less high-quality sites do not have as many high-quality sites linking to them

2007.04.24 - SLIDE 28IS 240 – Spring 2007 Ranking: PageRank Google uses the PageRank We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0.85. C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) +... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

2007.04.24 - SLIDE 29IS 240 – Spring 2007 PageRank T2Pr=1 T1Pr=.725 T6Pr=1 T5Pr=1 T4Pr=1 T3Pr=1 T7Pr=1 T8Pr=2.46625 X1 X2 APr=4.2544375 Note: these are not real PageRanks, since they include values >= 1

2007.04.24 - SLIDE 30IS 240 – Spring 2007 PageRank Similar to calculations used in scientific citation analysis (e.g., Garfield et al.) and social network analysis (e.g., Waserman et al.) Similar to other work on ranking (e.g., the hubs and authorities of Kleinberg et al.) How is Amazon similar to Google in terms of the basic insights and techniques of PageRank? How could PageRank be applied to other problems and domains?

2007.04.24 - SLIDE 31IS 240 – Spring 2007 Today Review –Web Crawling and Search Issues –Web Search Engines and Algorithms Web Search Processing –Parallel Architectures (Inktomi – Eric Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

2007.04.24 - SLIDE 32IS 240 – Spring 2007

2007.04.24 - SLIDE 33IS 240 – Spring 2007

2007.04.24 - SLIDE 34IS 240 – Spring 2007

2007.04.24 - SLIDE 35IS 240 – Spring 2007

2007.04.24 - SLIDE 36IS 240 – Spring 2007

2007.04.24 - SLIDE 37IS 240 – Spring 2007

2007.04.24 - SLIDE 38IS 240 – Spring 2007

2007.04.24 - SLIDE 39IS 240 – Spring 2007

2007.04.24 - SLIDE 40IS 240 – Spring 2007

2007.04.24 - SLIDE 41IS 240 – Spring 2007

2007.04.24 - SLIDE 42IS 240 – Spring 2007

2007.04.24 - SLIDE 43IS 240 – Spring 2007

2007.04.24 - SLIDE 44IS 240 – Spring 2007

2007.04.24 - SLIDE 45IS 240 – Spring 2007

2007.04.24 - SLIDE 46IS 240 – Spring 2007

2007.04.24 - SLIDE 47IS 240 – Spring 2007

2007.04.24 - SLIDE 48IS 240 – Spring 2007

2007.04.24 - SLIDE 49IS 240 – Spring 2007

2007.04.24 - SLIDE 50IS 240 – Spring 2007

2007.04.24 - SLIDE 51IS 240 – Spring 2007

2007.04.24 - SLIDE 52IS 240 – Spring 2007 Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool Department of Computer Science Presented by Ray R. Larson University of California, Berkeley School of Information

2007.04.24 - SLIDE 53IS 240 – Spring 2007 Overview The Grid, Text Mining and Digital Libraries –Grid Architecture –Grid IR Issues Cheshire3: Bringing Search to Grid-Based Digital Libraries –Overview –Grid Experiments –Cheshire3 Architecture –Distributed Workflows

2007.04.24 - SLIDE 54IS 240 – Spring 2007 Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

2007.04.24 - SLIDE 55IS 240 – Spring 2007 Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

2007.04.24 - SLIDE 56IS 240 – Spring 2007 Grid-Based Digital Libraries Large-scale distributed storage requirements and technologies Organizing distributed digital collections Shared Metadata – standards and requirements Managing distributed digital collections Security and access control Collection Replication and backup Distributed Information Retrieval issues and algorithms

2007.04.24 - SLIDE 57IS 240 – Spring 2007 Grid IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is a challenge for sub-second retrieval Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

2007.04.24 - SLIDE 58IS 240 – Spring 2007 Introduction Cheshire History: –Developed at UC Berkeley originally –Solution for library data (C1), then SGML (C2), then XML –Monolithic applications for indexing and retrieval server in C + TCL scripting Cheshire3: –Developed at Liverpool, plus Berkeley –XML, Unicode, Grid scalable: Standards based –Object Oriented Framework –Easy to develop and extend in Python

2007.04.24 - SLIDE 59IS 240 – Spring 2007 Introduction Today: –Version 0.9.4 –Mostly stable, but needs thorough QA and docs –Grid, NLP and Classification algorithms integrated Near Future: –June: Version 1.0 Further DM/TM integration, docs, unit tests, stability –December: Version 1.1 Grid out-of-the-box, configuration GUI

2007.04.24 - SLIDE 60IS 240 – Spring 2007 Context Environmental Requirements: –Very Large scale information systems Terabyte scale (Data Grid) Computationally expensive processes (Comp. Grid) Digital Preservation Analysis of data, not just retrieval (Data/Text Mining) Ease of Extensibility, Customizability (Python) Open Source Integrate not Re-implement "Web 2.0" – interactivity and dynamic interfaces

2007.04.24 - SLIDE 61IS 240 – Spring 2007 Context Data Grid Layer Data Grid SRB iRODS Digital Library Layer Application Layer Web Browser Multivalent Dedicated Client User Interface Apache+ Mod_Python+ Cheshire3 Protocol Handler Process Management Kepler Cheshire3 Query Results Query Results ExportParse Document Parsers Multivalent,... Natural Language Processing Information Extraction Text Mining Tools Tsujii Labs,... Classification Clustering Data Mining Tools Orange, Weka,... Query Results Search / Retrieve Index / Store Information System Cheshire3 User Interface MySRB PAWN Process Management Kepler iRODS rules Term Management Termine WordNet... Store

2007.04.24 - SLIDE 62IS 240 – Spring 2007 Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

2007.04.24 - SLIDE 63IS 240 – Spring 2007 Object Configuration One XML 'record' per non-data object Very simple base schema, with extensions as needed Identifiers for objects unique within a context (e.g., unique at individual database level, but not necessarily between all databases) Allows workflows to reference by identifier but act appropriately within different contexts. Allows multiple administrators to define objects without reference to each other

2007.04.24 - SLIDE 64IS 240 – Spring 2007 Grid Focus on ingest, not discovery (yet) Instantiate architecture on every node Assign one node as master, rest as slaves. Master then divides the processing as appropriate. Calls between slaves possible Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) Typically: ('workflow-id', 'process', 'document-id')

2007.04.24 - SLIDE 65IS 240 – Spring 2007 Grid Architecture Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (workflow, process, document) fetch document document extracted data

2007.04.24 - SLIDE 66IS 240 – Spring 2007 Grid Architecture - Phase 2 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (index, load) store index fetch extracted data

2007.04.24 - SLIDE 67IS 240 – Spring 2007 Workflow Objects Written as XML within the configuration record. Rewrites and compiles to Python code on object instantiation Current instructions: –object –assign –fork –for-each –break/continue –try/except/raise –return –log (= send text to default logger object) Yes, no if!

2007.04.24 - SLIDE 68IS 240 – Spring 2007 Workflow example workflow.SimpleWorkflow Unparsable Record ”Loaded Record:” + input.id

2007.04.24 - SLIDE 69IS 240 – Spring 2007 Text Mining Integration of Natural Language Processing tools Including: –Part of Speech taggers (noun, verb, adjective,...) –Phrase Extraction –Deep Parsing (subject, verb, object, preposition,...) –Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) Planned: Information Extraction tools

2007.04.24 - SLIDE 70IS 240 – Spring 2007 Data Mining Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes Focus on automatic classification for predefined categories rather than clustering Algorithms integrated/implemented: –Perceptron, Neural Network (pure python) –Naïve Bayes (pure python) –SVM (libsvm integrated with python wrapper) –Classification Association Rule Mining (Java)

2007.04.24 - SLIDE 71IS 240 – Spring 2007 Data Mining Modelled as multi-stage PreParser object (training phase, prediction phase) Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore Document vectors generated per index per document, so integrated NLP document normalization for free

2007.04.24 - SLIDE 72IS 240 – Spring 2007 Data Mining + Text Mining Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. Computational grid for distributing expensive NLP analysis Results show better accuracy with fewer attributes:

2007.04.24 - SLIDE 73IS 240 – Spring 2007 Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: –Deduplicate millions of records –Enrich deduplicated records against database of 45 million –Automatically reclassify enriched records using machine learning processes (Naïve Bayes)

2007.04.24 - SLIDE 74IS 240 – Spring 2007 Applications (1) Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

2007.04.24 - SLIDE 75IS 240 – Spring 2007 Applications (2) Assessing the Grade Level of NSDL Education Material The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). This processing was done on the Teragrid cluster at SDSC.

2007.04.24 - SLIDE 76IS 240 – Spring 2007 Cheshire3 Grid Tests Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about 13000 records per second On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

2007.04.24 - SLIDE 77IS 240 – Spring 2007 SRB and SDSC Experiments We are working with SDSC to include SRB support We are planning to continue working with SDSC and to run further evaluations using the TeraGrid server(s) through a “small” grant for 30000 CPU hours – SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network. Planned large-scale test collections include NSDL, the NARA repository, CiteSeer and the “million books” collections of the Internet Archive

2007.04.24 - SLIDE 78IS 240 – Spring 2007 Conclusions Scalable Grid-Based digital library services can be created and provide support for very large collections with improved efficiency The Cheshire3 IR and DL architecture can provide Grid (or single processor) services for next-generation DLs Available as open source via: http://cheshire3.sourceforge.net or http://www.cheshire3.org/

2007.04.24 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2007.04.24 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2007.04.24 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Similar presentations

Presentation on theme: "2007.04.24 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00."— Presentation transcript:

Similar presentations

About project

Feedback