2013.04.24 - SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
A metadata-based approach Marti Hearst Associate Professor BT Visit August 18, 2005.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.
Overview of Search Engines
SLIDE 1ISGC Taipei, Taiwan Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Search Engines and Information Retrieval Chapter 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Developing GRID Applications GRACE Project
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Searching the Web for academic information Ruth Stubbings.
SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.
Data mining in web applications
Information Retrieval in Practice
Big Data is a Big Deal!.
Search Engine Architecture
Map Reduce.
CS6604 Digital Libraries IDEAL Webpages Presented by
Dept. of Computer Science University of Liverpool
Data Mining Chapter 6 Search Engines
Charles Tappert Seidenberg School of CSIS, Pace University
The Search Engine Architecture
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 21: Web Searching

SLIDE 2IS 240 – Spring 2013 Today MiniTREC initial results… Review –Web Search Engines and Algorithms –Web indexes and Ranking – PageRank –Hadoop/Nutch Processing Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

SLIDE 3IS Spring 2011 Mini-TREC Proposed Schedule –February 11 – Database and previous Queries –March 4 – report on system acquisition and setup –March 6, New Queries for testing… –April 22, Results due –April 24, Results and system rankings –April 29 Group reports and discussion

SLIDE 4 MiniTREC Results 2013 IS 240 – Spring 2013

SLIDE 5 With my runs… IS 240 – Spring 2013

SLIDE 6IS 240 – Spring 2013 Mini-TREC Reports In-Class Presentations April 29 th No class during RRR week Written report due by or before May 15 th (During Final Exam week) – 5-10 pages Content –System description –What approach/modifications were taken? –results of official submissions (see RESULTS) –results of “post-runs” – new runs with results using MINI_TREC_QRELS and trec_eval

SLIDE 7IS 240 – Spring 2013 Final Term Paper Should be about 8-15 pages on: – some area of IR research (or practice) that you are interested in and want to study further –Experimental tests of systems or IR algorithms –Build an IR system, test it, and describe the system and its performance Due May 15 th (Final exam Week)

SLIDE 8IS 240 – Spring 2013 Today Review –Web Search Engines and Algorithms –Web indexes and Ranking – PageRank –Hadoop/Nutch Processing Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

SLIDE 9IS 240 – Spring 2013 Web Search Engine Layers From description of the FAST search engine, by Knut Risvik

SLIDE 10IS 240 – Spring 2013 Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds

SLIDE 11IS 240 – Spring 2013 More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving.

SLIDE 12IS 240 – Spring 2013 Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge Most current web search systems partition the indexes across different machines –Each machine handles different parts of the data (Google uses thousands of PC-class processors and keeps most things in main memory) Other systems duplicate the data across many machines –Queries are distributed among the machines Most do a combination of these

SLIDE 13IS 240 – Spring 2013 Search Engine Querying In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row. From description of the FAST search engine, by Knut Risvik

SLIDE 14IS 240 – Spring 2013 Querying: Cascading Allocation of CPUs A variation on this that produces a cost- savings: –Put high-quality/common pages on many machines –Put lower quality/less common pages on fewer machines –Query goes to high quality machines first –If no hits found there, go to other machines

SLIDE 15IS 240 – Spring 2013 Today Review –Web Search Engines and Algorithms –Web indexes and Ranking – PageRank –Hadoop/Nutch Processing Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

SLIDE 16IS 240 – Spring 2013 Ranking: Link Analysis Why does this work? –The official Toyota site will be linked to by lots of other official (or high-quality) sites –The best Toyota fan-club site probably also has many links pointing to it –Less high-quality sites do not have as many high-quality sites linking to them

SLIDE 17IS 240 – Spring 2013 Ranking: PageRank Google uses the PageRank for ranking docs: We assume page p i has pages p j...p N which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to L(p i ) is defined as the number of links going out of page p i. The PageRank (PR) of a page p i is given as follows: Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

SLIDE 18IS 240 – Spring 2013 PageRank (from Wikipedia)

SLIDE 19IS 240 – Spring 2013 Today Review –Web Search Engines and Algorithms –Web indexes and Ranking – PageRank –Hadoop/Nutch Processing Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

SLIDE 20 MapReduce for search processing As we have heard from our guest speakers, Mapreduce is widely used for search processing ANNOUNCEMENT Kroeber from 9:30 to 11 this Thursday (tommorrow) Doug Cutting – the developer of Hadoop, Lucene and Nutch will talk on Big Data As a short review… IS 240 – Spring 2013

SLIDE 21 MapReduce in Hadoop (1)

SLIDE 22 MapReduce in Hadoop (2)

SLIDE 23 MapReduce in Hadoop (3)

SLIDE 24 Data Flow in a MapReduce Program in Hadoop InputFormat Map function Partitioner Sorting & Merging Combiner Shuffling Merging Reduce function OutputFormat  1:many

SLIDE 25

SLIDE 26IS 240 – Spring 2013

SLIDE 27IS 240 – Spring 2013

SLIDE 28IS 240 – Spring 2013

SLIDE 29IS 240 – Spring 2013

SLIDE 30IS 240 – Spring 2013

SLIDE 31IS 240 – Spring 2013

SLIDE 32IS 240 – Spring 2013

SLIDE 33IS 240 – Spring 2013

SLIDE 34IS 240 – Spring 2013

SLIDE 35IS 240 – Spring 2013

SLIDE 36IS 240 – Spring 2013

SLIDE 37IS 240 – Spring 2013

SLIDE 38IS 240 – Spring 2013

SLIDE 39IS 240 – Spring 2013

SLIDE 40IS 240 – Spring 2013

SLIDE 41IS 240 – Spring 2013

SLIDE 42IS 240 – Spring 2013

SLIDE 43IS 240 – Spring 2013

SLIDE 44IS 240 – Spring 2013

SLIDE 45IS 240 – Spring 2013

SLIDE 46IS 240 – Spring 2013 Today Review –Web Search Engines and Algorithms –Web indexes and Ranking – PageRank –Hadoop/Nutch Processing Web Search Processing –Parallel Architectures (Inktomi - Brewer) –Cheshire III Design Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

SLIDE 47IS 240 – Spring 2010 Introduction Cheshire History: –Developed at UC Berkeley originally –Solution for library data (C1), then SGML (C2), then XML –Monolithic applications for indexing and retrieval server in C + TCL scripting Cheshire3: –Developed at Liverpool, plus Berkeley –XML, Unicode, Grid scalable: Standards based –Object Oriented Framework –Easy to develop and extend in Python

SLIDE 48IS 240 – Spring 2010 Introduction Today: –Version 1.0 –Mostly stable, but needs thorough QA and docs –Grid, NLP and Classification algorithms integrated

SLIDE 49IS 240 – Spring 2010 Context Environmental Requirements: –Very Large scale information systems Terabyte scale (Data Grid) Computationally expensive processes (Comp. Grid) Digital Preservation Analysis of data, not just retrieval (Data/Text Mining) Ease of Extensibility, Customizability (Python) Open Source Integrate not Re-implement "Web 2.0" – interactivity and dynamic interfaces

SLIDE 50IS 240 – Spring 2010 Context Data Grid Layer Data Grid SRB iRODS Digital Library Layer Application Layer Web Browser Multivalent Dedicated Client User Interface Apache+ Mod_Python+ Cheshire3 Protocol Handler Process Management Kepler Cheshire3 Query Results Query Results ExportParse Document Parsers Multivalent,... Natural Language Processing Information Extraction Text Mining Tools Tsujii Labs,... Classification Clustering Data Mining Tools Orange, Weka,... Query Results Search / Retrieve Index / Store Information System Cheshire3 User Interface MySRB PAWN Process Management Kepler iRODS rules Term Management Termine WordNet... Store

SLIDE 51IS 240 – Spring 2010 Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

SLIDE 52IS 240 – Spring 2010 Object Configuration One XML 'record' per non-data object Very simple base schema, with extensions as needed Identifiers for objects unique within a context (e.g., unique at individual database level, but not necessarily between all databases) Allows workflows to reference by identifier but act appropriately within different contexts. Allows multiple administrators to define objects without reference to each other

SLIDE 53IS 240 – Spring 2010 Grid Focus on ingest, not discovery (yet) Instantiate architecture on every node Assign one node as master, rest as slaves. Master then divides the processing as appropriate. Calls between slaves possible Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) Typically: ('workflow-id', 'process', 'document-id')

SLIDE 54IS 240 – Spring 2010 Grid Architecture Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (workflow, process, document) fetch document document extracted data

SLIDE 55IS 240 – Spring 2010 Grid Architecture - Phase 2 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (index, load) store index fetch extracted data

SLIDE 56IS 240 – Spring 2010 Workflow Objects Written as XML within the configuration record. Rewrites and compiles to Python code on object instantiation Current instructions: –object –assign –fork –for-each –break/continue –try/except/raise –return –log (= send text to default logger object) Yes, no if!

SLIDE 57IS 240 – Spring 2010 Workflow example workflow.SimpleWorkflow Unparsable Record ”Loaded Record:” + input.id

SLIDE 58IS 240 – Spring 2010 Text Mining Integration of Natural Language Processing tools Including: –Part of Speech taggers (noun, verb, adjective,...) –Phrase Extraction –Deep Parsing (subject, verb, object, preposition,...) –Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) Planned: Information Extraction tools

SLIDE 59IS 240 – Spring 2010 Data Mining Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes Focus on automatic classification for predefined categories rather than clustering Algorithms integrated/implemented: –Perceptron, Neural Network (pure python) –Naïve Bayes (pure python) –SVM (libsvm integrated with python wrapper) –Classification Association Rule Mining (Java)

SLIDE 60IS 240 – Spring 2010 Data Mining Modelled as multi-stage PreParser object (training phase, prediction phase) Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore Document vectors generated per index per document, so integrated NLP document normalization for free

SLIDE 61IS 240 – Spring 2010 Data Mining + Text Mining Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. Computational grid for distributing expensive NLP analysis Results show better accuracy with fewer attributes:

SLIDE 62IS 240 – Spring 2010 Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: –Deduplicate millions of records –Enrich deduplicated records against database of 45 million –Automatically reclassify enriched records using machine learning processes (Naïve Bayes)

SLIDE 63IS 240 – Spring 2010 Applications (1) Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

SLIDE 64IS 240 – Spring 2010 Applications (2) Assessing the Grade Level of NSDL Education Material The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). This processing was done on the Teragrid cluster at SDSC.

SLIDE 65IS 240 – Spring 2010 Cheshire3 Grid Tests Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about records per second On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

SLIDE 66IS 240 – Spring 2010 SRB and SDSC Experiments We worked with SDSC to include SRB support We are planning to continue working with SDSC and to run further evaluations using the TeraGrid server(s) through a “small” grant for CPU hours – SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes, each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network. Planned large-scale test collections include NSDL, the NARA repository, CiteSeer and the “million books” collections of the Internet Archive

SLIDE 67IS 240 – Spring 2010 Conclusions Scalable Grid-Based digital library services can be created and provide support for very large collections with improved efficiency The Cheshire3 IR and DL architecture can provide Grid (or single processor) services for next-generation DLs Available as open source via: