Web Search Engines – Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???

Slides:



Advertisements
Similar presentations
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Advertisements

28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Bringing It All Together: An Academic Viewpoint (What is needed and what is likely to come next?) Association of Information and Dissemination Centers.
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
WEB MINING. Why IR ? Research & Fun
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Chapter 5: Introduction to Information Retrieval
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Review
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
Information Retrieval in Practice
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Distributed Information Retrieval Jamie Callan Carnegie Mellon University
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
Multimedia Databases (MMDB)
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595.
Master Thesis Defense Jan Fiedler 04/17/98
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Thi Nhu Truong, ChengXiang Zhai Paul Ogilvie, Bill Jerome John Lafferty, Jamie Callan Carnegie Mellon University David Fisher, Fangfang Feng Victor Lavrenko.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Evaluation of Information Retrieval Systems Xiangming Mu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Data mining in web applications
Information Retrieval in Practice
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Lecture 1: Introduction and the Boolean Model Information Retrieval
Information Retrieval (in Practice)
Search Engine Architecture
Implementation Issues & IR Systems
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Extraction
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Web Search Engines –

Browsing Services Search Engine Services Web Pages Bag of Words Two semantics extremes Two service extremes ???

Web ……

version: 1.0// version number url: URL origin: original URL date: Tue, 15 Apr :13:06 GMT // time of harvest ip: // IP address unzip-length: // If included, the data must be compressed length: 18133// data length // a blank line XXXXXXXX// the followings are data part XXXXXXXX …. XXXXXXXX// data end // insert a new line

File Organizations (Indexes) Choices for accessing data during query evaluation Scan the entire collection –Typical in early (batch) retrieval systems –Computational and I/O costs are O(characters in collection) –Practical for only small text collections –Large memory systems make scanning feasible Use indexes for direct access –Evaluation time O(query term occurrences in collection) –Practical for large collections –Many opportunities for optimization Hybrids: Use small index, then scan a subset of the collection

Indexes What should the index contain? Database systems index primary and secondarykeys –This is the hybrid approach –Index provides fast access to a subset of database records –Scan subset to find solution set IR Problem: Cannot predict keys that people will use in queries –Every word in a document is a potential search term IR Solution: Index by all keys (words) full text indexes

Index Contents The contents depend upon the retrieval model Feature presence/absence –Boolean –Statistical (tf, df, ctf, doclen, maxtf) –Often about 10% the size of the raw data, compressed Positional –Feature location within document –Granularities include word, sentence, paragraph, etc –Coarse granularities are less precise, but take less space –Word-level granularity about 20-30% the size of the raw data,compressed

Indexes: Implementation Common implementations of indexes –Bitmaps –Signature files –Inverted files Common index components –Dictionary (lexicon) –Postings document ids word positions No positional data indexed

Inverted Files

Word-Level Inverted File

Inverted Search Algorithm 1.Find query elements (terms) in the lexicon 2.Retrieve postings for each lexicon entry 3.Manipulate postings according to the retrieval model

Word-Level Inverted File Query: 1.porridge & pot (BOOL) 2.porridge pot (BOOL) 3. porridge pot (VSM)VSM lexicon posting Answer

A Brief history of Modern Information Retrieval In 1945, Vannevar Bush published "As We May Think" in the Atlantic monthly. In the 1960s, the SMART system by Gerard Salton and his students Cranfield evaluations done by Cyril Cleverdon The 1970s and 1980s saw many developments built on the advances of the 1960s. In 1992 with the inception of Text Retrieval Conference. The algorithms developed The algorithms developed in IR were employed for searching the Web from 1996.

Clustering of SIGIR papers by topic vs. year

Question answering

Clustering

Inverted files & Implementations

Message understanding & TDT

Filtering

Hypertext IR, Multiple evidence

Probabilistic & Language models

Distributed IR

Evaluation

Topic distillation & Linkage retrieval

Text categorisation

Document summarisation

Cross lingual

CIIR, University of Massachusetts LTI, Carnegie Mellon University The Stanford University DB Group Microsoft Research Asia TREC,,

Lemur

Lemur Toolkit LM IR research system –ad hoc, distributed retrieval, cross-language IR, summarization, filtering, and classification : – – Simple Language Model – Language Model : –C and C++ –Unix / Windows –Current Version 3.1

MRA: Towards Next Generation Web Search From Pages to Blocks –Analyze the Web at finer granularity From Surface Web to Deep Web –Unleash the huge assets of high-value information From Unstructure to Structure –Provide well organized results From relevance to intelligence –Contribute knowledge discovery with search From Desktop Search to Mobile Search –Bridge physical world search to digital world search

The Stanford Univ. DB Group WebBase –Crawling, storage, indexing, and querying of large collections of Web pages. Digital Libraries –Infrastructure and services for creating, disseminating, sharing and managing information

TREC Conference Established in 1992 to evaluate large-scale IR –Retrieving documents from a gigabyte collection Has run continuously since then –TREC 2004(13 th ) meeting is in November Run by NISTs Information Access Division Probably most well known IR evaluation setting –Started with 25 participating organizations in 1992 evaluation –In 2003, there were 93 groups from 22 different countries Proceedings available on-line ( ) –Overview of TREC 2003 at f

TREC consists of IR research tracks –Ad hoc, routing, confusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, … Each track works on roughly the same model –November: track approved by TREC community –Winter: tracks members finalize format for track –Spring: researchers train system based on specification –Summer: researchers carry out format evaluation Usually a blind evaluation: research do not know answer –Fall: NIST carries out evaluation –November: Group meeting (TREC) to find out: How well your site did How others tackled the program –Many tracks are run by volunteers outside of NIST (e.g. Web) Coopetition model of evaluation –Successful approaches generally adopted in next cycle TREC General Format

TREC Tracks

Summary of VLC/Web Track evaluation

Tianwang

CWT100g !

/8.8 = 28.4%

TEAMNAME TD- RUNS NPHP- RUNS APEX 55 ANS32 TRS 52 MUMIAN131 MUMIAN221 SCUTDB55 WLL 1 pooling google,yisou,baidu,sogou,zhongsou SE

TIANWANG_RUN

( :

Vector Space Model d q m TFIDF : ( tf,idf ) BACK

Query Answer 1.porridge & pot (BOOL) –d2 2.porridge pot (BOOL) –null 3. porridge pot (VSM) –d2 > d1>d5 –Next page BACK

CIIR- Center for Intelligent Information One of the leading research groups in IR –improving the probabilistic models, –first description of a retrieval system based on statistical language models. –introduced and improved a number of techniques for text and query representation –automatically representing databases and combining local searches for DIR –first high capacity probabilistic filtering architecture –define and evaluate the first versions of event detection and tracking software –earliest research on ranking and representation techniques for Asian languages –first approaches to information extraction that emphasized learning –novel techniques for indexing images and video

CIIR cont. Research –more than 500 journal and refereed conference papers over the past 12 years (52 submissions in 2003). industrial and government collaboration –INQUERY –licensed our software to nearly 300 sites Education –20 Ph.D.s, 29 M.S. –123/145, 34/4 graduate/undergraduate

CIIR cont. Personnel –Faculty4(W. BRUCE CROFT) –Technical personel10 –Graduate student34/10 Groups –IESL:Information Extraction and Synthesis Laboratory –IR :Information Retrieval Laboratory –MIR :Multimedia Indexing and Retrieval Laboratory The CIIR is currently concentrating on the unsolved long- term research problems that underlie effective information retrieval –text representation, –query acquisition, –retrieval models

LTI : Language Technologies Machine Translation, Natural Language Processing, Speech, and Information Retrieval IR Projects (Jamie Callan and Yiming Yang ) –Adaptive Information Filtering –Distributed Information Retrieval / Federated Search – Classification and Prioritization –Minerva: Web Mining for Question Answering –MuchMore: Translingual Information Retrieval –JAVELIN: Open-Domain Question Answering BACK