Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Slides:



Advertisements
Similar presentations
Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
IR Models: Overview, Boolean, and Vector
Learn how to search for information the smart way Choose your own adventure!
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Chapter 19: Information Retrieval
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Search Engines. Allows a user to find information residing on remote computers; Searching differs from browsing in that the user is not required to provide.
Chapter 5: Information Retrieval and Web Search
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Basics of Information Retrieval - Focus: the Web Lillian N. Cassel February 2008 For CSC 2500 : Survey of Information Science A number of these slides.
Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Web- and Multimedia-based Information Systems Lecture 2.
Search Tools and Search Engines Searching for Information and common found internet file types.
Information Retrieval
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Search Engine Optimization
Why indexing? For efficient searching of a document
Text Based Information Retrieval
Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Information Retrieval
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Chapter 16 The World Wide Web.
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Recuperação de Informação B
Recuperação de Informação B
Advanced information retrieval
Chapter 19: Information Retrieval
Information Retrieval
Presentation transcript:

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Basic ideas  Information overload  The challenging byproduct of the information age  Huge amounts of information available -- how to find what you need when you need it  Think about addresses, messages, files of interesting articles, etc.  Information retrieval is the formal study of efficient and effective ways to extract the right bit of information from a collection.  The web is a special case, as we will discuss.

Some distinctions  Data, information, knowledge  How do you distinguish among them?   Information sources  Very well organized, indexed, controlled  Totally unorganized, uncharacterized, uncontrolled  Something in between

Databases  Databases hold specific data items  Organization is explicit  Keys relate items to each other  Queries are constrained, but effective in retrieving the data that is there  Databases generally respond to specific queries with specific results  Browsing is difficult  Searching for items not anticipated by the designers can be difficult

The Web  The Web contains many kinds of elements  Organization?  There are no keys to relate items to each other  Queries are unconstrained; effectiveness depends on the tools used.  Web queries generally respond to general queries with specific results  Browsing is possible, though somewhat complicated  There are no designers of the overall Web structure.  Describe how you frequently use the web  What works easily?  What has been difficult?

Digital Library  Something in between the very structured database and the unstructured Web.  Content is controlled. Someone makes the entries. (Maybe a lot of people make the entries, but there are rules for admission.)  Searching and browsing are somewhat open, not controlled by fixed keys and anticipated queries.  Nature of the collection regulates indexing somewhat.

In all cases  Trying to connect an information user to the specific information wanted.  Concerned with efficiency and effectiveness  Effectiveness - how well did we do?  Efficiency - how well did we use available resources?

Effectiveness  Two measures:  Precision  Of the results returned, what percentage are meaningful to the goal of the query?  Recall  Of the materials available that match the query, what percentage were returned?  Ex. Search returns 590,000 responses and 195 are relevant. How well did we do?  Not enough information.  Did the 590,000 include all relevant responses? If so, recall is perfect.  195/590,000 is not good precision!

The process Query entered Query Interpreted Items retrieved Index searched Results Ranked

The Collection  Where does the collection come from?  How is the index created?  Those are important distinguishing characteristics  Inverted Index -- Ordered list of terms related to the collected materials. Each term has an associated pointer to the related material(s). 

Crawling the web  Misnomer as the spider or robot does not actually move about the web  Program sends a normal request for the page, just as a browser would.  Retrieve the page and parse it.  Look for anchors -- pointers to other pages. Put them on a list of URLs to visit  Extract key words (possibly all words) to use as index terms related to that page  Take the next URL and do it again  Actually, the crawling and processing are parallel activities

Responding to search queries  Use the query string provided  Form a boolean query  Join all words with AND? With OR?  Find the related index terms  Return the information available about the pages that correspond to the query terms.  Many variations on how to do this. Usually proprietary to the company.

Making the connections  Stemming  Making sure that simple variations in word form are recognized as equivalent for the purpose of the search: exercise, exercises, exercised, for example.  Indexing  A keyword or group of selected words  Any word (more general)  How to choose the most relevant terms to use as index elements for a set of documents.  Build an inverted file for the chosen index terms.

The Vector model  Let,  N be the total number of documents in the collection  n i be the number of documents which contain k i  freq(i,j) raw frequency of k i within d j  A normalized tf (term frequency) factor is given by  tf(i,j) = freq(i,j) / max(freq(i,j))  where the maximum is computed over all terms which occur within the document d j  The idf (index term frequency) factor is computed as  idf(i) = log (N/n i )  the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i.

Anatomy of a web page  Metatags: Information about the page  Primary source of indexing information for a search engine.  Ex. Title. Never mind what has an H1 tag (though that may be considered), what is in the brackets?  Other tags provide information about the page. This is easier for the search engine to use than determining the meaning of the text of the page.  Dealing with the cheaters  False information provided in the web page to make the search engine return this page  False metatags, invisible words (repeated many times), etc

Standard Metatags  The Dublin Core ( 15 common items to use in labeling any web document TitleContributorSource CreatorDateLanguage Subject Resources typeRelation DescriptionFormatCoverage PublisherIdentifierRights

Hubs and authorities  Hub points to a lot of other places.  CITIDEL is a hub for computing information  NSDL is a hub for science, technology, engineering and mathematics education.  Authorities are pointed to by a lot of other places.  W3C.org is an authority for information about the web.  When Hub or Authority status is captured, the search can be more accurate.  If several pages match a query, and one is an authority page, it will be ranked higher.  When a hub matches a query, the pages it points to are likely to be relevant.

Some Digital Library examples  Between the chaos of the Web and the strict structure of a database, the digital library contains an organized collection.  We saw the digital collection at the Falvey library session.  See also:  NSDL  And the computing component, CITIDEL: citidel.villanova.edu  American Memory

Conclusions  The plan was to introduce the basic concepts of information retrieval in a form accessible to most students,before you have read anything about it.  We will look more deeply at these subjects in the coming weeks.  A word about the pattern for these slides …