Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Similar presentations


Presentation on theme: "Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL."— Presentation transcript:

1 Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL

2 2 What is Information Retrieval? Google ? Yahoo ? MSN search? What goes on behind search engine??? How do they work??????

3 3..Many Question What makes a system like Google search trick? How does it gather information? What trick does it use? Extending beyond the Web How can those approaches be made better? Natural language understanding? User interactions?

4 4..Many Question How can we do to make things work quickly? Faster computer? Caching? Compression? How do we decide it works well? For all queries? For special types of queries? On every collection of information?

5 5 Motivation IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need: Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)

6 6 Motivation Data retrieval which documents contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important

7 7 Motivation IR at the center of the stage IR in the last 20 years: classification and categorization systems and languages user interfaces and visualization Advent of the Web changed this perception once and for all universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions!

8 8 Information Retrieval The indexing and retrieval of textual documents. Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

9 9 Typical IR Task Given: A corpus of textual natural-language documents. A user query in the form of a textual string. Find: A ranked set of documents that are relevant to the query.

10 10 Relevance Relevance is a subjective judgment and may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

11 11 Relevance Much of IR depends upon idea that Similar vocabulary -> relevant to same queries Usually look for documents matching query words “Similar” can be measured in many ways String matching/comparison Same vocabulary used Probability that documents arise from same model Same meaning of text

12 12 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

13 13 Problems with Keywords May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

14 14 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

15 15 IR Basic Concepts The User Task Retrieval information or data purposeful Browsing glancing around F1; cars, Le Mans, France, tourism Retrieval Browsing Database

16 16 IR Basic Concepts Logical view of documents Document representation viewed as a continuum: logical view of documents might shift Docs structure Accents spacing stopwords Noun groups stemming Manual indexing structureFull textIndex terms

17 17 IR System Architecture User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module Text Database Text

18 18 IR System Components Text: Operations forms index words (tokens). Stopword removal Stemming Indexing: constructs an inverted index of word to document pointers. Searching: retrieves documents that contain a given query token from the inverted index. Ranking: scores all retrieved documents according to a relevance metric.

19 19 IR System Components User Interface: manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results. Query Operations: transform the query to improve retrieval: Query expansion using a thesaurus. Query transformation using relevance feedback.

20 20 Web Search Application of IR to HTML documents on the World Wide Web. Differences : Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.

21 21 Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3. Document corpus Web Spider

22 22 History of IR 1960-70’s: Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area.

23 23 History of IR 1980’s: Large document database systems, many run by companies: Lexis-Nexis Dialog MEDLINE

24 24 History of IR 1990’s: Searching FTP able documents on the Internet Archie WAIS Searching the World Wide Web Yahoo Altavista

25 25 History of IR 2000’s & continued : Link analysis for Web Search Google Multimedia IR Image Video Audio and music Document Summarization


Download ppt "Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL."

Similar presentations


Ads by Google