Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Information Retrieval, Search, and Mining Introduction.

Similar presentations


Presentation on theme: "1 Information Retrieval, Search, and Mining Introduction."— Presentation transcript:

1 1 Information Retrieval, Search, and Mining Introduction

2 2 Course Outline Introduction Information Retrieval –Basic Information Retrieval Models –Indexing, Compression, and Online Search –Evaluation Methods Web Search –Challenges –Link Analysis –Other advanced methods Text Mining –Text Categorization –Text Clustering –Recommendation systems –Information extraction

3 3 References Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.Christopher D. ManningPrabhakar RaghavanHinrich Schütze Intelligent Information Retrieval and Web Search. A course by Raymond Mooney, U Texas. 2002. –http://www.cs.utexas.edu/users/mooney/ir-course/ Standford web search/mining class [Manning, Raghavan] –http://www.stanford.edu/class/cs276b/courseinfo.html Others: –S. Chakrabarti. 2003. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann. –MG = Managing Gigabytes, by Witten, Moffat, and Bell. MIR = Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto.

4 4 Information Retrieval (IR) The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent “killer app.” Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

5 5 Typical IR Task Given: –A corpus of textual natural-language documents. –A user query in the form of a textual string. Find: –A ranked set of documents that are relevant to the query.

6 6 IR System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

7 7 Relevance Relevance is a subjective judgment and may include: –Being on the proper subject. –Being timely (recent information). –Being authoritative (from a trusted source). –Satisfying the goals of the user and his/her intended use of the information (information need).

8 8 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

9 9 Problems with Keywords May not retrieve relevant documents that include synonymous terms. –“restaurant” vs. “café” –“PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. –“bat” (baseball vs. mammal) –“Apple” (company vs. fruit) –“bit” (unit of data vs. act of eating)

10 10 Beyond Keywords We will cover the basics of keyword-based IR, but… We will focus on extensions and recent developments that go beyond keywords. We will cover the basics of building an efficient IR system, but… We will focus on basic capabilities and algorithms rather than system’s issues that allow scaling to industrial size databases.

11 11 Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

12 12 IR System Architecture Text Database Manager Indexing Index Query Operations Searching Ranking Ranked Docs User Feedback Text Operations User Interface Retrieved Docs User Need Text Query Logical View Inverted file

13 13 IR System Components Text Operations forms index words (tokens). –Stopword removal –Stemming Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric.

14 14 IR System Components (continued) User Interface manages interaction with the user: –Query input and document output. –Relevance feedback. –Visualization of results. Query Operations transform the query to improve retrieval: –Query expansion using a thesaurus. –Query transformation using relevance feedback.

15 15 Web Search Application of IR to HTML documents on the World Wide Web. Differences: –Must assemble document corpus by spidering the web. –Can exploit the structural layout information in HTML (XML). –Documents change uncontrollably. –Can exploit the link structure of the web.

16 16 Web Search System Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3. Document corpus Web Spider

17 17 Other IR-Related Tasks Automated document categorization Information filtering (spam filtering) Information routing Automated document clustering Recommending information or products Information extraction Information integration Question answering

18 Topics: Text mining “Text mining” is a cover-all marketing term A lot of what we’ve already talked about is actually the bread and butter of text mining: –Text classification, clustering, and retrieval But we will focus in on some of the higher- level text applications: –Extracting document metadata –Topic tracking and new story detection –Cross document entity and event coreference –Text summarization –Question answering

19 Topics: Information extraction Getting semantic information out of textual data –Filling the fields of a database record E.g., looking at an events web page: –What is the name of the event? –What date/time is it? –How much does it cost to attend Other applications: resumes, health data, … A limited but practical form of natural language understanding

20 Topics: Recommendation systems Using statistics about the past actions of a group to give advice to an individual E.g., Amazon book suggestions or NetFlix movie suggestions A matrix problem: but now instead of words and documents, it’s users and “documents” What kinds of methods are used? Why have recommendation systems become a source of jokes on late night TV? –How might one build better ones?

21 Topics: XML search The nature of semi-structured data Tree models and XML Content-oriented XML retrieval Query languages and engines

22 22 History of IR 1960-70’s: – Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. –Development of the basic Boolean and vector- space models of retrieval. –Prof. Salton and his students at Cornell University are the leading researchers in the area.

23 23 IR History Continued 1980’s: –Large document database systems, many run by companies: Lexis-Nexis Dialog MEDLINE

24 24 IR History Continued 1990’s: –Searching FTPable documents on the Internet Archie WAIS –Searching the World Wide Web Lycos Yahoo Altavista

25 25 IR History Continued 1990’s continued: –Organized Competitions NIST TREC –Recommender Systems Ringo Amazon NetPerceptions –Automated Text Categorization & Clustering

26 26 Recent IR History 2000’s –Link analysis for Web Search Google Inktomi Teoma –Feedback based engine: DirectHit –Automated Information Extraction Whizbang Fetch Burning Glass –Question Answering TREC Q/A track Ask Jeeves

27 27 Recent IR History 2000’s continued: –Multimedia IR Image Video Audio and music –Cross-Language IR –Document Summarization

28 28 Related Areas Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning

29 29 Database Management Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of well- defined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.

30 30 Library and Information Science Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). Concerned with effective categorization of human knowledge. Concerned with citation analysis and bibliometrics (structure of information). Recent work on digital libraries brings it closer to CS & IR.

31 31 Artificial Intelligence Focused on the representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge and queries: –First-order Predicate Logic –Bayesian Networks Recent work on web ontologies and intelligent information agents brings it closer to IR.

32 32 Natural Language Processing Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.

33 33 Natural Language Processing: IR Directions Methods for determining the sense of an ambiguous word based on context (word sense disambiguation). Methods for identifying specific pieces of information in a document (information extraction). Methods for answering specific NL questions from document corpora.

34 34 Machine Learning Focused on the development of computational systems that improve their performance with experience. Automated classification of examples based on learning concepts from labeled training examples (supervised learning). Automated methods for clustering unlabeled examples into meaningful groups (unsupervised learning).

35 35 Machine Learning: IR Directions Text Categorization –Automatic hierarchical classification (Yahoo). –Adaptive filtering/routing/recommending. –Automated spam filtering. Text Clustering –Clustering of IR query results. –Automatic formation of hierarchies Learning for Information Extraction Text Mining


Download ppt "1 Information Retrieval, Search, and Mining Introduction."

Similar presentations


Ads by Google