Information Retrieval, Search, and Mining

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

An Introduction to Information Retrieval and Applications J. H. Wang Feb. 19, 2008.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Modern Information Retrieval Chapter 1: Introduction
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Information Retrieval in Practice
Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan Sep. 16, 2005.
1 Information Retrieval and Web Search Introduction.
1 Web Information Retrieval Web Science Course. 2.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
IR Lecture 1. Information Retrieval  Information retrieval is concerned with representing, searching, and manipulating large collections of electronic.
1 Web Search and Advanced Internet Services 290N Class Introduction Tao Yang, 2014.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Web Technologies Search Engines
Search Engines and Information Retrieval Chapter 1.
1 Information Retrieval and Advanced Internet Services 290N Class Introduction Tao Yang, 2015
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
고급정보검색론 Advanced Information Retrieval
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Information Retrieval and Knowledge Organisation Knut Hinkelmann.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Information Retrieval, Search, and Mining Introduction.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
1 Query Operations Relevance Feedback & Query Expansion.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
Search Engine Architecture
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
ITEC547 Text Mining Web Technologies Search Engines.
I NFORMATION R ETRIEVAL AND W EB S EARCH Jianping Fan Department of Computer Science UNC-Charlotte 1.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
CENG 776 Information Retrieval Nihan Kesim Çiçekli URL: 1/60.
Information Retrieval and Web Search Vasile Rus, PhD websearch/
Information Retrieval in Practice
Information Retrieval in Practice
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Information Retrieval (in Practice)
Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Web Search
Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 635 Multimedia Information Retrieval
Lecture 8 Information Retrieval Introduction
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Extraction
Information Retrieval and Web Search
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Information Retrieval, Search, and Mining Introduction

Course Topics Information Retrieval and Web Search Text Mining Information Retrieval Models Indexing, Compression, and Online Search Ranking methods: link analysis. Evaluation Methods Text Mining Text Categorization and Clustering Recommendation, and Information extraction Systems Support Clustering for online servers and offline computation. MapReduce/Hadoop. Caching. Data storage and communication. Crawling and document parsing. Basic algorithms: Duplicate detection. SVD/Eigen computation. Constrained optimization

References Selected papers Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. Others: S. Chakrabarti. 2003. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann. MG = Managing Gigabytes, by Witten, Moffat, and Bell. MIR = Modern Information Retrieval, by Baeza-Yates and Ribeiro-Neto. Selected papers

Information Retrieval System Document corpus IR System Query String Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

Information Retrieval (IR) The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is challenging Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

Relevance Relevance is a subjective judgment and may include: Being on the proper subject. Being timely (recent information). Being authoritative (from a trusted source). Satisfying the goals of the user and his/her intended use of the information (information need).

Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

Problems with Keywords May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating)

Intelligent IR Taking into account the meaning of the words used. Taking into account the order of words in the query. Adapting to the user based on direct or indirect feedback. Taking into account the authority of the source.

IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking

IR System Components Text Operations forms index words (tokens). Stopword removal Stemming Indexing constructs an inverted index of word to document pointers. Searching retrieves documents that contain a given query token from the inverted index. Ranking scores all retrieved documents according to a relevance metric.

IR System Components (continued) User Interface manages interaction with the user: Query input and document output. Relevance feedback. Visualization of results. Query Operations transform the query to improve retrieval: Query expansion using a thesaurus. Query transformation using relevance feedback.

Web Search Application of IR to HTML documents on the World Wide Web. Differences: Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web.

Web Search System Web Spider Document corpus IR Query String System Ranked Documents 1. Page1 2. Page2 3. Page3 .

Other IR-Related Tasks Automated document categorization Information filtering (spam filtering) Information routing Automated document clustering Recommending information or products Information extraction Information integration Question answering Advertisement placement

Topics: Text mining “Text mining” is a cover-all marketing term A lot of what we’ve already talked about is actually the bread and butter of text mining: Text classification, clustering, and retrieval But we will focus in on some of the higher-level text applications: Extracting document metadata Topic tracking and new story detection Cross document entity and event coreference Text summarization Question answering

Topics: Information extraction Getting semantic information out of textual data Filling the fields of a database record E.g., looking at an events web page: What is the name of the event? What date/time is it? How much does it cost to attend Other applications: resumes, health data, … A limited but practical form of natural language understanding

Topics: Recommendation systems Using statistics about the past actions of a group to give advice to an individual E.g., Amazon book suggestions or NetFlix movie suggestions A matrix problem: but now instead of words and documents, it’s users and “documents” What kinds of methods are used? Why have recommendation systems become a source of jokes on late night TV? How might one build better ones?

Topics: XML search The nature of semi-structured data Tree models and XML Content-oriented XML retrieval Query languages and engines

History of IR 1960-70’s: Initial exploration of text retrieval systems for “small” corpora of scientific abstracts, and law and business documents. Development of the basic Boolean and vector-space models of retrieval. Prof. Salton and his students at Cornell University are the leading researchers in the area.

IR History Continued 1980’s: Large document database systems, many run by companies: Lexis-Nexis Dialog MEDLINE

IR History Continued 1990’s: Searching FTPable documents on the Internet Archie WAIS Searching the World Wide Web Lycos Yahoo Altavista

IR History Continued 1990’s continued: Organized Competitions NIST TREC Recommender Systems Ringo Amazon NetPerceptions Automated Text Categorization & Clustering

Recent IR History 2000’s Link analysis for Web Search Google Inktomi Teoma Feedback based engine: DirectHit (Ask.com) Automated Information Extraction Whizbang Fetch Burning Glass Question Answering TREC Q/A track Ask.com/Ask Jeeves

Recent IR History 2000’s continued: Multimedia IR Cross-Language IR Image Video Audio and music Cross-Language IR Document Summarization

Related Areas Database Management Information Management Library and Information Science Artificial Intelligence/Machine Learning Natural Language Processing Large-scale systems Operating systems/networking support Parsing/compression/fast algorithms. Fault tolerance/paralle+distributed systems

Database Management Structured data stored in relational tables rather than free-form text. Efficient processing of well-defined queries in a formal language (SQL). Semi-structured data, XML

Library and Information Science Human user aspects of information retrieval (human-computer interaction, user interface, visualization). Effective categorization of human knowledge. Citation analysis and bibliometrics (structure of information). Digital libraries

Natural Language Processing&IR Syntactic, semantic, and pragmatic analysis of natural language text and discourse. Analyze syntax (phrase structure) and semantics to retrieve based on meaning rather than keywords. Sense of an ambiguous word based on context (word sense disambiguation). Identifying specific pieces of information in a document (information extraction). Answering specific NL questions from document corpora.

Artificial Intelligence/Machine Learning Representation of knowledge, reasoning, and intelligent action. Formalisms for representing knowledge First-order Predicate Logic Bayesian Networks Web ontologies and intelligent agents Machine learning: Automated classification (supervised learning). Clustering unlabeled examples (unsupervised learning).

Machine Learning&IR Text Categorization Text Clustering Automatic hierarchical classification . Adaptive filtering/routing/recommending. Automated spam filtering. Text Clustering Clustering of IR query results. Automatic formation of hierarchies Learning for Information Extraction Text Mining

System Supports for Internet-Scale Data-Intensive Search/Mining System Challenges for Large-scale Processing, Fast Response, and High Availability. Multiprocessing, Machine clustering, multiprocessing. Networking Thread/process/memory management. IO Middleware support. Software engineering (e.g. debugging) Data center management.

Fast Algorithms/Computing String comparison/manipulation Duplicates. Compression. Graph/matrix computation Ranking matrix. Sparse matrices Eigen computation and SVD Linear/nonlinear programming for optimization (e.g. used for SVM).