Presentation on theme: "資訊檢索與擷取 Information Retrieval and Extraction"— Presentation transcript:
1 資訊檢索與擷取 Information Retrieval and Extraction 陳信希Hsin-Hsi Chen台大資訊系
2 Information Retrieval generic information retrieval system select and return to the user desired documents from a large set of documents in accordance with criteria specified by the userfunctionsdocument search the selection of documents from an existing collection of documentsdocument routing the dissemination of incoming documents to appropriate users on the basis of user interest profiles
3 Detection NeedDefinition a set of criteria specified by the user which describes the kind of information desired.queries in document search taskprofiles in routing taskformskeywordskeywords with Boolean operatorsfree textexample documents...
4 Example <head> Tipster Topic Description <num> Number: 033 <dom> Domain: Science and Technology<title> Topic: Companies Capable of Producing DocumentManagement<des> Description:Document must identify a company who has the capability toproduce document management system by obtaining a turnkey-system or by obtaining and integrating the basic components.<narr> Narrative:To be relevant, the document must identify a turnkey documentmanagement system or components which could be integratedto form a document management system and the name of eitherthe company developing the system or the company using thesystem. These components are: a computer, image scanner oroptical character recognition system, and an information retrievalor text management system.
5 Example (Continued) <con> Concepts: 1. document management, document processing, office automationelectronic imaging2. image scanner, optical character recognition (OCR)3. text management, text retrieval, text database4. optical disk<fac> Factors:<def> DefinitionsDocument Management-The creation, storage and retrieval ofdocuments containing, text, images, and graphics.Image Scanner-A device that converts a printed image into a videoimage, without recognizing the actual content of the text or pictures.Optical Disk-A disk that is written and read by light, and aresometimes associated with the storage of digital images because oftheir high storage capacity.
6 search vs. routingThe search process matches a single Detection Need against the stored corpus to return a subset of documents.Routing matches a single document against a group of Profiles to determine which users are interested in the document.Profiles stand long-term expressions of user needs.Search queries are ad hoc in nature.A generic detection architecture can be used for both the search and routing.
7 Search retrieval of desired documents from an existing corpus Retrospective search is frequently interactive.Methodsindexing the corpus by keyword, stem and/or phraseapply statistical and/or learning techniques to better understand the content of the corpusanalyze free text Detection Needs to compare with the indexed corpus or a single document...
9 Document Detection: Search(Continued) Document Corpusthe content of the corpus may have significant the performance in some applicationsPreprocessing of Document Corpusstemminga list of stop wordsphrases, multi-term items...
10 Document Detection: Search(Continued) Building Index from Stemskey place for optimizing run-time performancecost to build the index for a large corpusDocument Indexa list of terms, stems, phrases, etc.frequency of terms in the document and corpusfrequency of the co-occurrence of terms within the corpusindex may be as large as the original document corpus
11 Document Detection: Search(Continued) Detection Needthe user’s criteria for a relevant documentConvert Detection Need to System Specific Queryfirst transformed into a detection query, and then a retrieval query.detection query: specific to the retrieval engine, but independent of the corpusretrieval query: specific to the retrieval engine, and to the corpus
12 Document Detection: Search(Continued) Compare Query with IndexResultant Rank Ordered List of DocumentsReturn the top ‘N’ documentsRank the list of relevant documents from the most relevant to the query to the least relevant
14 Routing (Continued) Profile of Multiple Detection Needs A Profile is a group of individual Detection Needs that describes a user’s areas of interest.All Profiles will be compared to each incoming document (via the Profile index).If a document matches a Profile the user is notified about the existence of a relevant document.
15 Routing (Continued) Convert Detection Need to System Specific Query Building Index from Queriessimilar to build the corpus index for searchingthe quantify of source data (Profiles) is usually much less than a document corpusProfiles may have more specific, structured data in the form of SGML tagged fields
16 Routing (Continued) Routing Profile Index Document to be routed The index will be system specific and will make use of all the preprocessing techniques employed by a particular detection system.Document to be routedA stream of incoming documents is handled one at a time to determine where each should be directed.Routing implementation may handle multiple document streams and multiple Profiles.
17 Routing (Continued) Preprocessing of Document A document is preprocessed in the same manner that a query would be set-up in a searchThe document and query roles are reversed compared with the search processCompare Document with IndexIdentify which Profiles are relevant to the documentGiven a document, which of the indexed profiles match it?
18 Routing (Continued) Resultant List of Profiles The list of Profiles identify which user should receive the document
19 SummaryGenerate a representation of the meaning or content of each object based on its description.Generate a representation of the meaning of the information need.Compare these two representations to select those objects that are most likely to match the information need.
20 an Information Retrieval System Basic Architecture ofan Information Retrieval SystemDocumentsQueriesDocumentRepresentationQueryRepresentationComparison
21 Research IssuesGiven a set of description for objects in the collection and a description of an information need, we must considerIssue 1What makes a good document representation?What are retrievable units and how are they organized?How can a representation be generated from a description of the document?
22 Research Issues (Continued) Issue 2 How can we represent the information need and how can we acquire this representation either from a description of the information need or through interaction with the user?Issue 3 How can we compare representations to judge likelihood that a document matches an information need?
23 Research Issues (Continued) Issue 4 How can we evaluate the effectiveness of the retrieval process?
24 Information Extraction Generic Information Extraction System An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.
25 Information Extraction (Continued) What are the transducers or modules?What are their input and output?What structure is added?What information is lost?What is the form of the rules?How are the rules applied?How are the rules acquired?
26 Example: Parser transducer: parser input: the sequence of words or lexical itemsoutput: a parse treeinformation added: predicate-argument and modification relationsinformation lost: norule form: unification grammarsapplication method: chart parseracquisition method: manually
27 Modules Text Zoner turn a text into a set of text segments Preprocessor turn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributesFilter turn a set of sentences into a smaller set of sentences by filtering out the irrelevant onesPreparser take a sequence of lexical items and try to identify various reliably determinable, small-scale structures
28 Modules (Continued)Parser input a sequence of lexical items and perhaps small-scale structures (phrases) and output a set of parse tree fragments, possibly completeFragment Combiner turn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentenceSemantic Interpreter generate a semantic structure or logical form from a parse tree or from parse tree fragments
29 Modules (Continued)Lexical Disambiguation turn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicatesCoreference Resolution, or Discourse Processing turn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the textTemplate Generator derive the templates from the semantic structures
30 Topics 1. Introduction to Information Retrieval and Extraction 2. Conventional Text-Retrieval Systems (Salton, Chapter 8)- Database Management and Information Retrieval- Text Retrieval Using Inverted Indexing Methods- Extensions of the Inverted Index Operations- Typical File Organization- Text-Scanning Systems3. Automatic Indexing (Salton, Chapter 9)- Indexing Environment- Indexing Aims- Single-Term Indexing Theories- Term Relationships in Indexing- Term-Phrase Formulation- Thesaurus-Group Generation
31 Topics (Continued)4. Advanced Information-Retrieval Models (Salton, Chapter 10)- The Vector Space Model- Automatic Document Classification- Probabilistic Retrieval Model- Extended Boolean Retrieval Model5. File Structures (Frakes & Baeza-Yates, Chapters 3-5)- Inverted Files- Signature Files- PAT trees6. Term and Query Operations (Frakes & Baeza-Yates, Chapters 7-9,10)- Lexical Analysis and Stoplists- Stemming Algorithms- Thesaurus Construction- Relevance Feedback7. Evaluation Metrices (Jones & Willett, Chapter 4)- The Pragmatics of Information Retrieval Experimentation, Revisited- The TREC Conferences
32 Topics (Continued) 8. IR on the World Wide Web (Cheong, Chapter 4) - Spiders for Indexing the Web- Web Indexing Spiders- WebCrawler: Finding What People Want- Lycos: Hunting WWW Information- Harvest: Gathering and Brokering Information- WebAnts: Hunting in Packs- Issues of Web Indexing- Spiders of the Future9. Cross-Language Information Retrieval (Hsin-Hsi Chen)10. Information Extraction (Jerry R. Hobbs)- What information extraction is- What is involved in building information extraction systems,and some how to?- What kinds of resources and tools are needed, and how toaccess them
33 Information Sources Books Salton, G. (1989) Automatic Text Processing. The Transformation, Analysis and Retrieval of Information by Computer. Reading, MA: Addison-Wesley.Frakes, W.B. and Baeza-Yates, R. (Eds.) (1992) Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.Cheong, F. (1996) Internet Agents: Spiders, Wanderers, Brokers, and Bots. Indianapolis, IN: New Riders, 1996.Karen Sparck Jones and Peter Willett (1997) Readings in Information Retrieval, CA: Morgan Kaufmann Publishers.
34 Information Sources Conference Proceedings Journals ACM SIGIR Annual International Conference on Research and Development in Information Retrieval (1978-)JournalsACM Transactions on Information SystemsInformation Processing and Management (formerly Information Storage and Retrieval)Journal of the American Society for Information Science (formerly American Documentation)Journal of Documentation