Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction CS 652 Information Extraction and Integration.

Similar presentations


Presentation on theme: "Information Extraction CS 652 Information Extraction and Integration."— Presentation transcript:

1 Information Extraction CS 652 Information Extraction and Integration

2 Information Extraction(IE) Task Information Retrieval(IR) and IE History of IE Evaluation Metrics Approaches to IE Free, Structured, and Semistructured Text Web Documents IE Systems

3 IR and IE IR Retrieves relevant documents from collections Information theory, probabilistic theory, and statistics IE Extracts relevant information from documents Computational linguistics and natural language processing

4 History of IE The exponential growth in the amount of both online and offline textual data. Message Understanding Conference (MUC) Quantitative evaluation of IE systems Tasks  Latin American terrorism  Joint ventures  Microelectronics  Company management changes

5 Evaluation Metrics Precision (PR) Recall (R) F-measure

6 Approaches to IE Knowledge Engineering Approach Grammars are constructed by hand Domain patterns are discovered by human experts through introspection and inspection of a corpus Much laborious tuning and “hill climbing” Automatic Training Approach Use statistical methods when possible Learn rules from annotated corpora Learn rules from interaction with user

7 Knowledge Engineering Advantages With skills and experience, good performing systems are not conceptually hard to develop. The best performing systems have been hand crafted. Disadvantages Very laborious development process Some changes to specifications can be hard to accommodate Required expertise may not be available

8 Automatic Training Advantages Domain portability is relatively straightforward System expertise is not required for customization “Data driven” rule acquisition ensures full coverage of examples Disadvantages Training data may not exist, and may be very expensive to acquire Large volume of training data may be required Changes to specifications may require reannotation of large quantities of training data

9 Texts Free Text Natural language processing Structured Text Textual information in a database or file following a predefined and strict format Semistructured Text Ungrammatical Telegraphic Web Documents

10 Web Document Categorization [Hsu,1998] Structured Itemised information Uniform syntactic clues (e.g., delimiters, attribute orders, …) Semistructured (e.g., missing attributes, multi-value attributes, …) Unstructured (e.g., linguistic knowledge is required, …)

11 Free Text AutoSlog Liep Palka Hasten Crystal WebFoot WHISK

12 AutoSlog [1993] The Parliament building was bombed by Carlos.

13 LIEP [1995] The Parliament building was bombed by Carlos.

14 PALKA [1995] The Parliament building was bombed by Carlos.

15 HASTEN [1995] The Parliament building was bombed by Carlos. Egraphs (SemanticLabel, StructuralElement)

16 CRYSTAL [1995] The Parliament building was bombed by Carlos.

17 CRYSTAL + Webfoot [1997]

18 WHISK [1999] The Parliament building was bombed by Carlos. WHISK Rule: *( PhyObj )*@passive *F ‘bombed’ * {PP ‘by’ *F ( Person )} Context-based patterns

19 Comparison Extraction granularity Semantic Class Constraint Single_ Slot Rule Multi_ Slot Rule Syntactic Constraints AutoSlog  Liep  Palka  Hasten  Crystal  WHISK 

20 Web Documents Semistructured and Unstructured RAPIER (E. Califf, 1997) SRV (D. Freitag, 1998) WHISK (S. Soderland, 1998) Semistructured and Structured Shopbot (R.B. Doorenbos, O. Etzioni, D.S. Weld, 1996/1997) WIEN (N. Kushmerick, 1997) SoftMealy (C-H. Hsu, 1998) STALKER (I. Muslea, S. Minton, C. Knoblock, 1998)

21 Inductive Learning Task Inductive Inference Learning Systems Zero-order First-order, e.g., Inductive Logic Programming (ILP)

22 RAPIER [1997]

23 SRV [1998]

24 WHISK [1998]

25 Comparison IE systems Single slot- Multiple slot- English syntax Semantic constraints otherinduction RAPIER  WordNet POS, Length Bottom- up SRV  WordNetLengthTop-down WHISK  User-defTop-down

26 Wrapper Generation Wrapper: an IE application for one particular information source Delimiter-based Rules No use linguistic constraints

27 WIEN [1997] Assumes Items are always in fixed, known order Introduces several types of wrappers Advantages Fast to learn and extract Drawbacks Can not handle permutations and missing items Must label entire pages No use semantic classes

28 WIEN Rule

29 SoftMealy [1998] Learns a transducer Advantages Also learns the order of items Allows item permutations and missing items Uses wildcards Drawbacks Must see all possible permutations No use delimiters that do not immediately precede and follow the relevant items

30 SoftMealy Rule

31 STALK [1998,1999,2001] Hierarchical Information Extraction Embedded Catalog Tree (ECT) Formalism Advantages Nested data Drawbacks

32 STALKER

33 Comparison

34 Commercial Systems Junglee [1996] Jango [1997] MySimon [1998] …?…

35 Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence, 1993.

36 Overview Problem: Domain specific dictionary Solution: AutoSlog

37 Information Extraction Selective Concept Extraction CIRCUS: Sentence Analyzer Domain-specific Dictionary Concept nodes  Lexical items  Linguistic context  Slots Syntactic expectation Hard and soft constraints

38 Extraction Procedure Sentence CIRCUS Concept Node Dictionary Concept Nodes

39 Observations News reports follow certain stylistic conventions. The first reference to a targeted piece of information is mostly likely where the relationship between that information and the event is made explicit The immediate linguistic context surrounding the targeted information usually contains the words or phrases that describe its role in the events.

40 Conceptual Anchor Points A conceptual anchor point is a word that should activate a concept CIRCUS: trigger word 13 Heuristics

41 Domain Specifications A set of mappings from template slots to concept node slots Hard and soft constraints for each type of concept node slot A set of mappings from template types to concept node types

42 A Good Concept Node Def.

43 Another Good Concept Node Def.

44 A Bad Concept Node Def.

45 Discussion When a sentence contains the targeted string but does not describe the event. When a heuristic proposes the wrong conceptual anchor point. When CIRCUS incorrectly analyzes the sentence.

46 Comparative Results System/ Test Set RecallPrecisionF-measure MUC-4/TST3465650.51 AutoSlog/TST3435648.65 MUC-4/TST4444041.90 AutoSlog/TST4394541.79


Download ppt "Information Extraction CS 652 Information Extraction and Integration."

Similar presentations


Ads by Google