Download presentation
Presentation is loading. Please wait.
1
Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005
2
Introduction Abundant information on the Web –Static Web pages –Searchable databases: Deep Web Information Integration –Information for life e.g. shopping agents, travel agents –Data for research purpose e.g. bioinformatics, auction economy
3
Various IE Survey Muslea Hsu and Dung Chang Kushmerick Laender Sarawagi Kuhlins and Tredwell
4
Related Work: Time MUC Approaches –AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995] Post-MUC Approaches –WHISK [Soderland, 1999], RAPIER [califf, 1998], SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]
5
Related Work: Automation Degree Hsu and Dung [1998] –hand-crafted wrappers using general programming languages –specially designed programming languages or tools –heuristic-based wrappers, and –WI approaches
6
Related Work: Automation Degree Chang and Kuo [2003] –systems that need programmers, –systems that need annotation examples, –annotation-free systems and –semi-supervised systems
7
Related Work: Input and Extraction Rules Muslea [1999] –IE from free text using extraction patterns that are mainly based on syntactic/semantic constraints. –The second class is Wrapper induction systems which rely on the use of delimiter-based rules. –The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.
8
Related Work: Extraction Rules Kushmerick [2003] –Finite-state tools (regular expressions) –Relational learning tools (logic rules)
9
Related Work: Techniques Laender [2002] –languages for wrapper development –HTML-aware tools –NLP-based tools –Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER), –Modeling-based tools –Ontology-based tools New Criteria: –degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.
10
Related Work: Output Targets Sarawagi [2002] –Record-level –Page-level –Site-level
11
Related Work: Usability Kuhlins and Tredwell [2002] –Commercial –Noncommercial
12
Three Dimensions Task Domain –Input (Unstructured, semi-structured) –Output Targets (record-level, page-level, site- level) Automation Degree –Programmer-involved, learning-based or annotation-free approaches Techniques –Regular expression rules vs Prolog-like logic rules –Deterministic finite-state transducer vs probabilistic hidden Markov models
13
Classification by Automation Degree Manually –TSIMMIS, Minerva, WebOQL, W4F, XWrap Supervised –WIEN, Stalker, Softmealy Semi-supervised –IEPAD, OLERA Unsupervised –DeLa, RoadRunner, EXALG
14
Task Domain: Input
15
Task Domain: Output Missing Attributes Multi-valued Attributes Multiple Permutations Nested Data Objects Various Templates for an attribute Common Templates for various attributes Untokenized Attributes
16
ToolsPTNHSCPELNestedMAMVAMOAFVFUDAUTASPA Man ual MinervaSemi-SYes Record LevelYes NoYes TSIMMISSemi-SYes Record LevelYes NoYesNoYesNo WebOQLSemi-SNoYesRecord LevelYes No W4FSemi-SNoYesRecord LevelYes No Yes XWRAPSemi-SNoYesRecord LevelYes No Yes Supe rvise d RAPIERFreeYes Field LevelNoYes No SRVFreeYes Field LevelNoYes No WHISKFreeYes Record LevelNoYes No NoDoSESemi-SYes Record LevelYes No DEByESemi-SYes Record LevelYes No WIENSemi-SYes Record LevelNo STALKERSemi-SYes Record LevelYes NoYes SoftMealySemi-SYes Record Level Multi Pass Yes LimitedYesNoYes Semi - Supe rvise d IEPADSemi-SNoLimitedRecord LevelLimitedYes LimitedYesNoYes OLERASemi-SNoLimitedRecord LevelLimitedYes LimitedYesNoYes Un- Supe rvise d RoadRunnerSemi-SNoLimitedPage LevelYes No Yes EXALGSemi-SYesLimitedPage LevelYes NoYesNo Yes DeLaSemi-SNoLimitedRecord LevelYes LimitedYesNo Yes
17
Automation Degree Page-fetching Support Annotation Requirement Output Support API Support
18
Tools GUI support Page- Fetching support Output Support Training Examples API. Support MinervaNo XMLNoYes TSIMMISNo TextNoYes WebOQLNo TextNoYes W4FYes XMLLabeledYes XWRAPYes XMLLabeledYes RAPIERNo TextLabeledNo SRVNo TextLabeledNo WHISKNo TextLabeledNo NoDoSEYesNoXML, OEMLabeledYes DEByEYes XML, SQL DBLabeledYes WIENYesNoTextLabeledYes STALKERYesNoTextLabeledYes SoftMealyYes XML, SQL DBLabeledYes IEPADYesNo TextUnlabeledNo OLERAYesNoXMLUnlabeledNo RoadRunnerNoYesXMLUnlabeledYes EXALGNo TextUnlabeledNo DeLaNoYesTextUnlabeledYes
19
Technologies Scan passes Extraction rule types Learning algorithms Tokenization schemes Feature used
20
ToolsScan Pass Extraction Rule Type Features UsedLearning Algorithm Tokenization Schemes MinervaSingleRegular exp.HTML tags/Literal wordsNoneManually TSIMMISSingleRegular exp.HTML tags/Literal wordsNoneManually WebOQLSingleRegular exp.HypertreeNoneManually W4FSingleRegular exp.DOM tree path addressingNoneTag Level XWRAPSingle Context- Free DOM treeNoneTag Level RAPIERMultipleLogic rulesSyntactic/SemanticILP (bottom-up)Word Level SRVMultipleLogic rulesSyntactic/SemanticILP (top-down)Word Level WHISKSingleRegular exp.Syntactic/SemanticSet covering (top-down)Word Level NoDoSESingleRegular exp.HTML tags/Literal wordsData ModelingWord Level DEByESingleRegular exp.HTML tags/Literal wordsData ModelingWord Level WIENSingleRegular exp.HTML tags/Literal wordsAd-hoc (bottom-up)Word Level STALKERMultipleRegular exp.HTML tags/Literal wordsAd-hoc (bottom-up)Word Level SoftMealyBothRegular exp.HTML tags/Literal wordsAd-hoc (bottom-up)Word Level IEPADSingleRegular exp.HTML tags Pattern Mining, String Alignment Multi-Level OLERASingleRegular exp.HTML tagsString AlignmentMulti-Level RoadRunnerSingleRegular exp.HTML tagsString AlignmentTag Level EXALGSingleRegular exp.HTML tags/Literal words Equivalent Class and Role Differentiation Word Level DeLaSingleRegular exp.HTML tagsPattern MiningTag Level
21
Conclusion Criteria for evaluating IE systems from the task domain Comparison of IE systems from various automation degree The use of various techniques in IE systems
22
Future Work Page Fetching –XWrap, W4F, WNDL Schema Mapping –Full information –Partial information Query Interface Integration –[He, Chang and Han, 2004
23
References C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, Criteria for Evaluating Web Information Extraction Systems.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.