Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005.

Similar presentations


Presentation on theme: "Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005."— Presentation transcript:

1 Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005

2 Introduction Abundant information on the Web –Static Web pages –Searchable databases: Deep Web Information Integration –Information for life e.g. shopping agents, travel agents –Data for research purpose e.g. bioinformatics, auction economy

3 Various IE Survey Muslea Hsu and Dung Chang Kushmerick Laender Sarawagi Kuhlins and Tredwell

4 Related Work: Time MUC Approaches –AutoSolg [Riloff, 1993], LIEP [Huffman, 1996], PALKA [Kim, 1995], HASTEN [Krupka, 1995], and CRYSTAL [Soderland, 1995] Post-MUC Approaches –WHISK [Soderland, 1999], RAPIER [califf, 1998], SRV [Freitag, 1998], WIEN [Kushmerick, 1997], SoftMealy [Hsu, 1998] and STALKER [Muslea, 1999]

5 Related Work: Automation Degree Hsu and Dung [1998] –hand-crafted wrappers using general programming languages –specially designed programming languages or tools –heuristic-based wrappers, and –WI approaches

6 Related Work: Automation Degree Chang and Kuo [2003] –systems that need programmers, –systems that need annotation examples, –annotation-free systems and –semi-supervised systems

7 Related Work: Input and Extraction Rules Muslea [1999] –IE from free text using extraction patterns that are mainly based on syntactic/semantic constraints. –The second class is Wrapper induction systems which rely on the use of delimiter-based rules. –The third class also processes IE from online documents; however the patterns of these tools are based on both delimiters and syntactic/semantic constraints.

8 Related Work: Extraction Rules Kushmerick [2003] –Finite-state tools (regular expressions) –Relational learning tools (logic rules)

9 Related Work: Techniques Laender [2002] –languages for wrapper development –HTML-aware tools –NLP-based tools –Wrapper induction tools (e.g., WIEN, SoftMealy and STALKER), –Modeling-based tools –Ontology-based tools New Criteria: –degree of automation, support for complex objects, page contents, availability of a GUI, XML output, support for non-HTML sources, resilience and adaptiveness.

10 Related Work: Output Targets Sarawagi [2002] –Record-level –Page-level –Site-level

11 Related Work: Usability Kuhlins and Tredwell [2002] –Commercial –Noncommercial

12 Three Dimensions Task Domain –Input (Unstructured, semi-structured) –Output Targets (record-level, page-level, site- level) Automation Degree –Programmer-involved, learning-based or annotation-free approaches Techniques –Regular expression rules vs Prolog-like logic rules –Deterministic finite-state transducer vs probabilistic hidden Markov models

13 Classification by Automation Degree Manually –TSIMMIS, Minerva, WebOQL, W4F, XWrap Supervised –WIEN, Stalker, Softmealy Semi-supervised –IEPAD, OLERA Unsupervised –DeLa, RoadRunner, EXALG

14 Task Domain: Input

15 Task Domain: Output Missing Attributes Multi-valued Attributes Multiple Permutations Nested Data Objects Various Templates for an attribute Common Templates for various attributes Untokenized Attributes

16 ToolsPTNHSCPELNestedMAMVAMOAFVFUDAUTASPA Man ual MinervaSemi-SYes Record LevelYes NoYes TSIMMISSemi-SYes Record LevelYes NoYesNoYesNo WebOQLSemi-SNoYesRecord LevelYes No W4FSemi-SNoYesRecord LevelYes No Yes XWRAPSemi-SNoYesRecord LevelYes No Yes Supe rvise d RAPIERFreeYes Field LevelNoYes No SRVFreeYes Field LevelNoYes No WHISKFreeYes Record LevelNoYes No NoDoSESemi-SYes Record LevelYes No DEByESemi-SYes Record LevelYes No WIENSemi-SYes Record LevelNo STALKERSemi-SYes Record LevelYes NoYes SoftMealySemi-SYes Record Level Multi Pass Yes LimitedYesNoYes Semi - Supe rvise d IEPADSemi-SNoLimitedRecord LevelLimitedYes LimitedYesNoYes OLERASemi-SNoLimitedRecord LevelLimitedYes LimitedYesNoYes Un- Supe rvise d RoadRunnerSemi-SNoLimitedPage LevelYes No Yes EXALGSemi-SYesLimitedPage LevelYes NoYesNo Yes DeLaSemi-SNoLimitedRecord LevelYes LimitedYesNo Yes

17 Automation Degree Page-fetching Support Annotation Requirement Output Support API Support

18 Tools GUI support Page- Fetching support Output Support Training Examples API. Support MinervaNo XMLNoYes TSIMMISNo TextNoYes WebOQLNo TextNoYes W4FYes XMLLabeledYes XWRAPYes XMLLabeledYes RAPIERNo TextLabeledNo SRVNo TextLabeledNo WHISKNo TextLabeledNo NoDoSEYesNoXML, OEMLabeledYes DEByEYes XML, SQL DBLabeledYes WIENYesNoTextLabeledYes STALKERYesNoTextLabeledYes SoftMealyYes XML, SQL DBLabeledYes IEPADYesNo TextUnlabeledNo OLERAYesNoXMLUnlabeledNo RoadRunnerNoYesXMLUnlabeledYes EXALGNo TextUnlabeledNo DeLaNoYesTextUnlabeledYes

19 Technologies Scan passes Extraction rule types Learning algorithms Tokenization schemes Feature used

20 ToolsScan Pass Extraction Rule Type Features UsedLearning Algorithm Tokenization Schemes MinervaSingleRegular exp.HTML tags/Literal wordsNoneManually TSIMMISSingleRegular exp.HTML tags/Literal wordsNoneManually WebOQLSingleRegular exp.HypertreeNoneManually W4FSingleRegular exp.DOM tree path addressingNoneTag Level XWRAPSingle Context- Free DOM treeNoneTag Level RAPIERMultipleLogic rulesSyntactic/SemanticILP (bottom-up)Word Level SRVMultipleLogic rulesSyntactic/SemanticILP (top-down)Word Level WHISKSingleRegular exp.Syntactic/SemanticSet covering (top-down)Word Level NoDoSESingleRegular exp.HTML tags/Literal wordsData ModelingWord Level DEByESingleRegular exp.HTML tags/Literal wordsData ModelingWord Level WIENSingleRegular exp.HTML tags/Literal wordsAd-hoc (bottom-up)Word Level STALKERMultipleRegular exp.HTML tags/Literal wordsAd-hoc (bottom-up)Word Level SoftMealyBothRegular exp.HTML tags/Literal wordsAd-hoc (bottom-up)Word Level IEPADSingleRegular exp.HTML tags Pattern Mining, String Alignment Multi-Level OLERASingleRegular exp.HTML tagsString AlignmentMulti-Level RoadRunnerSingleRegular exp.HTML tagsString AlignmentTag Level EXALGSingleRegular exp.HTML tags/Literal words Equivalent Class and Role Differentiation Word Level DeLaSingleRegular exp.HTML tagsPattern MiningTag Level

21 Conclusion Criteria for evaluating IE systems from the task domain Comparison of IE systems from various automation degree The use of various techniques in IE systems

22 Future Work Page Fetching –XWrap, W4F, WNDL Schema Mapping –Full information –Partial information Query Interface Integration –[He, Chang and Han, 2004

23 References C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, Criteria for Evaluating Web Information Extraction Systems.


Download ppt "Comparison of IE Approaches Chia-Hui Chang National Central University Jan. 4, 2005."

Similar presentations


Ads by Google