A Brief Survey of Web Data Extraction Tools (WDET) Laender et al.
Introduction Web data is hard to query A lot of unstructured data Wrappers can help extract data There are several ways to generate wrappers A wrapper maps a page to a repository This paper is a survey of different wrappers
Taxonomy of WDET Languages for Wrapper Development HTML-aware Tools NLP-based Tools Wrapper Induction Tools Modeling based Tools Ontology based Tools
Languages for Wrapper Development HTML-aware Tools NLP-based Tools procedural programming languages(Minerva, TSIMMIS) Overview of WDET W4F, XWRAP, RoadRunner Uses free text form (RAPIER, SRV, WHISK)
Taxonomy of WDET Wrapper Induction Tools Modeling based Tools Ontology based Tools Generates wrappers from input(WIEN,SoftMealy,STALKER) Based on hierarchies of objects(NoDoSE, DEByE) Uses Conceptual Models or Ontologies (BYU tool)
Qualitative Analysis Degree of Automation Support for Complex Objects Page Contents: Semistructured data or text Ease of Use XML Output Support for Non-HTML Sources Resilience and Adaptiveness
Conclusions
Questions