Download presentation
Presentation is loading. Please wait.
1
Traditional Information Extraction -- Summary CS652 Spring 2004
2
Measurements Recall l 100%: the whole truth Precision l 100%: nothing but the truth F-Measure
3
RAPIER Techniques: NLP, Inductive Learning Advantages: l Can handle free text l Somewhat robust because of natural language l Wrapper generation is automatical, given training data Disadvantages: l Cannot handle fully structured data (needs NLP tokens) l Training data labeling l Single-slot algorithm (but just for RAPIER)
4
RoadRunner Technique: regular expression generation from patterns by direct comparison Advantages: l Fully automatic (when it works) l Very low cost for wrapper maintenance (but without automatic detection) Disadvantages: l Requires well-formatted HTML pages (not manually created, and must be fully-tagged) l Requires a new wrapper for every different Web site l Human must label data semantics
5
Hidden Markov Model (HMM) Technique: create probabilistic automata and probabilities of words w.r.t. each class; determine how to traverse the automata Advantages: l Robust: Works well for the domain, across all pages l Works with free text Disadvantages: l Huge volume of training data needed
6
Lixto Technique: by-example generation with lots of user input Advantages: l Ease of use with no training l Commercial Disadvantages: l Not robust – sensitive to page changes l Does not work with free text
7
BYU Ontos Technique: ontology-driven Advantages: l Robust when page changes l Adaptive for Web pages from different sources as long as they are for the same domain l Works well when (multiple) record information is easy to locate Disadvantages: l Needs all record information together (or must preprocess information to generate record information); e.g. structured tables are hard l Manual ontology construction
8
Wrapper Maintenance Not a wrapper Pages change & wrappers break Can fix – maybe?! Meng, et.al.’s approach l Advantages: Works for simple structural changes l Disadvantages: hard to do in general May be easier to make wrappers robust
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.