WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

WHISK: a brief summary A wrapper induction algorithm handle from highly structured to free text Learns rules in form of regular expressions patterns can extract either single-slot or multi-slot Every rule is a sequence of the following elements: a text token the “*” symbol as a wildcard a Semantic Class one of the above enclosed in parentheses (extraction delimiters)

WHISK: a brief summary (2) Input: a set of hand-tagged instances every instance is in turn considered as a “seed instance”, while the rest of the input is taken as a training set. WHISK induces rules topdown First find the most general rule covering the seed Then extend the rule by adding terms one at a time Test the results on the training set. The metric used to select a new term is the Laplacian expected error of the rule: Laplacian = e+1 n+1 where n is the number of extractions made on the training set and e is the number of errors among those extractions.

WHISK rule representation Pattern :: * (Digit) ‘ BR’ * ‘$’ (Number) Output:: Rental {Bedrooms $1} {Price $2} Applied to: Capitol Hill – 1 br twnhme. fplc D/W W/D. Undrgrnd pkg incl $675. 3 BR, upper flr of turn of ctry HOME. incl gar, grt N. Hill loc $995. (206) 999-9999 (This ad last ran on 08/03/97.) Wildcard “*” is limited to nearest match Rule is re-applied to remaining input after a successful match When a rule fails, filled slots are retained and the rule restarts on remaining input to avoid unlimited backtracking If a non-wildcard follows an extraction slot, it must be matched for the extraction to succeed

WHISK implementation WHISK_ Main WHISK_Evaluator Training Set XHTML Pages FE Surrogates Testing Set XHTML Pages FE Surrogates WHISK Train Set Tokenized XML Text Product Descr WHISK Test Set Tokenized XML Text Product Descr WHISK Output Product Descr RuleSet WHISK_Main Evaluation Results WHISK Training WHISK Testing WHISK_Trainer WHISK_Main A WHISK_Evaluator WHISK_Corpus_Feeder

WHISK implementation WHISK_Corpus_Feeder module: prepares the training set for learning and evaluation modules Merge web pages and the FE surrogate files into XML files, containing NE + Pdemarcator information in the form of additional XML tags Tokenize the above XML files, then produces prolog lists of these tokens Extract from FE surrogate files target structured product descriptions, used to calculate Laplacian during training process and to evaluate WHISK’s output during evaluation process WHISK_Main module: core implementation of WHISK algorithm Performs both training and testing processes. WHISK_Evaluator module: performs the evaluation of product extractions on the testing set Once obtained extractions from WHISK_Main methods, it reports statistics for precision and recall metrics

WHISK improvements Heavy Semanticisation of WHISK rules construction Base 1 Contruction: Original Base_1 construction:  * (Semantic Class|Token) Modified Base_1 construction:  * (series of [Tokens|Semantic Classes]) Example: WHISK STANDARD: * (128 Mb) [‘128 Mb’ is not a semantic class] RTV WHISK : * (Number, Capacity_mb) [‘128’ and ‘Mb’ are both specific semantic classes]

WHISK improvements Heavy Semanticisation of WHISK rules construction Base 2 Contruction: Original Base_2 construction:  left_token_delimiter (*) right_token_delimiter Modified Base_2 construction:  Left_semantic_class_delimiter (*) Right_semclass_delimiter Example: SENTENCE: TFT WHISK STANDARD Base_2: > ( * ) < RTV WHISK : Base_2Term_Tag ( * )Close_Term_Tag

WHISK improvements Two different windows while adding terms Adding terms one at a time and then testing rules versus every time a term is added is a very cumbersome task We used a window of a specific number of tokens near the element to be extracted, where to take terms from During recognition of semantic classes, effective dimension (related to semantic elements and not simple tokens) of this window may vary dramatically Dilemma: small window size versus large window size Many elements of semantic classes contain many tokens (example: html tags)  = 21 tokens Small window size = may get too small if window contains large semantic classes Large window size = may require too much processing time

WHISK improvements Two different windows while adding terms Solution: Use of two windows:  Token Window  Semantic Window At first a Token Window of token_window_size is created near the element to be extracted The elements (tokens) in the Token Window are converted to Semantic Elements A number of semantic_window_size elements are considered when adding terms to the WHISK Expression This way a more stable window is created for purposes of rule improvement

WHISK improvements WHISK_Corpus_Feeder Multiple Product Reference If a single named entity refers to multiple product descriptions, we used the following syntax of PDemarcator to represent this phenomenon: WHISK will handle this form of the attribute and test against every single product that is referred by the TAG We propose agreement over this syntax and use it for new PDemarcator releases

WHISK evaluation Some Experiments Experiments have been conducted using a very simple set of base rules (one generic rule for every fact type) There were no mutual exclusion in appliance of the different rules: every rule extracts all the elements matched, no matter if they have been extracted by other rules

WHISK evaluation Application of base rules, results: RuleType:PrecisionRecall manufacturerName:10.875433 processorName:11 preinstalledOS:0.7799230.966507 processorSpeed:0.5329571 price:11 dvdSpeed:0.07721280.854167 hdCapacity:0.4990060.992095 ram:0.5009940.992126 screenSize:0.8428090.936803 warranty:0.8243240.709302 batteryLife:0.1756760.866667 preinstalledSoftware:0.2200771 width:0.05016720.681818 modelName:10.972561 modemSpeed:0.3013180.958084 batteryType:0.2703580.873684 screenType:0.671010.944954 cdromSpeed:0.08851221 screenResolution:10.666667 weight:10.86 height:0.05685620.586207 depth:0.05016720.681818

WHISK: conclusions We’ve recently concluded WHISK modifications (at least we hope so!) and started experiments ISSUE: multi-slot extraction requires too much training instances, as different product descriptions present different sequence of slots (number and/or order) At least a new rule for every different combination of slots is extracted, and considering the large amount of varatio characterizing every single slot, providing multi-slot extraction would be unfeasible. We decided to work only with single-slot extraction, charging to rule priorities and mutual exclusion between rule extractions the task of obtaining coherent product descriptions FUTURE WORKS: Heavy semanticization of elements of the domain Presemanticization of the entire Corpora before training on them

WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

Similar presentations

Presentation on theme: "WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”

Similar presentations

Presentation on theme: "WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”"— Presentation transcript:

Similar presentations

About project

Feedback