Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATESO, April 14 th 2005 Multimedia Information extraction from HTML product catalogues Martin Labský 1, Vojtěch Svátek 1, Pavel Praks 2, Ondřej Šváb 1.

Similar presentations


Presentation on theme: "DATESO, April 14 th 2005 Multimedia Information extraction from HTML product catalogues Martin Labský 1, Vojtěch Svátek 1, Pavel Praks 2, Ondřej Šváb 1."— Presentation transcript:

1 DATESO, April 14 th 2005 Multimedia Information extraction from HTML product catalogues Martin Labský 1, Vojtěch Svátek 1, Pavel Praks 2, Ondřej Šváb 1 {labsky, svatek, rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics 2 Dept. of Applied Mathematics, Technical University of Ostrava

2 DATESO, April 14 th Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application

3 DATESO, April 14 th IE from Internet Motivation –Semantic and structured search over large document collections Requirements –Identify relevant documents –Perform automatic IE documents are semi-structured, have heterogeneous layouts and formattings searching for objects of type Bicycle in price range €500 - €900 find structures (name, price, equipment) IE from Internet

4 DATESO, April 14 th Our approach to IE Preprocessing Acquire new document Annotation using HMMs w 1 w 2... w n w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9... w n w 3 w 4 w 6 w 7 HTML w9w9 namepricepicture Instance extraction name price picture Bicycle offer w3w4w3w4 w6w7w6w7 w9w9 IE from Internet

5 DATESO, April 14 th Relevant documents IE from Internet

6 DATESO, April 14 th Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application

7 DATESO, April 14 th Preprocessing HTML cleanup –conversion to valid XHTML Only potentially relevant blocks kept –blocks that do not directly contain text or images omitted Formatting tags –attributes removed –several rules matching common constructions (add-to- basket form, choose-amount button) Images –baseline: all images treated as a single token Annotation using HMMs

8 DATESO, April 14 th Preprocessing – example TREK Session 77 ( 2005 ) OUR PRICE £ Select Size TREK Session 77 (2005) OUR PRICE £ Select Size Annotation using HMMs

9 DATESO, April 14 th Document modeling using HMMs Generative model Document = [w 1 c 1 ] [w 2 c 2 ] P([w 1 c 1 ] [w 2 c 2 ]) = P(c 1 )P(c 2 |c 1 )P(w 1 |c 1 )P(w 2 |c 2 ) c 1 c 2 = argmax i,j P([w 1 c i ] [w 2 c j ]) Annotation using HMMs c1c1 c2c2 P(c 2 |c 1 ) P(c 1 |c 2 ) P(w 1 |c 1 )P(w 1 |c 2 ) transition prob. lexical prob. estimated from training data (frequencies) word class

10 DATESO, April 14 th HMM Structure States –adopted from [Freitag, McCallum 99] –Target, Prefix, Suffix and Background –densely connected Class trigram model –P(name | name_prefix, name) Variations –word-ngram models for lexical probabilities of target states P(w 1 | w i-1, name) –state substructures instead of single target states, learned by EM Annotation using HMMs

11 DATESO, April 14 th Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application

12 DATESO, April 14 th Extracting Images Baseline –every image represented by the same token –HMM only extracts product images based on context, e.g. P(product_picture | name, product_picture_prefix) Use image classifier to preprocess images –classifies into 3 classes – Pos, Neg, Unk –before HMM annotation, each image occurrence in document is substituted by its class Extracting Images

13 DATESO, April 14 th Image Classification – Features Image size –estimated 2-dimensional normal distribution from a set of 1000 unique bicycle images  N C (x, y) –estimated decision threshold (1-feature binary classifier) using held-out set of 150 images (60% positive) Image similarity –latent semantic similarity [Praks 2004]  sim(I 1,I 2 ) – –estimated decision threshold for 1-feature bin classifier Does the image repeat in document? Extracting Images

14 DATESO, April 14 th Image Classification Combined binary classifier –Multi-layer perceptron (Weka) –Features: N C (x,y), sim C (I), repeats(I) Performance of binary classifiers –10-fold cross-validation, document-level folds Extracting Images

15 DATESO, April 14 th Annotation Results Combined ternary classifier –outputs Pos Unk Neg –decision list based on predictions of all 3 single feature ternary classifiers Extracting Images

16 DATESO, April 14 th Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application

17 DATESO, April 14 th Instance Composition Instance extraction algorithm Instances (xml) Sesame RDF repository Document annotated by HMM Presentation ontology

18 DATESO, April 14 th Domain ontology Instance Composition Presentation Ontology

19 DATESO, April 14 th Instance extraction algorithm Sequentially parses annotated document Adds annotated attributes to working instance WI If adding an attribute would cause an inconsitency, an empty working_instance is created. The old working_instance is saved only if it is consistent. 1.WI = empty_instance; 2.while (more_attributes) { 3. A = next_attribute; 4. if (cannot_add (WI, A)) { 5. if (consistent (WI)) { 6. store (WI); 7. } 8. WI = empty_instance; 9. } 10. add (WI, A); 11.} Instance Composition

20 DATESO, April 14 th Agenda Information Extraction from Internet Annotation using Hidden Markov Models Extracting images Instance composition guided by ontology Bicycle search application

21 DATESO, April 14 th Bicycle search application, powered by Sesame RDF DB

22 DATESO, April 14 th Future work Learn to correct annotation errors –use document structure to detect unlabeled attributes –bootstrap from these new examples –use ontology constraints on values (types, lists, regexps) Population algorithm –utilize scores for each annotated attribute –augment presentation ontology with frequencies of attribute orderings –use approximate name matching to identify instances Improve search interface –approximate name matching (word and char edit distance)

23 DATESO, April 14 th Thank you! rainbow.vse.cz


Download ppt "DATESO, April 14 th 2005 Multimedia Information extraction from HTML product catalogues Martin Labský 1, Vojtěch Svátek 1, Pavel Praks 2, Ondřej Šváb 1."

Similar presentations


Ads by Google