Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University.

Similar presentations


Presentation on theme: "An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University."— Presentation transcript:

1 An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea Tak-Lam Wong, Wai Lam, Tik-Shun Wong The Chinese University of Hong Kong SIGIR 2008

2 Copyright  2009 by CEBT Contents  Introduction  Problem Definition  Model  Inference Method  Experimental Results  Conclusions  Discussion IDS Lab Seminar - 2

3 Copyright  2009 by CEBT Introduction  Motivation IDS Lab Seminar - 3 (Source: (Source:

4 Copyright  2009 by CEBT Introduction  Information Extraction Prior knowledge about content – Sensor resolution Previously unseen attributes – Layout format White balance, shutter speed – Mutual influence Light sensitivity IDS Lab Seminar - 4

5 Copyright  2009 by CEBT Introduction  Attribute Normalization Samples of extracted text fragments from a page: – Cloudy, daylight, etc… – What do they refer to? A text fragment extracted from another page: – white balance auto, daylight, cloudy, etc… Attribute normalization – To cluster text fragments into the same group – Better indexing for product search – Easier understanding and interpretation IDS Lab Seminar - 5

6 Copyright  2009 by CEBT Introduction  Existing Works Supervised wrapper induction – They need training examples. – The wrapper learned from a Web site cannot be applied to other sites. Template-independent extraction (Zhu et al., 2007) – They cannot handle previously unseen attributes. Unsupervised wrapper learning (Crescenzi et al, 2001) – Extracted data are not normalized. IDS Lab Seminar - 6

7 Copyright  2009 by CEBT Introduction  Contributions Unsupervised learning framework for jointly extracting and normalizi ng product attributes from multiple Web sites. Can extract unlimited number of product attributes (Dirichlet process ) Can visualize the semantic meaning of each product attribute IDS Lab Seminar - 7

8 Copyright  2009 by CEBT Problem Definition (1)  A product domain, E.g., Digital camera domain  A set of reference attributes, E.g., “resolution”, “white balance”, etc. A special element,, representing “not-an-attribute”  A collection of Web pages from any Web sites,, each of which contains a single product  Let be any text fragment from a Web page IDS Lab Seminar - 8

9 Copyright  2009 by CEBT Problem Definition (2) IDS Lab Seminar - 9 White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom White balance Auto, daylight, cloudy, tungstem, fluorescent, fluorescent H, custom Line separator

10 Copyright  2009 by CEBT Problem Definition (3) IDS Lab Seminar - 10   Information extraction:  Attribute normalization:  Joint attribute extraction and normalization: Attribute information Target information Layout information Content information e.g., x =(resolution 10,000,000 pixels, black and in small font size, 1, resolution)

11 Copyright  2009 by CEBT Problem Definition (4)  White balance Auto, daylight, cloudy, tungstem, fluorescent, fluor escent H, custom T=1 A=“white balance”  “Cloudy, daylight” T=1 A=“white balance”  View larger image T=0 A=“not-an-attribute” IDS Lab Seminar - 11

12 Copyright  2009 by CEBT Model IDS Lab Seminar - 12 Dirichlet Process Prior (Infinite Mixture Model) N Text Fragments S Different Web Pages k-th component proportion Content info. generation Target info. generation A set of layout distribution

13 Copyright  2009 by CEBT Generation Process IDS Lab Seminar - 13

14 Copyright  2009 by CEBT Generation Process  The joint probability for generating a particular text fragment gi ven the parameters,,,, and,  Inference Intractable (means very difficult to deal with) IDS Lab Seminar - 14

15 Copyright  2009 by CEBT Variational Method  Finding is intractable  Goal Design a tractable distribution such that should be as close to as possible.  Kullback-Leibler(KL) divergence Since D(Q||P) ≥ 0, IDS Lab Seminar - 15

16 Copyright  2009 by CEBT Experiments  We have conducted experiments on four different domains: Digital camera:85 Web pages from 41 different sites MP3 player:96 Web pages from 62 different sites Camcorder:111 Web pages from 61 different sites Restaurant:29 Web pages from LA-Weekly Restaurant Guide  In each domain, we conducted 10 runs of experiments.  In each run, we randomly selected a Web page and use the attrib utes inside as prior knowledge. IDS Lab Seminar - 16

17 Copyright  2009 by CEBT Evaluation on Attribute Normalization  Baseline approach Agglomerative clustering – Only consider the text content of text fragments  Evaluation metrics Recall (R) Precision (P) F1-measure (F) IDS Lab Seminar - 17

18 Copyright  2009 by CEBT Results of Attribute Normalization IDS Lab Seminar - 18

19 Copyright  2009 by CEBT Visualize the Normalized Attributes  The top five weighted terms in the ten largest normalized attribut es in the digital camera domain IDS Lab Seminar - 19

20 Copyright  2009 by CEBT Evaluation on Attribute Extraction  Surprisingly, in the restaurant domain, our framework achieves  A performance (0.95 F1-measure) which is comparable to the su pervised method (Muslea et al. 2001) IDS Lab Seminar - 20

21 Copyright  2009 by CEBT Conclusions  Developed an unsupervised framework aiming at simultaneously extracting and normalizing product attributes from Web pages col lected from different sites.  Developed a graphical model to model the generation of text frag ments in Web pages.  Showed that content and layout information can collaborate and i mprove both extraction and normalization performance under our model. IDS Lab Seminar - 21

22 Copyright  2009 by CEBT Discussion  Pros Good motivation and proposed solution Performance is good enough for real situation.  Cons Lack explanation of equations Some words used wrongly IDS Lab Seminar - 22


Download ppt "An Unsupervised Framework for Extracting and Normalizing Product Attributes from Multiple Web Sites Center for E-Business Technology Seoul National University."

Similar presentations


Ads by Google