Semi-automatic Product Attribute Extraction from Store Website

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Machine learning continued Image source:
Information Retrieval in Practice
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Three kinds of learning
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Recommender systems Ram Akella November 26 th 2008.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Scalable Text Mining with Sparse Generative Models
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Introduction to Machine Learning Approach Lecture 5.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
ISMB 2003 presentation Extracting Synonymous Gene and Protein Terms from Biological Literature Hong Yu and Eugene Agichtein Dept. Computer Science, Columbia.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Text mining.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Chapter 6: Information Retrieval and Web Search
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Data Mining and Decision Support
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Queensland University of Technology
Semi-Supervised Clustering
DATA MINING © Prentice Hall.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Semi-automatic Product Attribute Extraction from Store Website Yan Liu Carnegie Mellon University Sep 2, 2005 yanliu@cs.cmu.edu

Example from Dick’s Sporting Goods webpage Free text Product name Description Features Structured data

Applications Direct application More general applications Product recommendation systems for customers Price estimates for auction Sales amount prediction More general applications Document organization Email prioritization Question answering And many more text mining tasks

Relationship with Previous Work Information extraction Extract from the documents salient facts about prespecified types of events, entities or relationships Different from information retrieval Previous work Finite state machines Sliding windows Sequential models, such as HMMs or CRFs Association and clustering Major challenges Few training data Unclear attribute definition Making better use of labeled and unlabeled data, Making better use of user feedback using active learning

Outline Introduction General framework Detailed algorithms Experiment results Conclusion and discussion

General Framework Attribute Identification Name-value Assignment Example: Input: free text 9.68-lb total weight (4.4-kg) Attribute Identification (Semi-supervised learning) 9.68-lb total weight (4.4-kg) Name-value Assignment (Statistical and grammatical association) 9.68-lb total weight (4.4-kg) weight: 9.68-lb, 4.4-kg weight: CD-lb weight: CD-kg …. Feedback (Active learning) Output: structured data

Attribute Identification Initial label acquisition Template matching Knowledge database Semi-supervised learning Yarowsky’s algorithm Co-training Co-EM Co-boosting Graph-based methods Phrase identification Statistical associations between adjacent words Heuristic grammatical rules

Attribute Identification (1) Initial Label Acquisition Positive labels Template matching Extracted templates from data with special format Noisy data Knowledge database Measure units: length, weight, volume and etc Material Country Color Negative labels Partial stop word list

Attribute Identification (1) Semi-supervised learning Co-training [Blum & Mitchell, 1998; Collins & Singer, 1999] Separation of two views Contextual features Spelling features Two kinds of features Stemmer words (Porter Stemmer) POS tagging (Brill’s tagger) Algorithm Psedocode

Attribute Identification (1) Phrase Identification Difference between chunking Label propagation Category dependent Statistical association Information gain Mutual information Yule’s statistic display team colors up to 12 inches display team colors up to 12 inches

Name-value Assignment Combination of three information sources Semantic association Knowledge database Attribute name generation and pair assignment Grammatical association Parsing tree (Minipar) Attribute name/value generation Statistical association scores Yule’s statistic (category dependent) Pair assignment Other association sources Wordnet

User Feedback Clustering─based active learning Clustering algorithm Novelty attribute identification Merge and splitting attributes Better use of labeled examples Clustering algorithm Sparse data problem Multiple clustering algorithms Cluster selection Within-cluster coherence Novelty based measurement

User Feedback Clustering algorithm Latent semantic indexing (LSI) [Deerwester et al, 1990] Singular value decomposition on term─document matrix Mapping the words into hidden semantic concepts Similarity measure: cosine similarity Clustering algorithm using CLUTO K─means Bisected K─means Agglomerative algorithm Single linkage Complete linkage Average linkage

User Feedback Cluster Selection Novelty concepts Major difference from previous task Supervised novelty detection is difficult Tradeoff between novelty and relevancy Recently studied by the IR community [Carbonell and Goldstein, 1995; Zhang et al, 2003; Zhai et al, 2004] Cluster selection criterion using maximal marginal relevance (MMR) Similarity measure Cosine similarity KL-divergence

Outline Introduction General framework Detailed algorithms Experiment results Conclusion and discussion

Experiment Setup Dataset Evaluation measures Free text extracted from product descriptions on http://www.dickssportinggoods.com Subsets from two categories Football (largest category) 52339 entries, 194273 words, 2926 predicted feature-value pairs Tennis (medium category) 3840 entries, 12533 words, 419 predicted feature-value pairs Evaluation measures Direct evaluation Precision on feature value pairs Indirect evaluation in other applications Recommender systems

Experiment Results Examples by steps Initial label acquisition Semi-supervised learning Phrase identification Semantic association Grammatical association Statistical association scores

Experiment Results Human feedback Examples by active learning Sample files (link to file) Total labeling time of 5 mins Identified concepts color, graphics, logo, design, fit, size, pocket, pad, set, adjustment, attachment, construction, strap Examples by active learning

Experiment Results Precision on most frequent feature-value pairs Most frequent 600 pairs Assignment of 5 labels Fully correct, incorrect names, incorrect values, incorrect associations, nonsense: Human labeling of approximately 6 hours Thanks to Katharin and Marko Results

Conclusion Product attribute identification is a difficult task Few training data Making use of labeled and unlabeled data by semi-supervised learning Unclear attribute definition Novelty attribute identification by active learning A framework of active learning combined with semi-supervised learning

Text Learning Techniques Text processing Stemming (Porter stemmer) POS tagging (Brill’s parser) Text chunking and parsing (Minipar) Word semantics (Wordnet, dependency-based thesaurus) Latent semantic indexing (SVDPack) Machine learning Semi-supervised learning (Co-training) Active learning (MMR) Classification (C4.5 decision tree, FOIL) Clustering (K-means, CLUTO) Information theory and statistical associations (Information gain, Yule’s statistic)

Future Work Associations of product attributes across categories or websites More effective active learning algorithms Graphical models with application to information extraction

Questions?