Anurag Bhardwaj – Head, Quad Analytix

Large Scale Multimodal Automated Document Categorization in Ecommerce at Strata 2016, San Jose, CA.
Anurag Bhardwaj – Head, Quad Analytix Sreeni Iyer – CTO/CIO/Founder @ Quad Analytix.

INTRODUCTION

Quad Analytix Implementation @ Industrial Scale
Attempts to add structure to unstructured data. Politely execute distributed-Crawls across large swaths of the classic-Web and the social-Web. Semantic Extraction off the these crawled assets via Machine Learning Techniques. via Crowd Application that sits atop Odesk/Upwork and Mechanical-Turk Part of this extraction is to classify products/SKUs correctly to a canonical Taxonomy. Data Processing Pipeline to Ensure Quality Normalize e.g. Color: Blue, Teal, Azure -> Blue Units: (kg/oz./lb./pounds -> oz.; miles/km -> miles) Validate ETL and Publish for analytical insights, API. SaaS app delivering Visualizations: For Ecommerce Marketers, Merchants, Manufacturers. For Hedge Funds For Consultants (Bain etc.)

DATA ACQUISITION/CONSUMPTION – The 4Vs
VARIETY: Multiple DATA SOURCES WWW-Merchant Sites: Product PRODUCT/PRICE/POSITION VELOCITY: Distributed/Polite Crawls @ Specified Crawl Frequency EXTRACTION Semantic Extraction of Attributes Machine Learning Humans Data Processing Pipeline Cleansing/Normalization Validation/Publication Semantic Extraction of Meta-Data Classification (via ML) Type of Page (via ML) VOLUME: N-Dimensional Hyper-Cube via API, SaaS App VERACITY: Measure KPIs at each step + Send % to Crowd UPWORK, MECHANICAL TURK WWW-Merchant Sites Home Page Department Landing Page Category Landing Page SOCIAL Twitter Facebook PROMOTION WWW - Aggregator Sites WWW – Deal Sites WWW – Opinions, Reviews OTHER

THE PROBLEM

What is the problem we seek to solve.
Adding Structure to dirty, unstructured data: Classify crawled products. How so: The web (and hence the ecommerce portion of the web) is very vast. No standardization exists across ecommerce sites in terms of Taxonomic Organization (levels and labels) Quality/style of content (at category and product levels) In terms of actual implementation. If this content can be semantically understood, then we could solve several vexing use cases. Marketplace: Categorize incoming Product Uploads into an existing Ecommerce Taxonomy Canonical Taxonomy: Insight generation and comparisons will need products to be normalized. Fix Misclassification Problems: Our research has shown that typically anywhere from 10-20% products are poorly classified, making them difficult to find.

Canonical Taxonomy Use Case: To unify the various merchant taxonomies
Home Kitchen Appliances Cutlery Blenders Juicers Coffee Makers ……………. Merchant 1 Electronics Kitchen Appliances Canonical Taxonomy Cutlery Home Home and Kitchen Electronic Appliances Blenders Juicers Coffee Makers ……………. Espresso Machines 1:1 Merchant…N Home Kitchen Appliances Knives Blenders and Juicers Coffee, Tea & Espresso ……………. Espresso Appliances N:1 1:N

CLASSIFICATION DETAIl

Classification Problem – What’s available to use
Multiple signals available on the Product Page Text Title  (high amount of Semantic Content for the purpose of Classification) Bread Crumb  (is itself a classification label based on a given merchant’s specific taxonomy) Product Description  (excluded – too noisy) Product Recommendations (excluded – too noisy). Image Thumbnails  (1 – many/ page)

Quality of Signals @ Industrial Scale: Several 100 merchants,
Several 1000 Item Categories, Several 100 Million SKUs. One big challenge is to generalize the approach to account for varying quality of signals across Merchants/Item Categories. Title: Good Examples -> Blue Suede Floral Women's Desert Wedge Booties Chambray Women's Open Toe Alpargatas Misleading Examples -> Denim Men's Classics (is a shoe) Crop kick pants (yes pants – but no gender signal). Downright unusable Title Examples -> Kupuna, Mohalu etc. (are sandals at Thumbnail Images: Often poor, especially in the Marketplace. Bread-Crumbs: While most sites feature hierarchical bread-crumbs, some just do navigation and some none at all (e.g. )

PROGRESSION OF APPROACHES - BOW

Text Classification – Standard BOW (Bag-Of-Words)
Item-Category-2: Lorem ipsum2 Item-Category-1: Lorem ipsum1 …… Item-Category-N: Lorem ipsumN …………… Training Data Manually Collected: Example Titles Item-Category-1 Training File Lorem ipsum1 – Label 1 …… Lorem ipsum1 - Label 1 Lorem ipsum2 – Label 0 Lorem ipsumN – Label 0 File Per Category Basic Text Processing (Case normalization, Punctuation Removal, Space Tokenization) Stop Word Removal Stemming File Processing Dictionary of Words Only tokens with > K occurrences (Bow Vector) Dictionary Processing W1 W2 Wn T1 K+1 T2 K+2 T3 K+3 .. T n FINAL MODEL for the Category SVM with linear kernel Classifier Processing

Issue with BoW: No Semantics
Running Shorts Runner Rugs

PROGRESSION OF APPROACHES – WORD2VEC

Opportunity: Word2Vec [Reference]

Word2Vec: What it does for us
Explore semantics as relationships between words: A man is to woman as a king is to queen Vectors “encode” relationships What does word2vec do? Takes a “text corpus” as input and produces “word vectors” as outputs analytics

Word2Vec [Under-the-hood]
How does word2vec generate vectors? Words are “passed” through a neural network model It is NOT a deep learning network, but shallow. Words Vectors

How do these models work ? Continuous bag-of-words (CBOW) Continuous Skip-gram (SKIP) CBOW Predict word given its context Faster training Hamilton Beach Example – (3-gram): Hamilton Beach FlexBrew Drip Coffee Maker { Hamilton Beach FlexBrew, Beach FlexBrew 49983, FlexBrew Drip, 49983 Drip Coffee, Drip Coffee Maker } FlexBrew 49983 Drip

SKIP Predict context given the word Slower training Hamilton Example – (3-gram, 2 SKIP): Hamilton Beach FlexBrew Drip Coffee Maker { Hamilton Beach FlexBrew, Hamilton FlexBrew 49983, Hamilton Drip, Beach FlexBrew 49983, Beach Drip, Beach Drip Coffee, … Drip Coffee Maker } Beach FlexBrew 49983 Drip

Word2Vec [NN-Classification]
How do you use word2vec for classification? Compute distance (earth-mover distance variation) between title and each item cat name from taxonomy Coffee Maker Black Coffee Maker Black Women’s Shoes Women Shoes 3 X D Matrix 2 X D Matrix Word Mover Distance Similarity Score = 0.1

Word2Vec [SVM-Classification]
How do you use word2vec for classification? Train SVM after “pooling” vectors of all words from a given title Coffee Maker Black Coffee Maker Black 3 X D Matrix Feature Pooling 1 X D Matrix Average Pooling Max Pooling Fisher Vector Pooling SVM Classifier

Text Classification – W2Vec
Unsupervised Word Vector Embedding Model: M walk stroll run K-Dimensional Vectors Training Titles for Item-Cat-1 Training Titles for Item-Cat-2 Training Titles for Item-Cat-n …. T by K Matrices For N Titles T= number of tokens in each Training Title N by K Matrix 1 by K Matrices Fisher Vector Pooling For N Titles SVM with multi-class Linear SVM FINAL MODEL: Single Classifier for all Categories Label Count # of unique product pages 1.45M # of words generated after HTML parsing ~170M Input for model training Total time taken 12 hours (HTML Parsing) + 4 hours (Training) Vocabulary Size ~77K Vector Size 200 dimensional vector / word Classification Data Set 3K titles

Word2Vec [Results] Results are directionally positive Method Accuracy
BoW 57% Word2Vec + NN 69% Word2Vec + SVM 74% Results are directionally positive

Issues with Word2vec: Low-Quality Titles

PROGRESSION OF APPROACHES – IMAGES

Opportunity: Good Quality Images

Issues with Learning From Images (1/2)
Photometric Invariance Brightness Exposure

Issues with Learning From Images (2/2)
Geometric Invariance Translation Rotation Scale

Deep Learning: Learning non-linear features

Convolution Neural Network – Full Network

Image Classification Image Database …………… ImageNet:
(14M Images organized via WordNet Hierarchy – where each node has > 5K Images) Basic Shapes – Curves, Loops. Image Database AlexNet: 7x7x512 Activation Volume Convolution and Pooling Layers 2 Fully Connected Layers and a last layer of output NEURONS CAFFE Coarse Grained Category Classification Image Processing to provide Input of 224x224 pixel image Training Images for Item-Cat-1 Training Images for Item-Cat-2 …………… Quad Training Set: (100K Images organized via Canonical Quad Taxonomy - Where each node has ~ 50-1K images) Training time reduced from days to hours Number-Images required from 5000 to 500 Fine-Grained Classification Quad-Net

Image Classification [Results]
Method Accuracy Top-1 63% Top-5 91% Most misclassifications are around gender signals Top-5 results are impressive which suggest combining them with other modalities should help

Issues with Image Classification: Fine-grained Categories
Women’s Sneakers Women’s Hiking Shoes

CLASSIFIER FUSION

Cascading Classifiers
Series Architecture Iterative filtering of classification candidates Need to decide classification order Ill-suited for distributed processing BoW Word2vec ConvNet BoW Word2vec ConvNet Fusion Parallel Architecture Well-suited for distributed processing Robust to error propagation Need to combination results

Fusion Algorithm Score Level Classifier Fusion:
Weighted Aggregation across the K Classifiers. Pros/Cons: Score Normalization Differing Precision and Recall Hard in practice to empirically arrive at a suitable weight for each classifier. Decision Level Classifier Fusion: Ignore scores and only use Predicted Labels/Responses. Majority Voting: Labels with highest votes are output as the final choice of classifier combination system. Pros/Cons: Can easily lead to biased results with a fusion of 3 or more classifiers where at least 2 classifiers are sub-optimal. Mutual Agreement: If all classifiers agree on their final results, it is returned as the output, otherwise not returned. Cons: More Restrictive. This strategy leads to lower recall but higher precision of the system. Pros: Stable classification results irrespective of using a combination of sub-optimal classifiers in first place.

Deployment Performance Metrics – Image Classification
Top-5 Accuracy: Week Deployment # of Tasks Accuracy Before Image Classification 132,792 69% After Image Classification 95,371 84%

Deployment Performance Metrics – Image + Word2Vec Classification
Top-5 Accuracy: Week Deployment # of Tasks Accuracy Before Word2Vec 54,869 84% After Word2Vec 115,141 93%

Decision Level Classifier Fusion – Mutual Agreement
Choices Employed: Fusion-A: Top-choice agreement BoW Text Classifier and Word2Vec Text Classifier Fusion-B: Top-choice agreement between BoW Text Classifier and ConvNets Image Classifier Fusion-C: Fusion A or Fusion B i.e. (BOW && W2VEC) || (BOW && CNN) Early Results: Algorithms Precision Recall F-Score BoW 41.5 % 100 % 58.7 % Word2Vec 45.8 % 62.8 % ConvNet 39.8 % 57 % Fusion-A 92.7 % 41.6 % 57.4 % Fusion-B 94.7 % 33.0 % 48.9 % Fusion-C 92.2 % 47.4 % 62.6 % High Precision Classifier

Take Away Message No one ring to rule them all
Exploit multiple sources of signals on the page Integrated crowd approach to large scale high precision eCommerce classification

Anurag Bhardwaj – Head, Quad Analytix

Similar presentations

Presentation on theme: "Anurag Bhardwaj – Head, Quad Analytix"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anurag Bhardwaj – Head, Quad Analytix

Similar presentations

Presentation on theme: "Anurag Bhardwaj – Head, Quad Analytix"— Presentation transcript:

Similar presentations

About project

Feedback