TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab 2015.05.20 Computational Intelligence Laboratory Toyota.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

An Introduction to Boosting Yoav Freund Banter Inc.

Random Forest Predrag Radenković 3237/10

Chapter 5: Introduction to Information Retrieval

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Machine learning continued Image source:

Political Party, Gender, and Age Classification Based on Political Blogs Michelle Hewlett and Elizabeth Lingg.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Chapter 4: Linear Models for Classification

Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Ensemble Learning: An Introduction

Presented by Zeehasham Rasheed

Introduction to Language Models Evaluation in information retrieval Lecture 4.

Distributed Representations of Sentences and Documents

Ensemble Learning (2), Tree and Forest

Computational Analysis of USA Swimming Data Junfu Xu School of Computer Engineering and Science, Shanghai University.

PAKDD'15 DATA MINING COMPETITION: GENDER PREDICTION BASED ON E-COMMERCE DATA Team members: Maria Brbic, Dragan Gamberger, Jan Kralj, Matej Mihelcic, Matija.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.

Text Classification using SVM- light DSSI 2008 Jing Jiang.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

The identification of interesting web sites Presented by Xiaoshu Cai.

Text Classification, Active/Interactive learning.

Multimodal Alignment of Scholarly Documents and Their Presentations Bamdad Bahrani JCDL 2013 Submission Feb 2013.

Wei Zhang Akshat Surve Xiaoli Fern Thomas Dietterich.

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Project 1: Machine Learning Using Neural Networks Ver 1.1.

Conﬁdence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Info rm atics luis rocha 2007 uncovering protein-protein interactions in the bibliome BioCreative.

Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Deep Questions without Deep Understanding

Class Imbalance in Text Classification

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Contextual Text Cube Model and Aggregation Operator for Text OLAP

UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

A distributed PSO – SVM hybrid system with feature selection and parameter optimization Cheng-Lung Huang & Jian-Fan Dun Soft Computing 2008.

Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Sentiment Analysis CMPT 733. Outline What is sentiment analysis? Overview of approach Feature Representation Term Frequency – Inverse Document Frequency.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

A Simple Approach for Author Profiling in MapReduce

A Straightforward Author Profiling Approach in MapReduce

Trees, bagging, boosting, and stacking

Can Computer Algorithms Guess Your Age and Gender?

Classifying enterprises by economic activity

Presented by: Prof. Ali Jaoua

Project 1: Text Classification by Neural Networks

Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets

Multivariate Methods Berlin Chen

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

Presentation transcript:

TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab Computational Intelligence Laboratory Toyota Technological Institute

Outline  Introduction  Original dataset  Session Augmentation  Unique IDs Decomposing  Identical-Hierarchy  Context window  Text to vector representation  Binary weighting  Bootstrapping approach 2

Introduction 3  Training and Test Dataset  A single product viewing log is composed of four columns  u10001, :02:14, :02:20,A00001/B00001/C00001/D00001/  u10001  Session ID  :02:14  session.startTime  ! features  :02:20  session.endTime  ! features  A00001/B00001/C00001/D00001/  Unique ID  fetures  Training and Test Dataset  15,000 (labeled), 15,000(un-labeled)

Session Augmentation Process 4  Step1  Session augmentation using unique IDs decomposition  Step2  Session augmentation using Identical-Hierarchy  Step3  Session augmentation using generating history based on context window Session [i-2] Session [i-2] Session [i-1)] Session [i-1)] Session [i] Session [i+1] Session [i+1] Session [i+2] Session [i+2]

Session Augmentation: Unique IDs Decomposing 5  Recall: Training data  u10001, :02:14, :02:20,A00001/B00001/C00001/D00001/  To generate text to vector representation  Each Unique ID can be decomposed into features using different combinations  A00001/B00001/C00001/D00001  Uni-gram, Bi-gram, Tri-gram  Unique

Unique IDs Decomposing (cont.) 6  Text to vector representation: Uni-gram  A distribution of unique product IDs in the data is decomposed into eight different features  For each Unique ID  A00001/B00001/C00001/D00001  A00001, B00001, C00001, D00001, A00001-label, B0000l-label, C label, and D00001-label  Adding more features

Session Augmentation: Identical-Hierarchy 7  First: Generate hierarchy  A category hierarchy of  A000001/B000001/C000001/D A00001 B00001 B00001 C00001 C00001 D00001

Second: Determining the Identical- Hierarchy 8  Identical categories  The product IDs which are only appears in certain category  Compute the class space density in female category and  Compute the class space density in male category  Identical-Hierarchy  Is the complete parent- and child-list of a certain identical category  Identical-hierarchies are extracted from training data

Example Hierarchy 9 A00001 A00002 A B00001 B00002 B B00091 C00001 C00002 C00003 C00091 C D00001 D00002 D00003 D00091 D D36122 Leaf Nodes Intermediate Nodes Top Nodes  Training: 22,440 hierarchies  Test:: 22,304 hierarchies  Training + Test: 36,731 hierarchies

Session Augmentation: Identical-Hierarchy 10  Motivation  Augment the training and test data with more features  Why???  Exchange info between training and test using identical-hierarchy  How???

Analyze: Training Data based on hierarchy 11  A00001/B00001/C00001/D00001  A: Most General Categories  A00001 – A00011 (Appear: All, Missing: 0)  B: Sub-categories  B00001 – B00091 (Appear: 86, Missing: 5)  C: Sub-subcategories  C00001 – C00441 (Appear: 383, Missing: 58  D: Individual Products  D00001 – D36122 (Appear: 21880, Missing: 14242)

Analyze: Test Data based on hierarchy 12  A00001/B00001/C00001/D00001  A: Most General Categories  A00001 – A00011 (Appear: All, Missing: 0)  B: Sub-categories  B00001 – B00091 (Appear: 84, Missing: 7)  C: Sub-subcategories  C00001 – C00441 (Appear: 392, Missing: 49)  D: Individual Products  D00001 – D36122 (Appear: 21739, Missing: 14383)

Building Combined Hierarchy: Training + Test 13  A00001/B00001/C00001/D00001  A: Most General Categories  A00001 – A00011 (Appear: All, Missing: 0)  B: Sub-categories  B00001 – B00091 (Appear: 91, Missing: 0)  C: Sub-subcategories  C00001 – C00441 (Appear: 440, Missing: 1)  D: Individual Products  D00001 – D36122 (Appear: 36092, Missing: 30)

Identical-Hierarchy based on Combined Hierarchy  Parent- and child-list of identical-categories letter starting with ‘B’ Parent- and child-list of identical-categories letter starting with ‘C’ A00003 B00008 C00026 C00288 C00305 B00007 C00025 D00889 D00892 D01583 D30012 D33674

Why??? 15 B00007 C00025 C00025 D00089 C00025 D00892 C00025 D01583 C00025 D30012 C00025 D33674 B00007 C00025 D00889 D00892 D01583 D30012 D33674 Appears in TrainingAppears in Test

Adding Identical Categories from ‘B’ 16  A00003/B00008/C000026/D00070  Extract parent- and child-list from hierarchy based on Identical-Hierarchy A00003 B00008 =B00008 C00026 B00008 C00288 B00008 C00305  A00003/B00008/C000026/D00070;C00288/C00305 A00003 B00008 C00026 C00288 C00305

Adding Identical Categories from ‘C’ 17  A00002/B00007/C000025/D00089  Extract parent- and child-list from hierarchy based on Identical-Hierarchy B00007 C00025 C00025 D00089 =C00025 D00892 C00025 D01583 C00025 D30012 C00025 D33674  A00002/B00007/C000025/D00089; D00092/D01583/D30012/D33674 B00007 C00025 D00889 D00892 D01583 D30012 D33674

Session augmentation: Generating History based on window size 18

Generating History: Set window size = 3 19  Current Session:  curSession.prevSession.endTime < curSession.startTime  Build History  curSession.endTime < curSession.nextSession.startTime  Build History

Session Augmentation: Pros and Cons 20  Pros:  Generate text to vector for a certain session uniformly  Increase feature size  Increase the system performance  Cons  It increase the system computational time

Term Weighting 21  Different Weighting approaches  Term frequency (TF)  TF.IDF  IDF  Inverse Document Frequency  TF.IDF.ICSdF  ICSdF  Inverse Category Space Density Frequency

Term Weighting: Applied 22  Binary Weighting Approach  Normalize the session

Bootstrapping: The Basic Idea 23  Bootstrapping is the process of re-sampling method to estimating the precision of sample by using subsets of available data.  In the re-sampling process exchanging labels on data points when performing significant test.

Bootstrapping process 24  Perform 4-iteration for re-sampling the data  If first_iteration  Input: Training data (15000)  10-fold cross validation  9-fold for training data  1-fold for development data  Build Training model  Provide Test data (15000)  Predict labels

Bootstrapping process (cont.) 25  If !first_iteration  Input: Training + Test (30000)  Assign labels  Training: Gold labels  Test: Predicted labels  10-fold cross validation  9-fold for training data  1-fold for development data  Build Training model  Provide Test data (15000)  New predicted labels

Classification: LIBLINEAR 26  LIBLINEAR is a simple package for solving large-scale regularized linear classification  Option parameters:  -s 1  L2-regularized L2 loss support vector classification  -c 1  -B 1  -wi weight: set the parameter C of class i to weight*C  nfemale/nmale

Results: Bootstrapping Approach with LIBLINEAR 27  Iteration 0  Mean Accuracy:  Accuracy for (female, male) = ,  Iteration1  Mean Accuracy:  Accuracy for (female, male) = ,  Iteration 2  Mean Accuracy:  Accuracy for (female, male) = ,  Iteration 3  Mean Accuracy:  Accuracy for (female, male) = ,  Iteration 4  Mean Accuracy:  Accuracy for (female, male) = , (Remain unchanged)

Final Results: Bootstrapping Approach with LIBLINEAR 28  Predicted Labels using Bootstrapping  Using submission system  85.47%  Final Result  %

Summary 29  In this work  Session augmentation  Identical-Hierarchy  Generating conditional history using context window  Term weighting  Binary weighting  Re-sampling process  Bootstrapping  Classification problem  SVM classifier

!!! Thank you !!! 30