Presentation is loading. Please wait.

Presentation is loading. Please wait.

TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab 2015.05.20 Computational Intelligence Laboratory Toyota.

Similar presentations


Presentation on theme: "TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab 2015.05.20 Computational Intelligence Laboratory Toyota."— Presentation transcript:

1 TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab 2015.05.20 Computational Intelligence Laboratory Toyota Technological Institute

2 Outline  Introduction  Original dataset  Session Augmentation  Unique IDs Decomposing  Identical-Hierarchy  Context window  Text to vector representation  Binary weighting  Bootstrapping approach 2

3 Introduction 3  Training and Test Dataset  A single product viewing log is composed of four columns  u10001,2014-11-14 00:02:14,2014-11-14 00:02:20,A00001/B00001/C00001/D00001/  u10001  Session ID  2014-11-14 00:02:14  session.startTime  ! features  2014-11-14 00:02:20  session.endTime  ! features  A00001/B00001/C00001/D00001/  Unique ID  fetures  Training and Test Dataset  15,000 (labeled), 15,000(un-labeled)

4 Session Augmentation Process 4  Step1  Session augmentation using unique IDs decomposition  Step2  Session augmentation using Identical-Hierarchy  Step3  Session augmentation using generating history based on context window Session [i-2] Session [i-2] Session [i-1)] Session [i-1)] Session [i] Session [i+1] Session [i+1] Session [i+2] Session [i+2]

5 Session Augmentation: Unique IDs Decomposing 5  Recall: Training data  u10001,2014-11-14 00:02:14,2014-11-14 00:02:20,A00001/B00001/C00001/D00001/  To generate text to vector representation  Each Unique ID can be decomposed into features using different combinations  A00001/B00001/C00001/D00001  Uni-gram, Bi-gram, Tri-gram  Unique

6 Unique IDs Decomposing (cont.) 6  Text to vector representation: Uni-gram  A distribution of unique product IDs in the data is decomposed into eight different features  For each Unique ID  A00001/B00001/C00001/D00001  A00001, B00001, C00001, D00001, A00001-label, B0000l-label, C00001- label, and D00001-label  Adding more features

7 Session Augmentation: Identical-Hierarchy 7  First: Generate hierarchy  A category hierarchy of  A000001/B000001/C000001/D000001 A00001 B00001 B00001 C00001 C00001 D00001

8 Second: Determining the Identical- Hierarchy 8  Identical categories  The product IDs which are only appears in certain category  Compute the class space density in female category and  Compute the class space density in male category  Identical-Hierarchy  Is the complete parent- and child-list of a certain identical category  Identical-hierarchies are extracted from training data

9 Example Hierarchy 9 A00001 A00002 A00011... B00001 B00002 B00003... B00091 C00001 C00002 C00003 C00091 C00441... D00001 D00002 D00003 D00091 D36121... D36122 Leaf Nodes Intermediate Nodes Top Nodes  Training: 22,440 hierarchies  Test:: 22,304 hierarchies  Training + Test: 36,731 hierarchies

10 Session Augmentation: Identical-Hierarchy 10  Motivation  Augment the training and test data with more features  Why???  Exchange info between training and test using identical-hierarchy  How???

11 Analyze: Training Data based on hierarchy 11  A00001/B00001/C00001/D00001  A: Most General Categories  A00001 – A00011 (Appear: All, Missing: 0)  B: Sub-categories  B00001 – B00091 (Appear: 86, Missing: 5)  C: Sub-subcategories  C00001 – C00441 (Appear: 383, Missing: 58  D: Individual Products  D00001 – D36122 (Appear: 21880, Missing: 14242)

12 Analyze: Test Data based on hierarchy 12  A00001/B00001/C00001/D00001  A: Most General Categories  A00001 – A00011 (Appear: All, Missing: 0)  B: Sub-categories  B00001 – B00091 (Appear: 84, Missing: 7)  C: Sub-subcategories  C00001 – C00441 (Appear: 392, Missing: 49)  D: Individual Products  D00001 – D36122 (Appear: 21739, Missing: 14383)

13 Building Combined Hierarchy: Training + Test 13  A00001/B00001/C00001/D00001  A: Most General Categories  A00001 – A00011 (Appear: All, Missing: 0)  B: Sub-categories  B00001 – B00091 (Appear: 91, Missing: 0)  C: Sub-subcategories  C00001 – C00441 (Appear: 440, Missing: 1)  D: Individual Products  D00001 – D36122 (Appear: 36092, Missing: 30)

14 Identical-Hierarchy based on Combined Hierarchy  Parent- and child-list of identical-categories letter starting with ‘B’ Parent- and child-list of identical-categories letter starting with ‘C’ A00003 B00008 C00026 C00288 C00305 B00007 C00025 D00889 D00892 D01583 D30012 D33674

15 Why??? 15 B00007 C00025 C00025 D00089 C00025 D00892 C00025 D01583 C00025 D30012 C00025 D33674 B00007 C00025 D00889 D00892 D01583 D30012 D33674 Appears in TrainingAppears in Test

16 Adding Identical Categories from ‘B’ 16  A00003/B00008/C000026/D00070  Extract parent- and child-list from hierarchy based on Identical-Hierarchy A00003 B00008 =B00008 C00026 B00008 C00288 B00008 C00305  A00003/B00008/C000026/D00070;C00288/C00305 A00003 B00008 C00026 C00288 C00305

17 Adding Identical Categories from ‘C’ 17  A00002/B00007/C000025/D00089  Extract parent- and child-list from hierarchy based on Identical-Hierarchy B00007 C00025 C00025 D00089 =C00025 D00892 C00025 D01583 C00025 D30012 C00025 D33674  A00002/B00007/C000025/D00089; D00092/D01583/D30012/D33674 B00007 C00025 D00889 D00892 D01583 D30012 D33674

18 Session augmentation: Generating History based on window size 18

19 Generating History: Set window size = 3 19  Current Session:  curSession.prevSession.endTime < curSession.startTime  Build History  curSession.endTime < curSession.nextSession.startTime  Build History

20 Session Augmentation: Pros and Cons 20  Pros:  Generate text to vector for a certain session uniformly  Increase feature size  Increase the system performance  Cons  It increase the system computational time

21 Term Weighting 21  Different Weighting approaches  Term frequency (TF)  TF.IDF  IDF  Inverse Document Frequency  TF.IDF.ICSdF  ICSdF  Inverse Category Space Density Frequency

22 Term Weighting: Applied 22  Binary Weighting Approach  Normalize the session

23 Bootstrapping: The Basic Idea 23  Bootstrapping is the process of re-sampling method to estimating the precision of sample by using subsets of available data.  In the re-sampling process exchanging labels on data points when performing significant test.

24 Bootstrapping process 24  Perform 4-iteration for re-sampling the data  If first_iteration  Input: Training data (15000)  10-fold cross validation  9-fold for training data  1-fold for development data  Build Training model  Provide Test data (15000)  Predict labels

25 Bootstrapping process (cont.) 25  If !first_iteration  Input: Training + Test (30000)  Assign labels  Training: Gold labels  Test: Predicted labels  10-fold cross validation  9-fold for training data  1-fold for development data  Build Training model  Provide Test data (15000)  New predicted labels

26 Classification: LIBLINEAR 26  LIBLINEAR is a simple package for solving large-scale regularized linear classification  Option parameters:  -s 1  L2-regularized L2 loss support vector classification  -c 1  -B 1  -wi weight: set the parameter C of class i to weight*C  nfemale/nmale

27 Results: Bootstrapping Approach with LIBLINEAR 27  Iteration 0  Mean Accuracy: 0.960156  Accuracy for (female, male) = 0.966761, 0.953551  Iteration1  Mean Accuracy: 0.966785  Accuracy for (female, male) = 0.967530, 0.966040  Iteration 2  Mean Accuracy: 0.966834  Accuracy for (female, male) = 0.967188, 0.966480  Iteration 3  Mean Accuracy: 0.967122  Accuracy for (female, male) = 0.967444, 0.966800  Iteration 4  Mean Accuracy: 0.967122  Accuracy for (female, male) = 0.967444, 0.966800 (Remain unchanged)

28 Final Results: Bootstrapping Approach with LIBLINEAR 28  Predicted Labels using Bootstrapping  Using submission system  85.47%  Final Result  85.103191%

29 Summary 29  In this work  Session augmentation  Identical-Hierarchy  Generating conditional history using context window  Term weighting  Binary weighting  Re-sampling process  Bootstrapping  Classification problem  SVM classifier

30 !!! Thank you !!! 30


Download ppt "TTI's Gender Prediction System using Bootstrapping and Identical-Hierarchy Mohammad Golam Sohrab 2015.05.20 Computational Intelligence Laboratory Toyota."

Similar presentations


Ads by Google