Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,

Similar presentations


Presentation on theme: "Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,"— Presentation transcript:

1 Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications, Faculty of Engineering Cairo University Research Team: Farhan M. A. Nashwan Prof. Dr. Mohsen A. A. Rashwan Presented By: Farhan M. A. Nashwan

2 Contribution:  Reduce vocabulary  Increase speed 2 The Thirteenth Conference on Language Engineering 11-12, December 2013

3 Generated Image word Preprocessing and Word segmentor Word Grouping Clustering Groups and Clusters for Holistic Recognition Proposed Approach: 3 The Thirteenth Conference on Language Engineering 11-12, December 2013

4 Grouping:  Extraction subwords (PAW)  Extraction dots and diacritics  Used it to select the group 4 The Thirteenth Conference on Language Engineering 11-12, December 2013

5 Grouping: 5 Secondaries separation using contour analysis Secondaries Recognition using SVM Grouping Process Groups Preprocessing and Word segmentor Generated Image Word

6 Grouping Example: 6 Grouping code (1,21,2) Grouping Code (3,0, 2) Grouping Code (4,11, 12) Grouping Code (3,2, 21) Grouping Code (2,0, 2) PAW=1 Upper Sec.=2 PAW=3 Down Sec.=0 Upper Sec.=2 PAW=4 Down Sec.=1&1 Upper Sec.=1 & 2 PAW=3 Down Sec.=2 Upper Sec.=2 &1 PAW=2 Down Sec.=0 Upper Sec.=2 Down Sec.= 2 & 1 The Thirteenth Conference on Language Engineering 11-12, December 2013

7 7 Challenges  Sticking  Sensitive to noise Treatments  PAWs  Down secondaries  Upper secondaries Grouping based on:  Overlapping  SVM The Thirteenth Conference on Language Engineering 11-12, December 2013

8 Clustering:  Complementary of grouping  LBG algorithm used  Done on groups contain large words  Euclidean distance used 8 The Thirteenth Conference on Language Engineering 11-12, December 2013 Groups Feature Extraction Clustering using LBG Clustering using LBG Clusters & Groups

9 Features : 1- (ICC): Image centroid and CellsImage centroid and Cells 2- (DCT):Discrete Cosine TransformDiscrete Cosine Transform 3- (BDCT):Block Discrete Cosine TransformBlock Discrete Cosine Transform 4-(DCT-4B): Discrete Cosine Transform 4- BlocksDiscrete Cosine Transform 4- Blocks 5- (BDCT+ICC):Hybrid BDCT with ICC. 6- (ICC+DCT): Hybrid DCT with ICC 7- (ICZ):Image Centroid and ZoneImage Centroid and Zone 8- (DCT+ICZ): Hybrid DCT and ICZ. 9- (DTW ):Dynamic Time WarpingDynamic Time Warping 10- The Moment Invariant FeaturesThe Moment Invariant Features 9 The Thirteenth Conference on Language Engineering 11-12, December 2013

10 Results : Word/ClusterGroup ER (%) Clustering ER (%) Total ER (%) Cluster Rate (%) Features 1150.750.551.3198.7 ICC 1180.752.473.2296.8 BDCT 1290.750.050.8199.2 DCT 1130.750.551.398.7 DCT-4B 1170.750.911.6698.3 ICC+BDCT 1140.750.230.9899.0 ICC+ DCT 1160.752.533.2896.7 IZC 1150.750.591.3498.7 IZC+DCT 1540.751.171.9298.1 DTW 1760.7516.6417.3982.6 Moments TABLE 1: CLUSTERING RATE OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES 10 The Thirteenth Conference on Language Engineering 11-12, December 2013

11 To_Ave_Time (ms) Clus_Ave_Time (ms) Feat_Ext_Time (ms) Word/Cluster Cluster Rate (%) Features 0.290.250.04115 98.7 ICC 0.510.130.38118 96.8 BDCT 11.990.0311.95129 99.2 DCT 1.900.021.87113 98.7 DCT-4B 0.660.240.41117 98.3 ICC+BDCT 2.160.261.90114 99.0 ICC+ DCT 0.050.040.01116 96.7 IZC 1.940.061.87115 98.7 IZC+DCT 4.094.040.05145 98.1 DTW 0.290.150.13176 82.6 Moments TABLE 2: PROCESSING TIME FOR FEATURE EXTRACTION AND CLUSTERING OF SIMPLIFIED ARABIC FONT USING DIFFERENT FEATURES 11 The Thirteenth Conference on Language Engineering 11-12, December 2013

12 Conclusion: based on their holistic features:  Recognition speed increased  unnecessary entries in the vocabulary removed  Total average time of ICC or Moments (0.29 ms) is better than that of other methods.  but the clustering rates are not the best (98.69% for ICC and 82.61% for Moment).  the clustering rate of DCT (99.19%) is the better, but time is the worst (~12 ms).  With two parameters (clustering rate and time) ICC may be a good compromise. 12 The Thirteenth Conference on Language Engineering 11-12, December 2013

13 Thanks for your attention.. 13 The Thirteenth Conference on Language Engineering 11-12, December 2013

14 Go Back counting the number of black pixels Vertical transitions from black to white horizontal transitions from black to white 14 The Thirteenth Conference on Language Engineering 11-12, December 2013

15 Go Back DCT. -Applying DCT to the whole word image -The features are extracted in a vector form by using the DCT coefficient set in a zigzag order. -Usually we get the most significant DCT coefficients(160 coef.) 15 The Thirteenth Conference on Language Engineering 11-12, December 2013

16 Go Back Block Discrete Cosine Transform (BDCT) Apply the DCT transform for each cell Get the average of the differences between all the DCT coefficients 16 The Thirteenth Conference on Language Engineering 11-12, December 2013

17 Go Back Discrete Cosine Transform 4-Blocks (DCT-4B) 1- Compute the center of gravity of the input image. 2- Divide the word image into 4-parts taking the center of gravity as the origin point. 3- Apply the DCT transform for each Part. 4- Concatenate the features taken from each part to form the feature set of the given word. 17 The Thirteenth Conference on Language Engineering 11-12, December 2013

18 Go Back Image Centroid and Zone (ICZ) Compute the average distance among these points (in a given zone) and the centroid of the word image 18 The Thirteenth Conference on Language Engineering 11-12, December 2013

19 Go Back DTW (Dynamic Time Warping) Features. The three types of features are extracted from the binarized images and used in our DTW techniques: X-axis and Y-axis Histogram Profile Profile Features(Upper, Down, Left and Right) Forground/Background Transition DTW) is an algorithm for measuring similarity between two sequences The distance between two time series x1... xM and y1... yN is D(M,N), that is calculated in a dynamic programming approach using 19 The Thirteenth Conference on Language Engineering 11-12, December 2013

20 Go Back DTW (Dynamic Time Warping) Features. 20 The Thirteenth Conference on Language Engineering 11-12, December 2013 Figure 1: The Four Profiles Features: (A) Left Profile. B) Up (C) Down Profile. D) Right Profile

21 Go Back The Moment Invariant Features Hu moments: Hu defined seven values, computed from central moments through order three 21 The Thirteenth Conference on Language Engineering 11-12, December 2013

22 Go Back 22 The Thirteenth Conference on Language Engineering 11-12, December 2013

23 Go Back Moments 23 The Thirteenth Conference on Language Engineering 11-12, December 2013 The moment invariant descriptors are calculated and fed to the feature vector. 16 12


Download ppt "Using Word Based Features for Word Clustering The Thirteenth Conference on Language Engineering 11-12, December 2013 Department of Electronics and Communications,"

Similar presentations


Ads by Google