Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Recognition Mebers: Shihao Zhang Xusheng Zhang Zhuqing Zhang Yiying Zhao Qiang Zhou Ning Feng Ying Hu Yankai Liu Yifan Liu Donghao Luo Cheng Qian.

Similar presentations


Presentation on theme: "Pattern Recognition Mebers: Shihao Zhang Xusheng Zhang Zhuqing Zhang Yiying Zhao Qiang Zhou Ning Feng Ying Hu Yankai Liu Yifan Liu Donghao Luo Cheng Qian."— Presentation transcript:

1 Pattern Recognition Mebers: Shihao Zhang Xusheng Zhang Zhuqing Zhang Yiying Zhao Qiang Zhou Ning Feng Ying Hu Yankai Liu Yifan Liu Donghao Luo Cheng Qian

2 Pattern recognition is a branch of machine learning that focuses on the recognition of patterns and regularities in data.

3 P-CNN: Pose-based CNN Features for Action Recognition Guilhem Cheron ´ ∗ † Ivan Laptev ∗ Cordelia Schmid† INRIA 张世豪 115034910179

4 Introduction Design a new action descriptor based on human poses, combining motion and appearance features for body parts. Investigate P-CNN features both for automatically estimated as well as manually annotated human poses. Combination of our method with Dense trajectory features improves the state of the art for both datasets.

5 Related Work Dense Trajectory(DT) features combined with Fisher Vector(FV) aggregation have shown outstanding results for a number of challenging benchmarks. IDT-FV (improved version of DT with FV encoding) Convolutional Neural Networks(CNN) have made a significant progress in image classification. We extend previous global CNN methods and address action recognition using CNN descriptors at the local level of human body parts. Local video descriptors fail to capture important spatio-temporal structure, which is very important for fine-grained action recognition. We design a new CNN-based representation for human actions combining positions, appearance and motion for human body parts.

6 P-CNN: Pose-based CNN features

7 The static video descriptor for part p is defined by the concatenation of time- aggregated frame descriptors as

8 State-of-the-art methods Pose estimation We have implemented a video pose estimator based on [8]. Following [8], we extract a large set of pose configurations in each frame and link them over time using Dynamic Programming (DP). The poses selected with DP are constrained to have a high score of the pose estimator [42]. At the same time, the motion of joints in a pose sequence is constrained to be consistent with the optical flow extracted at joint positions.

9 State-of-the-art methods High-level pose features (HLPF) High-level pose features (HLPF) encode spatial and temporal relations of body joint positions and were introduced in [19]. Positions of body joints are first normalized with respect to the person size. Then, the relative offsets to the head are computed for each pose in P. Dense trajectory features DT method densely samples points which are tracked using optical flow. For each trajectory, 4 descriptors are computed in the aligned spatio-temporal volume: HOG,HOF,MBH. Fisher Vectors encoding results in state-of-the-art performance for action recognition in combination with DT features.

10 Datasets JHMDB contains 21 human actions, such as brush hair, climb, golf, run or sit. Video clips are restricted to the duration of the action. In our experiments we also use a subset of JHMDB, including 316 clips distributed over 12 actions in which the human body is fully visible. MPII Cooking Activities contains 64 fine-grained actions and an additional background class. Action take place in kitchen with static back-ground. We also defined a subset of MPII cooking, with classes washing hands and washing objects. We have selected these two classes as they are visually very similar and differ mainly in manipulated objects.

11 Experimental results Performance of human part features

12 Experimental results Robustness of pose-based features

13 Experimental results Comparison to the state of the art

14 Experimental results

15 Conclusion This paper introduces pose-based convolutional neural network features (P-CNN). P-CNN description is shown to be significantly more robust to errors in human pose estimation compared to existing pose-based features such as HLPF [19]. In particular, P-CNN significantly outperforms HLPF on the task of fine- grained action recognition in the MPII Cooking Activities dataset. Furthermore, P-CNN features are complementary to the dense trajectory features and significantly improve the current state of the art for action recognition when combined with IDT- FV. Our study confirms that correct estimation of human poses leads to significant improvements in action recognition. Pose-based action recognition methods have a promising future due to the recent progress in pose estimation.

16 ACTION RECOGNITION USING VISUAL ATTENTION Shikhar Sharma, Ryan Kiros & Ruslan Salakhutdinov Department of Computer Science University of Toronto, Toronto, ON M5S 3G4, Canada ICLR 2016 张旭升 115034910181

17 Introduction visual cognition: humans focus their attention on different parts of the scene to extract relevant information. Attention based models can infer the action happening in videos by focusing only on the relevant places in each frame. In this paper they propose a soft attention based recurrent model for action recognition.

18 Related Work Convolutional Neural Networks (CNNs) have been highly successful in classification and object recognition tasks. LSTMs have been recently shown to perform well in the domain of speech recognition. Many existing approaches also tend to have CNNs underlying the LSTMs and classify sequences directly or do temporal pooling of features prior to classification. Attention models add a dimension of interpretability by capturing where the model is focusing its attention when performing a particular task.

19 The Model: Convolutional Features Extract the last convolutional layer obtained by pushing the video frames through GoogLeNet model trained on the ImageNet dataset. This last convolutional layer has D convolutional maps and is a feature cube of shape K × K × D. At each time-step t, they extract K 2 D-dimensional vectors. Refer to these vectors as feature slices in a feature cube: Each of these K 2 vertical feature slices maps to different K 2 overlapping regions.

20 The Model: The Attention Mechanism At each time-step t, the model predicts l t+1, a softmax over K×K locations, and y t, a softmax over the label classes with an additional hidden layer with tanh activations.

21 The Model: The Attention Mechanism The location softmax is defined as follows : This softmax can be thought of as the probability with which the model believes the corresponding region in the input frame is important. the expected value of the input at the next time-step : X t is the feature cube and X t,i is the i th slice of the feature cube at time-step t.

22 The Model: The LSTM LSTM: Long Short-Term Memory the LSTM implementation:

23 The Model: Loss function They impose an additional constraint over the location softmax, so that. The loss function is defined as follows:

24 Experiments: Quantitative Analysis Datasets:UCF-11, HMDB-51 and Hollywood2 Table 1 reports accuracies on both UCF-11 and HMDB-51 datasets and mean average precision(mAP) on Hollywood2. The results from Table 1 demonstrate that the attention model performs better than both average and max pooled LSTMs.

25 Experiments: Quantitative Analysis

26 Compare with other state-of-the-art action recognition models: Divide table into three sections. Models in the first section use only RGB data while models in the second section use both RGB and optical flow data. The model in the third section uses both RGB, optical flow, as well as object responses of the videos. This model performs against deep learning models in its category.

27 Conclusion Developed recurrent soft attention based models for action recognition and analyzed where they focus their attention. Proposed model tends to recognize important elements in video frames based on the action that is being performed. This model performs better than baselines which do not use any attention mechanism. Further work: Explore hard attention models as well as hybrid soft and hard attention approaches which can reduce the computational cost of their model.

28 Thank You!

29 CONTEXTUAL ACTION RECOGNITION WITH R*CNN Georgia Gkioxari UC Berkeley gkioxari@berkeley.edu Ross Girshick Microsoft Research rbg@microsoft.com Jitendra Malik UC Berkeley malik@berkeley.edu Reporter : Zhuqing Zhang 115034910182

30 CONTENTS Background Purpose Experiment Conclusion References

31 BACKGROUND There are multiple cues in an image which reveal what action a person is performing. Related work : 1 ) Action recognition ; 2 ) Scene and Context ; 3 ) Multiple-Instance Learning.

32 PURPOSE This paper exploited the simple observation that actions are accompanied by contextual cues to build a strong action recognition system——R*CNN Advantages of R*CNN 1)Adapt RCNN to use more than one region for classification while still maintaining the ability to localize the action. 2) The action-specific models and the feature maps are trained jointly, allowing for action specific representations to emerge. 3)R*CNN is not limited to action recognition.

33 EXPERIMENT Fig.1 is the architecture of our network. Given an image, we select the primary region to be the bounding box containing the person (knowledge of this box is given at test time in all action datasets). Bottom up region proposals form the set of candidate secondary regions. For each action α, the most informative region is selected through the max operation and its score is added to the primary. The softmax operation transforms scores into estimated posterior probabilities, which are used to predict action labels.

34 EXPERIMENT R*CNN: based on the Fast RCNN(FRCN) Learning: train R*CNN with stochastic gradient descent(SGD); Datasets: 1)PASCAL VOC Actions: contains 10 different actions; 2)MPII Human Pose dataset: contains 400 actions and consists of approximately 40000 instances and 24000 images.

35 EXPERIMENT PASCAL VOC Action Control experiments:1)RCNN; 2)Random- RCNN 3)Scene-RCNN; 4)R*CNN. Results table: We compare R ∗ CNN to other approaches on the PASCAL VOC Action test set. Table shows the results. R ∗ CNN outperforms all other approaches by a substantial margin. R ∗ CNN seems to be performing significantly better for actions which involve small objects and action-specific pose appearance, such as Phoning, Reading, Taking Photo, Walking.

36 EXPERIMENT MPII Human Pose Dataset Control experiments:1)RCNN; 2)R*CNN. Results compared with published results: We evaluate R ∗ CNN on the test set and achieve 26.7% mAP for frame- level recognition. Our approach does not use motion, which is a strong cue for action recognition in video, and yet manages to outperform DT by a significant margin. Evaluation on the test set is performed only at the frame-level.

37 CONCLUSION R*CNN: adapt RCNN to use more than one region in order to make a prediction because contextual cues are also significant when making prediction. Setting: both features and models are learnt jointly, allowing for action-specific representations to emerge. Results: 1)R*CNN outperforms all published approaches on two datasets. 2)the auxiliary information selected by R ∗ CNN for prediction captures different con- textual modes depending on the instance in question. Application: R ∗ CNN is not limited to action recognition. It can also be used successfully for tasks such as attribute classification. Our visualizations show that the secondary regions capture the region relevant to the attribute considered.

38 REFERENCE

39 THANK YOU!

40 Recurrent Convolutional Neural Network for Object Recognition 赵易颖 115034910184

41 Content 1. Introduction of CNN and RCNN 2. RCNN Model 3. Experiments 4. Conclusion

42 1. Introduction Convolution Neural Network(CNN) Typical artificial neural network Feed-forward architectures, Widely used, especially in object recognition

43 1. Introduction Context is important for object recognition Feed-forward model can only capture the context in higher layers connections within the same layer used to better use context.

44 2. RCNN Model Recurrent convolutional layer For a unit located at (i,j)on the kth feature map in an RCL The activity or state of this unit

45 2. RCNN Model Overall architecture RCNN contains a stack of RCLs, optionally interleaved with max pooling layers. max pooling layers is used in the middle global max pooling layer yields a feature vector representing the image softmax layer is used to classify the feature vectors

46 2. RCNN Model The overall architecture of RCNN is shown in the right An RCL is unfolded for T = 3 time steps, leading to a feed forward subnetwork with the largest depth of 4 and the smallest depth of 1. At t = 0 only feed forward computation takes place The RCNN used in this paper contains one convolutional layer, four RCLs, three max pooling layers and one softmax layer

47 3. Experiments Comparison with the baseline models on CIFAR-lO Comparison with existing models on CIFAR-IOO Comparison with existing models on MNIST

48 4. Conclusion Basic idea Add recurrent connections within every convolutional layer of the feed-forward CNN Thus enable the units to be modulated by other units in the same layer Advantages Enhance the capability of the CNN to capture statistical regularities in the context of the object Increase the depth of the original CNN while kept the number of parameters constant by weight sharing between layers

49 THANK YOU

50 Bilinear CNN Models for Fine- grained Visual Recognition - ICCV 2015 Presented by Zhou Qiang Shanghai Jiao Tong University

51 ▪ Introduction ▪ Method ▪ Innovation ▪ Experiments & Results Bilinear CNN Models for Fine-grained Visual Recognition

52  What is CNN ? Convolutional Neural Network Bilinear CNN Models for Fine-grained Visual Recognition Figure. Architecture of LeNet-5, a convolutional network

53 Bilinear CNN Models for Fine-grained Visual Recognition  What is Fine-grained ? Difference between classes in vision is slight. Bilinear CNN Models for Fine-grained Visual Recognition

54 ▪ Introduction ▪ Method ▪ Innovation ▪ Experiments & Results Bilinear CNN Models for Fine-grained Visual Recognition

55  Recognizing : pre-training, input, descriptor extracted, classifying Figure. A bilinear CNN model for image classification.

56 Bilinear CNN Models for Fine-grained Visual Recognition  Recognizing: Pre-training on ImageNet by features from convolutional layers of M-Net and D-Net. Input a image including a object. Extract features of image by two different neural networks. Outer product at each location of the image and pooled to obtain an image descriptor. Classify based on the image descriptor by one-vs-all linear SVMs

57 ▪ Introduction ▪ Method ▪ Innovation ▪ Experiments & Results Bilinear CNN Models for Fine-grained Visual Recognition

58  Challenges: The visual differences between the categories are small and can be easily overwhelmed by those caused by factors such as pose, viewpoint, or location of the object in the image.  Common approach First localize various parts of the object and then model the appearance conditioned on their detected locations. The parts are often defined manually and the part detectors are trained in a supervised manner.

59 Bilinear CNN Models for Fine-grained Visual Recognition  Innovation: Bilinear CNN models, a recognition architecture that addresses several drawbacks of part-based one. Train the classifier directly, using pairwise features, one for appearance and another one for parts. e.g. gender recognition. Train a gender-specific face detector, instead of train a gender-neutral face detector and followed by a gender classifier.

60 ▪ Introduction ▪ Method ▪ Innovation ▪ Experiments & Results Bilinear CNN Models for Fine-grained Visual Recognition

61  Baselines: FC-CNN [M], FC-CNN [N]. The features are extracted from the last fully-connected layer before the softmax layer of the CNN (M or N). FV-CNN [M], FV-CNN [N]. Trained on and pool features from a single-scale, building a descriptor using Fisher Vector pooling. FV-SIFT. A FV baseline using dense SIFT features extracted using VLFEAT.  Paper Methods B-CNN. Three bilinear CNN models: B-CNN [M,M], B-CNN [M,N], B-CNN [N,N]

62 Bilinear CNN Models for Fine-grained Visual Recognition  Datasets: Birds. The CUB-200-2011 dataset contains 11,788 images of 200 bird species.

63 Bilinear CNN Models for Fine-grained Visual Recognition Aircrafts. The FGVC-aircraft dataset consists of 10,000 images of 100 aircraft variants. e.g. discriminate variants such as the Boeing 737-300 from Boeing 737-400.  Datasets:

64 Bilinear CNN Models for Fine-grained Visual Recognition Cars. The cars dataset contains 16,185 images of 196 classes. Categories are typically at the level of Make, Model, Year, e.g., “2012 Tesla Model S” or ‘2012 BMW M3 coupe.”  Datasets:

65 Bilinear CNN Models for Fine-grained Visual Recognition Proposed method works more efficiently than baselines do. The performance is near to or better than previous work’s.  Results :

66 Bilinear CNN Models for Fine-grained Visual Recognition  Mistake: Common mistakes confused by our fine-tuned B-CNN model look remarkably similar.

67 Thanks Bilinear CNN Models for Fine-grained Visual Recognition

68 A Neural-based Approach to Answering Questions about Images 2015 IEEE ICCV Reporter: Neil Feng 2016.5.3

69 Introduction Related work Approach Experiments Conclusions Architecture of the article

70 a novel approach based on recurrent neural networks(RNN) named Neural-Image-QA for answering of questions about images Outperforms prior work on this task——doubling the performance Proposes two new metrics sensitive to human consensus on the task introduction

71 Convolutional Neural Networks for visual recognition Recurrent Neural Networks (RNN) for sequence modeling Combining RNNs and CNNs for description of visual content Grounding of natural language and visual concepts Textual question answering Visual Turing Test Related work

72 In this paper, questions have multiple word answers. Problem is formulated as a variable input/output sequence Neural-Image-QA is a deep network built of CNN and Long Short-term Memory(LSTM). Approach

73 The model of answering questions on images: Approach Where A ˆ t−1 = {a ˆ1,..., a ˆ t−1 } is the set of previous words

74 Approach

75 LSTM has been recently shown to be effective in learning a variable length sequence-to-sequence mapping

76 Experimental protocol: DAQUAR dataset which provides 12, 468 human question answer pairs and the WUPS score at {0.9, 0.0} In order to study how much information is already contained in questions, they train a version of our model that ignores the visual input Experiments

77 Table 1 shows the results of our Neural-Image-QA method on the full set (“multiple words”) with 653 images and 5673 question-answer pairs available at test time. Experiments

78

79 Proposed a neural architecture for answering natural language questions about images and it outperforms prior work by doubling performance the model that does not use the image to answer the question performs only slightly worse and even outperforms a new human baseline extended our existing DAQUAR dataset to DAQUAR Consensus Conclusion

80

81 Multi-scale recognition with DAG-CNNs 胡颖 2016.05.03

82 Outline Models Steps Results Conclusions

83 Models CNN Convolutional Neural Nets DAG-CNN Directed Acyclic Graph-CNN

84 Hierarchical chain model: Output of the chain is task-dependent. Coarse classification is different from fine- grained ones so we need multi-scale representations

85 Steps Multi-scale classification Multi-scale selection Multi-scale pooling

86 Classification & Selection

87 Pooling

88 Results

89 Fine-tuned

90 Conclusions Both coarse and fine-grained classification DAG-structured, allowing for end-to-end training Perform quite well

91 Thank you !

92 Convolutional Neural Networks with Intra-layer Recurrent Connections for Scene Labeling 学号: 115034910197 刘彦凯

93 1.Introduction Scene labeling: aims at fully parsing the input image by labeling the semantic category of each pixel. Challenge: simultaneously solves both segmentation and recognition. Typical Approach: 1. Extract local handcrafted features (CNN) 2. Integrate context information using probabilistic graphical models or other techniques( conditional random field (CRF) and recursive parsing tree )

94 1.Introduction Incorporate context modulation in neural networks ? Recurrent Connections Recurrent neural networks (RNN) Long-range context information can be captured by a fixed number of recurrent weights Treating scene labeling as a two-dimensional variant of sequence learning

95 1.Introduction To model the relationship between pixels (units in the hidden layers of CNN) in the 2D space explicitly requiring recurrent connections between units within layers. RCNN Feed-forward Recurrent connections co-exist Seamless integration of feature extraction & context modulation Multi-scale Technique Outputs of all networks concatenated to next layer Different scales have same structure and weights Characteristics of RCNN

96 2.1 RCNN Generic RNN: feed-forward input u(t) Internal state x(t) and parameters θ F is the function describing the dynamic behavior of RNN Recurrent convolutional layer (RCL): RCL a special two-dimensional RNN, whose feed-forward and recurrent computations both take the form of convolution.

97 2.1 RCNN Where g is the widely used rectified linear function h is the local response normalization (LRN) Recurrent convolutional layer (RCL)

98 2.1 RCNN During the training or testing phase, an RCL is unfolded for T time steps into a multi- layer subnetwork. Unfold the RCLs one by one, and each RCL is unfolded for T time steps before feeding to the next RCL Sold arrows denote feed-forward connections and dotted arrows denote recurrent connections.

99 2.2 Multi-scale RCNN Motivation: the model should be scale invariant. Framework: several RCNNs with shared weights are used to process images of different scales. Softmax layer: Cross entropy loss

100 2.2 Multi-scale RCNN The patch-wise approach is time consuming Instead, inputs the entire image to the network and obtains down- sampled label map, then simply up-sample the map to the same resolution as the input image (bilinear or other).

101 3. Model Analysis u 0 :the static input (e.g., an image). u(t):the input to the RCL γ ∈ [0, 1] :determines the tradeoff between the feed-forward /recurrent component. If γ = 0, the feed-forward component is totally discarded. Unfold the RCLs one by one, and each RCL is unfolded for T time steps before feeding to the next RCL. Notations:

102 3. Model Analysis Model analysis over the Sift Flow dataset PA : the ratio of correctly classified pixels to the total pixels in testing images. CA : the average of all category-wise accuracies.

103 3. Model Analysis For RCNN with γ = 1, the performance monotonously increase with more time steps. with γ = 0, with which the network tends to be over-fitting with more iterations RCNN-large: 4 RCLs, and has more parameters and larger depth. (γ = 1 better/ γ = 0 worse).

104 3. Model Analysis CNN1 is constructed by removing all recurrent connections from RCNN, increasing the numbers of feature maps in each layer from 32, 64 and 128 to 60, 120 and 240. CNN2 is constructed by removing the recurrent connections and adding two extra convolutional layers. CNN2 had five convolutional layers and the corresponding numbers of feature maps are 32, 64, 64, 128 and 128.

105 3. Model Analysis

106 4. Conclusion A multi-scale recurrent convolutional neural network is used for scene labeling. 1. Perform local feature extraction and context integration simultaneously in each parameterized layer 2. This is an end-to-end approach and can be simply trained by the backpropagation through time algorithm (BPTT). 3. Experimental results over two benchmark datasets demonstrate the effectiveness and efficiency of the model.

107 Thank you !

108 Semantic Pose using Deep Networks Train on Synthetic RGB-D 刘宜璠 115034910198

109 Outline Introduction Synthetic RGB-D scenes Network Architecture Training Experimental Evaluation Conclusions

110 Introduction Pose estimation Object detection Semantic segmentation Object classification

111 introduction Silberman over-segmentation Couprie multi-scale CNN Hariharan categories with an SVM aggregation onto a coarse mask Song and Xiao renderings of 3D models Guo and Hiem aggregates low-level features Gupta CNN Lin candidate cuboids

112 Synthetic RGB-D Scenes Obstacles : lack of large annotated datasets Generate realistic RGB-D renderings : BlenSor sensor simulation toolbox add perlin noise standard Blender pipeline Models : Princeton ModelNet10 dataset Object categories: bathtub,bed,chair,desk,dresser,monitor,neightst and,sofa,table,toilet.

113 Network Achitecture Input :96x96 real-valued image consisting of five layers (intensity layer, depth layer, three layers representing the surface normal vector) Do not consider depth when generationg proposals Four models: two “standard” krizhevsky- sytle,two larger networks with “inception”-style layers.

114 Training Train to output : a class label, a rotation around the floor normal axis and a distance from camera Synthetic data :1000 random scenes with 812 models from the same dataset. Training time :12 hours for simple models,48 hours for most comlpex model Hardware: Titan X GPU with constrained memory 12Gb

115 Experimental Evaluation Different between models is not very substantial Each recent work has reported classification accuracy slightly differently.

116 Experimental Evaluation

117 Conclusions Presented a method for generating realistic syn- thetic RGB-D scenes and they are valid Networks trained on synthetic RGB-D scenes can be adapted easily to work on the most challenging real data available Accomplish three tasks within a single network

118 Multi-view Convolutional Neural Networks for 3D Shape Recognition 115034910199 Donghao Luo 1

119 Overview Background Basic Idea Method Experiment 2

120 Background A question: What kind of descriptors can be used to represent the 3D shapes? Native 3D formats: Voxel grid Polygon mesh How about view-based descriptors? 2D imagesRecognize 3D shapes 3

121 Basic Idea Standard CNN architecture: Independent view of 3D shape The accuracy is far higher than 3D shape descriptors recognition Number of views 4

122 Basic Idea Novel CNN architecture: Information from multiple views of 3D shape combine A single shape descriptor 5

123 Method Input: A Multi-view Representation Camera setup 6

124 Method Recognition with Multi-view Representations: Multiple 2D image descriptors per 3D shape integrate Recognition task Classification: sum up the SVM decision values and return the class with the highest sum. 7

125 Method Multi-view CNN: Learning to Aggregate Views Information from all views synthesize Single, compact 3D descriptor 8

126 Experiment 9

127 10 Thanks!

128 HD-CNN for Large Scale Visual Recognition By 钱程

129 Introduction Deep CNNs are well suited for large-scale supervised visual recognition tasks. The complications arises in large datasets. 1. It’s easy to distinguish in coarse category, but difficult to distinguish between categories. 2. How to learn such a category hierarchy from the training data itself. 3. A hierarchical CNN classifier consists of multiple CNN models and would be slower and more memory consuming.

130 Overview of HD-CNN HD-CNN Architecture HD-CNN comprises four parts: (i) shared layers (ii) a single component B to handle coarse categories (iii) multiple components for fine classification (iv) a single probabilistic averaging layer

131 Overview of HD-CNN Shared layer receive raw image pixels and extract low-level features Coarse component independent layers B produces intermediate fine prediction and a coarse category prediction

132 Overview of HD-CNN Independent layers of a set of fine category classifiers. Both coarse and fine components share common layers because low-level features are useful for both coarse and fine classification tasks. The single probabilistic averaging layer produces a final prediction based on weighted average.

133 Learning a Category Hierarchy A top-down approach to learn the hierarchy from the training data is mentioned in the paper. With disjoint coarse categories, the overall classification depends heavily on the coarse category classifier. Therefore, overlapping coarse category is used to remove the separability constraint between coarse categories.

134 HD-CNN Training Since fine category components are embedded into HD-CNN, the training complexity and the risk of over-fitting will increase. The following algorithm decompose the HD-CNN training into multiple steps. HD-CNN training algorithm step1: Pretrain HD-CNN step1.1: Initialize coarse category components step1.2: Pretrain fine category components step2: Fine-tune the complete HD-CNN

135 HD-CNN Testing To ensure HD-CNN is scalable for large-scale visual recognition, the paper develop conditional execution and layer parameter compression techniques. 1. it’s not necessary to evaluate all fine category classifiers. Conditional execution can accelerate HD-CNN classification. 2. To compress the layer parameters at test time can reduce the memory footprint.


Download ppt "Pattern Recognition Mebers: Shihao Zhang Xusheng Zhang Zhuqing Zhang Yiying Zhao Qiang Zhou Ning Feng Ying Hu Yankai Liu Yifan Liu Donghao Luo Cheng Qian."

Similar presentations


Ads by Google