Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University.

Similar presentations


Presentation on theme: "Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University."— Presentation transcript:

1 Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University of Central Florida Orlando, FL, USA 1 Department of EECS University of Michigan Ann Arbor, MI, USA IEEE International Conference on Computer Vision and Pattern Recognition, 2011

2 Cross-View Action Recognition – View 1: having labeled examples to train an action classifier F1 – View 2: having NO training examples, i.e., “Checking watch” – Question: How to use knowledge of view 1 to recognize unknown actions of view 2? View 1 View 2 2 Low-level features representation Classifier “Checking watch” Low-level features representation ? “Checking watch”

3 Cross-View Action Recognition Directly use classifier F1 to recognize actions of view 2? – No! Performance decreases dramatically Motion appearance looks very different across views View 1 View 2 3 Low-level features representation Classifier “Checking watch” Low-level features representation ? “Checking watch”

4 Cross-lingual text categorization/retrieval [Bel et al. 2004, Pirkola 98] – Translate them into a common language E.g., an interlingua, as used in machine translation [Hutchins et al. 92] – Underlying assumption: having word-by-word Analogy to Text Analysis 4 Common Languages OR An Interlingua In Chinese In French

5 Our Proposal An “action view interlingua” – Treat each view point as a language; construct vocabulary – Model an action by a Bag-of-Visual-Words (BoVW) – Translate two BoVWs into an “action view interlingua” View 1 View 2 An Action View Interlingua Videos Vocabulary V1 Vocabulary V2 5 Histogram of Visual-Words

6 Previous Work Geometry-based approaches – Geometric measurement of body joints C. Rao et al. IJCV 2002, V. Paramesmaran et al. IJCV 2006, etc. Require stable body joint detection and tracking – 3D reconstruction related D. Weinland et al. ICCV07, P. Yan et al. CVPR08, F. Lv et al. ICCV07, D. Gavrila et al. CVPR96, R. Li et al. ICCV07, etc. Strict alignments between views Computationally expensive in reconstruction Temporal self-similarity matrix [Junejo et al. ECCV08] – Non knowledge transfer; – Poor performance on top view 6

7 Previous Work Transfer-based approaches – Farhadi et al. ECCV08 Requires feature to feature correspondence at frame level Mapping is provided by a trained predictor Mapping is conducted in one direction – Farhadi et al. ICCV 09 Abstract discriminative aspects Training a hash mapping No explicit model transfer 7

8 Our Contributions Advantages of our approach – More flexible: no geometry constraints, human body joint detection and tracking, and 3D reconstruction – No requirement on strict temporal alignment – Two directional mapping rather than one direction – No supervision for bilingual words discovery Fuse transferred multi-view knowledge using Locally Weighted Ensemble method 8 First View Features Second View Features Info. Exchange First View Features Second View Features  

9 Graph Partitioning Our Framework Phase I: Discovery of bilingual words – Given N pairs of unlabelled videos captured from two views – Learn two view-dependent visual vocabularies – Discover bi-lingual words by bipartite graph partitioning First View Second View Training Data Matrix M First View Second View Vocabulary V 1 Vocabulary V 2 BoVW models M S BoVW models M T Bipartite Graph V1V1 V2V2 9 Bilingual Words

10 Our Framework First View Second View Training Data Matrix M First View Second View Vocabulary V 1 Vocabulary V 2 BoVW models M 1 BoVW models M 2 Bilingual Words BoBW models Bipartite Graph Graph Partitioning V1V1 V2V2 10 A BY Z Phase I: Discovery of bilingual words ­ Given N pairs of unlabelled videos captured from two views ­ Learn two view-dependent visual vocabularies ­ Discover bi-lingual words by bipartite graph partitioning

11 Our Framework Phase II: cross-view novel action recognition Source View Source View Action Model Learning Bilingual Words Bag-of-Visual-WordsBag-of-Bilingual-WordsBag-of-Visual-WordsBag-of-Bilingual-Words Target View Novel Action Recognizing Training Classifier on Source View Testing Classifier on Target View Training videos Target View Testing videos

12 12 Low-level Action Representation Acquiring the training matrix M Feature Detector 3D cuboids extraction Feature Clustering Visual Word A Visual Word B Video-words histogram Visual vocabulary Bag-of-Visual-Words (BoVW) model Examples d visual words x View 1 View 2

13 Bipartite Graph Modeling Build a bipartite graph between two views – Edge weights matrix, where S is a similarity matrix Generate similarity matrix S – In the column space of M, each S(i,j) of S can be estimated, X: Visual words of view 1 Y: Visual words of view 2 W Video Examples visual words Source View Target View 13

14 Bipartite Graph Bi-Partitioning Bipartite graph partition: – [1] H. Zha, X. He, C. Ding, H. Simon & M. Gu, CIKM 2001 – [2] I.S. Dhillon, SIGKDD 2001 A. Before Partition B. After Partition Two clusters (1,2,3; a, b) & (4,5; c, d, e) -> two bilingual words 14

15 IXMAS Data Set IXMAS videos: 11 actions performed by 10 actors, taken from 5 views. C0 C1 C2 C3 C4 Check-watchScratch-head Sit-down Wave-hand Kicking Pick-up 15

16 Data Partition Kick Pick-up Classes Z Check-watchScratch-head Sit-down Wave-hand Source View Target View Classes Y IXMAS Data Classes Ys Classes Z

17 Data Partition Kick Pick-up Classes Z Check-watchScratch-head Sit-down Wave-hand Source View Target View Classes Y IXMAS Data Classes Ys Classes Z source view target view Learning Bilingual Words Training Z classesTesting Z classes View 2 View 1

18 Data Partition Kick Pick-up Classes Z Check-watchScratch-head Sit-down Wave-hand Source View Target View Classes Y IXMAS Data Classes Ys Classes Z source view target view Learning Bilingual Words Training Z+Y classesTesting Z classes source view target view

19 Results on View Knowledge Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam Cam Cam Cam Cam “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively Average, “W/O”=10.9%, “W/” = 67.4% 19 Training View Testing View

20 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam Cam Cam Cam Cam Training View Testing View “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively Average, “W/O”=10.9%, “W/” = 67.4%

21 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam Cam Cam Cam Cam Training View Testing View “W/O” and “W/” show the results without and with view knowledge transfer via the bag of bilingual words, respectively Average, “W/O”=10.9%, “W/” = 67.4%

22 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam Cam Cam Cam Cam Training View Testing View “W/O” and “W/” show the results without and with knowledge transfer respectively Average, woTran=10.9%, wTran = 67.4%

23 Performance of Transfer (%) Camera 0Camera 1Camera 2Camera 3Camera 4 W/OW/W/OW/W/OW/W/OW/W/OW/ Cam Cam Cam Cam Cam Training View Testing View “W/O” and “W/” show the results without and with knowledge transfer respectively Average, woTran=10.9%, wTran = 67.4%

24 Performance Comparison – Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] – Columns “A”: A. Farhadi et al. ECCV 2008 – Columns “B”: I. N. Junejo, et al. ECCV 2008 – Columns “C”: A. Farhadi et al. ICCV 2009 (%) Camera 0Camera 1Camera 2Camera 3Camera 4 OursABC ABC ABC ABC ABC C C C C C Ave

25 Performance Comparison – Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] – Columns “A”: A. Farhadi et al. ECCV 2008 – Columns “B”: I. N. Junejo, et al. ECCV 2008 – Columns “C”: A. Farhadi et al. ICCV 2009 (%) Camera 0Camera 1Camera 2Camera 3Camera 4 OursABC ABC ABC ABC ABC C C C C C Ave

26 Performance Comparison – Low-level features: ST cuboids + shape-flow features [D. Tran et al. ECCV 2008] – Columns “A”: A. Farhadi et al. ECCV 2008 – Columns “B”: I. N. Junejo, et al. ECCV 2008 – Columns “C”: A. Farhadi et al. ICCV 2009 (%) Camera 0Camera 1Camera 2Camera 3Camera 4 OursABC ABC ABC ABC ABC C C C C C Ave

27 Transferred Knowledge Fusion One target view V.S. n-1 source views – Each source view have an action classifier – How to fuse the knowledge to final decision? Locally Weighted Ensemble strategy [ Gao et al. SIGKDD 08 ] 27 + – – – – – – – – – – + + – – – + – – – – – – – – – – – – – – – – – – – – + + – – Classifier of Source 1 Classifier of Source 2 Fusion R R

28 Knowledge Fusion Results (%)Camera 0Camera 1Camera 2Camera 3Camera 4Average Ours Baseline –Each column denotes a testing (target) view, and the rest four views are source view 28

29 Knowledge Fusion Results (%)Camera 0Camera 1Camera 2Camera 3Camera 4Average Ours Baseline Junejo et al. ECCV Liu et al. CVPR N/A73.8 Weinland et al. ECCV –Each column denotes a testing (target) view, and the rest four views are source view 29

30 Detailed Recognition Rate 30

31 Summary Create an “action view interlingua” for cross-view action recognition Bilingual words serve as a bridge for view knowledge transfer Fuse multiple transferred knowledge using Locally Weighted Ensemble method Our approach achieves state-of-the-art performance

32 Thank You! Acknowledgements: UMich Intelligent Robotics Lab UMich Computer Vision Lab UCF Computer Vision Lab NSF

33 Confusion Table


Download ppt "Jingen Liu 1, Mubarak Shah 2, Benjamin Kuipers 1, Silvio Savarese 1 Cross-View Action Recognition via View Knowledge Transfer 2 Department of EECS University."

Similar presentations


Ads by Google