Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shin’ichi Satoh National Institute of Informatics.

Similar presentations


Presentation on theme: "Shin’ichi Satoh National Institute of Informatics."— Presentation transcript:

1 Shin’ichi Satoh National Institute of Informatics

2  Nowadays abundant multimedia information available  Web, broadband network, CATV, satellite...  digital camera, mobile phone,

3  YouTube: 35 hours of video uploaded every minute

4  Flickr: 5 billion photos  Facebook: 3 billion photos per month

5  How can we utilize such huge amounts of multimedia?  Search could be one promising option  Any technical problems?  It seems like multimedia search is already available  Google, Yahoo!, Bing image search, Flickr, YouTube, etc...

6  Multimedia search is possible only via text search technology  This problem is prominent especially for visual media (audio can be converted into text via ASR)

7  But major part of multimedia data has no text data  We checked a number of photos in Flickr and found around 85% of photos have no tags or description  as far as we use the text search-based technologies, such large amounts of multimedia are inaccessible at all!

8  Moreover, text-based multimedia search is NOT perfect  searching images of "people playing drums"  some results are good  but some results are very strange John dog

9  Multimedia semantic content analysis is required  However it’s difficult ◦ Multimedia is difficult to handle by computers ◦ Inherently difficult due to “Semantic Gap” Query: Lion Lion

10  Multimedia data is huge ◦ text: 1kb/s (10 words), audio: 100kb/s (MP3), video 10Mb/s (MPEG2)  computers since 1940s (ENIAC 1946)  text processing by computer since 1950s! (Turing test 1950, ELIZA and SHRDLU 1960s)  project Gutenberg since 1971  CD-ROM (1985), DVD (1993), larger memory, external storage (hard disk drives)  multimedia data (audio/image/video) are getting manageable only after 1990s

11  Please guess what this is. Water Lilies, Monet

12  Please guess what this is.

13

14  Computers are so good at handling text, but not so at handling multimedia  text: artificial media, symbolized by nature  multimedia: ambiguous, depend on cognition, natural media, not symbolized, etc...  human can easily “see” or perceive  but we cannot explain how we “see” The quick brown fox jumps over the lazy dog

15  1980s  Landsat images, medial images, stock photos  Search using relational DB  only via statistics and text  issue was how to handle “huge” data of images  less attention was paid to content analysis

16  CBIR: Image retrieval based on “content”  T. Kato, TRADEMARK & ART MUSEUM (1989)  IBM QBIC (1990s)  Take an image as a query, and return “similar” images  Use “features,” e.g., color histogram, edge, shape, etc.  It worked for images without metadata  Assume that similar images in the feature space are semantically similar as well  But this is not always true

17 Feature space Semantic Gap

18  Let’s take a look at face detection as an example...  Face detection is very stable technology  Before 1990 face detection was very unstable ◦ Shape of facial features and their geometric relations were hard coded  After late 1990s face detector using machine learning succeeded in very stable performance ◦ Simply provide a lot of face image examples (a few thousands) to the system and let it learn Early face detection method Machine learning

19 Following the success of machine-learning- based approaches in face detection, OCR, ASR, etc., researchers decided to “train” computers for media semantic content analysis build corpus: tens, hundreds, or thousands images/video shots per concept with manual annotation extract features (low-level, but recently “local” features are known to be more effective) train computers to automatically map between low-level features and semantic categories using machine learning Several corpora available

20  Caltech 101 (2003), Caltech 256 (2007)  101/256 concepts  define the set of concepts first, then collect images (via image search engine)  manual selection, so clean annotation  up to a few hundreds images per concept  standard benchmark datasets  “small world effect” anticipated  questionable selection of concepts

21  airplane, chair, elephant, faces, leopards, rhino  bonsai, brain, scorpion, trilobite, yin_yang...

22  Large number of concepts, large number of images  #concepts: 10,000+  #images: 10,000,000+  concepts are systematically selected from WordNet (computer-readable thesaurus)

23  Manual annotation by Amazon Mechanical Turk  Hard to control quality  Scalability issue

24 Currently researchers are focusing on the issue: how to effectively learn semantic concepts from GIVEN training media corpus Corpus: the larger, the better But how to obtain large corpus? CGM (Flickr, Web): noisy Manual annotation (AMT): costly, less scalable Other approaches such as ESP game could be interesting

25 19701980199020002010 Text Audio/ Speech Image Video Project Gutenberg bag-of-words TF/IDF WSJ TREC PageRank MFCC Viterbi, HMM CMU-MIT Face DB Pascal VOC ImageNet TRECVID Caltech101 V-J Face Det. USPS OCR single digit 1000 words LVCSR IBM ViaVoice

26  Multimedia content analysis research: “just started”  More advanced results to come  Business value?  Killer applications?


Download ppt "Shin’ichi Satoh National Institute of Informatics."

Similar presentations


Ads by Google