Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT

Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT fdejong@ewi.utwente.nl http://hmi.ewi.utwente.nl/~fdejong

Genoa – May 23, 2006 LREC workshop Overview Introduction Cluster-oriented browsing Reasoning Audio indexing Conclusion

Genoa – May 23, 2006 LREC workshop Semantic Gap etc. (1) Need for access to information at conceptual level as old as idea of Information Retrieval manual annotation by documentalists –limitation: future user needs hard to predict text as second best: –full text indexing (all words/phrases) –infomation extraction (some words/phrases) –….. –but how about other modalities?

Genoa – May 23, 2006 LREC workshop Semantic Gap etc. (2) Approach for other modalities than text: exploit collateral linguistic elements –text: captions, telepromter text, subtitles –transcribed spreech enable automated semantic annotation –identification of top-n relevant concepts –train detectors (= automated learning of concepts based on low level features) –ontological frameworks: translation of feature patterns into a concept hierarchy

Genoa – May 23, 2006 LREC workshop Semantic Gap etc. (3) W.V.O Quine, ‘Word and Object’ (1960) Motto: Ontology recapitulates philology (James Miller) “Quine argues that the notion of a language- transcendent ‘sentence meaning’ must be rejected; meaningful studies in the semantics of reference can only be directed toward substantially the same language in which they are conducted”.

Genoa – May 23, 2006 LREC workshop Media Crossing Dominant search tasks: text-to-image speech-to-video concept-to-image/video

Genoa – May 23, 2006 LREC workshop Media Crossing as promising concept limitations of monomedia analysis overcome fills the semantic gap between content features and user needs full range of data avaliable can be exploited, including manual annotation records mature idea; early projects already in ’90s (e.g, THISL) TRECVID demonstrates that it works ….

Genoa – May 23, 2006 LREC workshop Media Crossing as poor concept Little progress in 10 years Many projects, but few implementations of fully automated media crossing applications for real life data sets/uses cases. Strong bias to text-to-image Workflow in archives often prohibits adoption of possibilities Why waiting for the breakthrough of an old idea?

Genoa – May 23, 2006 LREC workshop Other X-ing fields Important parallels in Language Crossing Machine Translation –idea is even older (’50s) –successes are rare –# languages covered is indication for sophistication –interesting concepts: language-specific vs. interlingual vs. language independent representations of meaning CLIR (Cross-lingual Information Retrieval) –on the agenda since beginning ’90s (TREC, CLEF) –focus now on tasks for which heuristics play a huge role (QA, image retrieval) –few people (if any) got rich

Genoa – May 23, 2006 LREC workshop Mining has a tradition Important parallels Data Mining Text Mining Audio Mining Media Mining Reality Mining Virtual Reality Mining ….

Genoa – May 23, 2006 LREC workshop Media Mining Ill-defined concept, but combines at least some of these feature Mining: finding patterns that haven’t been put in Content-oriented rather than query-based Format integration vs. format crossing Not limited to combinations of 2 modalities Not limited to text as starting point Not limited to uni-directional approaches Emphasis on automated analysis …

Genoa – May 23, 2006 LREC workshop Initial steps Three illustrations of initial steps in the right direction: content reduction via clustering: Novalist content merging via reasoning: MUMIS content enhancement via audio analysis: MultimediaN

Genoa – May 23, 2006 LREC workshop case 1 - Novalist layered browser for a news corpus heterogenenous in type, format, source: –news-related broadcast programmes –newspapers –webpages –corporate documents 20+ titles, covering 2 years topic clustering (currently based on text only) multifaceted metadata extraction no explicit semantics

Genoa – May 23, 2006 LREC workshop Content reduction Automatically generated cluster metadata keywords thesaurial terms (via automatic classification) lists of names entities network presentation for named entities headlines summarization (via extraction technique) timeline All metadata types can be queried. In addition: full text search

Genoa – May 23, 2006 LREC workshop query: orkaan (‘hurricane’); overview of clusters per period

Genoa – May 23, 2006 LREC workshop

case 2 - MUMIS completed IST project (1999 -2003) reports on EC soccer matches heterogenous content base: –multiple sources (speech transcripts, webpages, ticker text, newspapers, tables) –multiple languages target: –searchable knowledge base –timelinks to video content base –reduced redundancy (each event covered only once) –error correction approach: merging results from Information Extraction –time-alignement, unification, re-ordering

Genoa – May 23, 2006 LREC workshop Merging IE results

Genoa – May 23, 2006 LREC workshop case 3 – MultimediaN Content enhancement via audio analysis –segmention (topic, speaker) –speech recognition –time-alignment (enrichment of text with time-stamps) –linking of audio to newspaper archive –language model improvement via cross-media linking Emotion detection Applicaton domains: news, meeting recordings, oral history

Genoa – May 23, 2006 LREC workshop feature extraction from audio extract features -speech (words) -speaker (who, when) -structure (silence, music, speaker change) -emotion

Genoa – May 23, 2006 LREC workshop Time-alignment

Genoa – May 23, 2006 LREC workshop Next steps Attention for: Heterogeneous content integration should get more attention Other modalities than text, speech and image Integration of search results via more abstract (medium- neutral) content models (e.g., probabilistic approaches to image search models, ‘visual words’ approaches) Exploitation of manually created annotations and surface features for video (context models) can help Mining is data oriented; user-interaction can offer additional information. Explore the concept of parameterized search environment

Genoa – May 23, 2006 LREC workshop Conclusion Why waiting for the breakthrough of an old idea? The idea of Media Crossing has offered a useful playgroud for 10 years, some applications based on it have added value, but it should not be seen as a concept rich enough to be the basis of a longterm research programme.

Genoa – May 23, 2006 LREC workshop Thanks! PS. Has this been recorded... ?

Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT

Similar presentations

Presentation on theme: "Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT

Similar presentations

Presentation on theme: "Genoa – May 23, 2006 LREC workshop From Media Crossing to Media Mining Franciska de Jong University of Twente/TNO ICT"— Presentation transcript:

Similar presentations

About project

Feedback