3 Automated Creation of Metadata Records Sometimes it is possible to generate metadata automatically from the content of a digital object. The effectiveness varies from field to field. Examples Images -- characteristics of color, texture, shape, etc. (crude) Music -- optical recognition of score (good) Bird song -- spectral analysis of sounds (good) Fingerprints (good)
4 Automated Information Retrieval Using Feature Extraction Example: features extracted from images Spectral features: color or tone, gradient, spectral parameter etc. Geometric features: edge, shape, size, etc. Textural features: pattern, spatial frequency, homogeneity, etc. Features can be recorded in a feature vector space (as in a term vector space). A query can be expressed in terms of the same features. Machine learning methods, such as a support vector machine, can be used with training data to create a similarity metric between image and query Example: Searching satellite photographs for dams in California
8 Effective Information Discovery With Homogeneous Digital Information Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).
9 Mixed Content Examples: NSDL-funded collections at Cornell Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl. Logic-based tools to support programming and to implement formal computational mathematics.
10 Mixed Metadata: the Chimera of Standardization Technical reasons (a)Characteristics of formats and genres (b)Differing user needs Social and cultural reasons (a)Economic factors (b)Installed base
11 Information Discovery in a Messy World Building blocks Brute force computation The expertise of users -- human in the loop Methods (a)Better understanding of how and why users seek for information (b)Relationships and context information (c)Multi-modal information discovery (d)User interfaces for exploring information
12 Understanding How and Why Users Seek for Information Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query
13 Automatic Creation of Surrogates for Non-textual Materials Discovery of non-textual materials usually requires surrogates How far can these surrogates be created automatically? Automatically created surrogates are much less expensive than manually created, but have high error rates. If surrogates have high rates of error, is it possible to have effective information discovery?
14 Example: Informedia Digital Video Library Collections: Segments of video programs, e.g., TV and radio news and documentary broadcasts. Cable Network News, British Open University, WQED television. Segmentation: Automatically broken into short segments of video, such as the individual items in a news broadcast. Size: More than 4,000 hours, 2 terabyte. Objective: Research into automatic methods for organizing and retrieving information from video. Funding: NSF, DARPA, NASA and others. Principal investigator: Howard Wactlar (Carnegie Mellon University).
15 Informedia Digital Video Library History Carnegie Mellon has broad research programs in speech recognition, image recognition, natural language processing. 1994. Basic mock-up demonstrated the general concept of a system using speech recognition to build an index from a sound track matched against spoken queries. (DARPA funded.) 1994-1998. Informedia developed the concept of multi-modal information discovery with a series of users interface experiments. (NSF/DARPA/NASA Digital Libraries Initiative.) 1998 -. Continued research particularly in human computer interaction. Commercial spin-off failed.
16 The Challenge A video sequence is awkward for information discovery: Textual methods of information retrieval cannot be applied Browsing requires the user to view the sequence. Fast skimming is difficult. Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec). Surrogates are required
17 Multi-Modal Information Discovery The multi-modal approach to information retrieval Computer programs to analyze video materials for clues e.g., changes of scene methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition. analysis of video track, sound track, closed captioning if present, any other information. Each mode gives imperfect information. Therefore use many approaches and combine the evidence.
18 Multi-Modal Information Discovery With mixed content and mixed metadata, the amount of information about the various resources varies greatly but clues from many difference sources can be combined. "The fundamental premise of the research was that the integration of these technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task." [Wactlar, 2000]
19 Informedia Library Creation Video Audio Text Speech recognition Image extraction Natural language interpretation Segmentation Segments with derived metadata
20 Text Extraction Source Sound track: Automatic speech recognition using Sphinx II and III recognition systems. (Unrestricted vocabulary, speaker independent, multi-lingual, background sounds). Error rates 25% up. Closed captions: Digitally encoded text. (Not on all video. Often inaccurate.) Text on screen: Can be extracted by image recognition and optical character recognition. (Matches speaker with name.) Query Spoken query: Automatic speech recognition using the same system as is used to index the sound track. Typed by user
24 Limits to Scalability Informedia has demonstrated effective information discovery with moderately large collections Problems with increased scale: Technical -- storage, bandwidth, etc. Diversity of content -- difficult to tune heuristics User interfaces -- complexity of browsing grows with scale
25 Lessons Learned Searching and browsing must be considered integrated parts of a single information discovery process. Data (content and metadata), computing systems (e.g., search engines), and user interfaces must be designed together. Multi-modal methods compensate for incomplete or error- prone data.
26 Interoperability The Problem Conventional approaches require partners to support agreements (technical, content, and business) But a Web based digital library program needs thousands of very different partners... most of whom are not directly part of the program The challenge is to create incentives for independent digital libraries to adopt agreements
27 Approaches to interoperability The conventional approach Wise people develop standards: protocols, formats, etc. Everybody implements the standards. This creates an integrated, distributed system. Unfortunately... Standards are expensive to adopt. Concepts are continually changing. Systems are continually changing. Different people have different ideas.
28 Interoperability is about agreements Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. The challenge is to create incentives for independent digital libraries to adopt agreements
29 Function versus cost of acceptance Function Cost of acceptance Many adopters Few adopters
30 Example: security Function Cost of acceptance Public key infrastructure IP address Login ID and password
31 Example: metadata standards Function Cost of acceptance MARC Free text Dublin Core
32 NSDL: The Spectrum of Interoperability LevelAgreementsExample FederationStrict use of standardsAACR, MARC (syntax, semantic, Z 39.50 and business) HarvestingDigital libraries exposeOpen Archives metadata; simplemetadata harvesting protocol and registry GatheringDigital libraries do notWeb crawlers cooperate; services mustand search engines seek out information