Presentation on theme: "Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen."— Presentation transcript:
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen
Content Metadata for aggregations / collections Metadata and Vocabularies Matching Services and Data / Profile matching
CMDI record data model All Metadata elements consist from Name, Value, Scheme AND a concept reference Possible relations & pointers to Journal files (special feature for workflow systems) Recursive structure of metadata components: An Actor component can contain a Language component, Contact component etc. A CMDI record can describe/point to resources but also to other metadata descriptions.
Metadata & Collections I R MD RRR Flexible hierarchy of sub-collections using separate MD records MD RR RRR easy extension with new (sub) collections MD RRRR R Single resource: one metadata record describes one resource Simple collection: one metadata record describes multiple resources. The metadata represents the collection
Issues with metadata record hierarchies R MD RRR hierarchy of sub-collections MD RRRR Simple collection: one metadata record describes multiple resources the metadata pertains to all referenced resources We do not want to duplicate metadata So: make use of percolation both downwards and upward But percolation has rules French Dutch Dutch/French Dutch the metadata pertains to all implied resources
Virtual Collections What makes a collection virtual. Not published by the resource creator/owner Created by a researcher for a particular research purpose Often distributed resources It has no “natural” home so needs a VC registry In research papers a reference to the VC is preferable to referring to perhaps thousands of metadata records or resources MD Virtual Collection Registry RRR MD RR Repository A Repository B
Issues for Virtual Collections Registry of Virtual Collections Maintained by a suitable authority that guarantees its persistence Accessibility to users, who can create one? Unique & persistent identification of collections Link from the VC metadata to the (sub) collection metadata must be equally persistent as that to the resources Searching & browsing collections Should the the VC be equally visible as are the collections that are maintained and published by bona-fide archives?
METADATA & VOCABULARIES
Metadata Quality If we learnt anything from the CLARIN VLO prototype, it is that the general metadata content quality is not that good. Formatting variants Spelling variants and errors Translations Misinterpretations of the metadata schema If that basis is not good anything we build on it is flawed Therefore CLARIN wants to provide services that help improve the metadata content quality
MD Quality improvement At the metadata creation stage Provide metadata creation tools that constrict user input e.g. pick lists in ARBIL (the CLARIN metadata editor) However there is no guarantee these tools are used. Often the metadata is created from other metadata created by other tools Expert knowledge about the resources is probably available After the metadata harvesting stage Curation can be a central effort No expert knowledge about the resources available But both possibilities need well maintained vocabularies to help create or curate the metadata
Vocabularies for CLARIN MD General vocabularies: Country, Organization name, … Technical vocabularies: media type, mime-type, … Linguistic type vocabularies: Genre, linguistic subject, language names, language identifiers, …
Vocabularies for CLARIN MD Where can we fetch them? The metadata schema itself Good choice for a very stable vocabulary No updates of vocabularies possible when schema are distributed No support for a “recommendation” vocabulary ISO DCR, excellent for linguistic type vocabularies Maintained and updated by the linguistic community No support for a “recommendation” vocabulary ISO CDB, ISO language codes, country codes,.. Limited accessibility A “new” special vocabulary (web) service Technology is not difficult, several similar services exist. Providing updates is the real problem e.g. when does a new country or organization get added?
Vocabulary one stop shop The management & maintenance is done by several organizations for good reasons. On the other hand we would like a central registry that can act as a one stop shop rather than maintain N different interfaces. MD Editor. General Vocabulary Service MD Creator Vocabulary curation ISO CDB ISO Language codes ISO DCR Linguistic vocabularies EC Org. DB Vocabulary curation MD repair MD Curator
MATCHING SERVICES AND DATA
Language Technology Think of parsers, taggers, translators, aligners, speech recognizers and synthesizers,… For efficiency need to use a Service Oriented Architecture (SOA), where only remote processing takes place Add to that a way of combining distributed services in a workflow with interaction with resource repositories Tools like this exist in a more or less usable form: TAVERNA, GATE, Weblicht,... WS1 WS2 WS3 tokenizer POS tagger Named Entity recognition R R Workflow Engine Workflow Editor S WF specification
Profile matching The workflow editor should not show all available tools It must make a selection based on the metadata of the resource and that of the service Simple via type system or complicated via in and ouput schema matching WS1 R tokenizer Workflow Editor R Resource metadata Service metadata Service metadata Service metadata Service metadata Service metadata Metadata registry for Services and data
Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n°
Collections II (Bundles) R MD RRR Bundle: A few tightly related resources also used for collections generated by workflow systems Metadata records are copied and extended and so are complete R2 MD R2 R1 MD2 MD1 R1 MD2 MD1 MD MD3 R3 MD R1 MD1 Proces 1 Proces 2 MD MD1 MD2 MD MD1 R1 MD2 MD MD1 R2 R1 MD2 MD MD1 R2 R1 MD2 MD MD1 annotation media file =