Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.

Similar presentations

Presentation on theme: "Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen."— Presentation transcript:

1 Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen

2 Content  Metadata for aggregations / collections  Metadata and Vocabularies  Matching Services and Data / Profile matching


4 CMDI record data model All Metadata elements consist from Name, Value, Scheme AND a concept reference Possible relations & pointers to Journal files (special feature for workflow systems) Recursive structure of metadata components: An Actor component can contain a Language component, Contact component etc. A CMDI record can describe/point to resources but also to other metadata descriptions.

5 Metadata & Collections I R MD RRR Flexible hierarchy of sub-collections using separate MD records MD RR RRR easy extension with new (sub) collections MD RRRR R Single resource: one metadata record describes one resource Simple collection: one metadata record describes multiple resources. The metadata represents the collection

6 Issues with metadata record hierarchies R MD RRR hierarchy of sub-collections MD RRRR Simple collection: one metadata record describes multiple resources the metadata pertains to all referenced resources We do not want to duplicate metadata So: make use of percolation both downwards and upward But percolation has rules French Dutch Dutch/French Dutch the metadata pertains to all implied resources

7 Virtual Collections What makes a collection virtual.  Not published by the resource creator/owner  Created by a researcher for a particular research purpose  Often distributed resources  It has no “natural” home so needs a VC registry  In research papers a reference to the VC is preferable to referring to perhaps thousands of metadata records or resources MD Virtual Collection Registry RRR MD RR Repository A Repository B

8 Issues for Virtual Collections  Registry of Virtual Collections  Maintained by a suitable authority that guarantees its persistence  Accessibility to users, who can create one?  Unique & persistent identification of collections  Link from the VC metadata to the (sub) collection metadata must be equally persistent as that to the resources  Searching & browsing collections  Should the the VC be equally visible as are the collections that are maintained and published by bona-fide archives?


10 Metadata Quality  If we learnt anything from the CLARIN VLO prototype, it is that the general metadata content quality is not that good.  Formatting variants  Spelling variants and errors  Translations  Misinterpretations of the metadata schema  If that basis is not good anything we build on it is flawed  Therefore CLARIN wants to provide services that help improve the metadata content quality

11 MD Quality improvement  At the metadata creation stage  Provide metadata creation tools that constrict user input e.g. pick lists in ARBIL (the CLARIN metadata editor)  However there is no guarantee these tools are used. Often the metadata is created from other metadata created by other tools  Expert knowledge about the resources is probably available  After the metadata harvesting stage  Curation can be a central effort  No expert knowledge about the resources available  But both possibilities need well maintained vocabularies to help create or curate the metadata

12 Vocabularies for CLARIN MD General vocabularies: Country, Organization name, … Technical vocabularies: media type, mime-type, … Linguistic type vocabularies: Genre, linguistic subject, language names, language identifiers, …

13 Vocabularies for CLARIN MD Where can we fetch them?  The metadata schema itself  Good choice for a very stable vocabulary  No updates of vocabularies possible when schema are distributed  No support for a “recommendation” vocabulary  ISO DCR, excellent for linguistic type vocabularies  Maintained and updated by the linguistic community  No support for a “recommendation” vocabulary  ISO CDB, ISO 639-3 language codes, country codes,..  Limited accessibility  A “new” special vocabulary (web) service  Technology is not difficult, several similar services exist.  Providing updates is the real problem e.g. when does a new country or organization get added?

14 Vocabulary one stop shop  The management & maintenance is done by several organizations for good reasons.  On the other hand we would like a central registry that can act as a one stop shop rather than maintain N different interfaces. MD Editor. General Vocabulary Service MD Creator Vocabulary curation ISO CDB ISO 639-3 Language codes ISO DCR Linguistic vocabularies EC Org. DB Vocabulary curation MD repair MD Curator


16 Language Technology Think of parsers, taggers, translators, aligners, speech recognizers and synthesizers,…  For efficiency need to use a Service Oriented Architecture (SOA), where only remote processing takes place  Add to that a way of combining distributed services in a workflow  with interaction with resource repositories  Tools like this exist in a more or less usable form: TAVERNA, GATE, Weblicht,... WS1 WS2 WS3 tokenizer POS tagger Named Entity recognition R R Workflow Engine Workflow Editor S WF specification

17 Profile matching  The workflow editor should not show all available tools  It must make a selection based on the metadata of the resource and that of the service  Simple via type system or complicated via in and ouput schema matching WS1 R tokenizer Workflow Editor R Resource metadata Service metadata Service metadata Service metadata Service metadata Service metadata Metadata registry for Services and data

18 CLARIN WF workspace Workspace WS1WS2WS3 Repository/Archive provenance dataresource WF engine metadata

19 Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n° 212230

20 Collections II (Bundles) R MD RRR Bundle: A few tightly related resources also used for collections generated by workflow systems Metadata records are copied and extended and so are complete R2 MD R2 R1 MD2 MD1 R1 MD2 MD1 MD MD3 R3 MD R1 MD1 Proces 1 Proces 2 MD MD1 MD2 MD MD1 R1 MD2 MD MD1 R2 R1 MD2 MD MD1 R2 R1 MD2 MD MD1 annotation media file =

Download ppt "Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen."

Similar presentations

Ads by Google