Presentation is loading. Please wait.

Presentation is loading. Please wait.

Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006.

Similar presentations


Presentation on theme: "Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006."— Presentation transcript:

1 Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006

2 IMDI metadata Forms with ~150 possible descriptors –Describes bundles of related resources –Extensive set compared with DC/OLAC –But only “name” descriptor is compulsory Archive holds –~40000 IMDI sessions or resource bundles +15000 non-local but available in our DB –Describing ~150000 resources

3 IMDI Metadata The descriptors hierarchically ordered entries, which concern –the event (recording location, date, etc), –the project, –the languages involved, –the Participants, –the type and nature of speech, –technical information about the resources –access rights values of descriptors can be closed or open vocabularies or free text. user can use prose descriptions at each of these levels + project/user defined keys

4 Metadata Use Documentation of the resources Retrieval and reuse: archive offers tools for: –Browsing the archives’ corpora –Structured metadata search High precision, low recall –Unstructured google-like metadata search High recall, low precision Large set-> not all elements are always relevant –Sparsely populated metadata space –Search tool to show frequency counts for metadata values. Avoids fruitless searches.

5

6 Depositor Guidance In general depositors are urged to be complete as possible for documentation purposes Some projects have an obligatory set of descriptors to fill in. (CGN, DBD, …) Provide training to get familiar with the set and tools Provide documentation Support by student-assistants and corpus managers

7 Observations II Often researchers do not fill in all the relevant data at their disposal. Some tendency to avoid this time-consuming work oriented to re-usage by others. The sheer size of the set may discourage people to start filling in data at all. Training helps. Best results in projects that decided beforehand what descriptors were needed to fill in. Of course there are also very committed individuals!!! Corpus managers/student assistants may clean things up. –but limited use since only the researcher has specific knowledge –can serve as intermediaries.

8 Observations II Only that part of the archive where metadata was specified manually (e.g. CGN was excluded as were sessions outside the MPI) Statistics on the basis of ~25000 remaining sessions The data gives an impression of how often fields are actually filled in (e.g. not empty and not default “unknown“ or “unspecified“). Cannot exclude “repairs” where obvious omissions were repaired by corpus management

9 Descriptor nametotal-25000 fl-12000acqui-10000 Country939399 Address152115 Region71011 Description483077 Key331758 Project.Name909187 Content.Description939597 Genre294415 SubGenre233413 Task434934 Modalities808082 Subject362 Interactivity737281 PlanningType535173 Involvement707172 SocialContext6109 EventStructure799 Channel81011 Content.Language.Description432567 Content.Language.Id919091 Content.Language.Name919094

10 Actor.Language.Description331461 Actor.Language.Id252053 Actor.Language.Name473783 Actor.Role949799 Actor.Name949599 Actor.FullName909397 Actor.Code706884 Actor.FamilySocialRole243118 Actor.EthnicGroup142013 Actor.BirthDate588 Actor.Age444750 Actor.Sex706992 Actor.Education131611 Actor.Description657856 Actor.Key524468 MediaFile.Type858385 MediaFile.Format858385 MediaFile.Quality18831 WrittenResource.Type675771 WrittenResource.SubType301935 WrittenResource.Format564270 WrittenResource.ContentEncoding370 WrittenResource.CharacterEncoding3120 WrittenResource.LanguageId411

11 Conclusions As can be seen the sets are far from being complete. But also every field of the scheme has been used in some sessions, so that it seems that no field in the schema is obsolete People find use for the description fields that are available at different levels (~50%) Also the user/project defined keys are used (~50%) -> IMDI set is not big enough Some keys are not much used –Remove? –But where then to put this information if its available?


Download ppt "Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006."

Similar presentations


Ads by Google