Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg Centre for Cognition & Communication The University of Tilburg, NL ** National Centre for Text Mining The University of Manchester, UK
ACL/LaTeCH-Portland, June 24th 2011 Research on Metadata Developing standards: – collection specific (e.g. EAD, MARC21) – cross-collection (e.g. Dublin Core) Provide mappings: – across schemas – ontologies (ad hoc or standard CDOC-CRM) Discard metadata for IR (Koolen et al., 2007) Exploit metadata for IR (Zhang&Kamps, 2009)
ACL/LaTeCH-Portland, June 24th 2011 The IISH EAD dataset EAD: XML standard for encoding archival descriptions Challenges: – Variety of languages used – Varying type and amount of information – Style: enumerations, lists, incomplete sentences
ACL/LaTeCH-Portland, June 24th 2011 Motivation & Objectives Improved search and retrieval – content-based metadata document clustering – content-based/semantic search – support exploratory search – link across collections, metadata formats & institutions – create unified metadata knowledge resources
ACL/LaTeCH-Portland, June 24th 2011 Method overview
ACL/LaTeCH-Portland, June 24th 2011 Method overview
ACL/LaTeCH-Portland, June 24th 2011 Pre-processing EAD/XML element selection & extraction – EAD elements containing free-text & archive content information Language identification (n-gram method) – Identifier trained on Europarl corpus Text snippets length: ~20 tokens
ACL/LaTeCH-Portland, June 24th 2011 Snippet length based on language
ACL/LaTeCH-Portland, June 24th 2011 Method overview
ACL/LaTeCH-Portland, June 24th 2011 Method overview
ACL/LaTeCH-Portland, June 24th 2011 Enrichment & Structuring Topic detection: Automatic term recognition using C-value method Agglomerative hierarchical term clustering: – complete, single & average linkage criteria – document co-occurence & lexical similarity measures
ACL/LaTeCH-Portland, June 24th 2011 Method overview
ACL/LaTeCH-Portland, June 24th 2011 Method overview
ACL/LaTeCH-Portland, June 24th 2011 Term results (auto eval)
ACL/LaTeCH-Portland, June 24th 2011 Results C-value best performance: candidates that occur as non-nested at least once Average linkage criterion & Doc Co- occurence: provide broader and richer hierarchies
Questions? Check-out our poster!