Previous works Personal readings of journals & ISKO proceedings Query: was a query constructed and submitted to a database in order to retrieve records? Publications: reading / perusing of full texts? Records: bibliographic records (titles & abstracts)
Previous works Major findings: – : McIlwaine & Williamson (1999); McIlwaine (2003) Classification schemes (UDC, DCC, LCSH,..) Bias in classification (gender, culture) Interoperability of KO vocabularies Rise of Internet technology, search engines, impact on KO Resource discovery Emerging trends in expert systems (NLP, ontologies, automatic indexing...) Terminology management problems Thesauri design Information visualisation in online context
Previous works Major findings: ? – ?: Lopez-Huertas (1998); Mainstream research in KO are reformulations of old problems (classification, thesauri) Recasting them in web era gives them a new life! Especially since KO is more & more entwined with sister fields 2 major driving forces of research in KO: – demand for quality & interoperability in a multilingual, multicultural world – Managing emergent knowledge in KOS in the semantic web era Both are reformulations of multidimensionality of knowledge Necessitating an inter- and multi-disciplinary effort etc...
Previous works Major findings: (40 yrs!) – (40 yrs!): pre & post-web era Saumure & Shiri (1998); Organizing corporate or business information Machine-assisted knowledge organization Information professionals Interoperability Cataloging and classification Classifying the web Digital preservation and digital libraries Metadata applications and uses Cognition Education Indexing and abstracting Thesauri initiatives
Previous works Major findings: (40 yrs!) – Saumure & Shiri (1998): (40 yrs!): pre & post- web era ; Trends b/w pre (<1993, date of 1 st navigator, Mosaic) and post- web era KO research focused throughout on mainstream topics Cataloguing, classification Pre-web era: more focused on indexing and cataloguing Post-web era: metadata generation & harvesting, interoperability, thus more technological thrust
Previous works Summary Summary – Despite methodological differences in data collection and analysis methods – Important overlaps in findings – Mainstream research is still driving KO (classification research, cataloguing, thesauri, bias,...) – Reformulations in the web era (interoperability, metadata creation & harvesting, assisted indexing & retrieval, terminology issues...)
Goal Trends survey of research on KO issues over past 2 decades ( ), 21 yrs. What can we get from automatic data analysis methods? Can they provide any useful insight?
Goal Epistemology : – Empiricism (how): methodology - observation of evidence from data – Pragmatism (why): is it useful and for whom? Some connection with bibliometrics but focus is not on mapping authors but on mapping contents Methodological difference with mainstream data analysis techniques: symbolic (linguistic & terminology) vs bag-of- word approach
Data collection (1) issue ISKO proceedings ISKO proceedings: not indexed in a machine-processable format (database) No problem for peer-reviewed journals... ambiguityKO concept But ambiguity of KO concept! At the end of the day... a manual selection of KO & LIS- related journals Records downloaded from Web-of-Science (WoS)
Data collection (2) List of 31 selected journals at KO International Classification 931 records out of which 838 came from KO & ancestor (International Classification) words in titles & abstracts KO Research trends will portray mostly publications from KO journal. KO Not the entire realm of publications on KO but we had to be content with that...
Sample record from ISI-WoS PT J AU RADA, R ROSSIMORI, A PATON, R RECTOR, A MAGLIANI, F ROBBE, PD TI THE GALEN DREAM SO INTERNATIONAL CLASSIFICATION AB Outlines the origin, needs and principles of GALEN, the Generalized Architecture for Languages, Encyclopedias, and Nomenclatures as applicable to Medicine. Short-term and long-term plans of GALEN have been elaborated to cope with possible developments. ''Milestones'' are given indicating what should be reached when and how much funding will be required for each milestone. In two ''vision'' pictures the situation before and after the introduction of GALEN is shown and the responsibilities at 4 different levels are listed. SN PY 1992 VL 19 IS 4 BP 188 EP 191 UT ISI:A1992KH
Analysis methodology (1) Empirical observations of how terminology depicts knowledge artefacts (titles & abstracts) – Terminology engineering Descriptive text data analysis (propose automatically a partition in the data) Hierarchical agglomerative clustering – Mapping & Visualisation: – Multidimensional view of domain structure: symbolic & numerical information TermWatch system TermWatch system (SanJuan & Ibekwe-SanJuan 2006)
Analysis methodology (2) - Corpus split in 2 periods * * Terminology modeling * Automatic extraction of terms * Term variant search - Clustering by semantic relations - Linking clusters by co-occurrence - Mapping & visualization
Analysis methodology (3) - Terminology modeling * Automatic extraction of terms * surface morpho-syntactic properties of terms * rule implementation * extraction of likely candidates * filtering: statistical measures or manual * Problem: statistical measures work on massive data
Analysis methodology (4) - Terminology modeling * Term variant search * surface morpho-syntactic operations b/w terms spelling variants * spelling variants (WordNet) synonymsUSE/UF * synonyms (USE/UF)(WordNet) BT/NT * likely BT/NT candidates: syntactic information RT * likely RT: lexico-syntactic information * some errors and noise * but in automation you do a trade off!
Analysis methodology (5) Some term variants acquired Paradigmatic organization (BT/NT) classification scheme universalclassification scheme genericclassification scheme knowledge classification scheme Library of Congress – LC (USE/UF) knowledge organisation scheme knowledge organization tool (RT) The system does not tag these relations as such They are assumed to be implied by the variations
Analysis methodology (6) Assumptions behind terminology modeling Co nsensus from studies on terminology/lexicography: new terms (denominations of concepts) are mostly created from existing terms Rare creation of terms ad nihilo Surface linguistic operations reveal semantic (conceptual?) relations between domain concepts By studying these operations and visualising how they relate terms Reveal the conceptual structure of a domain
Analysis methodology (7) Clustering 3 tier process: 1 st group terms by close semantic relations 2 nd hierarchical clustering by lesser semantic relations (many iterations) 3 rd link cluster labels by co-occurrence of labels or that of their variants Visualisation Thematic maps (Pajek) Navigation interface (browser)
Results (2) Main topics for period 1 ( ) – – Global structure : typical « core - peripheral » layout – Knowledge – Knowledge is the structuring poleClassification – Subjects gravitating around the Knowledge pole: analysis online vocabulary control standardization bibliographic information system indexing (automatic & manual) thesaurus construction and usage information documentation system translation
Results (3) In the last decade ( ): Research network is much more intertwined No one center but several « core » issues connected to one another Major topics are intertwined: KO issues classification information theoretic indexing language user evaluation Newer topics: web issues, metadata, knowledge discovery, computer algorithm,...
Results (4) , equal divide b/w: theoretical research information science, concept, classification theory, epistemological foundation,... user-oriented studies user librarian, user-defined descriptor, user evaluation mainstream KO issues classification, thesaurus, KO, term selection technology oriented handling of KO issues knowledge, system, transfer, knowledge representation, knowledge engineering, knowledge discovery, information processing, computer algorithm... web, web designer, web document information retrieval, terminology structuring, metadata, metadata quality
Discussion Evaluation of clusters: information-theoretic problem. No solution. No gold standard Goal of the method: precisely to propose a partition amongst the data Is it the best one? Reliance on external criteria: human (expert) evaluation So response from the community neeeded!