Presentation is loading. Please wait.

Presentation is loading. Please wait.

IAEA International Atomic Energy Agency International Nuclear Information System (INIS) CAI, Thesaurus, Subject Categories and Metadata Extraction Tool.

Similar presentations


Presentation on theme: "IAEA International Atomic Energy Agency International Nuclear Information System (INIS) CAI, Thesaurus, Subject Categories and Metadata Extraction Tool."— Presentation transcript:

1 IAEA International Atomic Energy Agency International Nuclear Information System (INIS) CAI, Thesaurus, Subject Categories and Metadata Extraction Tool (MET) 13th Joint INIS/ETDE Technical Committee Meeting 20-22 October 2011, Vienna, Austria Neviana Rashkova INIS Subject Specialist

2 IAEA CONTENT 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20112  COMPUTER ASSISTED INDEXING – CAI  INIS/ETDE THESAURUS  SUBJECT CATEGORIES  INIS INPUT QUALITY CONTROL - UPDATE in co-operation with L. Iliev, Computer Support Group

3 IAEA COMPUTER ASSISTED INDEXING – CAI Assists the indexer to choose subject category and descriptors based on the text analysis of abstract and title Offers an opportunity for off-line work – batch indexing Incorporates the latest version of INIS Thesaurus Uses “hidden terms” pointing to a valid Thesaurus term Currently we have: 28 accounts created for Member states 19 countries with access to CAI 6 accounts created for external users This year - 53 658 documents indexed - 55% of the input from: Springer, ELSEVIER, ANS, IOPP, IAEA, MemSt, AIP 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20113

4 IAEA INIS/ETDE THESAURUS Thesaurus is “a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge“ (part of UNISCO definition) Types of relations for terms: BT (level1,2…10); NT (1,2…10); RT – related term; UF(+) – used for, SF seen for Contains: 21882 valid terms 8677 forbidden terms 30559 total 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20114

5 IAEA INIS/ETDE THESAURUS Maintaining the INIS/ETDE Thesaurus Regularly updated simultaneously at INIS and ETDE New terms proposed by Member States Terms revised if needed Discussion Group of experts – for new proposals and updates Translations Original - in English Other languages: German, French, Arabic, Russian, Chinese INIS Liaison Officer of the respective countries provide translations with yearly updates for the new terms 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20115

6 IAEA USES OF INIS/ETDE THESAURUS For indexing WinFibre CAI – hidden terms Independent use For retrieval Incorporated in INIS search For independent advanced search For establishing of search strategy As a dictionary 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20116

7 IAEA USES OF INIS/ETDE THESAURUS Other potential applications Retrieval – for navigation search together with subject classification Automation in text analysis – provides multiple level taxonomy Learning tool – give immediate structured information about the terms and their relations 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20117 BRUCE-1 REACTOR Tiverton, Ontario, Canada. *BT1 candu type reactors *BT1 natural uranium reactors *BT1 phwr type reactors RT bruce site BUBBLE CHAMBERS *BT1 gas track detectors NT1 cryogenic bubble chambers NT1 heavy liquid bubble chambers NT1 ultrasonic bubble chambers RT digitizers

8 IAEA INIS/ETDE SUBJECT CATEGORIES INIS/ETDE subject categories update Review the existing subject categories to include newer concepts and/or areas of research and development Make the "ETDE only" categories available for INIS Consider the introduction of new categories Four new Subject categories S77 NANOSCIENCE AND NANOTECHNOLOGY S79 ASTROPHYSICS, COSMOLOGY AND ASTRONOMY S96 KNOWLEDGE MANAGEMENT AND PRESERVATION S97 MATHEMATICAL METHODS AND COMPUTING 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20118

9 IAEA INIS/ETDE SUBJECT CATEGORIES ETDE/INIS Joint Reference Series No. 2 (Rev. 1) INIS Scope Descriptions The current categorization scheme contains 49 subject categories, both for INIS and ETDE. The categories have three-character alphanumeric codes The document defines the subject categories and provides the scope descriptions Subject Index is included as an aid to subject classifiers Cross references to other categories are provided where appropriate The tool is provided to Member States to assist in subject indexing 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 20119

10 IAEA INIS INPUT QUALITY CONTROL UPDATE INTRODUCTION The general goal of the procedure is to improve the quality of input Identifies documents with errors in input and extracts them for manual check by a specialist Knowledge Base created using a large number of expert decisions made by human indexers - intellectual choices for usage of a specific SC/D combination Implemented in a computer program, currently in use Uses documents from immediately preceding time period At the time of implementation – 75% of identified records were proved to be real errors 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201110

11 IAEA CURRENT PROCEDURE Based on old statistics period 1980-1984 26 000 documents used Subject categories changed several times new categories added artificially adjusted values to replace the real statistics Thesaurus updated many times new descriptors new concepts 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201111

12 IAEA CURRENT PROCEDURE THE RESULTS FROM THE QA PROCEDURE DO NOT REFLECT THE REAL SITUATION Too many false warnings (~ 50% of all documents) More bad records allowed in production Not relevant any more- no consistent approach for all pairs categories/descriptors THE OLD QA PROCEDURE NEEDS REVISION 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201112

13 IAEA UPDATED PROCEDURE Based on real statistics using the whole INIS database Takes in account all subject categories Takes in account the accumulated experience about specific error usage of category/descriptor combinations Flexible towards changes of descriptors weights UPDATED PROCEDURE IS EXPECTED TO IMPROVE QUALITY AND SAVE TIME 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201113

14 IAEA PRELIMINARY ANALYSES Analysis of the documentation on procedure for category match value (CMV) calculation An Expert System for Quality Control in Bibliographic Databases* Claudio Todeschini International Nuclear information System, international Atomic Energy Agency, Wagramerstrasse 5, A-7400 Vienna, Austria Michael P. Farrell Carbon Dioxide Information Center, Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3783 1 U.S.A. *Based on work performed at Oak Ridge National Laboratory, operated for the U.S. Department of Energy under Contract No. DE-ACOS- 840R21400 with Martin Marietta Energy Systems, Inc. Work was partially supported by the Carbon Dioxide Research Division, U.S. Department of Energy. Analysis of the program for quality control and testing the formula Criteria for category/descriptor combination 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201114

15 IAEA WORK DONE Conversion of all existing categories to the currently used set of categories Calculation of frequencies – table category/descriptor Comparison between two statistics new/all SC Decision about which period to use for the statistics Adjustment to avoid expected errors Identification of known combinations giving nearly100% errors Creating a table for “bad” combinations - assigned different weight (to reach very low CMV) Possibility to manually change weights 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201115

16 IAEA FINE TUNNING EXPECTED ERRORS – examples: Material Science GROWTH - CRYSTAL GROWTH Plasma physics IGNITION – THERMONUCLEAR IGNITION Physics of Elementary Particles and Fields PRODUCTION – PARTICLE PRODUCTION COLOR, FLAVOR, HOLOGRAPHY, TRANSPORT, CAVITIES,…etc. 17 descriptors in 18 subject categories have been adjusted 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201116

17 IAEA TOOLS DEVELOPED Tools were developed to perform the steps: Scanning the records from the Reference DB to make full statistics for the subject category-descriptor pairs Report to show difference between table and the one to replace it A table for manual “tuning” some pairs. Unfinished report to show the effect of changing the table on raw (unprocessed) and processed records 13 th INIS/ETDE Joint Technical Committee Meeting 20-21 October 201117

18 IAEA COMPARISON WITH IRPS (processed records) 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201118

19 IAEA COMPARISON WITH IRPS (unprocessed records) 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201119

20 IAEA TRESHOLD DETERMINATION 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201120

21 IAEA TRESHOLD DETERMINATION 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201121

22 IAEA TRESHOLD DETERMINATION 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201122

23 IAEA TRESHOLD DETERMINATION 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201123

24 IAEA DISCUSSION First analyses suggest a natural threshold value CMV ∈ (1,2) Analysis of the number of documents to be scanned for different threshold CMV is necessary Tests to assess errors if choose the threshold value in the different intervals are necessary Further testing over different sets of records is required before implementation Possibility for integration in WinFibre 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201124

25 IAEA 13 th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 201125 Thank you!


Download ppt "IAEA International Atomic Energy Agency International Nuclear Information System (INIS) CAI, Thesaurus, Subject Categories and Metadata Extraction Tool."

Similar presentations


Ads by Google