Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.

Slides:



Advertisements
Similar presentations
SCOPUS Searching for Scientific Articles By Mohamed Atani UNEP.
Advertisements

Database Searching: How to Find Journal Articles? START.
THE STEPS OF SEARCH You have opened a new veterinary clinic in a small town, and want people in the vicinity to know about it. You need some new ideas.
Searching EBSCOhost A guide to searching and retrieving information from the EBSCOhost Databases.
How to Read a Scientific Research Paper : an overview Asst.Prof.K.Chinnasarn, Ph.D.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Chapter 12 – Strategies for Effective Written Reports
Advanced Searching Engineering Village.
Lecture №2 State System of Scientific and Technical Information.
Literature Survey, Literature Comprehension, & Literature Review.
Engineering Village ™ ® Basic Searching On Compendex ®
Automatic Classification of Accounting Literature Nineteenth Annual Strategic and Emerging Technologies Workshop Vasundhara Chakraborty, Victoria Chiu,
Search Engines and Information Retrieval
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Information Skills Training – Physics Selina Lock.
Learn how to search for information the smart way Choose your own adventure!
Thesaurus Design and Development
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Nursing: Concept Models for Professional Practice Introduction to Research Resources at the Kean University Library.
How do I know the differences and uses of keyword versus subject searching in a database?
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
SciFinder Web Version Pootorn R. Book Promotion & Service Co.,Ltd. Thailand.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Rescue for the Researcher and Writer. The Research Process 1.Planning the project 2.Selecting / refining a topic 3.Finding sources 4.Evaluating your sources.
Search Engines and Information Retrieval Chapter 1.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Library Resources Barbara Dorward November Previous session  Catalogues  Library resources  Finding information on the web  Evaluation of information.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Library Information and Services CSE Librarian: Jason Neal Phone: Office: B 03 E Nedderman Hall UTA.
Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.
Bibliographic databases, online journals and literature searching.
WISER Social Sciences: Politics & International Relations Gillian Beattie (Social Science Library) Jane Rawson (Vere Harmsworth Library)
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
Current Events and Issues Using Index Databases for Finding Answers.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
UoS Libraries 2011 EndNote X5 - basic graduate session.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Internet Research – Illustrated, Fourth Edition Unit A.
1 Smart Searching Techniques Fall 2006 the Library.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Research Methods School of Economic Information Engineering Dr. Xu Yun :
Three indexes: Social Science Citation Index Index to Legal Periodicals Index to Foreign Legal Periodicals.
 Steps to locate MathSciNet Database: 1. Go to library.astate.edu and go to the article database section 2. Then select the ‘M’ in the A-Z section 3.
Oxlip+. What is Oxlip+? A tool for finding & linking to databases – Online collections of (scholarly) materials – Includes full text / indexes / range.
IUB Libraries Faculty & Graduate Student Updates Web of Science: Citation Indexes on the Web Presented by Gary Wiggins
Bibliographic Record Description of a book or other library material.
CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.
Research Skills for Your Essay Where to begin…. Starting the search task for real Finding and selecting the best resources are the key to any project.
Abstract  An abstract is a concise summary of a larger project (a thesis, research report, performance, service project, etc.) that concisely describes.
GUIDE. P UB M ED
Scopus - Elsevier (Advanced Course Module 8)
Chapter 2: Hypothesis development: Where research questions come from.
Review of Related Literature
LECTURE 3: DATABASE SEARCHING PRINCIPLES
Using computers to search electronic databases
Internet Research Third Edition
IL Step 3: Using Bibliographic Databases
Introduction of KNS55 Platform
Advanced search techniques in databases
Introduction to Information Retrieval
Scopus - Elsevier (Advanced Course: Module 8)
Presentation transcript:

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1

2 AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE  HOW ARE DOCUMENTS SEARCHED BY SUBJECT ? Two measures: recall and precision in searching Connecting subjects via references Searching data available in the document itself  WHY DO AUTOMATIC KEYWORDING ? Adding new meta-data to the documents Comparison between free keywords and fixed terms Influence of the keywording on search quality  TOWARDS THE AUTOMATION IN HEP Existing Classifications in High Energy Physics Using an Expert System to derive words CERN test : the status

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 3 RECALL : Number of documents retrieved / total number of relevant documents (=~100) PRECISON : Number of relevant documents retrieved / number of documents retrieved (=~100) These two measures of search efficiency are not independent –Recall factor as high as possible  tend to pick up more “background” documents –Want all retrieved documents to be relevant  risk to miss a lot of relevant documents. Searching for a phrase of more than three words = low recall factor, because of the flexibility of natural language in representing variation of meaning Two measures: Recall and Precision Searching Documents by Subject (I) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 4 Searching Documents by Subject (II) Two main approaches when searching : REFERRAL APPROACH : search for a specific item which one already knows about. SUBJECT-BASED APPROACH : find documents which address to a specific problem We are only interested in the second approach AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 5 Searching Documents by Subject (III) Core document  references  set of relevant documents Only past documents are covered Improvement with citation linking and databases Is this the solution ? –Authors may not have referred to all the relevant material –This method is not adequate to get an exhaustive list –Very long A subject cannot be covered efficiently by connecting citations Connecting documents via references : AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 6 Searching data available in the documents (relevant to the subject) : Searching Documents by Subject (IV) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE The title: - Too short to contain a complete description of the subject area - Recall factor of a title-based search is low - Number of documents in the database increases => precision of title-based searches decreases The abstract: - Improve recall factor - “Contrast” words indexed => poor precision - Available from CERN HEP database The full text: - Good recall factor - Very bad precision - Huge number of documents in HEP Data from the documents does not provide a way to search a subject !

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 7 1/ Adding new terms to documents - Free Keywords - Fixed thesaurus terms 2/ Comparison between keywords and fixed terms 3/ Influence of the keywording on search quality Why do Automatic Keywording ? AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 8 Allow to use terms not present in the document Allow to add words / phrases from the text (section headings, specific words…) Allow to index terms containing special characters Allow to add synonyms of terms of the text Adding new terms to documents AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (I) Free keywords

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 9 Useful when the important terms do not appear in the same way as in the treated text That method requires a complete and up to date thesaurus Fixed thesaurus terms AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (I) Adding new terms to documents

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 10 Maintenance of free keywords and fixed terms AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (II) The thesaurus has to be modified to keep up with changes in the subject – DESY HEP Index thesaurus updated every 1-2 years Free keywording should conform to a set of rules : – Singular forms instead of plural – Terms given as free keywords  thesaurus in practice  standardization of keywords

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 11 Influence of the keywording on search quality  Free keywording and title Improves recall and precision But results are not better than title and abstract association (most of the time free keywords present in title or abstract)  Fixed thesaurus : series of thesaurus terms Both precision and recall are 100 % (in theory !) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (III)

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 12 AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (IV) Comparison of different searches performed in HEP databases

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 13 Searching meta-data which belongs to the document gives bad results Added value of the “thesaurus-type” keywording is obvious Specially in HEP where the gray literature is huge and not classified The more you keep documents the more you need keywording Indexing by subjects specialists costs in terms of –time –requirement for highly qualified people Need for automation AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Why do Automatic Keywording ? (V) Summary

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 14 1/ Existing Classifications in High energy Physics 2/ The HEP specificity 3/ Sokrates Learning System - The Term Derivation - The Thesaurus Term Mapping 4/ CERN test : the status Towards the automation in HEP AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 15 Manual keywording : DESY HEP Index –Publication from 1963 to 1997 –Keywords still searchable on the Web from DESY and SLAC libraries Searching by subject : CERN –Free keywording and then HEP Index thesaurus from 1983 to 1992 –Now, a single subject is attributed to each document Fixed commercial thesauri : INIS (International Nuclear Information System) and INSPEC (Physics, Computing and Electrical Engineering Abstracts) –Built manually and access not free of charge Keyword given by authors : PACS (American Physical Society) Only the HEP Index is specialized enough Towards the automation in HEP (I) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Existing Classifications in High Energy Physics

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 16 Scientific literature –Meaning expressed through multi-word terms (noun phrases) –Substantives more important then verbs HEP particularity –particle symbols, equations = new type of word –different knowledge bases for theoretical and experimental papers Towards the automation in HEP (II) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE HEP specificity

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 17 From Natural language to key terms Similar to a compiler Free keywording type: –The derived key terms exist in the text Towards the automation in HEP (III-1) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System Self-organizing Object-oriented Keyterm Recognition And Text Editing System Definition

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 18 A dictionary of individual words: –continuously updated –Two main attributes : a code / a frequency A knowledge base : all the key terms and their frequency The rules : –they describe sentences using the word codes –they are read by an inference engine Towards the automation in HEP (III-2) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System The terms derivation: three main components

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 19 With a new document : First parsing: –Extraction of all individual words –Update the dictionary: new words and frequency –For new words: request help from an operator Second parsing: –Extraction of all possible noun phrases according to rules and dictionary Third parsing: –Derived key terms are compared to the knowledge base –Selection of key terms according to their frequency and a threshold Last parsing: only if necessary, when too few key terms found Towards the automation in HEP (III-3) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System The terms derivation: the text parsing

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 20 Towards the automation in HEP (III-4) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE Sokrates Learning System The thesaurus term mapping Key term exists in the thesaurus: mapping is straightforward Key term is similar: dictionary of synonyms can be used Key term does not exist: clustering technics can be used

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 21 The sample –1400 abstracts of published articles (experimental, theoretical, technological fields). –They have been already manually keyworded. –Sample 1 (700 abstracts) : keywords given to Sokrates to tune the system. –Sample 2 (700 abstracts) : keywords not given to test the system. The results –70000 words used as “learning text”: 200 words unknown to the existing dictionary for the last 4000 words processed –250 rules defined –Thresholds still being refined: permanent evolution The first phase of the test: the longest because the knowledge base MUST be good. Towards the automation in HEP (IV) AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE CERN test : the status

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 22 Some statements: –Importance of automatic keywording for HEP Grey literature –Confidence to build a valid knowledge base of noun-phrases using Sokrates –Valid mapping of this base with HEP Index Thesaurus remains uncertain The ideal future: For each new document (+ abstract) entered into the system : –quick delivery of a set of key terms –If it maps the thesaurus: the output is added to the database Search and Navigation enabled from the thesaurus ==> quick and easy way to get full coverage on a precise topic. Conclusion AUTOMATIC KEYWORDING OF HIGH ENERGY PHYSICS GREY LITERATURE

Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 23 QUESTIONS ?