Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.

Slides:



Advertisements
Similar presentations
Database Searching: How to Find Journal Articles? START.
Advertisements

Subject Analysis: An Introduction Based on BASIC SUBJECT CATALOGING USING LCSH edited by Lori Robare.
1 In-Class Exercise 1 (cont.) society in East Asia consumers behaviors cultural anthropology research global influence of culture societal/social change.
Locating Items in the CCSU Library (and most college libraries)  We need a system to find items. To help the process, librarians catalog information.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
8/28/97Information Organization and Retrieval Metadata and Data Structures University of California, Berkeley School of Information Management and Systems.
Module 6a: Intro to Controlled Vocabularies, Taxonomies and Classification IMT530: Organization of Information Resources Winter 2007 Michael Crandall.
Thesaurus Design and Development
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Mess ‘o MeSH …Or, What are all those funny terms anyway? MU Cataloging Workshop 24 April 2008 Amanda Sprochi.
The Library Cataloging Tradition
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
Information Retrieval
Vocabulary & languages in searching
International Atomic Energy Agency INIS Training Seminar Principles of Information Retrieval and Query Formulation 07 – 11 October 2013 Vienna, Austria.
Why classification matters The foundations of bibliographic classification.
Developing facets in UDC for online retrieval Claudio Gnoli (University of Pavia) Aida Slavic (UDC Consortium) 8th NKOS Workshop, Corfu, 1 Oct 2009.
1 MeSH & Principles of Classification April 13, 2005.
LIBRARY OF CONGRESS SUBJECT HEADING By Ms. Preeti Patel Lecturer School of Library And Information Science DAVV, Indore
Languages are bridges … not barriers Chiara Carlucci – CEDEFOP Library ReferNet Technical Meeting September 2009.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Query Relevance Feedback and Ontologies How to Make Queries Better.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
LIS510 lecture 9 Thomas Krichel Organization of information Libraries organize information. Otherwise nothing that is an library could ever.
Improving Access to Audio- Visual Materials by Using Genre/Form Terms OLAC Conference 1-3 October 2004 Montreal, Quebec.
1 Catalog Displays, Retrieval, and FAST May 31, 2005.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
Types of Periodicals in Literature Professional Scholarly Literary.
D4: SKOS and HIVE—Enhancing the Creation, Design and Flow of Information Speakers: Hollie White Jane Greenberg Coordinator: Alan Keely.
The Library Cataloging Tradition Marty Kurth CS 431 February 9, 2005 [slides stolen from Diane Hillmann]
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Are LCSH still effective? Why not use keyword searching instead? Presented by Carol Bradsher October 29, 2004.
 Subject Analysis: Chapter 9 Presenters: Brandy Jesernik and Jeanne Jesernik Information Retrieval.
DACS Describing Archives: A Content Standard. The Background  Archives, Personal Papers & Manuscripts, 1980s –New Technologies with Web, XML, EAD –Revision.
Current Events and Issues Using Index Databases for Finding Answers.
The UNESCO Thesaurus Meeting for Managers of UNESCO Documentation Networks Meron Ewketu UNESCO Library June
Introduction to Searching Databases and Records. What is a database? A database is a large, organized collection of information. Addresses Recipes Citations.
INFO Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Controlled Vocabulary & Thesaurus Design Term Selection/Format & Synonyms.
Thesauri usage in information retrieval systems: example of LISTA and ERIC database thesaurus Kristina Feldvari Departmant of Information Sciences, Faculty.
Indexes and Abstracts: Dissecting the Resource By M. Leedy.
Cataloging and Authority Control Spring 2006, 13/15 February Bharat Mehra IS 520 (Organization and Representation of Information) School of Information.
Intellectual Works and their Manifestations Representation of Information Objects IR Systems & Information objects Spring January, 2006 Bharat.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Subject Headings for Reference Everything You Need to Know About Subject Headings in One Easy Lesson By Dr. Nancy J. Becker Presented by Dr. Kevin Rioux.
June 2003INIS Training Seminar1 INIS Training Seminar 2-6 June 2003 Subject Analysis Thesaurus and Indexing Alexander Nevyjel Subject Control Unit INIS.
Controlled Vocabulary & Thesaurus Design Associative Relationships & Thesauri.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
LIS 204: Introduction to Library and Information Science Week Nine Kevin Rioux, PhD.
ORGANIZATION OF ELEMENTS OF INFORMATION The Thesaurus.
8/28/97Information Organization and Retrieval Introduction University of California, Berkeley School of Information Management and Systems SIMS 245: Organization.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
1 Shelflisting and Filing Rules and Subject Authority Control May 11, 2005.
Ontologies COMP6028 Semantic Web Technologies Dr Nicholas Gibbins
1 How do we describe something? n What something is about? –What the content of an object is “about”? n Different methods (Wilson, 1968) –counting terms.
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
Theoretical Perspectives: Information, Language and Cognition Week 14 Lecture notes INF 380E: Perspectives on Information Spring
Subject Analysis: An Introduction
Subject Headings for Reference
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Information Organization
Subject Access: Indexing and Abstracting
Cataloging Tips and Tricks
MARC: Beyond the Basics 11/24/2018 (C) 2006, Tom Kaun.
Introduction to Semantic Metadata & Semantic Web
Introduction to Information Retrieval
Library of congress subject headings
THESAURUS CONSTRUCTION: GROUND WATER
Presentation transcript:

Subject Analysis and Vocabulary Control Spring 2006, 6 March Bharat Mehra IS 520 (Organization and Representation of Information) School of Information Sciences University of Tennessee

Subject and its Representation  Subject reveals what a work is about: the content of the work  Representing subjects of an information object in the most precise and concise linguistic format is necessary for computerized searching: word, phrase, sentence, etc.

Questions  Why can’t a computer do a good job in identifying the “aboutness” of a work?  How can you identify “aboutness” for nontextual materials?

Subject Analysis  Is part of creating metadata that deals with the conceptual analysis of an information object to determine what it is about and  Translating “aboutness” of an info object to create controlled vocabulary terms for subject headings and classification notations

Purpose of Subject Analysis  Provides meaningful subject access via retrieval tool  Provides collocation of objects of a like nature (Cutter)  Provides a logical location for similar objects  Saves user time

Conceptual Analysis  What is it? Philosophy, history  What is it for? For a farmer…  What is it about? D. W. Langridge, 1989

Methods in Conceptual Analysis  Purposive method: Figure out author’s purpose (statement of purpose)  Figure-ground method (what are the problems in this method?)  Objective method : Counting of references (what are the problems in this method?)  Appealing to unity or to rules of selection and rejection what has been said (selection) and not said (rejected) P. Wilson, 1968

Identification of Concepts  Topics  Names (person, corporate bodies, geographic areas, other named entities)  Time periods  Form

Subject Access Process  Textual and non-textual info objects  What will be helpful for identifying the “aboutness” of the info object?  What did the user queries of the NLM’s Prints and Photographs Collection reveal?

Dewey Decimal Classification  Main classes=>divisions=>sections  The system is made up of ten categories:  000 Computers, information and general reference  100 Philosophy and psychology  200 Religion  300 Social sciences  400 Language  500 Science and mathematics  600 Technology  700 Arts and recreation  800 Literature  900 History and geography 330 for economy + 94 for Europe = European economy; 973 for United States form division for periodicals = , periodicals concerning the United States generally economy EuropeUnited Statesperiodicalseconomy EuropeUnited Statesperiodicals

Dewey Decimal Classification From the divine to the mundane (except 000) From the divine to the mundane (except 000) Choosing decimals for its categories, allows purely numerical and infinitely hierarchical Choosing decimals for its categories, allows purely numerical and infinitely hierarchicaldecimals Faceted classification: combines elements from different parts of the structure to construct a number representing the subject content Faceted classification: combines elements from different parts of the structure to construct a number representing the subject content Except for general works and fiction, works are classified principally by subject, with extensions for subject relationships, place, time or type of material, producing classification numbers of not less than three digits but otherwise of indeterminate length with a decimal point before the fourth digit, where present Except for general works and fiction, works are classified principally by subject, with extensions for subject relationships, place, time or type of material, producing classification numbers of not less than three digits but otherwise of indeterminate length with a decimal point before the fourth digit, where presentfiction Classmarks are to be read as numbers, in the order: 050, 220, , 331 etc. Classmarks are to be read as numbers, in the order: 050, 220, , 331 etc.

Subject Access--The Problems  diverse expressions  linguistic phenomena  cultural diversity  human cognitive factors  individual differences  differences in methods, lack of consistency  exhaustivity: summarization and depth indexing

Subject Access--Some Solutions  1.Vocabulary control in indexing  2.Classification systems arranging concepts in hierarchical structure  3.Citations: citing and being cited  4. Hyperlinks

Why is controlled vocabulary needed?

What can Vocabulary Control Do?  to promote the consistent representation of subject matter by indexer/cataloger and searchers;  to guide users on subject access by clarifying linguistic ambiguity and linking terms with related meanings;  to increase precision as well as recall.

Recall and Precision Basic measures used in evaluating search strategies Assumptions: There is a set of records in the DB which is relevant to the search topic Records are assumed to be either relevant or irrelevant (these measures do not allow for degrees of relevancy) The actual retrieval set may not perfectly match the set of relevant records.

Recall and Precision RECALL is the ratio of the number of relevant records retrieved to the total number of relevant records in the database. It is usually expressed as a percentage. PRECISION is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved. It is usually expressed as a percentage.

PR Inverse Relationship Why is there an inverse relationship? Issue of Language If search goal is comprehensive retrieval, then searcher must include synonyms, related terms, broad or general terms, for each concept Precision suffers: Searcher may decide to combine terms using Boolean rather than proximity operator: secondary concepts may get omitted Because synonyms may not be exact synonyms the probability of retrieving irrelevant material increases Recall suffers Broader terms may result in the retrieval of material which does not discuss the narrower search topic Using Boolean operators rather than proximity operators may increase the probability that the terms won't be in context

Other Problems with P and R Records must be considered either relevant or irrelevant (what about records that are marginally relevant, somewhat irrelevant, very relevant, completely irrelevant) Individual perception: what is relevant to one person may not be relevant to another Measuring recall: difficult to know how many relevant records exist in DB Measures for estimating recall Usefulness of P and R

Challenges in Vocabulary Control  Specific vs. general  Synonymous concepts  Word form and one-word forms (e.g., online)  Sequence and form for multiword terms and phases; inverted order  Abbreviations and acronyms  Popular vs. technical names

What is a Controlled Vocabulary?  A limited set of terms for indexing (subject cataloging) and for searching  authorized terms (representing concepts)  scope notes  related concepts  lead-in terms (non-preferred synonym term, not for indexing or searching; a pointer to authorized ones)

Types of Control Terminology  Synonyms (more terms for one concept)  Homographs (more than one meaning): qualifiers or preferred term synonym  Homophones  Conceptual relationships  Hierarchical (narrower, broader)  Associative (related)  Cross References

Cross reference Structure of Controlled Vocabulary  Term-A  scope note: explains use of the term  UF lead-in term-B “used for”  BT term(s)  SA term(s) “see also”  NT term(s)  -- subdivision  Lead-in term-B  USE Term-A

Examples Subject Heading Lists  developed in library community  in favor of pre-coordination in card cataloging environment Thesauri  developed as part of IR systems  in favor of post-coordination and somewhat pre-coordination

Pre-Coordination  The combination of concepts at the time of cataloging or indexing, e.g.:  Library -- automation -- United States  The above example is one heading in a structured format: Topic -- subtopic -- geography  (LCSH is a highly pre-coordinated control vocabulary)  Indexer constructs subject strings with main terms followed by subdivisions

Post-Coordination  The combination of concepts at the time of searching for a compound concept, e.g.:  library  automation  United States  The above example indicates three descriptors assigned to a work; no structure exists between them  Examples: ERIC

Pre-Coordinated SH Document Number: 195 Title: France importing crops from US and exporting wine to US  SH: crop--export--US  SH: crop--import--France  SH: wine--export--France  SH: wine--import--US Document Number: 44 Title: US importing wine from France SH: wine--export—FranceSH: wine--export—France SH: wine--import--USSH: wine--import--US

Pre-Coordinated Indexes  crop--export--US195  crop--import--France195  wine--export--France44, 195  wine--import--US44, 195 These facet headings are clear about the direction of the trade between two countries. What happens if the concepts are not combined in the headings?

Post-Coordinated Indexes  crop195  export44, 195  import44, 195  France44, 195  US44, 195  wine44, 195 Let’s do a Boolean search: crop AND import AND US results: Document irrelevant

Subject Cataloging--Process 1.Conceptual analysis of a document to identify what the document is about The methods:  purpose of the author (indicative statements)  figure-ground  objective analysis (statistics)

Subject Cataloging--Process (cont’d) 2. Translation of the conceptual analysis into a particular vocabulary The methods  look up subject headings  weighted headings  assign headings

Various Subjects in MARC  MARC tags  600 vs. 100 vs. 700  610 vs. 110 vs. 710  1XXfields (main entries)  4XXfields (series statements)  6XXfields (subject headings)  7XXfields (added entries other than subject or series)  8XXfields (series added entries)  X00Personal names  X10Corporate names  X11Meeting names  X30Uniform titles  X40Bibliographic titles  X50Topical terms  X51Geographic names For example, 610: subject heading that is a corporate name

Subject Cataloging Quality  Consistency: works on the same subjects are given the same headings  Exhaustivity: whether the headings cover all aspects of the work -- number of headings  Specificity: whether the heading assigned is at the same hierarchical level of the concept

Controlled Vocabularies 1. Subject heading lists: include phrases, precoordinated terms LCSH Sears List of SH MeSH, 2. Thesauri: single and bound terms (e.g., Type A Personality) representing single concepts (descriptors); strictly hierarchical; narrower in scope; can be multilingual Art & Architecture Thesaurus (cultural heritage info) Thesaurus of ERIC Descriptors (educational resources) INSPEC Thesaurus (physics and engineering communities)

Controlled Vocabularies  1. and 2. provide subject access to info objects by providing terminology that can be consistent (controlled vocabulary)  Choose preferred terms and make references from non-used terms  Provide hierarchies: BT, NT, RT 3. Ontologies: bring all variant ways of expressing a concept and showing relationships via BT, NT, RT; do not select preferred terms“systematic account of existence”

Solution to the “Subject Problem” for Images: Natural Language Analysis  Natural language that people use (linguistic constructs, grammar relationships, syntax, communication vocabulary) can be used for describing and searching in visual information retrieval systems  Content-based natural language processing is understood in terms of syntactic structure in the spoken natural language  Concept-based natural language processing attempts to capture the semantics of an image

Critical Reflection 7: The GAME  User-Based Natural Language Analysis for Creation and Evaluation of Visual information Retrieval Systems in Library and Museum Settings  Your response: On the Black Board space, respond to the questions provided on the handout

Exercise 3: Authority Control OBJECTIVES  to observe name authority control  to observe controlled vocabulary for subject access Part I. Name authority  Go to the authority record database in Library of Congress Search for the popular author, Samuel Clemens.  How many authorized headings are established for him? Attach the most complete MARC Authority record for each authorized heading.  For the MARC Authority format, explain the semantics (meanings) of the fields: 1xx, 4xx and 5xx. Make sure that you mention how authorized and unauthorized headings are cross-referenced.  For each authorized heading, how many bibliographic records are found in LC collection using the heading? If an authorized heading is not used, why so?  Can the user just click on the authorized heading to retrieve bibliographic records by the author?

Exercise 3: Authority Control Part II. Authorized subject headings  Go to the authority record database in Library of Congress  Search for an authorized subject heading for each of the topics: Teapot Dome scandal Watergate scandal  What are the broader heading (BT)?  What are the narrower headings (NT)?  What are the related headings (RT)?  Construct an alphabetical subject headings list of the headings (BT, NT, RT, the heading itself) and their related headings including both authorized headings and lead-in terms. Under each heading cross- reference the related terms: Used-for, Use, BT, NT, RT.

Exercise 3: Authority Control WHAT TO TURN IN?  The authority records for Samuel Clemens in MARC format and your answers to all the questions.  The authority records for the two subject headings in MARC format and the subject headings list.  A brief discussion on the roles of authority control in IR.