Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Organization

Similar presentations


Presentation on theme: "Information Organization"— Presentation transcript:

1 Information Organization
LBSC 670 Information Organization

2 Today Guest Speaker –Jeremy York – HathiTrust
Classification Thoughts and CV Overview & History Related concepts Examples A note on MARC specifications

3 Classification concpets
Aboutness, specificity, granularity “Words have power,“ - classification systems exist within a socio-political context Classification methods Manual/automatic, Pre/Post coordinate, Hierarchical/faceted, formal/social Social tagging, folksonomies, taxonomies, classification, conceptual trees, resemblance, aboutness, binomial, type, metadata, topic/subtopic, map-of-knowledge, exceptions, overlaps, without prior planning, lump and split Classification, Information Architecture, Indexing, Information Retrieval

4 CV overview What are controlled vocabularies?
Types Basic concepts How are cv created and maintained Metadata standards Example Systems When does a CV turn into a KO? Term Lists, Thesauri, Taxonomies, Ontologies

5 Controlled Vocabularies
“organized lists of words and phrases, or notation systems, that are used to initially tag content, and then to find it through navigation or search.” (Warner via Leise, Fast) “the primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval” (ANSI Z39.19)

6 Knowledge Organization
“tools that present the organized interpretation of knowledge structures” (Hjørland) “classification schemes that organize materials at a general level…, subject headings that provide more detailed access, and authority files that control variant versions of key information” (Hodge)

7 Uses of controlled vocabulary
Define scope, content, and context of a body of knowledge Support discovery - Navigation, search, browsing Map information objects to user terminology Enforce term consistency and relationships

8 A+ A good CV. . . Removes ambiguity
Defines relationships between things Contextualizes information A+ Removes ambiguity Synonyms, Homonyms, polysemes, Defines relationships Equivalence, hierarchical, associative (BT, NT, RT, CR) reciprocity, Provides context Category, scope, qualifiers, modifiers, scope notes

9 CV Concepts Content Analysis Form Analysis Ambiguity Synonymy
Exhaustivity Specificity Co-extensivity Aboutness Semantic structure Warrant (User, Literary, Organization) Form Analysis Linguistics Grammar Semiotics Single / Multiple terms Indexing & Retrieval Pre vs. Post Coordinate Recall vs. Precision Natural language processing (NLP) There is a lot here. This is not a class about indexing There is a great document that help you understand these things in more detail Content analysis – e.g. how do we understand the ‘aboutness’? Form analysis – e.g. how does the form of the text impact our understanding and how do we code our thesaurus Indexing & retrieval – pre coordination vs post coordination, recall vs precision, nlp,

10 Content Analysis Ambiguity Synonymy Specificity Exhaustivity
Each term should relate to a single concept Synonymy Each concept should be identified by a single entry Specificity Using the most specific words or phrase expressing the subject Exhaustivity The extent to which the entire document is indexed (Summarization, depth) Co-extensivity “Assign as many terms as needed to bring out the main theme, and according to guidelines sub-themes.” (p. 29, Lancaster) “nothing more, nothing less” Semantic Structure Terms can be related with equivalence, hierarchy, or associated relationships (Use, See, NT, BT, RT)

11 Content Analysis (2) Aboutness = Subject/topic? Wilson (1968)
Author intent, topicality, relationship to other resources, textual analysis Farithorne (1969) Intentional aboutness (author), extensional aboutness (document) Maron (1977) objective about (document), subjective about (user), and retrieval about (information retrieval) Hjorland (2001) “Closely related to theories of meaning, interpretation, and epistemology”

12 Content Analysis (3) Wilson’s criteria for evaluating aboutness (1968)
Identify author’s purpose (intent) Weigh the predominant topics, elements (topical analysis) Group/count a document’s use of concepts and references (bibliometrics) Identify essential elements (text analysis) Subject Access to Information through: Evaluate, assess… Translate to where in the language system Assign the descriptor (term, class notation, code) Read the document [Intellectual reading] look for key features many indexers mark up the items rarely have time to read the whole document Determine aboutness [Conceptual analysis] Translate aboutness into the vocabulary or scheme you are using In general: Subject headings: 1-3 headings Descriptors, 5-8 descriptors Classification: 1 notation.

13 Content Analysis (4) Literary Warrant User Warrant
“The inclusion of a vocabulary term in a controlled vocabulary based on its appearance in one or more content items. For example, a medical text may use the term “oncology.” Based on literary warrant, that term would be included in the controlled vocabulary even though the general public uses the term “cancer.” (Glosso-Thesaurus) User Warrant “The inclusion of a vocabulary term in a controlled vocabulary based on use by users. Such terms can be identified through search log analysis or free listing.” (Glosso-Thesaurus) Organizational Warrant “Justification for the...selection of a preferred term due to the characteristics and context of the organization using the resource” (ANSI Z39.19)

14 Form Analysis Linguistics Semiotics Lexical structure
Synatx/Form (grammar) Morphology (internal word structure) Semantics (meaning) Pragmatics, discourse analysis (word/phrase use) Semiotics study of signs/symbols Lexical structure Document layout, markup, tags (think DOM)

15 Indexing & Retrieval Pre/Post-Coordinate Recall / Precision
Organization prior to retrieval Organization at the point of retrieval Recall / Precision Recall: Number of retrieved relevant docs / total number of docs in collection Precision: number or retrieved relevant docs / all relevant docs in collection Natural language processing Uses semantics and syntax to automatically distill ‘aboutness’

16 Recall & Precision CV Entry # of docs A collection of 100 documents
Searches “Vocabularies” Recall 100/100 = 1 Precision 100/100 = 1 “Facet” Recall 20/100= .2 Precision 20/28 = .71 “OWL” Recall 1/100 = .001 Precision 1/1 = 1 CV Entry # of docs Controlled Vocabularies 100 Faceted analysis 20 Ontologies 5 OWL 1 RDF 3 Lets consider recall and precision Inverse relationship – the more recall the lower precision and vice versa A very specific thesaurus and rigorous indexing system leads to high precision and low recall – great for medical inquiry A broader thesaurus and looser indexing leads to high recall – use other factors to rank to create relevance Recall = # of docs retrieved / total # of docs in collection Precision = # relevant of docs retrieved / total relevant # of docs in collection

17 Types of Controlled Vocabularies
Term Lists Glossaries, Dictionaries, Gazetteers, Folksonomies Synonym rings Z39.19 example Oracle Text Taxonomies Website navigation scheme Thesauri / Ontologies Authority files, subject thesauri, topic maps

18

19 Thesauri & taxonomy examples
List of vocabularies Taxonomy warehouse Two Examples Health & Ageing Thesaurus Thesaurus of Geographic names

20 Interoperable system example
NCBI Entrez 35 databases using interoperable controlled vocabulary systems to provide rich meta-searching Cross-database discovery – search for “heart attack” Cross database linking – search for aconitase, follow the “other links” tab.

21 CV Structures Organization structures Hierarchical systems
Term Lists / Enumerative systems Hierarchies Tees Facets / Associative relationships Folksonomies

22 Hierarchies Features Inclusiveness “Is-a” relationship Inheritance
Transitivity Systematic Mutually exclusive Neccesary and sufficient Issues Illusion of completeness Multiple perspectives Lack of comprehensive knoeldge IDfference in scale Lack of tranistivity Strict rules Benefits Comprehensive Economy of notation Inheritance Inference Real definitions Holistic perspective High level view In contrast trees do not denote inheritance, merely relationships (e.g. partitigve) but they do Rigidity One-way perspective Selective perspective (single attribute) Shows a primary relationship well Indicates distance between objects Shows relative frequency From

23 Relationships Equivalence ( Term Lists)
“use”, “see”, “isVersionOf”, “isFormatOf” Hierarchical (Thesauri, Taxonomies) Generic – “is a” Partitive – “is part of”, “has part”, “has conceptual part”, “member of” Instance – Associative (Facets, Ontologies) “isReferencedBy”, “isRequiredBy”, “hasDerivative”

24 Faceted vocabularies Issues Benefits
Lack of obvious relationships Difficult to navigate, visualize Harder to establish facets Benefits Accommodates Partial Knowledge Flexible, Hospitable Expressive Bottom-up, not top-down Multi-theoretical Multi-perspective Multi-dimensional, multi-relationship driven, Subject, Object, Predicate From

25 Folksonomy Features Single level description Open vocabulary list
User supplied/harvested tags Issues Lack of controlled vocabulary Lack of relationship/hierarchy assignment Lack of definition of intent Benefits Flexible User-Centered Harvestable(?) – for what? Go to Talk about egyptian protest Jan25 – the day the protests began – see how the trends hit Also show yemen Show justin beiber And vamosmaickel Issues – tags are event driven, non consistent, Talk about flickr More consistency, more resource based Incredible thing – twitter was a major enabling factor, could it have happened if they had had to look up the lc heading first?

26 Term List Examples Authority files – Maps to preferred terms
Library of Congress Encoded Archival Context Union List of Artist Names Glossaries/Dictionaries –Words & definitions, sometimes topic focused Glosso-Thesaurus Folksonomies – Contextualization, Trend discovery, Personal Information Synonym rings – Used for back-end equivalence in searching Princeton Wordnet

27 Choosing a framework Use questions Content questions System Questions
Who is your user, what are their needs? What systems are your users familiar with? Will this system be internal/external? Content questions How extensive, defined is the information? Is your subject matter static or fluid? What organizational framework best describes your content? System Questions What access are you trying to provide? What external pressures exist? What external entities/theories will interact with this system?

28 Thesauri Definitions “Guide to use of terms, showing relationships between them, for the purpose of providing standardized, controlled vocabulary for information storage and retrieval”(Monash) “A list of words showing similarities, differences, dependencies, and other relationships to each other”(USG)

29 Creating a CV (1) Design methods
Re-use existing, start with content & desired use ideas Committee / community approach Top-down Concept driven Bottom-up Document driven Empirical approach Deductive approach Select terms, create relationships, perform term control Inductive approach Establish CV at outset, build hierarchies on as needed basis

30 Creating a CV (2) Top-Down (deductive) Bottom-up (Inductive)
Identify audience Identify all topics, concepts, uses, and context of the domain Sort topics identified into an appropriate organization scheme (enumerative, hierarchical, faceted) Solidify structure and clean up gaps & redundancies Assign documents to categories, test retrieval Bottom-up (Inductive) Identify audience Survey documents for topics/concepts. Build system on the fly – let content drive structure and limits of system Identify gap & redundancies in system Test retrieval

31 Creating a CV (3) Think about scope, use, content, maintenance
Gather Terms Based on existing systems, content Based on user needs/expectations Investigate issues of specificity, exhaustivity, granularity Build hierarchies, relationships Broader/narrower terms, Related terms, Use/Use for, see/see also Establish Rules Implement Evaluate Maintain

32 Evaluating a CV Goals Methods
Determine if the CV solves retrieval needs of user/system Determine if CV matches user’s content model/term expectations Methods Expert evaluation of CV User based card sorting compared to actual CV Identification of non-included documents Analysis of use of system - HCI Are required facets/hierarchies present? Is term entry form consistent Is the system balanced (NT/BT/RT) How does the content of the vocabulary map to the content of the resources

33 CV Maintenance Primary responsibility New terms Modified terms
Editor, board, committee New terms Is it really new or a different view What is the proper form & placement Modified terms Include a change log Use a “USE” reference to point to new term Deleted terms Unused / Overused terms May want to keep for historical retrieval purposed Modification history Use modification notes, date/time stamps

34 Case study - MeSH

35 Class exercise Protégé overview Replication of the Glosso-Thesaurus
Orientation Object types (Classes, Slots, Instances) Relationships (hierarchies, associative) Replication of the Glosso-Thesaurus Visit the Boxes & Arrows Glosso Thesaurus Look at the data there and come up with a structure in Protégé that allows replication of the thesaurus Some issues to consider are: Do you want terms to be classes or instances? What is the easiest way to show the relationships (broader term, narrower term, etc)? Do you need to allow multiple relationships for a given type (BT, RT, etc)? If you have multiple classes, at what level should you create the slots?

36 Thesauri Concepts Preferred terms Non-preferred terms
Semantic relations between terms How to apply terms (guidelines, rules) Scope notes Adding terms (How to produce terms that are not listed explicitly in the thesaurus)

37 Common thesaural identifiers
SN Scope Note Instruction, e.g. don’t invert phrases USE Use (another term in preference to this one) UF Used For BT Broader Term NT Narrower Term RT Related Term

38 Thesauri Guides National Information Standards Organization. (2005). Guidelines for the construction, format, and management of monolingual thesauri. ANSI/NISO Z Bethesda, MD: NISO Press. Aitchison, Jean & Gilchirist, Alan. Thesaurus Construction: A Practical Guide. 3rd ed. London: Aslib, 1997. Willpower Information Management Consultants

39 Thesaurus Exploration
Protégé introduction and tour What is protégé? What is it used for? How will we use it this semester? Search for mississippi Show the hierarchical relationships Open protégé Give a short tour

40 When is a CV an Ontology? “The study of being or existence”
“A conceptualization of a specification” (Gruber) “An ontology formally defines a common set of terms that are used to describe and represent a domain.” (OWL)

41 Webster’s Dictionary Webster’s Third New International Dictionary defines Ontology as: A science or study of being, specifically a branch of metaphysics* relating to the nature and relations of being. A theory concerning the kinds of entities and specifically the kinds of abstract entities that are to be admitted to a language system. *Metaphysics: Nature of being “or” existence. Ontology will analyze the most general and abstract concepts or distinctions that unlderly every more specific description of any phenomenon in the world e.g. time, space, matter, process, cause and effect, system

42 Next Week Work time for Protégé Exploration of ontologies


Download ppt "Information Organization"

Similar presentations


Ads by Google