Presentation is loading. Please wait.

Presentation is loading. Please wait.

DL:Lesson 5 Classification Schemas Luca Dini

Similar presentations

Presentation on theme: "DL:Lesson 5 Classification Schemas Luca Dini"— Presentation transcript:

1 DL:Lesson 5 Classification Schemas Luca Dini

2 Overview The Dublin Core defines a number of metadata elements, but what about the values for those elements? Should they be unrestricted text values or come from pre-defined vocabularies? "it depends". We will discuss how to determine the appropriate approach for an organization's situation. We will also cover how pre-defined vocabularies should be sourced, structured, and maintained.

3 Vocabulary development and maintenance
Vocabulary development and maintenance is the LEAST of three problems: The Vocabulary Problem: How are we going to build and maintain the lists of pre-defined values that can go into some of the metadata elements? The Tagging Problem: How are we going to populate metadata elements with complete and consistent values? What can we expect to get from automatic classifiers? What kind of error detection and error correction procedures do we need? The ROI Problem: How are we going to use content, metadata, and vocabularies in applications to obtain business benefits? More sales? Lower support costs? Greater productivity? How much content? How big an operating budget? Need to know the answer to the ROI Problem before solving the Vocabulary Problem.

4 Definitions Term Definition Metadata Element
A ‘field’ for storing information about one piece of content. Examples: Title, Creator, Subject, Date, … Metadata Value The ‘contents’ of one Metadata Element. Values may be text strings, or selections from a predefined vocabulary. Metadata Schema A defined set of metadata elements. The Dublin Core is one schema. Free Text Value An unconstrained text metadata value. Some text values are constrained to follow a format (e.g. YYYY-MM-DD). Vocabulary A list of predefined values for a metadata element. Controlled Vocabulary A vocabulary with a defined and enforced procedure for its update.

5 Controlled vocabularies
Hierarchical classification of things into a tree structure Kingdom Phylum Class Order Family Genus Species Animalia Chordata Mammalia Carnivora Canidae Canis C. familiari Linnaeus … Segment Family Class Commodity 44-Office Equipment and Accessories and Supplies .12-Office Supplies .17-Writing Instruments .05-Mechanical pencils .06-Wooden pencils .07-Colored pencils UNSPSC …

6 Classification Schemes
Types of vocabularies Vocabulary Type Cplxty. Description Relation Type Term List 1 Simple list of terms with no internal structure or relations. None Synonym Rings 2 List of sets of terms to regard as equivalent. Widely supported in search software. Equivalence Authority Files 3 List of names for known entities – people, organizations, books, etc. Reference Classification Schemes 4 Hierarchical arrangement of concepts. Loose Hierarchy Thesauri 5 Hierarchical arrangement of concepts plus supporting information and additional, non-hierarchical, relations. “Is-a” Hierarchy plus Loose Relations Ontologies 6 Arrangement of concepts and relations based on a model of underlying reality – e.g. organs, symptoms, diseases & treatments in medicine. Model-based Typed Relations

7 Vocabulary Control The degree of control over a vocabulary is (mostly) independent of its type. Uncontrolled – Anybody can add anything at any time and no effort is made to keep things consistent. Multiple lists and variations will abound. Managed – Software makes sure there is a list that is consistent (no duplicates, no orphan nodes) at any one time. Almost anybody can add anything, subject to consistency rules. (e.g. File System Hierarchy) Controlled – A documented process is followed for the update of the vocabulary. Few people have authority to change the list. Software may help, but emphasis is on human processes and custodianship. (e.g. Employee list) Term lists, synonym lists, … can be controlled, managed, or uncontrolled. Ontologies are managed.

8 Type of controls Controlled vocabularies are frequently mentioned
That does not mean they are always necessary Control comes at a cost, but can provide significant data quality benefits by reducing variations. Is this a well-controlled vocabulary? No! It is an uncontrolled, but well-managed, term list Is this part of an appropriate solution to the ROI problem? Yes! There is no budget to do ongoing control and QA Source:

9 Likelihood of controlled values
(Virtually) Mandatory Highly Likely Maybe Highly Unlikely (Virtually) Impossible Language RFC 3066 Format IMT Coverage ISO 3166 Type DCMI Type? Subject Custom Creator LDAP? Publisher Contributor Identifier Date W3C DTF Rights Title Relation Source Description

10 Mandatory DC recommends specific best practices:
Language: RFC 3066 (which works with ISO 639) Format: Internet Media Types (aka MIME) These vocabularies are widely used throughout the Internet. If you want to do something else, it should be justified. Describing physical objects? Use Extent and Medium refinements instead of Format. Regional (vs. National) dialects? a) Why? b) Consider a custom element in addition to standard Language

11 Likely DC recommends specific best practices: Coverage: ISO 3166
ISO 3166 should be used unless you have good reasons to use something else Consider Getty Thesaurus of Geographic Names if you need cities, rivers, etc. ( DC provides Encodings for both Type: DCMITypes ( DCMIType list is not necessarily a best practice No widely accepted type list exists, so a custom list is likely

12 May be Creator, Contributor could come from an “authority file”
LC NAF in library contexts LDAP Directory in corporate contexts Recommended where possible Many exceptions where author is outside LDAP Publisher could come from an authority file Org chart in corporate contexts – e.g. internal records management system. Identifier should be a URI Organization may manage these, but its typically a text field, not a controlled list.

13 Subject and extensions
Best practice: Use pre-defined subject schemes, not user-selected keywords. DC Encodings (DDC, LCC, LCSH, MESH, UDC) most useful in library contexts. Not useful for most corporate needs Recommended: Factor “Subject” into separate facets. People, Places, Organizations, Events, Objects, Products & Services, Industry sectors, Content types, Audiences, Business Functions, Competencies, … Store the different facets in different fields Use DC elements where appropriate (coverage, type, audience, …) Extend with custom elements for other fields (industry, products, …)

14 Thesauri A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms

15 Standards National and International Standards for Thesauri
ANSI/NISO z — American National Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri ANSI/NISO Draft Standard Z x — American National Standard Guidelines for Indexes in Information Retrieval ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri

16 Thesaurus Examples Examples The ERIC Thesaurus of Descriptors
The Medical Subject Headings (MESH) of the National Library of Medicine The Art and Architecture Thesaurus

17 ERIC Thesaurus – Entry

18 ERIC Thesaurus – Online

19 MeSh

20 MeSh Online

21 Dewey Dewey Decimal Classification System (DDC) first published in 1876 by Melvil Dewey Most widely used classification system in the world (used in 135 countries) In this country used primarily by public and school libraries Maintained by the Library of Congress

22 Dewey DDC is divided into ten main classes, then ten divisions, each division into ten sections The first digit in each three-digit number represents the main class. “500” = natural sciences and mathematics. The second digit in each three-digit number indicates the division. “500” is used for general works on the sciences “510” for mathematics “520” for astronomy “530” for physics

23 Dewey The third digit in each three-digit number indicates the section. “530”is used for general works on physics “531” for classical mechanics “532” for fluid mechanics “533” for gas mechanics A decimal point follows the third digit in a class number, after which division by ten continues to the specific degree of classification needed.

24 Library of Congress Subjects
Essentially an artificial indexing language Based on literary warrant Entry vocabulary provided in the form of reference structure Moving slowly towards a real thesaurus structure (not there yet) Not faceted—subdivisions pre-selected, based on individual heading or “pattern” heading

25 LCSH Digital libraries see from “Electronic libraries”
see from “Virtual libraries” see broader term: “Libraries” see also “Information storage and retrieval systems”

26 Library of Congress Classification
21 basic classes, based on single alphabetic character (K=law, N=art, etc.) Subdivided into two or three alpha characters (KF=American Law, ND=painting, etc.) Further subdivision by specific numeric assignment Author numbers and dates arrange works by a particular author together and in chronological order

27 LCC 153##$aQL638.E55$hZoology$hChordates. Vertebrates$hFishes$hSystematic divisions$hOsteichthys (Bony fishes). By family, A-Z$hFamilies$jEngraulidae (Anchovies) $a = Classification number--single number or beginning number of span (R) $h = Caption hierarchy $j = Caption (lowest level, relating to the specific number in $a)

28 DMOZ: A worst case example of a unified ‘subject’
DMOZ has over 600k categories Most are a combination of common facets – Geography, Organization, Person, Document Type, … (e.g.) Top: Regional: Europe: Spain: Travel and Tourism: Travel Guides

29 History of Faceted Navigation
Relatively New -- Taxonomies - Aristotle S. R. Ranganathan – 1960’s Issue of Compound Subjects The Universe consists of PMEST Personality, Matter, Energy, Space, Time Classification Research Group- 1950’s, 1970’s Based on Ranganathan, simplified, less doctrinaire Principles: Division – a facet must represent only one characteristic Mutual Exclusivity Classification Theory to Web Implementation An Idea waiting for a technology Multiple Filters / dimensions

30 What are Facets? Facets are not categories
Entities or concepts belong to a category Entities have facets Facets are metadata - properties or attributes Entities or concepts fit into one category All entities have all facets – defined by set of values Facets are orthogonal – mutually exclusive – dimensions An event is not a person is not a document is not a place. A winery is not a region is not a price is not a color. Relations between facets, subfacets, and foci (elements) are not restricted to hierarchical generalization-specialization relations Combined using grammars of order and relation to form compound descriptions

31 Facetted Classification
Clearly distinguishes between semantic relationships and syntactic relationships Semantic relationships Within a facet Containment relations Syntactic relationships Across facets Combinatoric relations Have a “syntax” for syntactic combination of semantic terms

32 Semantic and Syntactic Relationships
Semantic relationships Is-A (thing/kind, genus/species) Mammals Primates Humans Has-Parts Human Head Eyes Syntactic relationships Compounds Wheat + harvesting = “wheat harvesting” Object + operation = operation on object

33 What is Faceted Navigation?
Not a Yahoo-style Browse Computer Stores under Computers and Internet One value per facet per entity Faceted Navigation is not hierarchical Tree – travel up and down, not across Facets are filters, multidimensional Facets are applied at search results time – post-coordination, not pre-coordination [Advanced Search] Faceted Navigation is an active interface – dynamic combination of search and browse

34 When to Use Faceted Navigation Advantages
Systematic Advantages: Need fewer Elements 4 facets of 10 nodes = 10,000 node taxonomy Ability to Handle Compound Subjects Content Management Advantages: Easier to “categorize” – not as conceptual Fewer = simple, can use auto-classification better Flexible – can add new facets, elements in facet

35 When to Use Faceted Navigation Advantages: Implementation
More intuitive – easy to guess what is behind each door Simplicity of internal organization 20 questions – we know and use Dynamic selection of categories Allow multiple perspectives Trick Users into “using” Advanced Search wine where color = red, price = x-y, etc. Click on color red, click on price x-y, etc. Flexible – can be combined with other navigation elements

36 When to Use Faceted Navigation Disadvantages
Systematic Disadvantages: Lack of Standards for Faceted Classifications Every project is unique customization Implementation Disadvantages: Loss of Browse Context Difficult to grasp scope and relationships No immediate support for popular subjects Essential Limit of Faceted Navigation Limited Domain Applicability – type and size Entities not concepts, documents, web sites

37 Developing Facet Structure: Selection of Facets: Theory
Issue - Complete Model of a domain Ranganathan – PMEST Personality – Person, animal, event Matter – what x is made of Energy – how x changes Space – where x is Time – when x happens Three Planes – Idea, Verbal, Notational

38 Facets: an example A Language B Genre C Period Aa English Literature
b French c Spanish B Genre a Prose b Poetry c Drama C Period a 16th Century b 17th Century c 18th Century d 19th Century Aa English Literature AaBa English Prose AaBaCa English Prose 16th Century AbBbCd French Poetry 19th Century BbCd Drama 19th Century

39 Developing Facet Structure: Selection of Facets: Practice
Region Australia, California Type Red Wine, White, Bubbly Winery Alphabetical listing Price $25 and below $25-$50 Top Rated Wines 90+ under $20 Top Sellers Cabinet Sauvignon Pinot Noir Hot Features Wine outlet Sideways collection

40 Faceted Approach Power Faster construction Reduced maintenance cost
4 independent categories of 10 nodes = 10,000 nodes (104) Faster construction Use existing taxonomies in specific fields Reduced maintenance cost More opportunity for data reuse Can be easier to navigate with appropriate UI 60 nodes 24,000 combinations

41 Organization Either expose them directly in the user interface (post-coordinating) or Combine them in a minimal hierarchy (pre-coordination) or Hide them to the user! Post-coordination takes software support, which may be fancy or basic. How many facets? Log10(#documents) as a guide

42 Element Data Type Length Req. / Repeat Source Purpose Asset Metadata
Unique ID Integer Fixed 1 System supplied Basic accountability Recipe Title String Variable Licensed Content Text search & results display Recipe summary Content Main Ingredients List ? Main Ingredients vocabulary Key index to retrieve & aggregate recipes, & generate shopping list Subject Metadata Meal Types * Meal Types vocab Browse or group recipes & filter search results Cuisines Courses Courses vocab Cooking Method Flag Cooking vocab Link Metadata Recipe Image Pointer Product Group Merchandize products Use Metadata Rating Filter, rank, & evaluate recipes Release Date Date Publish & feature new recipes dc:identifier dc:title dc:description X dcterms:hasPart dc:date dc:type=“recipe”, dc:format=“text/html”, dc:language=“en”

43 Project/exercise Produce a faced classification of your documents (at least 3 facets, min 5 foci each) Encode the facet classification as an extension of dc:subject Attribute facets to your docs. Check exptensibility by adding 10 new docs

Download ppt "DL:Lesson 5 Classification Schemas Luca Dini"

Similar presentations

Ads by Google