1DL:Lesson 5 Classification Schemas Luca Dini firstname.lastname@example.org
2OverviewThe Dublin Core defines a number of metadata elements, but what about the values for those elements?Should they be unrestricted text values or come from pre-defined vocabularies?"it depends".We will discuss how to determine the appropriate approach for an organization's situation.We will also cover how pre-defined vocabularies should be sourced, structured, and maintained.
3Vocabulary development and maintenance Vocabulary development and maintenance is the LEAST of three problems:The Vocabulary Problem: How are we going to build and maintain the lists of pre-defined values that can go into some of the metadata elements?The Tagging Problem: How are we going to populate metadata elements with complete and consistent values?What can we expect to get from automatic classifiers? What kind of error detection and error correction procedures do we need?The ROI Problem: How are we going to use content, metadata, and vocabularies in applications to obtain business benefits?More sales? Lower support costs? Greater productivity?How much content? How big an operating budget?Need to know the answer to the ROI Problem before solving the Vocabulary Problem.
4Definitions Term Definition Metadata Element A ‘field’ for storing information about one piece of content. Examples: Title, Creator, Subject, Date, …Metadata ValueThe ‘contents’ of one Metadata Element. Values may be text strings, or selections from a predefined vocabulary.Metadata SchemaA defined set of metadata elements. The Dublin Core is one schema.Free Text ValueAn unconstrained text metadata value. Some text values are constrained to follow a format (e.g. YYYY-MM-DD).VocabularyA list of predefined values for a metadata element.Controlled VocabularyA vocabulary with a defined and enforced procedure for its update.
5Controlled vocabularies Hierarchical classification of things into a tree structureKingdomPhylumClassOrderFamilyGenusSpeciesAnimaliaChordataMammaliaCarnivoraCanidaeCanisC. familiariLinnaeus …SegmentFamilyClassCommodity44-Office Equipment and Accessories and Supplies.12-Office Supplies.17-Writing Instruments.05-Mechanical pencils.06-Wooden pencils.07-Colored pencilsUNSPSC …
6Classification Schemes Types of vocabulariesVocabulary TypeCplxty.DescriptionRelation TypeTerm List1Simple list of terms with no internal structure or relations.NoneSynonym Rings2List of sets of terms to regard as equivalent. Widely supported in search software.EquivalenceAuthority Files3List of names for known entities – people, organizations, books, etc.ReferenceClassification Schemes4Hierarchical arrangement of concepts.Loose HierarchyThesauri5Hierarchical arrangement of concepts plus supporting information and additional, non-hierarchical, relations.“Is-a” Hierarchy plus Loose RelationsOntologies6Arrangement of concepts and relations based on a model of underlying reality – e.g. organs, symptoms, diseases & treatments in medicine.Model-based Typed Relations
7Vocabulary ControlThe degree of control over a vocabulary is (mostly) independent of its type.Uncontrolled – Anybody can add anything at any time and no effort is made to keep things consistent. Multiple lists and variations will abound.Managed – Software makes sure there is a list that is consistent (no duplicates, no orphan nodes) at any one time. Almost anybody can add anything, subject to consistency rules. (e.g. File System Hierarchy)Controlled – A documented process is followed for the update of the vocabulary. Few people have authority to change the list. Software may help, but emphasis is on human processes and custodianship. (e.g. Employee list)Term lists, synonym lists, … can be controlled, managed, or uncontrolled.Ontologies are managed.
8Type of controls Controlled vocabularies are frequently mentioned That does not mean they are always necessaryControl comes at a cost, but can provide significant data quality benefits by reducing variations.Is this a well-controlled vocabulary?No! It is an uncontrolled, but well-managed, term listIs this part of an appropriate solution to the ROI problem?Yes! There is no budget to do ongoing control and QASource:
10Mandatory DC recommends specific best practices: Language: RFC 3066 (which works with ISO 639)Format: Internet Media Types (aka MIME)These vocabularies are widely used throughout the Internet. If you want to do something else, it should be justified.Describing physical objects?Use Extent and Medium refinements instead of Format.Regional (vs. National) dialects?a) Why?b) Consider a custom element in addition to standard Language
11Likely DC recommends specific best practices: Coverage: ISO 3166 ISO 3166 should be used unless you have good reasons to use something elseConsider Getty Thesaurus of Geographic Names if you need cities, rivers, etc. (http://www.getty.edu/research/conducting_research/vocabularies/tgn/)DC provides Encodings for bothType: DCMITypes (http://dublincore.org/documents/dcmi-type-vocabulary/)DCMIType list is not necessarily a best practiceNo widely accepted type list exists, so a custom list is likely
12May be Creator, Contributor could come from an “authority file” LC NAF in library contextsLDAP Directory in corporate contextsRecommended where possibleMany exceptions where author is outside LDAPPublisher could come from an authority fileOrg chart in corporate contexts – e.g. internal records management system.Identifier should be a URIOrganization may manage these, but its typically a text field, not a controlled list.
13Subject and extensions Best practice: Use pre-defined subject schemes, not user-selected keywords.DC Encodings (DDC, LCC, LCSH, MESH, UDC) most useful in library contexts.Not useful for most corporate needsRecommended: Factor “Subject” into separate facets.People, Places, Organizations, Events, Objects, Products & Services, Industry sectors, Content types, Audiences, Business Functions, Competencies, …Store the different facets in different fieldsUse DC elements where appropriate (coverage, type, audience, …)Extend with custom elements for other fields (industry, products, …)
14ThesauriA Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms
15Standards National and International Standards for Thesauri ANSI/NISO z — American National Standard Guidelines for the Construction, Format and Management of Monolingual ThesauriANSI/NISO Draft Standard Z x — American National Standard Guidelines for Indexes in Information RetrievalISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauriISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri
16Thesaurus Examples Examples The ERIC Thesaurus of Descriptors The Medical Subject Headings (MESH) of the National Library of MedicineThe Art and Architecture Thesaurus
21DeweyDewey Decimal Classification System (DDC) first published in 1876 by Melvil DeweyMost widely used classification system in the world (used in 135 countries)In this country used primarily by public and school librariesMaintained by the Library of Congress
22DeweyDDC is divided into ten main classes, then ten divisions, each division into ten sectionsThe first digit in each three-digit number represents the main class.“500” = natural sciences and mathematics.The second digit in each three-digit number indicates the division.“500” is used for general works on the sciences“510” for mathematics“520” for astronomy“530” for physics
23DeweyThe third digit in each three-digit number indicates the section.“530”is used for general works on physics“531” for classical mechanics“532” for fluid mechanics“533” for gas mechanicsA decimal point follows the third digit in a class number, after which division by ten continues to the specific degree of classification needed.
24Library of Congress Subjects Essentially an artificial indexing languageBased on literary warrantEntry vocabulary provided in the form of reference structureMoving slowly towards a real thesaurus structure (not there yet)Not faceted—subdivisions pre-selected, based on individual heading or “pattern” heading
25LCSH Digital libraries see from “Electronic libraries” see from “Virtual libraries”see broader term: “Libraries”see also “Information storage and retrieval systems”
26Library of Congress Classification 21 basic classes, based on single alphabetic character (K=law, N=art, etc.)Subdivided into two or three alpha characters (KF=American Law, ND=painting, etc.)Further subdivision by specific numeric assignmentAuthor numbers and dates arrange works by a particular author together and in chronological order
27LCC153##$aQL638.E55$hZoology$hChordates. Vertebrates$hFishes$hSystematic divisions$hOsteichthys (Bony fishes). By family, A-Z$hFamilies$jEngraulidae (Anchovies)$a = Classification number--single number or beginning number of span (R)$h = Caption hierarchy$j = Caption (lowest level, relating to the specific number in $a)
28DMOZ: A worst case example of a unified ‘subject’ DMOZ has over 600k categoriesMost are a combination of common facets – Geography, Organization, Person, Document Type, …(e.g.) Top: Regional: Europe: Spain: Travel and Tourism: Travel Guides
29History of Faceted Navigation Relatively New -- Taxonomies - AristotleS. R. Ranganathan – 1960’sIssue of Compound SubjectsThe Universe consists of PMESTPersonality, Matter, Energy, Space, TimeClassification Research Group- 1950’s, 1970’sBased on Ranganathan, simplified, less doctrinairePrinciples:Division – a facet must represent only one characteristicMutual ExclusivityClassification Theory to Web ImplementationAn Idea waiting for a technologyMultiple Filters / dimensions
30What are Facets? Facets are not categories Entities or concepts belong to a categoryEntities have facetsFacets are metadata - properties or attributesEntities or concepts fit into one categoryAll entities have all facets – defined by set of valuesFacets are orthogonal – mutually exclusive – dimensionsAn event is not a person is not a document is not a place.A winery is not a region is not a price is not a color.Relations between facets, subfacets, and foci (elements) are not restricted to hierarchical generalization-specialization relationsCombined using grammars of order and relation to form compound descriptions
31Facetted Classification Clearly distinguishes between semantic relationships and syntactic relationshipsSemantic relationshipsWithin a facetContainment relationsSyntactic relationshipsAcross facetsCombinatoric relationsHave a “syntax” for syntactic combination of semantic terms
32Semantic and Syntactic Relationships Semantic relationshipsIs-A (thing/kind, genus/species)MammalsPrimatesHumansHas-PartsHumanHeadEyesSyntactic relationshipsCompoundsWheat + harvesting = “wheat harvesting”Object + operation = operation on object
33What is Faceted Navigation? Not a Yahoo-style BrowseComputer Stores under Computers and InternetOne value per facet per entityFaceted Navigation is not hierarchicalTree – travel up and down, not acrossFacets are filters, multidimensionalFacets are applied at search results time – post-coordination, not pre-coordination [Advanced Search]Faceted Navigation is an active interface – dynamic combination of search and browse
34When to Use Faceted Navigation Advantages Systematic Advantages:Need fewer Elements4 facets of 10 nodes = 10,000 node taxonomyAbility to Handle Compound SubjectsContent Management Advantages:Easier to “categorize” – not as conceptualFewer = simple, can use auto-classification betterFlexible – can add new facets, elements in facet
35When to Use Faceted Navigation Advantages: Implementation More intuitive – easy to guess what is behind each doorSimplicity of internal organization20 questions – we know and useDynamic selection of categoriesAllow multiple perspectivesTrick Users into “using” Advanced Searchwine where color = red, price = x-y, etc.Click on color red, click on price x-y, etc.Flexible – can be combined with other navigation elements
36When to Use Faceted Navigation Disadvantages Systematic Disadvantages:Lack of Standards for Faceted ClassificationsEvery project is unique customizationImplementation Disadvantages:Loss of Browse ContextDifficult to grasp scope and relationshipsNo immediate support for popular subjectsEssential Limit of Faceted NavigationLimited Domain Applicability – type and sizeEntities not concepts, documents, web sites
37Developing Facet Structure: Selection of Facets: Theory Issue - Complete Model of a domainRanganathan – PMESTPersonality – Person, animal, eventMatter – what x is made ofEnergy – how x changesSpace – where x isTime – when x happensThree Planes – Idea, Verbal, Notational
38Facets: an example A Language B Genre C Period Aa English Literature b Frenchc SpanishB Genrea Proseb Poetryc DramaC Perioda 16th Centuryb 17th Centuryc 18th Centuryd 19th CenturyAa English LiteratureAaBa English ProseAaBaCa English Prose 16th CenturyAbBbCd French Poetry 19th CenturyBbCd Drama 19th Century
39Developing Facet Structure: Selection of Facets: Practice Wine.com RegionAustralia, CaliforniaTypeRed Wine, White, BubblyWineryAlphabetical listingPrice$25 and below$25-$50Top Rated Wines90+ under $20Top SellersCabinet SauvignonPinot NoirHot FeaturesWine outletSideways collection
40Faceted Approach Power Faster construction Reduced maintenance cost 4 independent categories of 10 nodes = 10,000 nodes (104)Faster constructionUse existing taxonomies in specific fieldsReduced maintenance costMore opportunity for data reuseCan be easier to navigate with appropriate UI60 nodes24,000 combinations
41OrganizationEither expose them directly in the user interface (post-coordinating) orCombine them in a minimal hierarchy (pre-coordination) orHide them to the user!Post-coordination takes software support, which may be fancy or basic.How many facets?Log10(#documents) as a guide
42Element Data Type Length Req. / Repeat Source Purpose Asset Metadata Unique IDIntegerFixed1System suppliedBasic accountabilityRecipe TitleStringVariableLicensed ContentText search & results displayRecipe summaryContentMain IngredientsList?Main Ingredients vocabularyKey index to retrieve & aggregate recipes, & generate shopping listSubject MetadataMeal Types*Meal Types vocabBrowse or group recipes & filter search resultsCuisinesCoursesCourses vocabCooking MethodFlagCooking vocabLink MetadataRecipe ImagePointerProduct GroupMerchandize productsUse MetadataRatingFilter, rank, & evaluate recipesRelease DateDatePublish & feature new recipesdc:identifierdc:titledc:descriptionXdcterms:hasPartdc:datedc:type=“recipe”, dc:format=“text/html”, dc:language=“en”
43Project/exerciseProduce a faced classification of your documents (at least 3 facets, min 5 foci each)Encode the facet classification as an extension of dc:subjectAttribute facets to your docs.Check exptensibility by adding 10 new docs