Presentation on theme: "Ron Daniel & Joseph Busch Taxonomy Strategies"— Presentation transcript:
1 Ron Daniel & Joseph Busch Taxonomy Strategies Workshop: Why and How to Use Dublin Core for Enterprise-Wide Metadata ApplicationsRon Daniel & Joseph Busch Taxonomy Strategies
2 Workshop goals What is the Dublin Core? Answer these enterprise-wide metadata ROI questions:What is the value proposition for adding metadata to content? Does metadata make content reusable? Findable? Improve productivity? How can metadata value be measured in a way that quantifies how it contributes to the bottom line?Answer these Business process questions:How is Dublin Core tagging being done on content to expose metadata to portals, search engines, and other metadata-aware applications? How are metadata value spaces (controlled vocabularies) maintained within an enterprise? Across enterprises?Answer these technology questions:What tools exist to use Dublin Core and other metadata standards in enterprise information management environments?
3 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
4 Who we are: Joseph Busch Over 25 years in the business of organized informationFounder, Taxonomy StrategiesDirector, Solutions Architecture, InterwovenVP, Infoware, Metacode Technologies (acquired by Interwoven, November 2000)Program Manager, Getty FoundationManager, PricewaterhouseMetadata and taxonomies community leadershipPresident, American Society for Information Science & TechnologyDirector, Dublin Core Metadata InitiativeAdviser, National Research Council Computer Science and Telecommunications BoardReviewer, National Science Foundation Division of Information and Intelligent SystemsFounder, Networked Knowledge Organization Systems/Services
5 Who we are: Ron Daniel, Jr. Over 15 years in the business of metadata & automatic classificationPrincipal, Taxonomy StrategiesStandards Architect, InterwovenSenior Information Scientist, Metacode Technologies (acquired by Interwoven, November 2000)Technical Staff Member, Los Alamos National LaboratoryMetadata and taxonomies community leadershipChair, PRISM (Publishers Requirements for Industry Standard Metadata) working groupActing chair: XML Linking working groupMember: RDF working groupsCo-editor: PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1 & 2 reports.
6 Recent & current projects CommercialAllstate InsuranceBlue Shield of CaliforniaDebevoise & PlimptonHalliburtonHewlett PackardMotorolaPeopleSoftPricewaterhouse CoopersSiderean SoftwareSprintTime Inc.Commercial subcontractsAgency.com – Top financial servicesCritical Mass – Fortune 50 retailerDeloitte Consulting – Big credit cardGistics/OTB – Direct selling giantNGO’sCENIDEAllianceIMFOCLCGovernmentCommodity Futures Trading CommissionDefense Intelligence AgencyERICFederal Aviation AdministrationFederal Reserve Bank of AtlantaForest ServiceGSA Office of Citizen Services (Head StartInfocomm Development Authority of SingaporeNASA (nasataxonomy.jpl.nasa.gov)Small Business AdministrationSocial Security AdministrationUSDA Economic Research ServiceUSDA e-Government Program (Please see for brief descriptions of client projects.
7 What we doOrganize StuffFigure out how to organize stuff.
8 Who are you? Tell us: Your name Your organization Your job title The things you want to get from this workshop
9 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
10 Metadata: Different definitions Library & Information ScienceAuthor/Title/SubjectControlled Vocabularies for Subject Codes (e.g. Dewey)Authority Files for Author NamesDatabaseTables/Columns/ Datatypes/RelationshipsReferences for some values
11 Metadata: Why it matters “Adding metadata to unstructured content allows it to be managed like structured content. Applications that use structured content work better.”“Enriching content with structured metadata is critical for supporting search and personalized content delivery.”“Content that has been adequately tagged with metadata can be leveraged in usage tracking, personalization and improved searching.”“Better structure equals better access: Taxonomy serves as a framework for organizing the ever-growing and changing information within a company. The many dimensions of taxonomy can greatly facilitate Web site design, content management, and search engineering. If well done, taxonomy will allow for structured Web content, leading to improved information access.”The WHY partS. Phillips, E. Maguire, C. Shilakes. Content management: The new data infrastructure–Convergence and divergence out of chaos. Merrill Lynch, June 2001.P.R. Hagen. Must search stink? Forrester Research, June 2000.K. Hall. Content tagging strategies. Giga Information Group, February 2001.
12 Metadata: Supports core functions Asset metadata – Who:Creator, Publisher, Contributor, Type, Format, IdentifierSubject metadata –What, Where & Why:Subject, Title, Description, CoverageRelational metadata – Links between and to:Source, RelationUse metadata – When & How:Date, Language, RightsEnabled FunctionalityComplexityBetter navigation & discoveryMore efficient editorial processMetadata contains critical information about each content item—the who, what, when, where, and why for each content asset.This information is provided to meet certain needs. In general, those needs boil down to “better search” for existing material, and “better processes” for creating new material.
13 What is a taxonomy? Systematics view Hierarchical classification of things into a tree structureKingdomPhylumClassOrderFamilyGenusSpeciesAnimaliaChordataMammaliaCarnivoraCanidaeCanisC. familiariLinnaeus …SegmentFamilyClassCommodity44-Office Equipment and Accessories and Supplies.12-Office Supplies.17-Writing Instruments.05-Mechanical pencils.06-Wooden pencils.07-Colored pencilsUNSPSC …
14 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
15 Dublin Core: A little more complicated ElementsIdentifierTitleCreatorContributorPublisherSubjectDescriptionCoverageFormatTypeDateRelationSourceRightsLanguageAbstractAccess rightsAlternativeAudienceAvailableBibliographic citationConforms toCreatedDate acceptedDate copyrightedDate submittedEducation levelExtentHas formatHas partHas versionIs format ofIs part ofIs referenced byIs replaced byIs required byIssuedIs version ofLicenseMediatorMediumModifiedProvenanceReferencesReplacesRequiresRights holderSpatialTable of contentsTemporalValidRefinementsBoxDCMITypeDDCIMTISO3166ISO639-2LCCLCSHMESHPeriodPointRFC1766RFC3066TGNUDCURIW3CTDFEncodingsCollectionDatasetEventImageInteractiveResourceMoving ImagePhysical ObjectServiceSoftwareSoundStill ImageTextTypes
16 Dublin Core framework for corporate use Not just 15 elementsA framework to enable cross-resource exploration and useDublin Core is framework for “integration metadata” at BellSouthSource: Todd Stephens, BellSouth
17 Metadata: A data specification – a recipe example ElementData TypeLengthReq. / RepeatSourcePurposeAsset MetadataUnique IDIntegerFixed1System suppliedBasic accountabilityRecipe TitleStringVariableLicensed ContentText search & results displayRecipe summaryContentMain IngredientsList?Main Ingredients vocabularyKey index to retrieve & aggregate recipes, & generate shopping listSubject MetadataMeal Types*Meal Types vocabBrowse or group recipes & filter search resultsCuisinesCoursesCourses vocabCooking MethodFlagCooking vocabLink MetadataRecipe ImagePointerProduct GroupMerchandize productsUse MetadataRatingFilter, rank, & evaluate recipesRelease DateDatePublish & feature new recipesdc:identifierdc:titledc:descriptionXdcterms:hasPartdc:datedc:type=“recipe”, dc:format=“text/html”, dc:language=“en”Legend: ? – 1 or more * - 0 or more
18 Why Dublin Core?Taxonomies, Vocabularies, OntologiesDublin Core is a de-facto standard across many other systems and standardsRSS (1.0), OAIInside organizations – portals, CMS, …Mapping to DC elements from most existing schemes is simpleBeware of force-fitsWhy will metadata already exist?Because of search projects, portal integration projects, etc. that are creating it or standardizing a mapping.Dublin Core and SimilarSource: Todd Stephens, BellSouthPer-Source Data Types, Access Controls, etc.
19 Creator Refinements None Encodings None “An entity primarily responsible for making the content of the resource”In other words – Author, Photographer, Illustrator, …Potential refinements by creative roleRarely justifiedCreators can be persons or organizationsKey Point – Reminder: Name variations are a big issue in data quality:Ron DanielRon Daniel, Jr.Ron Daniel Jr.R.E. DanielRonald DanielRonald Ellison Daniel, Jr.Daniel, R.EncodingsNoneName fields may contain other information<dc:creator>Case, W. R. (NASA Goddard Space Flight Center, Greenbelt, MD, United States)</dc:creator>Best practice – Validate names against LDAP or other “Authority File”
20 Example – Name mismatches One of these things is not like the other:Ron Daniel, Jr. and Carl Lagoze; “Distributed Active Relationships in the Warwick Framework”Hojung Cha and Ron Daniel; “Simulated Behavior of Large Scale SCI Rings and Tori”Ron Daniel; “High Performance Haptic and Teleoperative Interfaces”Differences may not matterIf they doThis error cannot be reliably detected automaticallyAuthority files and an error-correction procedure are needed
21 Contributor Refinements None Encodings None “An entity responsible for making contributions to the content of the resource.”In practice – rarely used.Difficult to distinguish from Creator.Adds UI Complexity for no real gainBest Practice?Recommendation – Don’t use.EncodingsNone
22 Publisher Refinements None Encodings None “An entity responsible for making the resource available”.Problems:All the name-handling stuff of Creator.Hierarchy of publishers (Bureau, Agency, Department, …)EncodingsNone
23 Title Refinements Alternative Encodings None “A name given to the resource”.Issues:Hierarchical Titlese.g. Conceptual Structures: Information Processing in Mind and Machine (The Systems Programming Series)Untitled WorksMetaphysicsEncodingsNone
24 Identifier Refinements Bibliographic Citation Encodings URI “An unambiguous reference to the resource within a given context”Best Practice: URLFuture Best Practice: URI?ProblemsMetaphysicsPersonalized URLsMultiple identifiers for same contentNon-standard resolution mechanisms for URIsRecommendations – Plan how to introduce long-lived URLsEncodingsURI
25 DateRefinementsCreatedValidAvailableIssuedModifiedDate AcceptedDate CopyrightedDate Submitted“A date associated with an event in the life cycle of the resource”Woefully underspecified.Typically the publication or last modification date.Best practice: YYYY-MM-DDEncodingsDCMI PeriodW3C DTF (Profile of ISO 8601)
26 Subject Refinements None Encodings DDC LCC LCSH MESH UDC The topic of the content of the resource.Best practice: Use pre-defined subject schemes, not user-selected keywords.Supported Encodings probably not useful for most corporate needsFactor “Subject” into separate facets.People, places, organizations, events, objects, servicesIndustry sectorsContent types, audiences, functionsTopicSome of the facets are already defined in DC (Coverage, Type) or DCTERMS (Audience)EncodingsDDCLCCLCSHMESHUDC
27 Coverage “The extent or scope of the content of the resource”. RefinementsSpatialTemporal“The extent or scope of the content of the resource”.In other words – places and times as topics.Key Point – Locations important in SOME environments, irrelevant in others. Time periods as subjects rarely important in commercial work.Best Practice – ISO ,EncodingsBox (for Spatial)ISO3166 (for Spatial)Point (for Spatial)TGN (for Spatial)W3CTDF (for Temporal)
28 Refinements Abstract Table of Contents Description“An account of the content of the resource”.In other words – an abstract or summaryKey Point – What’s the cost/benefit tradeoff for creating descriptions?Quality of auto-generated descriptions is lowFor search results, hit highlighting is probably betterRefinements Abstract Table of ContentsEncodingsNone
29 Type Refinements None Encodings DCMI Type “The nature or genre of the content of the resource”Best Current Practice: Create a custom list of content types, use that list for the values.Try to avoid “image”, “audio”, and other format names in the list of content types, they can be derived from “Format”.No broadly-acceptable list yet found.EncodingsDCMI Type
30 Format “The physical or digital manifestation of the resource.” RefinementsExtentMedium“The physical or digital manifestation of the resource.”In other words – the file formatBest practice: Internet Media TypesOutliers: File sizes, dimensions of physical objectsEncodingsIMT
31 Language Refinements None Encodings ISO639-2 RFC1766 RFC3066 “A language of the intellectual content of the resource”.Best Practice: ISO 639, RFC 3066Dialect codes: Advanced practiceEncodingsISO639-2RFC1766RFC3066
32 Relation “A reference to a related resource” RefinementsIs Version OfHas VersionIs Replaced ByReplacesIs Required ByRequiresIs Part OfHas PartIs Referenced ByReferencesIs Format OfHas FormatConforms To“A reference to a related resource”Very weak meaning – not even as strong as “See also”.Best practice: Use a refinement element and URLs.EncodingsURI
33 Source Refinements None Encodings URI “A reference to a resource from which the present resource is derived”Original intent was for derivative worksFrequently abused to provide bibliographic information for items extracted from a larger work, such as articles from a JournalEncodingsURI
34 Rights Refinements Access Rights License Encodings None “Information about rights held in and over the resource”Could be a copyright statement, or a list of groups with access rights, or …EncodingsNone
35 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
36 CEN/ISSS Workshop on Dublin Core CEN/ISSS Workshop on Dublin Core. Guidance information for the deployment of Dublin Core metadata in Corporate Environments
37 Dublin Core: CEN/ISSS Workshop on Dublin Core Metadata – corporate uses Applied Information TechniqueAstraZenicaBBCBellSouthCiscoDaimler ChryslerGiunti LabsGSKHalliburtonHPIBMIntelJohn Wiley & SonsLillyPeopleSoftRohm HaasSAPSoftware AGUnisysThe CEN/ISSS Workshop on Dublin Core Metadata Guidance information for the deployment of Dublin Core metadata in Corporate Environments (ftp://ftp.cenorm.be/public/ws-mmi-dc/mmidc128.htm) is a draft CWA (CEN Workshop Agreement) under the 2004 Workplan of the CEN/ISSS Workshop on Dublin Core Metadata for Multimedia Information - Dublin Core (MMI-DC) of the European Committee for Standardization CEN prepared by Joseph Busch, Kerstin Forsberg, and Makx Dekkers.
38 How is Dublin Core used in corporate environments? In corporate environments, Dublin Core is used :As the de facto descriptive metadata standard,because it is a simple & transparent metadata scheme.Dublin Core is used to:Enable integrated access to multiple, heterogeneous information resources, andAddress compliance requirements.Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core– Guidance information for the deployment of Dublin Core metadata in Corporate Environments
40 How Dublin Core is extended? Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core– Guidance information for the deployment of Dublin Core metadata in Corporate Environments
41 Custom business process document types? Ouch! Oil & gas services company document typesanalysis, appraisals, assessments, forecasts, predictionsagendas, plans, designs, schedules, workflowapplications, proposals, requests, requirementspermits, consents, approvals, rejections, certificateswork orders, correspondenceauditing, compliance, testing, inspections, operations reportslessons learned, after-action reviews, meeting minutes, FAQspolicies, procedures, training manuals, standards, best practicesresearch notes, journal articlesnewsletters, bulletins, press releasesads, brochures, data sheets, technical notes, case studies, price listschecklists, templates, forms, logos, brandingsoftware, database forms
42 The power of taxonomy facets 4 independent categories of 10 nodes each have the same discriminatory power as one hierarchy of 10,000 nodes (104)Easier to maintainCan be easier to navigate
43 Taxonomic metadata example: Form SS-4 Taxonomic metadata example: Form SS-4. Employer Identification Number (EIN)FacetValuesAgencyIRSContent TypeInformation SubmissionIndustry ImpactGenericJurisdictionFederalPrograms & ServicesSupport Delivery of Services/General Government/Taxation ManagementKeyword TopicCommerce/Employment taxesAudienceBusinessFacet ValuesAgency IRSContent Type Application [or Information Submission]Industry Impact GenericJurisdiction FederalBRM Impact Support Delivery of Services/General Government/Taxation ManagementKeyword Topic Commerce/Employment taxesAudience Business
44 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
45 Fundamentals of metadata ROI Tagging content using metadata and a taxonomy are costs, not benefits.There is no benefit without exposing the tagged content to users in some way that cuts costs or improves revenues.Putting metadata and a taxonomy into operation requires UI changes and/or backend system changes, as well as data changes.You need to determine those changes, and their costs, as part of the ROI.
46 Common metadata ROI scenarios Catalog siteIncreased sales.Increased productivity.Customer supportCutting costs.ComplianceAvoiding penalties.Knowledge worker productivityLess time searching, more time working.Executive MandateNo ROI study, just someone with a vision and a budget.
47 Metadata ROI: Catalog site Guided Navigation2-3 clicks to productNo dead ends
48 Metadata ROI: Catalog site Increased salesProduct findability.Product cross-sells and up-sells.Customer loyalty.1-5% increase in sales$57.6B sales (’04)$2.1B net income (’04)Enterprise portal cost$6M $600M to $2B/year $21M to $105M/year1-5% increase in productivity$50K average cost per employee310,400 employees (’04) $155M to $776M/year
49 Metadata ROI: Customer support model Help on search page, not a click away.Type and go to search for specific policiesPolicy categories for browsingRefine search offered with resultsGood search results for policy topics, e.g., “pets”
50 Metadata ROI: Customer support model Self serviceFewer customer calls.Faster, more accurate CSR responses through better information access.25-50% service efficiency increase300K customer service calls per month$6 cost per callManual processing100,000 documents2 pages per document$4 per page$800K $5.4M to $10.8M/yr1-5% increased sales$18.6B sales (’04)($761M) net income (’04) $186M to $930M/year ($575M) to $169M/year
51 Metadata ROI: Compliance Avoiding penalties for breaching regulationsSOX: up to 5 years in jailSOX: up to $5MFollowing required proceduresLoss of company$100B revenue (’00)Loss of partner companiesArthur Andersen $100B
52 … But find what they are looking for only 40% of the time. Knowledge workers spend up to 2.5 hours each day looking for information …K.S. Taylor. "The brief reign of the knowledge worker," Cited by Sue Feldman in her original article.… But find what they are looking for only 40% of the time.— Kit Sims Taylor
53 High cost of not finding information “The amount of time wasted in futile searching for vital information is enormous, leading to staggering costs …”— Sue Feldman,High cost of poor classificationPoor classification costs a 10,000 user organization $10M each year—about $1,000 per employee.— Jakob Nielsen, useit.comBut “better search” itself is a weak ROISue Feldman. "The high cost of not finding information." 13:3 KM World (March 2004)The Jakob Nielsen comment may be apocryphal. It was mentioned in several Delphi reports including Taxonomy and content classification: market milestone report (2002) and Information intelligence: content classification and enterprise taxonomy practice (2004) But the original quote cannot be attributed.
54 Knowledge workers spend more time re-creating existing content than creating new content K.S. Taylor. "The brief reign of the knowledge worker," Cited by Sue Feldman in her original article.9%26%— Kit Sims Taylor
55 Metadata ROI: Productivity Decreased cost to marketDecreased development costIncreased R&D productivityReduced time for sales & marketing1-5% decrease in drug development cost$800M/drug5-10% increase in R&D productivity13% of revenue$39B in sales (’04)10-20% decrease in time for sales & marketingEnterprise document management system cost$10M $8M to $16M/drugPBS Frontline. The Other Drug War: FAQs. (June 2003) $254M to $507M/year $254M to $507M/year
56 Metadata FAQ: Executive mandate is key There is no ROI out of the boxJust someone with a vision…and the budget to make it happen.What’s really needed?Demos and proofs of value.So that a stronger cost benefit argument can be made for continuing the work
57 Metadata FAQ: How do you sell it? Don’t sell “metadata” or “taxonomy”, sell the vision of what you want to be able to do.Clearly understand what the problem is and what the opportunities are.Do the calculus (costs and benefits)Design the taxonomy (in terms of LOE) in relation to the value at hand.
58 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
59 Overview of metadata practices Identify the teamUse (or map to) Dublin Core for basic information.Extend with custom elements for specific facts.Use pre-existing, standard, vocabularies as much as possible.ISO country codes for locationsProduct & service info from ERP systemValidate author names with LDAP directoryDesign a QC ProcessStart with an error-correction process, then get more formal on error detectionLarge-scale ontologies may be valuable in automated error detection
60 Factor “Subject” into smaller facets SizeDMOZ tries to organize all web content, has more than 600k categories!Difficulty in navigating, maintainingHidden facet structure“Classification Schemes” vs. “Taxonomies”
61 Sources for 7 common vocabularies VocabularyDefinitionPotential SourcesOrganizationOrganizational structure.FIPS 95-2, U.S. Government Manual, Your organizational structure, etc.Content TypeStructured list of the various types of content being managed or used.DC Types, AGLS Document Type, AAT Information Forms , Records management policy, etc.IndustryBroad market categories such as lines of business, life events, or industry codes.FIPS 66, SIC, NAICS, etc.LocationPlace of operations or constituencies.FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics Div, US Postal Service, etc.FunctionFunctions and processes performed to accomplish mission and goals.FEA Business Reference Model, Enterprise Ontology, AAT Functions, etc.TopicBusiness topics relevant to your mission and goals.Federal Register Thesaurus, NAL Agricultural Thesaurus, LCSH, etc.AudienceSubset of constituents to whom a piece of content is directed or intended to be used.GEM, ERIC Thesaurus, IEEE LOM, etc.Products and ServicesNames of products/programs & services.ERP system, Your products and services, etc.dc:publisherdc:typedc:coveragedc:subjectdcterms:audience
62 Cheap and Easy Metadata Some fields will be constant across a collection.In the context of a single collection those kinds of elements add no value, but they add tremendous value when many collections are brought together into one place, and they are cheap to create and validate.
63 Taxonomy Business Processes Taxonomies must change, gradually, over time if they are to remain relevantMaintenance processes need to be specified so that the changes are based on rational cost/benefit decisionsA team will need to maintain the taxonomy on a part-time basisTaxonomy team reports to some other steering committee
64 Definitions about the Controlled Vocabulary Governance Environment Change Requests & ResponsesPublished CVs and STsConsuming Applications1: Syndicated Terminologies change on their own scheduleIntranetSearch’Web CMSArchivesERMS2: CV Team decides when to update CVsSyndicated TerminologiesISO3166-1Vocabulary Management SystemOther ExternalNotificationsCVsIntranet Nav.3: Team adds value via mappings, translations, synonyms, training materials, etc.ERPDAMCustodians…Other Internal4: Updated versions of CVs published to consuming applicationsOther Controlled Items…’’Controlled Vocabulary Governance Environment
65 Other Controlled Items Taxonomy Team will have additional items to manage:Charter, Goals, Performance MeasuresEditorial rulesTeam processesTagger training materials (manual and automatic)Outreach & ROICommunication planWebsitePresentationsAnnouncementsRoadmap
66 Taxonomy governance | Generic team charter Taxonomy Team is responsible for maintaining:The Taxonomy, a multi-faceted classification schemeAssociated taxonomy materials, such as:Editorial Style GuideTaxonomy Training MaterialsMetadata StandardTeam rules and procedures (subject to CIO review)Team evaluates costs and benefits of suggested changeTaxonomy Team will:Manage relationship between providers of source vocabularies and consumers of the TaxonomyIdentify new opportunities for use of the Taxonomy across the Enterprise to improve information management practicesPromote awareness and use of the Taxonomy
67 Other Controlled Items - Editorial Rules To ensure consistent style, rules are neededIssues commonly addressed in the rules:Sources of TermsAbbreviationsAmpersandsCapitalizationContinuations (More… or Other…)Duplicate TermsHierarchy and PolyhierarchyLanguages and Character SetsLength Limits“Other” – Allowed or Forbidden?Plural vs. Singular FormsRelation Types and LimitsScope NotesSerial CommaSpacesSynonyms and AcronymsTerm Arrangement (Alphabetic or …)Term Label Order (Direct vs. Inverted)Must also address issue of what to do when rules conflict – which are more important?Rule NameEditorial RuleUse Existing VocabulariesOther things being equal, reusing an existing vocabulary is preferred to creating a new one.AmpersandsThe character '&' is preferred to the word ‘and’ in Term Labels.Example: Use Type: “Manuals & Forms”, not “Manuals and Forms”.Special CharactersRetain accented characters in Term Labels.Example: EspañaSerial commaIf a category name includes more than two items, separate the items by commas. The last item is separated by the character ‘&’ which IS NOT preceded by a comma.Example: “Education, Learning & Employment”, not “Education, Learning, & Employment”.CapitalizationUse title case (where all words except articles are capitalized).Example: “Education, Learning & Employment”NOT “Education, learning & employment”NOT “EDUCATION, LEARNING & EMPLOYMENT”NOT “education, learning & employment”…
68 Roles in Two Taxonomy Governance Teams Taxonomy SpecialistSuggests potential taxonomy changes based on analysis of query logs, indexer feedbackMakes edits to taxonomy, installs into system with aid of IT specialistContent OwnerReality check on process change suggestionsBusiness LeadCustodiansResponsible for content in a specific CV.Training RepresentativeDevelops communications plan, training materialsWork Practices RepresentativeDevelops processes, monitors adherenceIT RepresentativeBackups, admin of CV ToolInfo. Mgmt. RepresentativeProvides CV expertise, tie-in with larger IM effort in the organization.Executive SponsorAdvocate for the taxonomy teamBusiness LeadKeeps team on track with larger business objectivesBalances cost/benefit issues to decide appropriate levels of effortSpecialists help in estimating costsObtains needed resources if those in team can’t accomplish a particular taskTechnical SpecialistEstimates costs of proposed changes in terms of amount of data to be retagged, additional storage and processing burden, software changes, etc.Helps obtain data from various systemsContent SpecialistTeam’s liaison to content creatorsEstimates costs of proposed changes in terms of editorial process changes, additional or reduced workload, etc.Small-scale Metadata QA ResponsibilityTeam structure at a different org.
69 Taxonomy governance | Where changes come from FirewallFirewallFirewallApplicationUIApplicationApplicationTaggingUITaggingTaggingUIUIUIUITaggingLogicApplication LogicContentContentTaggingTaggingLogicLogicTaxonomyTaxonomyStaffStaffQuery logQuery lognotesnotesanalysisanalysis‘‘missingmissing’’conceptsconceptsEnd UserEnd UserTagging StaffTagging StaffThe taxonomy must be changed over time.Suggestions for changes can come from users, through query log analysis, and staff, from feedback form.Governance structure needed to make sure changes are justified.Recommendations by EditorSmall taxonomy changes (labels, synonyms)Large taxonomy changes (retagging, application changes)New “best bets” contentTeam considerationsBusiness goalsChanges in user experienceRetagging costTaxonomy EditorTaxonomy EditorexperienceexperienceTaxonomy TeamRequests from otherRequests from other parts of the organizationparts of NASA
70 PrinciplesBasic facets with identified items – people, places, projects, instruments, missions, organizations, … Note that these are not subjective “subjects”, they are objective “objects”.Clearly identify the Custodians of the facets, and the process for maintain and publishing them.Subjective views can be laid on top of the objective facts, but should be in a different namespace so they are clearly distinguishable.For example, labels like “Anarchist” or “Prime Minister” can be applied to the same person at different times (e.g. Nelson Mandela).
71 Enterprise Portal challenges when organizing content Multiple subject domains across the enterpriseVocabularies varyGranularity variesUnstructured information represents about 80%Information is stored in complex waysMultiple physical locationsMany different formatsTagging is time-consuming and requires SME involvementPortal doesn’t solve content access problemKnowledge is power syndromeIncentives to share knowledge don’t existFree flow of information TO the portal might be inhibitedContent silo mentality changes slowlyWhat content has changed?What exists?What has been discontinued?Lack of awareness of other initiativesThe complexity of storage of information makes it a significant challenge to integrate all the data stores to act as a single seamless repositoryContent silos result in poor communication among groups ; lots of extra work because one group doesn’t know what the other is doing or has already doneYahoo employs a completely manual approach to tagging. All content is considered by SMEs.
72 Challenges when organizing content on enterprise portals Lack of content standardization and consistencyContent messages vary among departmentsHow do users know which message is correct?Re-usability low to non-existentCosts of content creation, management and delivery may not change when portal is implemented:Similar subjects, BUTDiverse mediaDiverse toolsDifferent usersHow will personalization be implemented?How will existing site taxonomies be leveraged?Taxonomy creation may surface “holes” in content
73 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Q&A6:45 Adjourn
74 Methods used to create & maintain metadata Paper or web-based forms widely used:Distributed resource origination metadata taggingCentralized clean-up and metadata entry.Automated tools & applications not widely used:Auto-categorization toolsVocabulary/taxonomy editing toolsGuided navigation applicationsFederated search and repository “wrappers”Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core– Guidance information for the deployment of Dublin Core metadata in Corporate Environments
75 The Tagging ProblemHow are we going to populate metadata elements with complete and consistent values?What can we expect to get from automatic classifiers?
76 Tagging Province of authors (SMEs) or editors? Taxonomy often highly granular to meet task and re-use needs.Vocabulary dependent on originating department.The more tags there are (and the more values for each tag), the more hooks to the content.If there are too many, authors will resist and use “general” tags (if available)Automatic classification tools exist, and are valuable, but results are not as good as humans can do.“Semi-automated” is best.Degree of human involvement is a cost/benefit tradeoff.
77 Automatic categorization vendors | Analyst viewpoint Accuracy LevelhighlowContent VolumesScalability requires simple creation of granular metadata and taxonomies.Better content architecture means more accurate categorization, and more precise content delivery.Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features – UI, versioning, workflow, storage – that provide the basis for building a QA process.
78 Considerations in automatic classifier performance AccuracyDevelopment Effort/ Licensing ExpenseRegexpsTrained Librarianspotential performance gainClassification Performance is measured by “Inter-cataloger agreement”Trained librarians agree less than 80% of the timeErrors are subtle differences in judgment, or big goofsAutomatic classification struggles to match human performanceException: Entity recognition can exceed human performanceClassifier performance limited by algorithms available, which is limited by development effortVery wide variance in one vendor’s performance depending on who does the implementation, and how much time they have to do it80/20 tradeoff where 20% of effort gives 80% of performance.Smart implementation of inexpensive tools will outperform naive implementations of world-class tools.
81 Metadata tagging workflows Compose in TemplateSubmit to CMSAnalystEditorReview contentProblem?CopywriterCopy Edit contentHard CopyWeb siteYNApprove/Edit metadataAutomatically fill-in metadataTagging ToolSys AdminEven ‘purely’ automatic meta-tagging systems need a manual error correction procedure.Should add a QA sampling mechanismTagging models:Author-generatedCentral librariansHybrid – central auto-tagging service, distributed manual review and correctionSample of ‘author-generated’ metadata workflow.
82 Automatic categorization vendors | Pragmatic viewpoint Accuracy LevelhighlowContent VolumesScalability requires simple creation of granular metadata and taxonomies.Better content architecture means more accurate categorization, and more precise content delivery.Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features – UI, versioning, workflow, storage – that provide the basis for building a QA process.
83 Seven practical rules for taxonomies Incremental, extensible process that identifies and enables users, and engages stakeholders.Quick implementation that provides measurable results as quickly as possible.Not monolithic—has separately maintainable facets.Re-uses existing IP as much as possible.A means to an end, and not the end in itself .Not perfect, but it does the job it is supposed to do—such as improving search and navigation.Improved over time, and maintained.
84 Agenda 3:30 Introductions: Us and you 3:45 Background: Metadata & controlled vocabularies4:00 Dublin Core: Elements, issues, and recommendations4:30 Dublin Core in the wild: CEN study and remarks4:45 Enterprise-wide metadata ROI questions5:00 Break5:15 ROI (Cont.)5:30 Business processes6:15 Tools & technologies6:30 Summary, Q&A6:45 Adjourn
85 Summary: Categorize with a purpose What is the problem you are trying to solve?Improve searchBrowse for content on an enterprise-wide portalEnable business users to syndicate contentOtherwise provide the basis for content re-useHow will you control the cost of creating and maintaining the metadata) needed to solve these problems?CMS with a metadata tagging productsSemi-automated classificationTaxonomy editing toolsGuided navigation tools