Ron Daniel & Joseph Busch Taxonomy Strategies

Ron Daniel & Joseph Busch Taxonomy Strategies
Workshop: Why and How to Use Dublin Core for Enterprise-Wide Metadata Applications Ron Daniel & Joseph Busch Taxonomy Strategies

Workshop goals What is the Dublin Core?
Answer these enterprise-wide metadata ROI questions: What is the value proposition for adding metadata to content? Does metadata make content reusable? Findable? Improve productivity? How can metadata value be measured in a way that quantifies how it contributes to the bottom line? Answer these Business process questions: How is Dublin Core tagging being done on content to expose metadata to portals, search engines, and other metadata-aware applications? How are metadata value spaces (controlled vocabularies) maintained within an enterprise? Across enterprises? Answer these technology questions: What tools exist to use Dublin Core and other metadata standards in enterprise information management environments?

Agenda 3:30 Introductions: Us and you
3:45 Background: Metadata & controlled vocabularies 4:00 Dublin Core: Elements, issues, and recommendations 4:30 Dublin Core in the wild: CEN study and remarks 4:45 Enterprise-wide metadata ROI questions 5:00 Break 5:15 ROI (Cont.) 5:30 Business processes 6:15 Tools & technologies 6:30 Q&A 6:45 Adjourn

Who we are: Joseph Busch
Over 25 years in the business of organized information Founder, Taxonomy Strategies Director, Solutions Architecture, Interwoven VP, Infoware, Metacode Technologies (acquired by Interwoven, November 2000) Program Manager, Getty Foundation Manager, Pricewaterhouse Metadata and taxonomies community leadership President, American Society for Information Science & Technology Director, Dublin Core Metadata Initiative Adviser, National Research Council Computer Science and Telecommunications Board Reviewer, National Science Foundation Division of Information and Intelligent Systems Founder, Networked Knowledge Organization Systems/Services

Who we are: Ron Daniel, Jr.
Over 15 years in the business of metadata & automatic classification Principal, Taxonomy Strategies Standards Architect, Interwoven Senior Information Scientist, Metacode Technologies (acquired by Interwoven, November 2000) Technical Staff Member, Los Alamos National Laboratory Metadata and taxonomies community leadership Chair, PRISM (Publishers Requirements for Industry Standard Metadata) working group Acting chair: XML Linking working group Member: RDF working groups Co-editor: PRISM, XPointer, 3 IETF RFCs, and Dublin Core 1 & 2 reports.

Recent & current projects
Commercial Allstate Insurance Blue Shield of California Debevoise & Plimpton Halliburton Hewlett Packard Motorola PeopleSoft Pricewaterhouse Coopers Siderean Software Sprint Time Inc. Commercial subcontracts Agency.com – Top financial services Critical Mass – Fortune 50 retailer Deloitte Consulting – Big credit card Gistics/OTB – Direct selling giant NGO’s CEN IDEAlliance IMF OCLC Government Commodity Futures Trading Commission Defense Intelligence Agency ERIC Federal Aviation Administration Federal Reserve Bank of Atlanta Forest Service GSA Office of Citizen Services ( Head Start Infocomm Development Authority of Singapore NASA (nasataxonomy.jpl.nasa.gov) Small Business Administration Social Security Administration USDA Economic Research Service USDA e-Government Program ( Please see for brief descriptions of client projects.

What we do Organize Stuff Figure out how to organize stuff.

Who are you? Tell us: Your name Your organization Your job title
The things you want to get from this workshop

Metadata: Different definitions
Library & Information Science Author/Title/Subject Controlled Vocabularies for Subject Codes (e.g. Dewey) Authority Files for Author Names Database Tables/Columns/ Datatypes/Relationships References for some values

Metadata: Why it matters
“Adding metadata to unstructured content allows it to be managed like structured content. Applications that use structured content work better.” “Enriching content with structured metadata is critical for supporting search and personalized content delivery.” “Content that has been adequately tagged with metadata can be leveraged in usage tracking, personalization and improved searching.” “Better structure equals better access: Taxonomy serves as a framework for organizing the ever-growing and changing information within a company. The many dimensions of taxonomy can greatly facilitate Web site design, content management, and search engineering. If well done, taxonomy will allow for structured Web content, leading to improved information access.” The WHY part S. Phillips, E. Maguire, C. Shilakes. Content management: The new data infrastructure–Convergence and divergence out of chaos. Merrill Lynch, June 2001. P.R. Hagen. Must search stink? Forrester Research, June 2000. K. Hall. Content tagging strategies. Giga Information Group, February 2001.

Metadata: Supports core functions
Asset metadata – Who: Creator, Publisher, Contributor, Type, Format, Identifier Subject metadata – What, Where & Why: Subject, Title, Description, Coverage Relational metadata – Links between and to: Source, Relation Use metadata – When & How: Date, Language, Rights Enabled Functionality Complexity Better navigation & discovery More efficient editorial process Metadata contains critical information about each content item—the who, what, when, where, and why for each content asset. This information is provided to meet certain needs. In general, those needs boil down to “better search” for existing material, and “better processes” for creating new material.

What is a taxonomy? Systematics view
Hierarchical classification of things into a tree structure Kingdom Phylum Class Order Family Genus Species Animalia Chordata Mammalia Carnivora Canidae Canis C. familiari Linnaeus … Segment Family Class Commodity 44-Office Equipment and Accessories and Supplies .12-Office Supplies .17-Writing Instruments .05-Mechanical pencils .06-Wooden pencils .07-Colored pencils UNSPSC …

Dublin Core: A little more complicated
Elements Identifier Title Creator Contributor Publisher Subject Description Coverage Format Type Date Relation Source Rights Language Abstract Access rights Alternative Audience Available Bibliographic citation Conforms to Created Date accepted Date copyrighted Date submitted Education level Extent Has format Has part Has version Is format of Is part of Is referenced by Is replaced by Is required by Issued Is version of License Mediator Medium Modified Provenance References Replaces Requires Rights holder Spatial Table of contents Temporal Valid Refinements Box DCMIType DDC IMT ISO3166 ISO639-2 LCC LCSH MESH Period Point RFC1766 RFC3066 TGN UDC URI W3CTDF Encodings Collection Dataset Event Image Interactive Resource Moving Image Physical Object Service Software Sound Still Image Text Types

Dublin Core framework for corporate use
Not just 15 elements A framework to enable cross-resource exploration and use Dublin Core is framework for “integration metadata” at BellSouth Source: Todd Stephens, BellSouth

Metadata: A data specification – a recipe example
Element Data Type Length Req. / Repeat Source Purpose Asset Metadata Unique ID Integer Fixed 1 System supplied Basic accountability Recipe Title String Variable Licensed Content Text search & results display Recipe summary Content Main Ingredients List ? Main Ingredients vocabulary Key index to retrieve & aggregate recipes, & generate shopping list Subject Metadata Meal Types * Meal Types vocab Browse or group recipes & filter search results Cuisines Courses Courses vocab Cooking Method Flag Cooking vocab Link Metadata Recipe Image Pointer Product Group Merchandize products Use Metadata Rating Filter, rank, & evaluate recipes Release Date Date Publish & feature new recipes dc:identifier dc:title dc:description X dcterms:hasPart dc:date dc:type=“recipe”, dc:format=“text/html”, dc:language=“en” Legend: ? – 1 or more * - 0 or more

Why Dublin Core? Taxonomies, Vocabularies, Ontologies Dublin Core is a de-facto standard across many other systems and standards RSS (1.0), OAI Inside organizations – portals, CMS, … Mapping to DC elements from most existing schemes is simple Beware of force-fits Why will metadata already exist? Because of search projects, portal integration projects, etc. that are creating it or standardizing a mapping. Dublin Core and Similar Source: Todd Stephens, BellSouth Per-Source Data Types, Access Controls, etc.

Creator Refinements None Encodings None
“An entity primarily responsible for making the content of the resource” In other words – Author, Photographer, Illustrator, … Potential refinements by creative role Rarely justified Creators can be persons or organizations Key Point – Reminder: Name variations are a big issue in data quality: Ron Daniel Ron Daniel, Jr. Ron Daniel Jr. R.E. Daniel Ronald Daniel Ronald Ellison Daniel, Jr. Daniel, R. Encodings None Name fields may contain other information <dc:creator>Case, W. R. (NASA Goddard Space Flight Center, Greenbelt, MD, United States)</dc:creator> Best practice – Validate names against LDAP or other “Authority File”

Example – Name mismatches
One of these things is not like the other: Ron Daniel, Jr. and Carl Lagoze; “Distributed Active Relationships in the Warwick Framework” Hojung Cha and Ron Daniel; “Simulated Behavior of Large Scale SCI Rings and Tori” Ron Daniel; “High Performance Haptic and Teleoperative Interfaces” Differences may not matter If they do This error cannot be reliably detected automatically Authority files and an error-correction procedure are needed

Contributor Refinements None Encodings None
“An entity responsible for making contributions to the content of the resource.” In practice – rarely used. Difficult to distinguish from Creator. Adds UI Complexity for no real gain Best Practice? Recommendation – Don’t use. Encodings None

Publisher Refinements None Encodings None
“An entity responsible for making the resource available”. Problems: All the name-handling stuff of Creator. Hierarchy of publishers (Bureau, Agency, Department, …) Encodings None

Title Refinements Alternative Encodings None
“A name given to the resource”. Issues: Hierarchical Titles e.g. Conceptual Structures: Information Processing in Mind and Machine (The Systems Programming Series) Untitled Works Metaphysics Encodings None

Identifier Refinements Bibliographic Citation Encodings URI
“An unambiguous reference to the resource within a given context” Best Practice: URL Future Best Practice: URI? Problems Metaphysics Personalized URLs Multiple identifiers for same content Non-standard resolution mechanisms for URIs Recommendations – Plan how to introduce long-lived URLs Encodings URI

Date Refinements Created Valid Available Issued Modified Date Accepted Date Copyrighted Date Submitted “A date associated with an event in the life cycle of the resource” Woefully underspecified. Typically the publication or last modification date. Best practice: YYYY-MM-DD Encodings DCMI Period W3C DTF (Profile of ISO 8601)

Subject Refinements None Encodings DDC LCC LCSH MESH UDC
The topic of the content of the resource. Best practice: Use pre-defined subject schemes, not user-selected keywords. Supported Encodings probably not useful for most corporate needs Factor “Subject” into separate facets. People, places, organizations, events, objects, services Industry sectors Content types, audiences, functions Topic Some of the facets are already defined in DC (Coverage, Type) or DCTERMS (Audience) Encodings DDC LCC LCSH MESH UDC

Coverage “The extent or scope of the content of the resource”.
Refinements Spatial Temporal “The extent or scope of the content of the resource”. In other words – places and times as topics. Key Point – Locations important in SOME environments, irrelevant in others. Time periods as subjects rarely important in commercial work. Best Practice – ISO , Encodings Box (for Spatial) ISO3166 (for Spatial) Point (for Spatial) TGN (for Spatial) W3CTDF (for Temporal)

Refinements Abstract Table of Contents
Description “An account of the content of the resource”. In other words – an abstract or summary Key Point – What’s the cost/benefit tradeoff for creating descriptions? Quality of auto-generated descriptions is low For search results, hit highlighting is probably better Refinements Abstract Table of Contents Encodings None

Type Refinements None Encodings DCMI Type
“The nature or genre of the content of the resource” Best Current Practice: Create a custom list of content types, use that list for the values. Try to avoid “image”, “audio”, and other format names in the list of content types, they can be derived from “Format”. No broadly-acceptable list yet found. Encodings DCMI Type

Format “The physical or digital manifestation of the resource.”
Refinements Extent Medium “The physical or digital manifestation of the resource.” In other words – the file format Best practice: Internet Media Types Outliers: File sizes, dimensions of physical objects Encodings IMT

Language Refinements None Encodings ISO639-2 RFC1766 RFC3066
“A language of the intellectual content of the resource”. Best Practice: ISO 639, RFC 3066 Dialect codes: Advanced practice Encodings ISO639-2 RFC1766 RFC3066

Relation “A reference to a related resource”
Refinements Is Version Of Has Version Is Replaced By Replaces Is Required By Requires Is Part Of Has Part Is Referenced By References Is Format Of Has Format Conforms To “A reference to a related resource” Very weak meaning – not even as strong as “See also”. Best practice: Use a refinement element and URLs. Encodings URI

Source Refinements None Encodings URI
“A reference to a resource from which the present resource is derived” Original intent was for derivative works Frequently abused to provide bibliographic information for items extracted from a larger work, such as articles from a Journal Encodings URI

Rights Refinements Access Rights License Encodings None
“Information about rights held in and over the resource” Could be a copyright statement, or a list of groups with access rights, or … Encodings None

CEN/ISSS Workshop on Dublin Core
CEN/ISSS Workshop on Dublin Core. Guidance information for the deployment of Dublin Core metadata in Corporate Environments

Dublin Core: CEN/ISSS Workshop on Dublin Core Metadata – corporate uses
Applied Information Technique AstraZenica BBC BellSouth Cisco Daimler Chrysler Giunti Labs GSK Halliburton HP IBM Intel John Wiley & Sons Lilly PeopleSoft Rohm Haas SAP Software AG Unisys The CEN/ISSS Workshop on Dublin Core Metadata Guidance information for the deployment of Dublin Core metadata in Corporate Environments (ftp://ftp.cenorm.be/public/ws-mmi-dc/mmidc128.htm) is a draft CWA (CEN Workshop Agreement) under the 2004 Workplan of the CEN/ISSS Workshop on Dublin Core Metadata for Multimedia Information - Dublin Core (MMI-DC) of the European Committee for Standardization CEN prepared by Joseph Busch, Kerstin Forsberg, and Makx Dekkers.

How is Dublin Core used in corporate environments?
In corporate environments, Dublin Core is used : As the de facto descriptive metadata standard, because it is a simple & transparent metadata scheme. Dublin Core is used to: Enable integrated access to multiple, heterogeneous information resources, and Address compliance requirements. Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core – Guidance information for the deployment of Dublin Core metadata in Corporate Environments

Controlled Vocabularies
Taxonomy: e-Forms example Agency Form Type Industry Impact Jurisdiction BRM Impact Keyword Topic Audience Facets 0001 Legislative 1000 Judicial 1100 Executive Office of Pres 0003 Exec Depts 1200 Agriculture 1300 Commerce 9700 Defense 9100 Education 8900 Energy 7500 HHS 7000 DHS 8600 HUD 1400 Interior 1500 Justice 1600 Labor 1900 State 6900 Transport 2000 Treasury 3600 Veterans Ind Agencies Intl Orgs Application Approval Claim Information request Information submission Instructions Legal filing Payment Procurement Renewal Reservation Service request Test Other input Other transaction 00 Generic 11 Agriculture 21 Mining 22 Utilities 23 Construct 31-33 Manuf 42 Wholesale 44-45 Retail 48-49 Trans 51 Info 52 Finance 54 Profession 55 Mgmt 56 Support 61 Education 62 Health Care 71 Arts 72 Hospitality 81 Other Services 92 Public Admin Citizen Srvcs Social Srvs Defense Disasters Econ Dev Education Energy Env Mgmt Law Enf Judicial Correctional Health Security Income Sec Intelligence Intl Affairs Nat Resour Transport Workforce Science Delivery Support Management Agriculture & food Commerce Communica-tions Education Energy Env pro Foreign rels Govt Health & safety Housing & comm dev Labor Law Named grps National def Nat resources Recreation Sci & tech Social pgms Transport All General Citizen Business Govt Employee Native American Non-resident Tourist Special group Federal State + Local + Other + Controlled Vocabularies

How Dublin Core is extended?
Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core – Guidance information for the deployment of Dublin Core metadata in Corporate Environments

Custom business process document types? Ouch!
Oil & gas services company document types analysis, appraisals, assessments, forecasts, predictions agendas, plans, designs, schedules, workflow applications, proposals, requests, requirements permits, consents, approvals, rejections, certificates work orders, correspondence auditing, compliance, testing, inspections, operations reports lessons learned, after-action reviews, meeting minutes, FAQs policies, procedures, training manuals, standards, best practices research notes, journal articles newsletters, bulletins, press releases ads, brochures, data sheets, technical notes, case studies, price lists checklists, templates, forms, logos, branding software, database forms

The power of taxonomy facets
4 independent categories of 10 nodes each have the same discriminatory power as one hierarchy of 10,000 nodes (104) Easier to maintain Can be easier to navigate

Taxonomic metadata example: Form SS-4
Taxonomic metadata example: Form SS-4. Employer Identification Number (EIN) Facet Values Agency IRS Content Type Information Submission Industry Impact Generic Jurisdiction Federal Programs & Services Support Delivery of Services/General Government/Taxation Management Keyword Topic Commerce/Employment taxes Audience Business Facet Values Agency IRS Content Type Application [or Information Submission] Industry Impact Generic Jurisdiction Federal BRM Impact Support Delivery of Services/General Government/Taxation Management Keyword Topic Commerce/Employment taxes Audience Business

Fundamentals of metadata ROI
Tagging content using metadata and a taxonomy are costs, not benefits. There is no benefit without exposing the tagged content to users in some way that cuts costs or improves revenues. Putting metadata and a taxonomy into operation requires UI changes and/or backend system changes, as well as data changes. You need to determine those changes, and their costs, as part of the ROI.

Common metadata ROI scenarios
Catalog site Increased sales. Increased productivity. Customer support Cutting costs. Compliance Avoiding penalties. Knowledge worker productivity Less time searching, more time working. Executive Mandate No ROI study, just someone with a vision and a budget.

Metadata ROI: Catalog site
Guided Navigation 2-3 clicks to product No dead ends

Metadata ROI: Catalog site
Increased sales Product findability. Product cross-sells and up-sells. Customer loyalty. 1-5% increase in sales $57.6B sales (’04) $2.1B net income (’04) Enterprise portal cost $6M  $600M to $2B/year  $21M to $105M/year 1-5% increase in productivity $50K average cost per employee 310,400 employees (’04)  $155M to $776M/year

Metadata ROI: Customer support model
Help on search page, not a click away. Type and go to search for specific policies Policy categories for browsing Refine search offered with results Good search results for policy topics, e.g., “pets”

Metadata ROI: Customer support model
Self service Fewer customer calls. Faster, more accurate CSR responses through better information access. 25-50% service efficiency increase 300K customer service calls per month $6 cost per call Manual processing 100,000 documents 2 pages per document $4 per page $800K  $5.4M to $10.8M/yr 1-5% increased sales $18.6B sales (’04) ($761M) net income (’04)  $186M to $930M/year  ($575M) to $169M/year

Metadata ROI: Compliance
Avoiding penalties for breaching regulations SOX: up to 5 years in jail SOX: up to $5M Following required procedures Loss of company $100B revenue (’00) Loss of partner companies Arthur Andersen  $100B

… But find what they are looking for only 40% of the time.
Knowledge workers spend up to 2.5 hours each day looking for information … K.S. Taylor. "The brief reign of the knowledge worker," Cited by Sue Feldman in her original article. … But find what they are looking for only 40% of the time. — Kit Sims Taylor

High cost of not finding information
“The amount of time wasted in futile searching for vital information is enormous, leading to staggering costs …” — Sue Feldman, High cost of poor classification Poor classification costs a 10,000 user organization $10M each year—about $1,000 per employee. — Jakob Nielsen, useit.com But “better search” itself is a weak ROI Sue Feldman. "The high cost of not finding information." 13:3 KM World (March 2004) The Jakob Nielsen comment may be apocryphal. It was mentioned in several Delphi reports including Taxonomy and content classification: market milestone report (2002) and Information intelligence: content classification and enterprise taxonomy practice (2004) But the original quote cannot be attributed.

Knowledge workers spend more time re-creating existing content than creating new content
K.S. Taylor. "The brief reign of the knowledge worker," Cited by Sue Feldman in her original article. 9% 26% — Kit Sims Taylor

Metadata ROI: Productivity
Decreased cost to market Decreased development cost Increased R&D productivity Reduced time for sales & marketing 1-5% decrease in drug development cost $800M/drug 5-10% increase in R&D productivity 13% of revenue $39B in sales (’04) 10-20% decrease in time for sales & marketing Enterprise document management system cost $10M  $8M to $16M/drug PBS Frontline. The Other Drug War: FAQs. (June 2003)  $254M to $507M/year  $254M to $507M/year

Metadata FAQ: Executive mandate is key
There is no ROI out of the box Just someone with a vision …and the budget to make it happen. What’s really needed? Demos and proofs of value. So that a stronger cost benefit argument can be made for continuing the work

Metadata FAQ: How do you sell it?
Don’t sell “metadata” or “taxonomy”, sell the vision of what you want to be able to do. Clearly understand what the problem is and what the opportunities are. Do the calculus (costs and benefits) Design the taxonomy (in terms of LOE) in relation to the value at hand.

Overview of metadata practices
Identify the team Use (or map to) Dublin Core for basic information. Extend with custom elements for specific facts. Use pre-existing, standard, vocabularies as much as possible. ISO country codes for locations Product & service info from ERP system Validate author names with LDAP directory Design a QC Process Start with an error-correction process, then get more formal on error detection Large-scale ontologies may be valuable in automated error detection

Factor “Subject” into smaller facets
Size DMOZ tries to organize all web content, has more than 600k categories! Difficulty in navigating, maintaining Hidden facet structure “Classification Schemes” vs. “Taxonomies”

Sources for 7 common vocabularies
Vocabulary Definition Potential Sources Organization Organizational structure. FIPS 95-2, U.S. Government Manual, Your organizational structure, etc. Content Type Structured list of the various types of content being managed or used. DC Types, AGLS Document Type, AAT Information Forms , Records management policy, etc. Industry Broad market categories such as lines of business, life events, or industry codes. FIPS 66, SIC, NAICS, etc. Location Place of operations or constituencies. FIPS 5-2, FIPS 55-3, ISO 3166, UN Statistics Div, US Postal Service, etc. Function Functions and processes performed to accomplish mission and goals. FEA Business Reference Model, Enterprise Ontology, AAT Functions, etc. Topic Business topics relevant to your mission and goals. Federal Register Thesaurus, NAL Agricultural Thesaurus, LCSH, etc. Audience Subset of constituents to whom a piece of content is directed or intended to be used. GEM, ERIC Thesaurus, IEEE LOM, etc. Products and Services Names of products/programs & services. ERP system, Your products and services, etc. dc:publisher dc:type dc:coverage dc:subject dcterms:audience

Cheap and Easy Metadata
Some fields will be constant across a collection. In the context of a single collection those kinds of elements add no value, but they add tremendous value when many collections are brought together into one place, and they are cheap to create and validate.

Taxonomy Business Processes
Taxonomies must change, gradually, over time if they are to remain relevant Maintenance processes need to be specified so that the changes are based on rational cost/benefit decisions A team will need to maintain the taxonomy on a part-time basis Taxonomy team reports to some other steering committee

Definitions about the Controlled Vocabulary Governance Environment
Change Requests & Responses Published CVs and STs Consuming Applications 1: Syndicated Terminologies change on their own schedule Intranet Search ’ Web CMS Archives ERMS 2: CV Team decides when to update CVs Syndicated Terminologies ISO 3166-1 Vocabulary Management System Other External Notifications CVs Intranet Nav. 3: Team adds value via mappings, translations, synonyms, training materials, etc. ERP DAM Custodians … Other Internal 4: Updated versions of CVs published to consuming applications Other Controlled Items … ’ ’ Controlled Vocabulary Governance Environment

Other Controlled Items
Taxonomy Team will have additional items to manage: Charter, Goals, Performance Measures Editorial rules Team processes Tagger training materials (manual and automatic) Outreach & ROI Communication plan Website Presentations Announcements Roadmap

Taxonomy governance | Generic team charter
Taxonomy Team is responsible for maintaining: The Taxonomy, a multi-faceted classification scheme Associated taxonomy materials, such as: Editorial Style Guide Taxonomy Training Materials Metadata Standard Team rules and procedures (subject to CIO review) Team evaluates costs and benefits of suggested change Taxonomy Team will: Manage relationship between providers of source vocabularies and consumers of the Taxonomy Identify new opportunities for use of the Taxonomy across the Enterprise to improve information management practices Promote awareness and use of the Taxonomy

Other Controlled Items - Editorial Rules
To ensure consistent style, rules are needed Issues commonly addressed in the rules: Sources of Terms Abbreviations Ampersands Capitalization Continuations (More… or Other…) Duplicate Terms Hierarchy and Polyhierarchy Languages and Character Sets Length Limits “Other” – Allowed or Forbidden? Plural vs. Singular Forms Relation Types and Limits Scope Notes Serial Comma Spaces Synonyms and Acronyms Term Arrangement (Alphabetic or …) Term Label Order (Direct vs. Inverted) Must also address issue of what to do when rules conflict – which are more important? Rule Name Editorial Rule Use Existing Vocabularies Other things being equal, reusing an existing vocabulary is preferred to creating a new one. Ampersands The character '&' is preferred to the word ‘and’ in Term Labels. Example: Use Type: “Manuals & Forms”, not “Manuals and Forms”. Special Characters Retain accented characters in Term Labels. Example: España Serial comma If a category name includes more than two items, separate the items by commas. The last item is separated by the character ‘&’ which IS NOT preceded by a comma. Example: “Education, Learning & Employment”, not “Education, Learning, & Employment”. Capitalization Use title case (where all words except articles are capitalized). Example: “Education, Learning & Employment” NOT “Education, learning & employment” NOT “EDUCATION, LEARNING & EMPLOYMENT” NOT “education, learning & employment” …

Roles in Two Taxonomy Governance Teams
Taxonomy Specialist Suggests potential taxonomy changes based on analysis of query logs, indexer feedback Makes edits to taxonomy, installs into system with aid of IT specialist Content Owner Reality check on process change suggestions Business Lead Custodians Responsible for content in a specific CV. Training Representative Develops communications plan, training materials Work Practices Representative Develops processes, monitors adherence IT Representative Backups, admin of CV Tool Info. Mgmt. Representative Provides CV expertise, tie-in with larger IM effort in the organization. Executive Sponsor Advocate for the taxonomy team Business Lead Keeps team on track with larger business objectives Balances cost/benefit issues to decide appropriate levels of effort Specialists help in estimating costs Obtains needed resources if those in team can’t accomplish a particular task Technical Specialist Estimates costs of proposed changes in terms of amount of data to be retagged, additional storage and processing burden, software changes, etc. Helps obtain data from various systems Content Specialist Team’s liaison to content creators Estimates costs of proposed changes in terms of editorial process changes, additional or reduced workload, etc. Small-scale Metadata QA Responsibility Team structure at a different org.

Taxonomy governance | Where changes come from
Firewall Firewall Firewall Application UI Application Application Tagging UI Tagging Tagging UI UI UI UI Tagging Logic Application Logic Content Content Tagging Tagging Logic Logic Taxonomy Taxonomy Staff Staff Query log Query log notes notes analysis analysis ‘ ‘ missing missing ’ ’ concepts concepts End User End User Tagging Staff Tagging Staff The taxonomy must be changed over time. Suggestions for changes can come from users, through query log analysis, and staff, from feedback form. Governance structure needed to make sure changes are justified. Recommendations by Editor Small taxonomy changes (labels, synonyms) Large taxonomy changes (retagging, application changes) New “best bets” content Team considerations Business goals Changes in user experience Retagging cost Taxonomy Editor Taxonomy Editor experience experience Taxonomy Team Requests from other Requests from other parts of the organization parts of NASA

Principles Basic facets with identified items – people, places, projects, instruments, missions, organizations, … Note that these are not subjective “subjects”, they are objective “objects”. Clearly identify the Custodians of the facets, and the process for maintain and publishing them. Subjective views can be laid on top of the objective facts, but should be in a different namespace so they are clearly distinguishable. For example, labels like “Anarchist” or “Prime Minister” can be applied to the same person at different times (e.g. Nelson Mandela).

Enterprise Portal challenges when organizing content
Multiple subject domains across the enterprise Vocabularies vary Granularity varies Unstructured information represents about 80% Information is stored in complex ways Multiple physical locations Many different formats Tagging is time-consuming and requires SME involvement Portal doesn’t solve content access problem Knowledge is power syndrome Incentives to share knowledge don’t exist Free flow of information TO the portal might be inhibited Content silo mentality changes slowly What content has changed? What exists? What has been discontinued? Lack of awareness of other initiatives The complexity of storage of information makes it a significant challenge to integrate all the data stores to act as a single seamless repository Content silos result in poor communication among groups ; lots of extra work because one group doesn’t know what the other is doing or has already done Yahoo employs a completely manual approach to tagging. All content is considered by SMEs.

Challenges when organizing content on enterprise portals
Lack of content standardization and consistency Content messages vary among departments How do users know which message is correct? Re-usability low to non-existent Costs of content creation, management and delivery may not change when portal is implemented: Similar subjects, BUT Diverse media Diverse tools Different users How will personalization be implemented? How will existing site taxonomies be leveraged? Taxonomy creation may surface “holes” in content

Methods used to create & maintain metadata
Paper or web-based forms widely used: Distributed resource origination metadata tagging Centralized clean-up and metadata entry. Automated tools & applications not widely used: Auto-categorization tools Vocabulary/taxonomy editing tools Guided navigation applications Federated search and repository “wrappers” Base: 20 corporate information managers CEN/ISSS Workshop on Dublin Core – Guidance information for the deployment of Dublin Core metadata in Corporate Environments

The Tagging Problem How are we going to populate metadata elements with complete and consistent values? What can we expect to get from automatic classifiers?

Tagging Province of authors (SMEs) or editors?
Taxonomy often highly granular to meet task and re-use needs. Vocabulary dependent on originating department. The more tags there are (and the more values for each tag), the more hooks to the content. If there are too many, authors will resist and use “general” tags (if available) Automatic classification tools exist, and are valuable, but results are not as good as humans can do. “Semi-automated” is best. Degree of human involvement is a cost/benefit tradeoff.

Automatic categorization vendors | Analyst viewpoint
Accuracy Level high low Content Volumes Scalability requires simple creation of granular metadata and taxonomies. Better content architecture means more accurate categorization, and more precise content delivery. Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features – UI, versioning, workflow, storage – that provide the basis for building a QA process.

Considerations in automatic classifier performance
Accuracy Development Effort/ Licensing Expense Regexps Trained Librarians potential performance gain Classification Performance is measured by “Inter-cataloger agreement” Trained librarians agree less than 80% of the time Errors are subtle differences in judgment, or big goofs Automatic classification struggles to match human performance Exception: Entity recognition can exceed human performance Classifier performance limited by algorithms available, which is limited by development effort Very wide variance in one vendor’s performance depending on who does the implementation, and how much time they have to do it 80/20 tradeoff where 20% of effort gives 80% of performance. Smart implementation of inexpensive tools will outperform naive implementations of world-class tools.

Tagging tool example: Interwoven MetaTagger
Manual form fill-in w/ check boxes, pull-down lists, etc. Auto keyword & summarization

Tagging tool example: Interwoven MetaTagger
Auto-categorization Rules & pattern matching Parse & lookup (recognize names)

Metadata tagging workflows
Compose in Template Submit to CMS Analyst Editor Review content Problem? Copywriter Copy Edit content Hard Copy Web site Y N Approve/Edit metadata Automatically fill-in metadata Tagging Tool Sys Admin Even ‘purely’ automatic meta-tagging systems need a manual error correction procedure. Should add a QA sampling mechanism Tagging models: Author-generated Central librarians Hybrid – central auto-tagging service, distributed manual review and correction Sample of ‘author-generated’ metadata workflow.

Automatic categorization vendors | Pragmatic viewpoint
Accuracy Level high low Content Volumes Scalability requires simple creation of granular metadata and taxonomies. Better content architecture means more accurate categorization, and more precise content delivery. Surprisingly, most organizations are better off buying tools from lower left quadrant. Their absolute accuracy is less, but it comes with a lot of other features – UI, versioning, workflow, storage – that provide the basis for building a QA process.

Seven practical rules for taxonomies
Incremental, extensible process that identifies and enables users, and engages stakeholders. Quick implementation that provides measurable results as quickly as possible. Not monolithic—has separately maintainable facets. Re-uses existing IP as much as possible. A means to an end, and not the end in itself . Not perfect, but it does the job it is supposed to do—such as improving search and navigation. Improved over time, and maintained.

3:45 Background: Metadata & controlled vocabularies 4:00 Dublin Core: Elements, issues, and recommendations 4:30 Dublin Core in the wild: CEN study and remarks 4:45 Enterprise-wide metadata ROI questions 5:00 Break 5:15 ROI (Cont.) 5:30 Business processes 6:15 Tools & technologies 6:30 Summary, Q&A 6:45 Adjourn

Summary: Categorize with a purpose
What is the problem you are trying to solve? Improve search Browse for content on an enterprise-wide portal Enable business users to syndicate content Otherwise provide the basis for content re-use How will you control the cost of creating and maintaining the metadata) needed to solve these problems? CMS with a metadata tagging products Semi-automated classification Taxonomy editing tools Guided navigation tools

Contact Info Ron Daniel Joseph Busch

Ron Daniel & Joseph Busch Taxonomy Strategies

Similar presentations

Presentation on theme: "Ron Daniel & Joseph Busch Taxonomy Strategies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ron Daniel & Joseph Busch Taxonomy Strategies

Similar presentations

Presentation on theme: "Ron Daniel & Joseph Busch Taxonomy Strategies"— Presentation transcript:

Similar presentations

About project

Feedback