Quality Taxonomies Jim Nisbet Senior Vice President of Technology Semio Corporation Knowledge Technologies 2001 March 5 th, 2001
Ontology / Taxonomy Root Ontology Taxonomy Generation Static Discovery Dynamic Discovery
What is Quality ? “Best value for the money” According to this definition, you are entitled to get high performance from a costly product; likewise a low cost product or service is expected to be a poor delivery. For example, a loose demo delivery is both predictable and acceptable, since its quality is: low conformance / low cost.
What is Quality ? “Good Quality is Nominal Conformance” Taxonomy Quality is defined as Taxonomy Conformance to: Valid requirements; Explicitly documented development standards; and, Implicit characteristics that are expected of all professionally developed taxonomies, such as the desire for good maintainability.
Standards ISO International Organization for Standardization. Documentation—Guidelines for the Establishment and Development of Monolingual Thesauri. 2nd ed. n.p.: ISO, (ISO (E)). (Available in the U.S. from American National Standards Institute) ISO International Organization for Standardization. Documentation—Guidelines for the Establishment and Development of Multilingual Thesauri. n.p.: ISO, (ISO (E)). (Available in the U.S. from American National Standards Institute) ANSI/NISO Z National Information Standards Institute. Guidelines for the Construction, Format, and Management of Monolingual Thesauri. Bethesda, MD: NISO Press, p. (ANSI/NISO Z ) SEMIO Quality Plan v ISO/IEC Topic Maps RDF Please refer to RDF at and XML at
Project Plan 1.Kick-off 2.Requirements Review 3.Lexicon Review 4.Taxonomy Review 5.Tags Review 6.Final Review
1. Kick-off Objectives Purpose Scope Scale Users Conditions of receipt Roles Supplier Customer –Admin –KE –Experts –Users Planning Training and Transfer
2. Requirements Review Sources Lexicon Ontology Install
Sources Dispersion (Multiplicity, Size, Homogeneity) Refresh Access
Typical Patterns Disparity Adjust sources Adjust crawl strategy Isolate communities / taxonomies
Lexicon Vocabularies, etc. Substitutions: Acronyms, Synonyms, etc. Preferred Keywords: Brand Names, etc. Banned Keywords
Typical Patterns Lack of requirements Use Librarian Resources
Ontology Thesaurus ? Is the information domain analysis complete, consistent, and accurate ? Is the partitioning of the problem complete ?
Typical Patterns Directory versus Taxonomy Isolate “directory” branches Thesaurus versus Taxonomy Put an ontology on top of thesaurus Check ASAP match of thesaurus generics with extracted lexicon Very high level design for top categories requirements Plan to work bottom-up See also Taxonomy (functions, combinations, etc.)
Install Implementation / Integration: Are external and internal interfaces properly defined? Are all requirements traceable to the system level? Has prototyping been conducted for the user/customer? Is performance achievable within the constraints imposed by other system elements? Are requirements consistent with schedule, resources, and budget?
Typical Patterns Scale Security Missing Documents
3. Lexicon Review Coverage Extracted words / Words (Extracted Index / Index) Sources bench-marking Coverage Extraction quality Topic distribution Structure Most Frequent Phrases Most Productive Generics Substitutions Exceptions
Typical Patterns Low level of frequency / quality for the most meaningful content Increase size of value corpus Filter and re-import lexicon
4. Taxonomy Review Taxonomy Operation Correctness Reliability Usability Integrity Efficiency Taxonomy Revision Maintainability Flexibility Testability Taxonomy Transition Portability Reusability Interoperability
Tax Liability Loan Term loan Short-term loan Unique Beginner Life Form Generic Specific Varietal Folk Taxonomies Design The Berlin and Kay model: Taxonomy = Nomenclature + Terminology
Correctness Accuracy Completeness Consistency
Accuracy Precision Recall
Completeness TaxonomyMapsLexiconCollection
Concentration Works Against Quality Lexicon Document Collection Maps Taxonomy Tagging Tagging Coverage Ontology Coverage Hook Coverage Map Coverage Lexical Coverage Collection Coverage
Consistency: Typical Patterns Objectivization Hyperonymy Speciation Necessity
Objectivization Employment Firing Hiring Salaries Avoid functional categories Don’t mix functions / objects Exhaust scripts Match idiomatic phrases
Genericity Parts Air Conditioning Belts and Hoses Body Brake System Chassis Engine Exhaust System Fuel System Glass Ignition Avoid meronymy Don’t mix meronymy / hyperonymy Exhaust prototypes
Speciation Person Unwelcome person Unpleasant person Selfish person Opportunist Backscratcher Avoid “strings” of categories Avoid (non-idioms) properties for categories (WordNet)
Necessity Avoid non-productive categories Avoid combinations of categories
Nomenclature (Design Structure) Quality Index Depth Width Balance
Complexity Index Cyclometric complexity increases with number of Cross References within the Taxonomy, giving an indication of complexity and difficulty of testing. Taxonomy Complexity Index combines: autonomy closure similarity typicality commonality redundancy stability
Maturity index The IEEE standard suggests a taxonomy maturity index to provide an indication of the stability of the taxonomy. Maturity Index combines: number of modules in current ontology / taxonomy. number of modules in current ontology / taxonomy that have been changed. number of modules added to current ontology / taxonomy. number of modules deleted from the previous version of the ontology / taxonomy.
5. Tags Review Document coverage Concepts coverage Liability Federal Funds 0.746
6. Final Review Receipt Maintenance
Quality Taxonomies Jim Nisbet Knowledge Technologies 2001