Synonymy Restroom, bathroom, toilet, loo, facilities, WC, ladies’ room, mens’ room, little girls’ room, little boys’ room... Synonymy: Using different words to identify the same concept.
Another vocabulary problem What is mercury? What is bank? What is python? What is java?
Polysemy Polysemy: Using the same word (morphologically speaking) to identify different concepts. Java: Island in Indonesia, variety of coffee bean, generic term for coffee, object-oriented programming language.
Yet more vocabulary problems The White House has been lobbying Congress to support the proposed budget... Freedom of the press is an important value in the United States... I’m tired of taking the bus; I need some new wheels...
Metonymy and synecdoche Metonymy: Using a related concept to stand for another concept. Synecdoche: Using the word for part of something to stand for the entire thing.
Do people label consistently? No. Furnas and colleagues asked people (including subject experts) to label a variety of items (recipes, text editing operations, “common content objects”). Surprise, there was little agreement among the names submitted by participants. Conclusion: “The idea of an ‘obvious,’ ‘self-evident,’ or ‘natural’ term is a myth! Since even the best possible name is not very useful, it follows that there can exist no rules, guidelines or procedures for choosing a good name, in the sense of ‘accessible to the unfamiliar user.’”
What to do? Furnas and colleagues suggest that interface designers: Implement unlimited aliasing. Disambiguate terms that can be used in multiple senses by presenting possibilities to users and asking them to select the appropriate one.
Limitations of Furnas study Participants were asked to label objects, not how they would search for objects. The study assumes a search interface, not a browsing (or menu-driven) interface. In a search interface, users must recall or guess an object’s name. In a browsing interface, users merely need to recognize the appropriate term.
Vocabulary problems and information systems Designers of information organization systems have long grappled with the ambiguities of language. Synonymy, polysemy, and so on complicate the goal to collocate, or bring together, like items in an information system.
Vocabulary control In LIS, vocabulary control is similar to Furnas’s idea of aliasing: concepts are associated with their synonyms. One term is designated as preferred: this is the term used in a display. Other labels associated with the concept are used in searching. Example: Search Nordstrom.com for “frock” and get “dresses” instead.
Example of a controlled term Preferred term: bathroom Equivalent terms: restroom, loo, toilet, WC, ladies’ room, mens’ room, little girls’ room, little boys’ room, ladies room, ladys room, lady’s room, ladie’s room, ladys’ room...
Equivalence can be relative Similar concepts may be treated as equivalents; this is a design decision by the vocabulary creator. Example Vocabulary includes this preferred term: Beer These terms are designated as equivalents: ale, porter, stout, pilsner, bock, IPA, malt liquor, barley wine.
Disambiguation in vocabularies Polysemous terms are often identified by adding qualifying terms in parentheses. Mercury (chemical element) Mercury (god in Greek mythology) Search engines may use ask users to select the sense they want.
Digression into the library catalog Library catalogs have three traditional access points: author, title, and subject. In the old card catalog, these were the three ways that users could search. Each of these access points has associated vocabulary control.
Control of names In library cataloging, controlled vocabularies for authors, titles, and subjects are called authority files. Authority files both disambiguate names that identify multiple people or items and group variations for the same person or item (that is, they deal with polysemy and synonymy).
Authority file examples In the UT author authority file: headings for Patricia Williams: Names are disambiguated by using middle initials and dates of birth. Cross references are used for some authors. There may still be two headings for one person.
Fun digression: Pseudonyms in the catalog The current catalog maintains pseudonymous identities (in older catalogs, everything went under the author’s real name). For example, “Carolyn Keene,” the name used by multiple people as the author for the Nancy Drew novels, is maintained as an author entity in the authority file.
Thesauri Thesauri are a type of controlled vocabulary that include equivalence, hierarchical, and associative relationships. Thesauri can also be faceted (that is, represent multiple aspects of a concept...we will discuss facets in depth later). Thesauri are often developed to deal with subjects of documents, and we will talk a lot about this beginning in a few weeks.
Example thesaurus entry Dark chocolate BTChocolate RTSingle-origin chocolate UF Semisweet chocolate Baker’s chocolate Sweet chocolate SN Chocolate without milk solids and with less than 70 percent chocolate mass. BT: broader term, one level up in a hierarchy RT: related term, in another facet or hierarchical branch UF: Use for; synonyms, or non- preferred terms SN: Scope note; definitions or usage guidelines
Controlled vocabulary example: MeSH and PubMed The Medical Subject Headings (MeSH) index journal articles for the PubMed database. Keyword searches in PubMed are automatically expanded with MeSH. Searches can also be explicitly limited to MeSH terms, which can increase precision. The comparison to a system like Google Scholar is illuminating.
Summary Controlled vocabularies increase precision and recall in searching by identifying equivalent terms. Authority files are types of controlled vocabularies. Thesauri are subject-based controlled vocabularies that include hierarchical and associative relationships in addition to equivalence relationships. Thesauri can also be used as browsing interfaces.