Download presentation
Presentation is loading. Please wait.
Published byWilfrid Small Modified over 6 years ago
1
INFO 202 “Information Organization & Retrieval” Fall 2014
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION INFO 202 “Information Organization & Retrieval” Fall 2014 Robert J. Glushko @rjglushko The Most Important Slides (from lectures 1-11) If it is here, you need to know it… You might need to know other things as well, but make sure you know this…
2
UNIVERSITY OF CALIFORNIA, BERKELEY
SCHOOL OF INFORMATION The Course In One Slide To organize is to create capabilities by intentionally imposing order and structure We organize things, we organize information, we organize information about things, and we organize information about information If we think abstractly about these activities, we can see commonalities that outweigh their differences; We select, organize, interact with, and maintain resources We organize resources as individuals, in informal association with other individuals, or as part of a more formal institutional or business context The methods and mechanisms we use to organize, and the tradeoffs we make, vary systematically in different domains and contexts We must recognize the profound impact of new technologies and their co-evolution with the nature of the organizing we do and the kinds of interactions that this organizing enables, but can't ignore the "classical" concepts and knowledge We also need to be skeptical of new information technology because it doesn't automatically solve all of our organizing problems 2
3
Motivating the Concept of “Organizing System”
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Motivating the Concept of “Organizing System” We can emphasize how all of these domains and types of collections differ… or we can emphasize what they have in common They are all “Organizing Systems” A collection of resources Intentionally arranged To enable some set of interactions 3
4
UNIVERSITY OF CALIFORNIA, BERKELEY
SCHOOL OF INFORMATION Key Concepts [1] RESOURCES are “anything of value that can support goal-oriented activity” A COLLECTION is a group of resources that have been selected for some purpose 4
5
UNIVERSITY OF CALIFORNIA, BERKELEY
SCHOOL OF INFORMATION Key Concepts [2] INTENTIONAL ARRANGEMENT captures the idea that the system requires explicit or implicit acts of organization by AGENTS – human or computational ones If the organization isn’t intentional, we can’t reuse or recreate it with our own efforts These arrangements follow or embody one or more ORGANIZING PRINCIPLES, ideally expressed in an implementation-neutral way 5
6
Organizing Principles [1]
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Organizing Principles [1] ORGANIZING PRINCIPLES use properties or DESCRIPTIONS that are associated with the resources; organizing and describing resources are inherently interconnected activities Almost any property of a resource might be used as a basis for an organizing principle, and multiple properties are often used simultaneously 6
7
Organizing Principles [2]
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Organizing Principles [2] For physical resources the properties are often perceptual, material ones, or task-oriented ones For information resources the properties are often semantic ones Some principles are domain-specific, but others can apply generally 7
8
Organizing Principles [3]
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Organizing Principles [3] Other typical arrangements are based on ownership, origin, taxonomic, or “taskonomic” properties (usage frequency, correlated usage) Any resource with a orderable name or identifier can have alphabetic or numeric ordering Any resource with an associated date (creation, acquisition) can have chronological ordering Principles should be expressed logically in a way that doesn’t assume an implementation 8
9
UNIVERSITY OF CALIFORNIA, BERKELEY
SCHOOL OF INFORMATION Principles don’t Specify Implementation: “Organize Spices Alphabetically” 9
10
The Three-Tier Architecture
Organizing Principles are logically separated from Implementation and Presentation Tiers
11
A “Design Space” or “Dimensional” Perspective
In addition to using categories like Library or Museum or Business Information System, consider a specific organizing system as a point in a multidimensional design space and these categories as regions in that space... This treats the familiar categories as “design patterns” that embody typical configurations of design choices 11
12
Consequences of Dimensional Thinking
Overcomes the bias and conservatism inherent in familiar categories Design patterns support multi-disciplinary work that cuts across familiar categories and applies knowledge about them to new domains Creates a design vocabulary for translating concepts and concerns from category and discipline-specific vocabularies 12
13
The 5 Dimensions of an Organizing System
What Is Being Organized? Why Is It Being Organized? How Much Is It Being Organized? When Is It Being Organized? Who (or What) is Organizing It? 13
14
1. What Is Being Organized?
Identifying the unit of analysis is a central problem in every intellectual or scientific discipline - and in every organizing system Resources that are aggregates or composites of other resources, or that have internal structure, or that can have many attributes, pose questions about the granularity of their "thingness” 14
15
2. Why Is It Being Organized?
The essential purpose of every Organizing System is to "bring like things together and differentiating among them” – enabling generic requirements of resource discovery, identification, access… But there are always more precise requirements and constraints to satisfy and more specific kinds of interactions to support Different stakeholders might not agree on these requirements, making it necessary to use multiple and possibly incompatible resource descriptions and organizing principles 15
16
Interactions –The Why of Organizing Systems
INTERACTIONS include any activity, function, or service supported by or enabled with respect to the resources in a collection or with respect the collection as a whole Interactions can include access, reuse, copying, transforming, translating, comparing, combining, visualizing, recommending… anything that a person or process can do with the resources… 16
17
3. How Much Is It Being Organized?
Not everything is equally organizable, because not everything is equally describable A controlled vocabulary can yield more consistent organization Are you organizing the resources you have, or do you need to create an organizing system that can apply to resources that you have not yet collected? The scope and size of a collection shapes how much it needs to be organized Making resources “smart” increases how much they can be organized 17
18
4. When Is It Being Organized?
When the resource is created When it is added to some collection Just in time Never All the time - continuous or incremental 18
19
5. Who or What Is Organizing?
Authors or creators Professional organizers Users “in the wild” Users "in institutional contexts“ Automated or computerized processes 19
20
The Organizing System Lifecycle: 4 Phases
Defining and scoping the domain Identifying requirements Design and implementation Operations and maintenance These phases are brief and mostly inseparable for some simple organizing systems, more sequential for others, and more systematic and iterative for complex organizing systems 20
21
Expected Lifetime Usually correlated with number of users and resources; small collections with single users are often ad hoc and don’t outlive the specific tasks for which they were created Think architecturally to enable more robustness and flexibility with respect to changes in technology or business contexts 21
22
Physical or Technological Environment
Physical environments create possibilities for organizing and interaction But there might be constraints that limit them “Installed base” or “legacy” concerns Be careful if you are designing or selecting technology for an organizing system Avoid resource-specific investments unless absolutely justified 22
23
Relationships with Other Organizing Systems
No organizing system exists in isolation There are always overlapping and adjacent systems from the user’s perspective Sometimes it is possible to design these to work together to achieve integration and interoperability, or standards / ecosystem can emerge At very least, try not to make this difficult 23
24
The Fundamental Tradeoff in an Organizing System
There is a tradeoff between the amount of work that goes into describing and organizing a collection of resources and the amount of work required to find and use them The more effort we put into describing and organizing resources, the more effectively they can support interactions The more effort we put into retrieving resources, the less they need to be organized first 24
25
The Fundamental Tradeoff in an Organizing System
We need to think in terms of investment, allocation of costs and benefits between the describer/organizer and user The allocation differs according to the relationship between them; who does the work and who gets the benefit? 25
26
The Activities in Organizing Systems
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION The Activities in Organizing Systems We can identify four activities in the lifecycle of every organizing system: Selecting resources Organizing resources Supporting resource-based interactions and services Maintaining resources 26
27
UNIVERSITY OF CALIFORNIA, BERKELEY
SCHOOL OF INFORMATION Selecting Resources SELECTION is the process by which resources are identified, evaluated, and added to a collection Selection is by definition an intentional process Selection methods and criteria vary across domains (as do the words that describe them); these can be subjective or objective What are some synonyms or near synonyms for selection? 27
28
Selection Principles Utility or relevance Intrinsic value
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Selection Principles Utility or relevance Intrinsic value Scarcity or uniqueness To support social or symbolic goals To establish a brand or reputation … many others 28
29
UNIVERSITY OF CALIFORNIA, BERKELEY
SCHOOL OF INFORMATION Organizing Resources Almost any property of resources can be used to organize them, but the most appropriate or effective properties differ across resource types and tasks Are you organizing the resources you have, or do you need to create an organizing system that can apply to resources that you have not yet collected or that might not even exist yet? 29
30
Interactions in Organizing Systems
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Interactions in Organizing Systems INTERACTIONS include any activity, function, or service supported by or enabled with respect to the resources in a collection or with respect the collection as a whole ACCESS is the fundamental interaction in all organizing systems – you wouldn’t organize anything you didn’t expect to access again – but most organizing systems have additional interactions to make access more efficient and to create additional value with the resources 30
31
Interactions Some interactions can be enabled with any type of resource, while others are tied to resource types Interaction can be direct, mediated or indirect, or limited to interactions with resource copies or descriptions The supported interactions depend on the nature and extent of the resource descriptions and arrangement Finding the optimal descriptions is an important goal but not always possible 31
32
The Choice of Properties Matters!
Organization using behavior or preferences can make interactions highly efficient for some users and the opposite for those who act differently When multiple properties are used to organize resources, their order determines the arrangement and the ease of interactions This implies the need to prioritize user and interaction types 32
33
The Four Big Ones for Libraries…
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION The Four Big Ones for Libraries… FINDING resources IDENTIFYING resources to confirm that that a found resource is the intended one by distinguishing it from similar ones SELECTING a resource from a collection when more than one might seem to fit the search criteria OBTAINING the resource if it isn’t immediately available 33
34
Digitization and Interactions
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Digitization and Interactions Digitization is the process of transforming a resource whose original format is physical so that it can be stored and manipulated by a computer Digitizing resource descriptions (e.g. bibliographic catalogs) for library resources can facilitate finding, identifying, and selecting interactions even if the resources themselves are not digitized 34
35
Digitization and Interactions
UNIVERSITY OF CALIFORNIA, BERKELEY SCHOOL OF INFORMATION Digitization and Interactions The variety and functions of interactions with digital resources are determined by the amount of structure and semantics in their encoding Access controls for the interactions involving digital resources can be very precisely defined and enforced 35
36
The Document Type Spectrum
UNIVERSITY OF CALIFORNIA, BERKELEY The Document Type Spectrum SCHOOL OF INFORMATION From Glushko & McGrath, DOCUMENT ENGINEERING, MIT Press 2005 36
37
The Document Type Spectrum: A Continuum
There is systematic and continuous variation in document instances and types and there is no clear boundary between “documents” and “data” But there are characteristic types of content, organizing structures, and presentation rules that characterize different regions of the spectrum Possible interactions with documents are largely determined by their location on the document type spectrum 37
38
Search in Organizing Systems
People often search for resources – especially information resources - that let them make better decisions or act more effectively "Search" is not a single type of interaction; because of the range of information finding requirements that users have, it is better to view it as continuum of interactions that need to be supported 38
39
Search in Organizing Systems
We can caricature two endpoints on a continuum of reasons for searching and what they're searching for They search for answers to specific questions They search to "find out about" 39
40
Understanding Recall and Precision
40
41
High Recall with Low Precision
41
42
Low Recall with High Precision
42
43
High Recall with High Precision
43
44
Iterative Models of Search
IR is an active process of Finding Out About It involves making hypotheses, evaluating whether process is on a useful path, proceeding forward (and backward) Users can recognize elements of a useful answer, even when incomplete Questions and understanding changes as the process continues 44
45
Search as Problem Solving, Iterative Model
“Information-seeking is a special case of problem solving. It includes recognizing and interpreting the information problem, establishing a plan of search, conducting the search, evaluating the results, and if necessary, iterating through the process again.” Figure 3.1 in Hearst, M. (2009). Search user interfaces. Cambridge University Press. 45
46
"Information Architecture" and Organizing Systems
An architecture describes a system's components (or "building blocks") and their relationships with each other This gives an "architected" system an explicit model, in contrast with systems that are implemented incrementally without a master plan 46
47
"Information Architecture" and Organizing Systems
Abstract patterns of information content or organization are sometimes called architectures, which means that IA and Organizing Systems could have very similar definitions: TDO "IA is designing an abstract and effective organization of information and then exposing that organization to facilitate navigation and information use“ 47
48
Maintaining Resources
Maintaining includes any activity whose purpose is to ensure that a resource will be available at some future time for use or reuse We can predict future use for some kinds of resources but not for others Note that the intended future "user" of a resource might be a computational process or service For many kinds of resources, you need to preserve more than the "primary resource" for it to have future use Maintenance has many synonyms or near-synonyms across domains: Preservation, curation, governance, storage… 48
49
What is a Resource? When you look at, refer to, or pick up a natural physical object, it usually seems that it is "one thing" Many things have highly visible parts or internal structures, and we must consider whether to treat them as assemblies or aggregations of these separate things For digital and computational resources, it is much harder to know how many parts we can or should identify Once you've decided what to treat as a resource, you can create a name or identifier so you can refer to it 49
50
Granularity: Whole vs. Parts
Resources that are aggregates or composites of other resources, or that have internal structure, pose questions about the granularity of their "thingness" We might need to organize and manage the granular resources, the composite resources, and the relationships between them - all at the same time The "part things" can be identified at various levels of contexts/containers/collections in which they occur The granularity of physical resources is more easily determined that for information resources 50
51
Abstraction - 1 We can identify a resource as a unique instance or as a member of a class of resources The size of this class - the number of resources that are treated as equivalent - is determined by the properties or characteristics we consider when we examine any instance The choice of these properties depends on context and intent, so the same resource can be identified abstractly in some situations and very concretely in others 51
52
Abstraction - 2 For tangible resources, there may be "natural" levels of abstraction (See Plato…) For intangible resources, there aren't 52
53
The “Abstraction Hierarchy” of the “Work”
WORK - an abstract entity; the distinct intellectual or artistic creation; it has no single material manifestation EXPRESSION - the multiple realizations of a work in some particular medium or notation, where it can actually be perceived MANIFESTATION - each of the formats of an expression that have the same appearance; but not necessarily the same implementation ITEM - a single exemplar of a manifestation; if we distinguish this level it is because otherwise identical manifestations have some differentiation 53
54
An Introduction to Modeling - 1
Models are simplified descriptions of a subject that abstract from its complexity to emphasize some features or characteristics while intentionally de- emphasizing others Models enable us to describe and communicate systems regardless of the specific domain or discipline that they represent 54
55
An Introduction to Modeling - 2
A model can represent a human activity, a natural system, or a designed system We can model structures – objects, their characteristics, their static relationships with each other like hierarchy, and reference We can model functions, processes, behaviors – dynamic activities that create and affect structures 55
56
The Classical Modeling Approach
56
57
Adapting the Classical Modeling Approach to TDO – 1
In an Organizing System context we can analyze a domain and analyze its resources and their processes or behaviors as we normally would But taking an architectural perspective that distinguishes organizing principles from implementation means that we emphasize the information or intangible content of the resources We only model those resources and their properties that are relevant to organizing them 57
58
Adapting the Classical Modeling Approach to TDO – 2
In some Organizing System contexts we start with a collection of resources to organize So we analyze the domain and the resources to determine appropriate properties to use in descriptions and organizing principles as we would in any other modeling activity But sometimes we concurrently design the Organizing System and the resources it will contain Systematic modeling of information components and documents to enable efficient reuse and transformation is essential 58
59
Model the Information Not the Thing <Book>
<Title>The Discipline of Organizing</Title> <Editor>Robert J. Glushko</Editor> <Publisher>MIT Press</Publisher> <PublicationDate>2013</PublicationDate> <ISBN> </ISBN> </Book> Not the Thing <Book> <Weight unit=“pound”>2.58</Weight> <Height unit=“inch”>9.25</Height> <Width unit=“inch”>8.25</Width> <Pages>540</Pages> <SpineColor>White</SpineColor> </Book> 59
60
Four Distinctions About Resources
60
61
Resource Domain Physical resources - often distinguished by material of composition and other visible properties Bibliographic resources - traditional focus of library and information science; most commonly used cataloguing rules have 11 categories that don't cleanly distinguish semantics and formats Information resources - the content of information- intensive artifacts and applications - "Document types" 61
62
Resource Format in TDO 62
63
Format Matters! 63
64
Resource Focus Any resource can have other resources associated with it, usually to describe it in some way to facilitate finding it, interacting with it, or interpreting it So we can designate a resource as a primary one or as a description resource Description resources are often called METADATA Metadata often defined as “data about data” but all three parts of that definition are wrong 64
65
Format x Focus 65
66
Resource Agency Passive or operand resources ("nouns") must be acted upon or interacted with to produce an effect Traditional organizing systems contain tangible and static resources that are "natural operands" Active or operant resources ("verbs") create effects or value on their own, sometimes when then initiate interactions with operand resources. Sensors, web-based services, information feeds are operant resources that often are combined to implement business processes or business models 66
67
Smart Things Sensors, RFID tags, GPS, other computation and communication built into physical objects (and phones) to identify them or their context Technical challenges in sorting and identifying the objects in the "information torrent" especially when the objects exhibit some intentional or discretionary behavior In an "Internet of Things" any object can have an IP address that uniquely identifies it 67
68
Resources over Time Organizing systems continually need to adapt and evolve in response to changes in content and context New resources are created or added, and other ones are destroyed or culled Resources might change in controlled ways or in uncontrolled ways The organizing system might need automated or formal mechanisms for ensuring persistence and authenticity 68
69
Putting them all Together
TDO Figure 3.6 69
70
What is a Name? A NAME is a label for some thing or some category that is used to distinguish one from another Many surnames are literally resource descriptions But most resource names do not convey properties of the resource If a name is used to refer to some thing and is unique in some context it is an IDENTIFIER 70
71
Issues with Names and Naming – 1
A specific resource or type or resource can often have multiple names; these are SYNONYMS or ALIASES Different things can sometimes have the same names -- these are HOMOGRAPHS or POLYSEMES 71
72
The Vocabulary Problem
People use a large variety of words for the same thing or concept Most people - especially system designers - are surprised by this because they think their own word choices are “intuitive” or "natural“ The extreme variability of word selection is an inescapable fact that has its roots in the nature of language and categorization 72
73
Identifiers - 1 An identifier is a special type of name assigned in a controlled way and governed by rules that define possible values and naming conventions; "some person or organization asserts the relationship between the string and the thing” (Coyle, 2006, p. 428) The same resource can have more than one identifier How many different identifiers can a person have? What if the answer is “none”? 73
74
Identifiers - 2 Identifiers are UNIQUE if they refer to one and only one resource within some defined context or scope Identifiers are PERSISTENT if they resolve to the same referent indefinitely, or as long as needed; but "persistence is a functions of organizations, not technology 74
75
Three Types of "Stuff“ or Kinds of Information
Content - "what does it mean" information Structure - "where is it" or "how it is organized or assembled" information Presentation - "how does it look" or "how is it displayed" information 75
76
Information (Content) Components
Information components - especially standardized ones - are semantic building blocks that can be assembled into different document types A component architecture for information is essential to enable efficient reuse, integration and operation of information-intensive organizing systems 76
77
Identifying Information Components
Any piece of information that can be addressed and manipulated by a person or process as a discrete entity Any piece of information with a unique name or identifier Any piece of information that is self-contained and comprehensible on its own 77
78
Separating Content from Structure and Presentation
An essential goal of modeling is to separate what something means from its structure and presentation We're expressing a distinction between information as conceptual or as content: and the physical container or medium, format, or technology in which the information is conveyed Separating content from its structure and presentation is the most important principle of Information Architecture 78
79
An Overview of Resource Description
What is a resource description? Why do we describe resources? What resource properties should be described? How are resource descriptions created? What makes a good resource description? 79
80
What is a Resource Description?
Information that is created intentionally and associated with a resource to enable it to be organized and interacted with A resource description is also a resource from the perspective of any other resource that uses it, ad infinitum Structured descriptions of information resources are often called “metadata” 80
81
Resource Descriptions as Substitutes
A resource description is often a functional substitute for the resource it describes when the latter can’t be accessed or used Computer-processable descriptions of physical resources are especially good substitutes because they enable manipulations and interactions otherwise impossible 81
82
Why We Describe Resources
We describe resources so we can refer to them, organize them, and interact with them Each purpose might require different descriptions and different methods of using them Different resource domains can have characteristic or standard resource descriptions (or description categories) 82
83
Why We Describe Resources
But different types of resources must also have differentiating properties, otherwise there would be no reason to distinguish them as different types But over time as a collection of resources grows and as requirements for interactions change, the reasons for describing resources will also change, and the descriptions must change as well We often combine descriptions... and we often compare them 83
84
Requirements about Resource Description
The most generic interactions use descriptions that can be associated with almost any type of resource, such as the name, creator, and date Different types of resources must have differentiating properties, otherwise there would be no reason or way to distinguish them as different types 84
85
Requirements about Resource Description
Business strategy and economics strongly influence the extent of resource description and the use of technology for automatic description The tradeoffs imposed by the extent and timing of resource description arise throughout the lifecycle, with the tradeoff between recall and precision being the most salient 85
86
Objectives of Bibliographic Description
Finding a resource that you know exists Identifying a resource to make sure you have the one you were looking for Selecting a resource from a set of candidates in a collection Obtaining the resource if what you have at this point is just a resource description 86
87
Classifying Resource Properties
88
Process of Describing Resources
Identify / scope the resources to be described Determine the purposes or uses of the descriptions Study the resource(s) to identify descriptive properties Design the description vocabulary Design the description form and implementation Create the descriptions (either "by hand" or by some automated / computational process Evaluate the descriptions (TDO FIGURE 4-3) 88
89
Scoping and Description
How explicit and thorough these 7 steps need to be depends on what we are calling "scope“ Do we know who the users are? Are the describers the users? Do we have all the resources, a representative set, or just the first set ... do we know what else is coming? 89
90
Nature and Number of Users
How precisely a collection of resources can be described and organized depends on well user types and requirements are known In some contexts, user types and individual users can be controlled and identified In others, types and users aren’t known until the organizing system is in operation 90
91
Designing the Description Vocabulary
Good descriptions use terms that their intended users might use Good descriptions don't contain details that aren't necessary Good descriptions are created systematically and follow standards 91
92
The Need for Controlled Vocabularies
The words people use to describe things or concepts are "embodied" in their context and experiences... so they are often different or even "bad" with respect to the words used by others These naturally-occurring words are an "uncontrolled vocabulary" 92
93
What is a Controlled Vocabulary?
A controlled vocabulary is a standardized set of terms (such as subject headings, names, classifications, etc.) assigned by organizers / cataloguers / indexers of resources A CV can be a content standard for the values used in (or as) descriptive elements A CV can be thought of as a fixed or closed dictionary in which everything must be defined using the same set of terms 93
94
Types of Controlled Vocabularies
Dictionaries Authoritative names Authority control for places and time periods Identifiers Code lists Subject heading lists (like the Library of Congress – LCSH) Thesauri Classification systems 94
95
Creating Resource Descriptions
By professionals: "Institutional" descriptions that follow standards By authors: Best knowledge of purpose and intended audience By users: Most variable, influenced by social purposes By automated processes: Most reliable, primarily technical / objective properties but semantic description capability is emerging 95
96
Evaluating Resource Descriptions
Quality should be evaluated with respect to intended purposes... but what if different stakeholders have different purposes? Creating resource descriptions can be costly; is it worth it? How and when measure the costs? Computational description is vastly less expensive, but is it good enough? How can we incent "community" or "crowdsourced" description to be of better quality? 96
97
Dublin Core: Motivation
When the web was just a few years old, professional “metadata-makers” were already concerned that web resources would be difficult to discover and identify because they lacked good descriptions An international, cross-disciplinary group of professionals from librarianship, computer science, text encoding, the museum community, and other related fields convened in 1995 to propose a standard set of metadata elements for web pages 97
98
More than a Controlled Vocabulary
A controlled vocabulary is a standardized set of terms (such as subject headings, names, classifications, etc.) assigned by organizers / cataloguers / indexers of resources A metadata schema like the Dublin Core controls the kinds of assertions about resources that you can make in the first place Controlled vocabularies can be very useful requirements or recommendations about the values that are contained in the assertions (the information content of the assertion) 98
99
Dublin Core “Content Model” is Unrestrictive
All elements are optional All elements are repeatable Elements can be in any order 99
100
Using the Dublin Core – Pragmatics and Problems
Some information may appear to belong in more than one metadata element There is potential semantic overlap between some elements There will occasionally be some judgment required from the person assigning the metadata 100
101
Descriptive Precision
Poets, painters, composers, sculptors, technical writers, programmers, and devices or machines all create resources The Dublin Core vocabulary has a single property for “creator” Easier to use, but loses a great deal of precision
102
Questions about Names How many names should be associated with a information object or resource? Should one be designated as the preferred or authoritative form? What references should be made from other possible forms of names that haven't been used? 102
103
Authority Control Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives for some resource based on some set of rules The Library of Congress maintains an "authority file" (in MARC format) for the names of persons, corporate entities, geographic names of political entities, and titles of works See and 103
104
Normative Forms of Names
When names appear in multiple forms, one form needs to be chosen; criteria include: Fullness (e.g., full names vs. initials only) Language of the name Spelling (choose predominant form) Entry element "Smith, John" not "John Smith" "Mao Zedong" or "Zedong, Mao" or "Mao Tse Tung" or ? 104
105
The Name Matching Problem
Name matching" is the task of determining when two different strings denote the same person, object, or other named entity It is ironic that this problem has many other names: Object consolidation Merge-purge Householding Fuzzy/approximate matching Co-reference resolution Duplicate detection Record linkage 105
106
Generic Approach to Name Matching
Naumann, F., & Herschel, M. (2010). Figure 1.1. 106
107
Data Quality Data quality problems can have any combination of technological causes, process causes, or managerial causes Does data have to be perfectly clean? Can it ever be? How can your own actions contribute to data quality problems or to their resolution? 107
108
Principles and Processes for Quality Information
Prioritize the data items Find the “headwaters” or data owners and involve / incent them Validate at the time of capture or creation Set realistic goals for data quality 108
109
Tagging - Overview Trant’s definition: “Publicly labeling or categorizing resources in a shared online environment” The aggregated results of individual tags have been described as: collaborative, cooperative, distributed, dynamic, community-based, folksonomic, wikified, ethnoclassification, democratic, user-assigned, or user-generated Tag sharing (or visibility) creates additional value through network effects 109
110
Design Dimensions for Tagging Systems
What can be tagged? (anything, photos, web resources, bibliographic resources…) Tagging lexicography? (whitespace delimited string, phrases, normalization) Who can tag? (self, permitted people, anyone) Tagging support? (none, suggested, previous tags visible) Aggregation model? (none, “bag”, labeled set) 110
111
“Tag Soup” Users are free to assign any number of labels or tags they choose No vocabulary control No text processing to handle derivational and morphological variants Key Question: When there are many tags for the same resource, how much statistical convergence naturally arises? Can it be encouraged by the tagging UI? 111
112
What is Multimedia? Media: Text Audio (speech, music, sound)
Graphics and Images Moving Images Multimedia -- Composed of more than one form of media Time-based media -- Audio and video 112
113
What's Different About Describing Multimedia?
Sensory Gap Semantic Gap Proliferation Problem 113
114
The Sensory Gap (2) There is a gap between an object and a computer's ability to sense and describe the object An infinite number of different "signals“ can be produced by the same object …And different objects can produce similar signals Human perceptual machinery excels at recognizing when different "signal patterns" are the same object or when similar patterns are different ones But the problem is difficult for computers 114
115
The Semantic Gap Instruments, devices, and sensors encode data in formats that are optimized for efficient capture, storage, decoding, or other processing This data is often limited to characteristics that involve coarse categorization and syntactic processing The representation of the object can't be (easily) processed to understand what the object "means" So there is a semantic gap between the descriptions that people assign and those that can be assigned by automated mechanisms or that are otherwise built into the storage format 115
116
Are these New Problems? Museums face some of the same or similar problems in describing art works and artifacts: There may be many artifacts that represent the same "work" - this is like the "sensory" gap The materials or medium in which the artifact is embodied don't convey semantics "on their surface" - this is the semantic gap There may be so many artifacts of a particular type that some get only limited descriptions - this is like the proliferation problem 116
117
Some Problems May Be New
The temporal structure of multimedia, especially video, mandates new descriptive vocabulary and new ways to identify meaningful components Video and music meet emotional/psychological needs that are more complex than those for "documents" - so the descriptions of the latter have to be able to address these needs People don't usually access or retrieve music or video "to satisfy information requirements“ 117
118
"Professional" Metadata for Multimedia and Non-textual Works
Museum works and other non-text artifacts have long been described by professionals using structured metadata according to classification rules The descriptions are often "layered" beginning with accessible properties with more analytic descriptions added for those objects that warrant them 118
119
"Panofsky's Three Levels of Meaning
Description ("Preiconographic“ or “Primary”) Identification ("Iconographic“ or “Secondary”) Interpretation ("Iconologic“) 119
120
Getty Categories for the Description of Works of Art (CDWA)
The CDWA is a massive metadata schema with 532 elements and sub-elements Each element typically has a controlled vocabulary or recommends one (e.g, Art and Architecture Thesaurus) There is a core set of 36 elements CDWA-Lite is the XML specification for another simple subset of CDWA 120
121
Comparing “Professional” Metadata with “Amateur" Metadata
Flickr, YouTube, and other collections of multimedia works are characterized by non- standard classification schemes There is no "professional" metadata to start with and build upon Result 1: Unstructured tagging with little use of controlled vocabularies within each collection Result 2: No standardization or interoperability across collections 121
122
Bridging the Semantic Gap in Describing Images
Bridging the gap means extracting features from multimedia content and finding ways to infer semantic-level descriptions from them Some key questions: What features best support semantic inference? What inference techniques work the best? Can we exploit multi-modal or cross-modal information? 122
123
Contextual Metadata Metadata about the context in which some content was "captured" Location, time, other people or things present are basic elements, but there are many more This kind of information has often been collected, but not usually analyzed and applied to description until afterwards 123
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.