Presentation on theme: "What do we want to identify? Ketil Albertsen, Paradigma project National Library of Norway."— Presentation transcript:
What do we want to identify? Ketil Albertsen, Paradigma project National Library of Norway
Existing ID schemes and FRBR Work: DOI (?) Expression: ISTC, ISWC, DOI (?)
Existing ID schemes and FRBR Manifestation: ISBN, ISMN, ISRN, ISAN, ISRC ISSN, SICI DOI (?)
Existing ID schemes and FRBR Item: Often: non-standardized holding IDs, shelf IDs etc. Internet: URL (actually, URLs are location IDs, but are often used as object IDs). Museum world, rare used books shops/collections etc. may have their own ID schemes. Several Item level schemes are really location IDs or a mixup of location and object identification.
Existing ID schemes and FRBR Other entity groups No widespread standards exists
Identifier issues Decisions made in Paradigma for ID assignments to objects with no assigned ID, or no assigned ID is found appropriate/ satisfactory Analysis may aid the understanding of properties and limitations of existing ID schemes For each identified issue, the survey states Note: to explain the heading in more detail, where required, arguments in favor of the decision (pro), arguments against the decision (con), our confidence in our own decision on a scale from 1 to 10, additional remarks, where applicable.
ID value should carry no informaton about identified object (e.g. ID should be grey, opaque, unintelligent) Pro: Object retains ID even if its attributes change (without becoming inconsistent). All views equal with respect to ID; different views wont have conflicting wishes. No ambiguiety with respect to how to form ID value. No restrictions on any object attribute, e.g. regarding single-value or uniqueness. Con: Readability is lower for ID values which appear meaningless to the user. Any form of ordering requires other attributes to be retrieved. Any lookup on object attributes requires use of and index mechanism. Confidence: 10 Remarks: Values may have a non-opaque prefix, identifying the ID scheme or responsible authority. We accept this as long as as no object properties is implied by the prefix.
ID values should use a restricted character/symbol set Pro: No problems caused by limited internationalization features No transcription problems. May avoid ambiguities such as UPPER/lower case distinctions or numeric base. Easier detection of typing/reading errors. Con: In cross-cultural contexts, users may have to work in unfamiliar symbol sets. Identifier length increases. Confidence: 7 Remarks: Opaque (grey) IDs usually satisfy this requirement. Special attention should be given to separator characters for increasing readability – use of different separators in a given ID is usually undesirable.
Check digits are included in IDs to be typed manually or read from print Pro: Errors are detected immediately, allowing retyping/rereading. Con: Identifier length is increased. Confidence: 9
Check digits are not stored internally as part of the ID Pro: Software (except input routines) need not relate to IDs known to be invalid. Con: There is no way to store/handle invalid IDs that cannot be retyped/reread. Confidence: 7
IDs have fixed length Pro: Elementary error detection may be done very early in the data entry process. Internal handling may be simplified. Implementers are forced to prepare software for entire range of ID values. Con: Limits total size of ID space; one may run out of IDs. A large ID space gives longer IDs than required from a practical viewpoint. Confidence: 9 Remarks: In numeric IDs, leading zeros should be avoided (for several technical reasons), i.e. serial numbers should start at value 100…0
An external, readable format is rigidly defined Pro: Readability may be improved, e.g. by defined use of digit grouping characters. IDs can be directly compared for equality with no preprocessing/canonizing. Con: Users may feel that rules are too rigid. Confidence: 9
External format optionally identifies primary resolution service Pro: Provides necessary info to realize clickable links. Con: Bound to one specific identification method for resolution service. May be unsuitable for long-term archival (if service is identified by URL). Users generally wont distinguish between object ID part and resolution ID part. Confidence: 7
Binary IDs are displayed as decimal digits Pro: Improved readability; users relate better to digits than to arbitrary character strings. May be entered from a purely numeric keyboard. Con: Identifier length is higher than with arbitrary (or e.g. hexadecimal) display. Confidence: 9 Remarks: Binary IDs are usually only/primarily intended for internal, technical use
One ID scheme covers all object classes / information types Pro: Prepared to handle the proliferation of information types on the Internet. Simplifies internal workings of indexing mechanisms. May avoid ambiguities with respect to which ID scheme is referenced. Con: Object/metadata must be inspected to determine semantics. Requires coordinated ID allocation. Confidence: Paradigma: 9. In non-digital contexts: 3-6
An ID may be assigned to objects without digital value representation Note: Applies to e.g. physical objects and abstract concepts, and also to digital objects which cannot be or is not available as a storable object, e.g. a network or web site. Pro: One ID scheme is used for all different purposes. An ID mechanism is provided for objects which might otherwise be unidentifiable. Con: Retrieval/presentation functions must be prepared for value being unavailable Confidence: Paradigma: 9. In non-digital contexts: 6 Remarks: In Paradigma, displayable agent objects are defined for un-digitizable objects.
One ID scheme handles static as well as dynamic resources: incrementally issued, integrating and streaming Pro: The only way that web documents may be handled properly by automatic mechanisms. Resources saved in ID management. Con: The semantics of the ID is limited to what is common to both static and dynamic objects. Confidence: 9
The contents of an object is either 100% specified, or the object is explicitly defined as an aggregate, e.g. a dynamic document Pro: Any ambiguity is deliberate and well known both to cataloguer and user. No need for qualified judgment regarding assignment of new IDs. Document contents never becomes inconsistent with ID – modified, specific contents should be assigned a new ID if needed. Con: Insignificant revisions cannot be ignored from an identifier point of view Confidence: 9
An ID may identify a rule to be interpreted to determine the object components Pro: The only way that web pages with continuously varying contents may be identified Con: The contents of an object identified by a rule cannot, by definition, be authenticated. Confidence: In a web-based, dynamic document context: 8. In static contexts: 3-5 Remarks: Even though the component set may vary from one moment to the next, any evaluation of the rule must result in a finite set of components.
An ID identifies a unique object in a given interpretation An HTML file as backup object is different from interpreting the HTML code as the top level of a composite web page; these should have different IDs. A periodical as a publication forum is distinct from the complete set of printed issues. Pro: Ambiguities in interpretation of a given ID is avoided. Con: Required size of IDs space increases. An object may have multiple IDs; ID is not unique for the object. Confidence: In digital document contexts: 9, in other contexts 3-7 Remarks: Distinct IDs are essential when the set of components depends on the interpretation, or when the interpretations represent different abstraction levels.
An ID scheme should not assume that objects are atomic (non-composite) Pro: New object classes may be identified without conflicting with ID scheme philosophy Con: Always being prepared for composite objects makes software more complex. Confidence: Context dependent. For referencing: 9, for storage: 1-3. Remarks: In Paradigma, physical IDs assume that object is a static, atomic bit sequence. Logical IDs, provided to external users, assume that objects are composite.
Each distinct component of a composite must be identifiable Pro: Allows references to a specific part of composite objects. Allows different views of a document, defining different extents. Allows a component shared among several composites to have one ID. Con: Requires a larger ID space. Confidence: 9 if components stored as independent files/records, otherwise: 7 Remarks: Components are not necessarily assigned an ID, but it must be possible to do so when the need for identifying the component arises.
An object ID is totally independent of the location of the object Pro: Object can be moved around without loosing its identity. Allows multiple copies of the same object to have the same ID. Con: Obtaining an object requires an explicit mapping from object ID to location. Confidence: 10
A location ID is totally independent of the identity of the object stored Pro: Object store may be reorganized without affecting object IDs. Con: Location IDs cannot be saved for later re-retrieval if store is subject to reorganizing. Confidence: 10
Making references to an object must not require contents interpretation Pro: The referencer need not know format details of documents he wants to reference. The document format may be changed, and references to it remains valid. References can be made at a more abstract (format independent) level. Con: References may have lower precision compared to format dependent references. Dereferencing may be more complex and resource consuming. Confidence: 7 Remarks: Relevant primarily to digital/Internet documents In Paradigma, references are indirect, made through a format independent reference object containing one or more format specific direct references.
Assigning an ID from one scheme does not prohibit allocation of IDs from other schemes Pro: In a given context, a single scheme may be employed for all objects, even those with existing IDs. Con: There will not be a single, unique way to reference a given object. Confidence: 9
Index mechanisms provide a fallback for arbitrary URI format ID schemes Pro: A large number of current and future ID schemes are handled with a single mechanism. Con: The general mechanism cannot handle e.g. allowed variations in syntax for the same ID. Confidence: 9 Remarks: For pragmatic reasons (people use URLs as if they were URNs!), Paradigma decided to allow both URLs and URNs to be entered in the fallback index. For known schemes with known, allowed syntax variations, scheme dependent preprocessing (canonizing) will have to be done, e.g. to hide case differences.
Framework for ID schemes: The URI world uri: urn: url: gopher://gopher.uminn.edu/pub/sched ftp://ftp.funet.fi/pub/at/8200.exe …: isbn: issn: doi:10.185/4dd nbn:fi-34a3aea fcdc14773f4 …:
One ID refers to several object aspects: value, metadata, converted versions… Pro: The same ID is used to reference all information about an object. Information can be added to (or about) an object without requiring a new ID. Con: A resolution request must identify the relevant aspect; the ID alone is not sufficient. Managing value and metadata independently increases complexity of resolution Confidence: 5. Context dependent
An ID scheme may prohibit multiple IDs for a given object (in that scheme) Pro: An unambiguous ID simplifies internal handling, especially with respect to storage. Object references can be directly compared for equality. ID value may be used to determine storage address. Con: May prohibit correct ID, e.g. for component used in multiple contexts Confidence: 5. Context dependent. Remarks: Paradigma: Physical IDs, identifying document elements at lowest level, have a single unique ID, while logical IDs do not have this restriction
IDs are assigned one by one from a single central office Pro: Procedures ensure that metadata is always available. The definition of the identified object is always known. No need to structure ID value, which may be opaque except for common prefix. Con: Channel to assignment authority may become a bottleneck. (Final) ID assignment cannot be done until the object definition is available. Confidence: In Paradigma: 9. Highly context dependent - in general contexts: 2-5 Remarks: In Paradigma, an ID can be reserved for a limited period of time prior to final assignment, to allow the ID to be inserted into the document text.
An automated ID assignment service is provided Pro: Saves manual labor. IDs may be assigned 7/24. Rapid response to assignment requests. Con: Service must be protected against malicious use. Confidence: In Paradigma: 9. In general contexts: 5-7
ID assignment may be requested by any user with no particular authorization Pro: Allows user to create reference to object where the publisher has provided no ID. Con: An object may be assigned an arbitrary number of IDs. An automated assignment service must be protected against malicious use. Confidence: For referencing purposes (point/fragment IDs): 9. Remarks: The resolution service may treat IDs assigned by unauthorized users different from IDs assigned by recognized and authorized users such as publishers.
A generally available, online ID resolution service is provided Pro: Satisfies user expectations for clickable links. Con: A complex infrastructure may be required to provide a high quality service. Confidence: 9 Remarks: The service may provide the object itself, metadata or other classes of information
A generally available authentication service is provided. Pro: Can be an essential aid in legal conflicts. Can be used to force storing of a snapshot of a dynamically changing resource. Con: Must be provided by an authority recognized by all relevant parties. Implementation requires significant resources; all documents must be held by provider. Ignoring pure syntax differences of no semantic importance is very difficult. Confidence: 6. For non-digital documents: Not applicable. Remarks: For various reasons, object may be unavailable for retrieval, only for authentication. The service may provide information about degree of discrepancy. Paradigma: Generally available metadata allows a user to authenticate document locally (for strict equality only).
Resolution/authentication infrastructure is document format independent, and requires no modifications of object contents Pro: All current and future document formats can be handled. Management functions need not know a large number of document syntaxes. Con: The functions cannot be based on content attributes, but must be based on independently managed structures. Confidence: 10
The object must be available to the assignment authority at assignment time Pro: Users are given a higher level of service: All assigned IDs have a valid definition. Con: Delegation of ID assignment have to be restricted to those satisfying requirements. A complex infrastructure is required to make available and maintain definitions. Confidence: In Paradigma: 9, otherwise: 5 Remarks: In an archival context, such as Paradigma, objects never disappear, so all IDs will be valid forever. This is not necessarily the case in other contexts.
A minimum set of metadata must be specified for ID allocation Pro: Guarantees that the resolution service can provide some information about object. Con: Requires that ID allocation is managed by an actor enforcing this requirement. Supplied metadata may be misleading or without value. Confidence: 9
Point/fragment references Offered to users who need to reference other documents Realized as a stored reference object I dentity of a document + starting position and length information May reference abstract document (expression) or aggregate – starting position and length may be specified for each instance / aggregate component. ID like a document – Paradigma: Norwegian branch of urn:nbn: space Resolution service interprets ID as retrieval + positioning, if possible