BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)

October 2005Documents2 Why bothering documents in DL Documents are DL’s building blocks. Documents are DL’s building blocks. High-level considerations High-level considerations  How are documents organized? Would documents be classified to various collections?  What do they look like, e.g., as web pages, or file attachment? Low level considerations Low level considerations  How are documents represented?  Document types, formats and their characteristics are to be considered.

October 2005Documents3 Challenges in keeping documents in DLs Probably the biggest challenge Probably the biggest challenge  Tension between the fast changing technologies and the long-term existence of libraries Decision problem Decision problem  What document formats to be supported (and why)? Page description (e.g. PDF) or editable format? Does a DL need to support internationalization? Does a DL need to support internationalization?  Display documents of various languages properly  Multi-languages user interface  Ability to search text properly (which is not always easy)

October 2005Documents4 Your turn … Download a document called “sample.rtf” from ILN and view it with MS Word. Now open it with Notepad and find the number of occurrences of the word “bold” in it. Does the number in line with your expectation? Any implication of your finding to the search design of a digital library? Download a document called “sample.rtf” from ILN and view it with MS Word. Now open it with Notepad and find the number of occurrences of the word “bold” in it. Does the number in line with your expectation? Any implication of your finding to the search design of a digital library?

October 2005Documents5 Searching properly is not always easy Internationalization can be problematic for searching Internationalization can be problematic for searching  Same text (more exactly, character) may not be represented in the same way in various document types, e.g. the ‘ï’ in naïve is represented as  ï (name code) or ï (number code) in HTML  A single character in extended ASCII (decimal equivalent 139) extended ASCII extended ASCII Solution: Unicode Solution: Unicode  Helps with character representation only (unique representation of each character)

October 2005Documents6 Indexing can be difficult too Indexes are often created out of “important” words from documents so as to facilitate rapid full-text searching. Automatic indexing is not easy as the issues are involved: Indexes are often created out of “important” words from documents so as to facilitate rapid full-text searching. Automatic indexing is not easy as the issues are involved:  How “important” words can be identified? e.g. is, the, am  How can we segment meaningful words out of a document when the languages are not written with spaces between words, e.g., Chinese?  Text abstraction is an area of research that aims to deal with the aforesaid problems. It requires knowledge about a language as well as the domain knowledge.

October 2005Documents7 Document formats not restricted to text Text Text  Portable Document Format (PDF), Rich Text Format (RTF), MS Word, LaTeX, and XML etc. Image Image  GIF, JPEG, PNG, TIFF Video Video  MPEG, Apple’s QuickTime, Microsoft’s AVI Audio Audio  WAV, AIFF, WMA, AU, AAC, MP3, …

October 2005Documents8 Are document format details important When striving for standardization, knowledge about how different formats work may help one to appreciate their strengths and weaknesses, e.g., When striving for standardization, knowledge about how different formats work may help one to appreciate their strengths and weaknesses, e.g.,  Interactive features of PDF will lose when converted into Postscript.  Converting GIF to JPEG can degrade image quality irreversibly.  Converting HTML to Postscript is easy but converting Postscript to HTML with the visual features accurately replicated will be extremely difficult.

October 2005Documents9 Representing characters Different computer vendors adopted different set of codes in the products until the 7-bit (1 bit: 2 combination 0, 1) America Standard Code for Information Interchange (ASCII) was announced by American National Standard Institute (ANSI). Different computer vendors adopted different set of codes in the products until the 7-bit (1 bit: 2 combination 0, 1) America Standard Code for Information Interchange (ASCII) was announced by American National Standard Institute (ANSI). In addition to printable characters, ASCII include non-printable characters too In addition to printable characters, ASCII include non-printable characters tooASCII  BEL rings the bell  BS for backspace the cursor (128 combination  STX starts the transmission in communication

October 2005Documents10 Extended ASCII: a solution or not? ASCII support the English language only. ASCII support the English language only. International Standards Organization (ISO) extends ASCII by using the unused bit of the 8-bit code to include more characters of alphabet-based languages, e.g., French and German. International Standards Organization (ISO) extends ASCII by using the unused bit of the 8-bit code to include more characters of alphabet-based languages, e.g., French and German.  Not a solution to many other languages like Chinese and Hebrew as the code can support up to 256 patterns only  Result: new competing standards arise, e.g., GB and Big-5 for Chinese. (support Chinese and English but not other language)

October 2005Documents11 Unicode First worked by Apple and Xerox in 1988 First worked by Apple and Xerox in 1988 A consortium formed with many other interested parties was founded in 1991 A consortium formed with many other interested parties was founded in 1991 ISO-10646, the “Universal Multiple-Octet (8-bit) Coded Character Set” (UCS) was released in 1993 ISO-10646, the “Universal Multiple-Octet (8-bit) Coded Character Set” (UCS) was released in 1993 Unicode aimed to represent scripts of languages in use around the world (and this has been achieved). Unicode aimed to represent scripts of languages in use around the world (and this has been achieved). Unicode offers a round-trip compatibility (e.g. ASCII – 7 bit) with existing encoding schemes. Unicode offers a round-trip compatibility (e.g. ASCII – 7 bit) with existing encoding schemes. Current work is to address historic languages and notations such as music. Current work is to address historic languages and notations such as music.

October 2005Documents12 Support for Unicode Computer languages Computer languages  Java (built-in Unicode support)  C, Perl, Python, etc. (standard Unicode libraries) All key operating systems, e.g., Windows 2000 and its successors All key operating systems, e.g., Windows 2000 and its successors Application support, e.g., Web browsers Application support, e.g., Web browsers Default encoding for HTML and XML Default encoding for HTML and XML

October 2005Documents13 Unicode character set Comes in two parts (ISO 10646-1 & ISO 10646-2) Comes in two parts (ISO 10646-1 & ISO 10646-2) Specifies a total of 94,000 characters Specifies a total of 94,000 characters ISO 10646-1 focuses on commonly used languages, called Basic Multilingual Plane (BMP) (express all info, ensure uniqueness of character) ISO 10646-1 focuses on commonly used languages, called Basic Multilingual Plane (BMP) (express all info, ensure uniqueness of character)  Divided into 1,000 pages  Contains 49,000 characters  Code space into different scripts Distinguishes characters by script and not by language, e.g., CJK ideographs for Chinese, Japan, and Korean Hangul characters Distinguishes characters by script and not by language, e.g., CJK ideographs for Chinese, Japan, and Korean Hangul characters

October 2005Documents14 Unicode character code charts by script

October 2005Documents15 CJK unified ideographs Japanese Simplified Chinese Traditional Chinese

October 2005Documents16 Composite and combining characters (1/3) A code point (modify specific character) is a Unicode value prefixed by U+ to the numeric value given in hexadecimal A code point (modify specific character) is a Unicode value prefixed by U+ to the numeric value given in hexadecimalhexadecimal A code point does not always represent a character, e.g., the Latin small ligature fi A code point does not always represent a character, e.g., the Latin small ligature fi A code point may specify part of a character, e.g., combining diaeresis (i.e., the two dots above ‘ï’, ‘ä’ and ‘ü’, etc. -- 2 codes to make a new word) which must be combined with a base character to form meaningful characters A code point may specify part of a character, e.g., combining diaeresis (i.e., the two dots above ‘ï’, ‘ä’ and ‘ü’, etc. -- 2 codes to make a new word) which must be combined with a base character to form meaningful characters

October 2005Documents17 Composite and combining characters (2/3) To comply with the round-trip compatibility requirement, Unicode has single code points that correspond to precisely the same units as character combinations and this means that some characters have more than one representation To comply with the round-trip compatibility requirement, Unicode has single code points that correspond to precisely the same units as character combinations and this means that some characters have more than one representation Combining characters help compensate for omissions which may otherwise be fixed by a lengthy standardization process Combining characters help compensate for omissions which may otherwise be fixed by a lengthy standardization process

October 2005Documents18 Composite and combining characters (3/3) Problem Problem  Alternative representations of the same character complicate:  Text searching (as alterative representations have to be considered)  String comparison  Sorting text into lexicographic order Solution Solution  Representing text in some normalized form; or normalized formnormalized form  Excluding the use of combining characters (Unicode Level 1)

October 2005Documents19 Unicode character encodings (1/3) UTF-32 UTF-32  ISO standard reserves 32 bits for each Unicode character with future extension in mind  Unicode consortium limits the range of code values to the first 21 bits  Rarely used in practice due to inefficient use of space

October 2005Documents20 Unicode character encodings (2/3) UTF-16 UTF-16  Basic Multilingual Plane (BMP) can be expressed in 16 bits (UTF-16)  A pair of surrogate characters (1 surrogate = 16 bit), defined inside the BMP, are used to represent characters outside BMP (see http://en.wikipedia.org/wiki/UTF-16) http://en.wikipedia.org/wiki/UTF-16

October 2005Documents21 Unicode character encodings (3/3) UTF-8 UTF-8  Motivation  the majority of UNIX tools expects ASCII files and cannot read 16-bit words as characters without major modifications  Ken Thompson (inventor of the Unix operating system and computer programming language B, predecessor of C programming language) invented UTF-8 over a placemat on a dining table in September, 1992 invented UTF-8 over a placematinvented UTF-8 over a placemat

October 2005Documents22 Encoding Unicode characters as UTF-8

October 2005Documents23 UTF-8 properties UCS characters U+0000 to U+007F (ASCII) are encoded as bytes 0x00 to 0x7F (ASCII compatibility). Thus files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. UCS characters U+0000 to U+007F (ASCII) are encoded as bytes 0x00 to 0x7F (ASCII compatibility). Thus files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. All possible 2 31 UCS codes can be encoded. All possible 2 31 UCS codes can be encoded. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. To avoid ambiguity, UTF-8 must use the shortest possible encoding. To avoid ambiguity, UTF-8 must use the shortest possible encoding.

October 2005Documents24 Using Unicode in a DL Ensure that the underlying programming language selects the right data type to represent Unicode characters Ensure that the underlying programming language selects the right data type to represent Unicode characters If possible, work with BMP and Unicode level 1 to make implementation of string matching operations easy. If possible, work with BMP and Unicode level 1 to make implementation of string matching operations easy. Save documents in UTF-8 to save storage space Save documents in UTF-8 to save storage space

October 2005Documents25 Documents in plain text (1/2) Ensure two communicating programs make the same assumption about the text coding or characters may not be displayed correctly Ensure two communicating programs make the same assumption about the text coding or characters may not be displayed correctly Paragraphs are separated by two successive line breaks, or the first line is indented with a tab character (which usually stops at every eighth character position). Paragraphs are separated by two successive line breaks, or the first line is indented with a tab character (which usually stops at every eighth character position). Different operating systems adopt conflicting conventions for specifying line breaks Different operating systems adopt conflicting conventions for specifying line breaks

October 2005Documents26 Documents in plain text (2/2) Different operating systems adopt conflicting conventions for specifying line breaks Different operating systems adopt conflicting conventions for specifying line breaks  Microsoft Windows inserts a new line when encountering carriage return followed by line feed but Apple Macintosh and Unix insert a new line when hitting line feed (and this may cause display program when a text file is created and opened in two computing environments) No metadata can be included explicitly No metadata can be included explicitly

October 2005Documents27 Data quality issues in DL Three levels of data quality Three levels of data quality  Absolute data quality (quality of original source)  Overall level of data quality of both digital objects and metadata within a DL  Faithful reproduction data quality (e.g. scanning)  Data quality of objects that originated elsewhere, e.g. are scan-in documents correctly recognized? (Data quality of translation process)  Digital data quality  Data quality of objects and metadata that were born digital within a DL

October 2005Documents28 Typographical errors (Gardner, 1992) Errors of letter omission, letter insertion, letter substitution, and letter transposition Typographical errors can also occur at the word or sentence level, e.g. omission of words or paragraphs Typographical errors can hinder access for phrase and proximity searching, and typos can create greater obstacles when they occur in the metadata associated with an object

October 2005Documents29 Scanning and data conversion errors Scanning errors like insertion of extra space or misread a letter or character Scanning errors like insertion of extra space or misread a letter or character It’s also possible for errors to creep into text documents when they are converted from one format to another, e.g. conversion of a document from MS Word format to HTML. Using find-and-replace only with care

October 2005Documents30 Metadata errors Metadata errors can block access to documents in a DL Metadata errors can block access to documents in a DL Particularly bad for objects like images that full text searching is not available Particularly bad for objects like images that full text searching is not available Automatic creation of metadata by means of special software can lead to errors in the metadata. Metadata creation generally needs human intervention to be successful. Translating metadata from one scheme or format to another can also be a source of errors.

October 2005Documents31 Managing errors (problem of data quality) Fix the error in the document (if you own it) Make available a new document that replaces the document with the error Make available a new document that contains a notice of the error and its correction (this document would be hyperlinked to and from the original document containing the error) Use special search software to compensate for some errors

October 2005Documents32 Document management issues Document storage issue is only one of the DL issues; other important questions to ask include Document storage issue is only one of the DL issues; other important questions to ask include  Would documents in a “digital library” be updated? If yes,  who can perform the update?  would an update need someone to approve it before it is effective?  would content changes be tracked?  How can a document be searched? By metadata, by content, or by both?

October 2005Documents33 References National Institute of Standards and Technology. (2002). Digital media file types: Survey of common formats. (http://www.immaculatetechnology.com/Tech%20Referen ces/Digital%20Media%20File%20Formats.pdf) National Institute of Standards and Technology. (2002). Digital media file types: Survey of common formats. (http://www.immaculatetechnology.com/Tech%20Referen ces/Digital%20Media%20File%20Formats.pdf)http://www.immaculatetechnology.com/Tech%20Referen ces/Digital%20Media%20File%20Formats.pdfhttp://www.immaculatetechnology.com/Tech%20Referen ces/Digital%20Media%20File%20Formats.pdf Beall, J. (2005). Metadata and Data Quality Problems in the Digital Library. Journal of Digital Information, 6(3), Article No. 355, 2005-06-12. (Available: http://jodi.tamu.edu/Articles/v06/i03/Beall/) Beall, J. (2005). Metadata and Data Quality Problems in the Digital Library. Journal of Digital Information, 6(3), Article No. 355, 2005-06-12. (Available: http://jodi.tamu.edu/Articles/v06/i03/Beall/)

BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)

Similar presentations

Presentation on theme: "BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)

Similar presentations

Presentation on theme: "BSIM0010 Digital Libraries: Documents (mainly based on Chapter 4 of Witten & Bainbridge’s book)"— Presentation transcript:

Similar presentations

About project

Feedback