Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI Waseda University.

Similar presentations


Presentation on theme: "1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI Waseda University."— Presentation transcript:

1 1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI (tsutomu@waseda.jp)tsutomu@waseda.jp Waseda University Library 4 th Hong Kong INNOPAC Users Group Meeting December 2003

2 2 WASEDA University Overview  Founded in 1882  Now has: -- 10 undergraduate schools -- 14 graduate schools -- 5 large campus libraries & 27 small libraries -- 2 university museums -- 44,576 undergraduate and 6,147 graduate students (as of end April, 2002)

3 3 Library Overview (as of March 31, 2002)  4,705,597 books (2,980,352 cjk books + 1,725,245 western books)  49,615 journal titles (Currently subscribing 19,509)  879,336 items checked out / year  ILL transactions : 13,951 requesets to other libraries : 18,491 requesets received from other libraries  Total number of Central Library visits : 1,197,731 (2002.4 – 2003.3)

4 4 Current Status of Our INNOPAC Recent record numbers (Oct. 29, 2003) from M-I-F-S  1,752,690 bibliographic records  3,434,122 item records  52,133 check-in records Public Catalog Searches from “ANALYZE patron searches”  5,149,322 searches (2002.4- 2003.3)

5 5 Unicode Port on WEBPAC  On November 17 th, Unicode OPAC was released to the public. ( some character code troubles still remain....)  Downloading Chinese & Korean bib data from OCLC.  Record Maitainance: AnzioWin  Number of the C & K bib records (as of 11 th Nov.) :15,971 bibs of Chinese materials :157 bibs of Korean materials

6 6 Appearance - Chinese record -

7 7 Appearance - Korean record -

8 8 Character code issues DisplaySearchGlyph Case1Mapping ErrorNG Case2Shift_JIS to EACC issueNG Case3 EACC layers related issue NG Case4 Duplication codes in EACC NG Case5 Not Unified character in UNICODE NG

9 9 Case1: Mapping Error The screen below shows my patron record on Millennium Circulation. One of Katakana character “Zu” is not displayed properly.

10 10 Case1: Mapping Error If I search “suzuki” on Unicode-OPAC, “zu” is ignored and “suki” hit.

11 11 Case1: Mapping Error SJIS: 253A EACC: 69253A SJISEACCUNICODE This EACC character is NOT mapped to any UNICODE character. It should be mapped to 30BA in UNICODE. UNICODE:30BA

12 12 Case2: Shift-JIS to EACC Issue When I search for this hanji on Shift_JIS OPAC, then Innopac returns only 9 records.

13 13 Case2: Shift-JIS to EACC Issue SJIS: 97E9 EACC: 214930 SJISEACCUNICODE The EACC character ”215D58” is not assigned any glyph, according to the OCLC CJK 3.11. But the mapping from S-JIS to EACC works fine.

14 14 Case2: Shift-JIS to EACC Issue On the other hand, I searched this hanji on Unicode OPAC, then Innopac returned more than 2,000 records!

15 15 Case2: Shift-JIS to EACC Issue UNICODE: 6FDB EACC: 214930 SJISEACCUNICODE These Shift_JIS and Unicode characters have the same glyph, but Innopac stored them into two different EACC code positions. Therefore we can NOT search both characters at once. SJIS: 97E9 EACC: 455564 No relationship

16 16 Case2: Shift-JIS to EACC Issue UNICODE: 6FDB EACC: 214930 SJISEACCUNICODE SJIS: 97E9 EACC: 455564 One of the solutions Change the mapping of this Shift_JIS character from 214930 to 455564.

17 17 Case3: EACC Layers Related Issue Shift_JIS Telnet Screen Sample (my record). The data is displayed correctly.

18 18 Case3: EACC Layers Related Issue SJIS: 97E9 EACC: 215D58 SJISEACCUNICODE In Shift_JIS environment, there is no troubles in searching and displaying this character.

19 19 Case3: EACC Layers Related Issue We can see the same data properly on Millennium. {69253a} is other problem already mentioned in case 1.

20 20 Case3: EACC Layers Related Issue Reviewing the same data AFTER editing an element (NOTE) on Millennium. EACC character codes are displayed directly at one of name field and address.

21 21 Case3: EACC Layers Related Issue We can see the data correctly on Millennium even after editting.

22 22 Case3: EACC Layers Related Issue SJIS: 97E9 EACC: 215D58 EACC: 4B5D58 SJISEACCUNICODE UNICODE: 9234 Relationship Same code position on other layers

23 23 Case3: EACC Layers Related Issue SJIS: 97E9 EACC: 215D58 EACC: 4B5D58 SJISEACCUNICODE UNICODE: 9234 No character assigned {4B5D58} If records including this character are saved on Millennium, this hanji is NOT stored as original EACC code (215D58). Relationship Same code position on other layers

24 24 Case4: Duplication codes in EACC

25 25 Case4: Duplication codes in EACC There are more than 1,000 records by “matsu” on Shift_JIS OPAC.

26 26 Case4: Duplication codes in EACC There is ONLY one record by “matsu” on Unicode OPAC. (The below shows direct hit result.)

27 27 Case4: Duplication codes in EACC UNICODE: 677E EACC: 21442D SJISEACCUNICODE We can DISPLAY both 21442D and 276163 in Unicode OPAC, but only 276163 is searchable. Because of this EACC code duplication, the search results is NOT same between Shift_JIS OPAC and Unicode OPAC. SJIS: 8FBC EACC: 276163

28 28 Case5: Not Unified characters in UNICODE Do you think these two characters are same or not? UNICODE: 5618UNICODE: 5653

29 29 The result of searching “uso” on Shift_JIS OPAC. Case5: Not Unified characters in UNICODE

30 30 The same search on Unicode OPAC. The result does not seem correct. Case5: Not Unified characters in UNICODE

31 31 Case5: Not Unified Characters in UNICODE Input the other “uso” by picking up from code table, the result is the same as Shift_JIS OPAC.

32 32 Case5: Not Unified Characters in UNICODE UNICODE: 5618 EACC: 21373B SJISEACCUNICODE UNICODE: 5653 SJIS: 8952 NOT HIT!

33 33 Case5: Not Unified Characters in UNICODE UNICODE: 5618 EACC: 21373B SJISEACCUNICODE UNICODE: 5653 SJIS: 8952 This 5618 should be normalized as 5653 in searching.

34 34 Normalization issue Some special characters are ignored at searching on Unicode OPAC. In this sample, “Cho-on”, Japanese prolonged sound symbol does not work. This search means “Harry Potter” in Katakana form.

35 35 Example of NOT unified characters (Case5) Unicode:6236,6237,6238

36 36 Related Documents & Information  The Library of Congress Homepage MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media -- CHARACTER SETS: Part 3 -- Code Table 9: EAST ASIAN (June 16, 2003) http://www.loc.gov/marc/specifications/specchareacc.html http://www.loc.gov/marc/specifications/specchareacc.html  The Unicode Standard Version 3.0. The Unicode Consortium. ISBN 0201616335 (Version 4.0 released now)  OCLC CJK and it’s contents in HELP http://www.oclc.org/cjk/

37 37 Unicode Opac in Japan  University of Tokyo Multilingual OPAC the University of Tokyo http://mulopac.dl.itc.u-tokyo.ac.jp/  National Diet Library NDL Asian Language Materials OPAC http://asiaopac.ndl.go.jp/index_e.html

38 38 Thank you!! The Best Solution Unicode + normalization scheme


Download ppt "1 Character Codes Related Problems - UNICODE OPAC and Millennium at WASEDA Univ. Library - Tsutomu SUZUKI Waseda University."

Similar presentations


Ads by Google