City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting,

Similar presentations


Presentation on theme: "City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting,"— Presentation transcript:

1

2 City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 1 The HKIUG Unicode Project Fourth Annual HKIUG Meeting 8-9 Dec, 2003 Lingnan University, Hong Kong Philip WONG, CityU Library HO Yee Ip, CUHK Library

3 City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 2 Overview Part I Background Problems Objective & Methodology Procedures Deliverables and Actions Part II Follow-up Are the problems solved Future work

4 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 3 The HKIUG Unicode Project - Part I by Philip Wong City University of Hong Kong Library December 8, 2003

5 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 4 Background There are different character sets that support CJK. Big5 is common in HK and Taiwan, GB is used in Mainland. CCCII and EACC are mainly used in libraries. EACC is LC standard Unicode is widely supported in OS, applications and W3C. No. of CJK char Code space ReleasedSupportProvide linking feature BIG513,05314,7581984TraditionalNo GB 1803027,0001.6 million 2000Trad. & Simplified No CCCII75,684830,5841980Trad. & Simplified Yes EACC15,728830,5841983Trad. & Simplified Yes Unicode82,2701.1 million 2000 (v. 3)Trad. & Simplified No Reference: KT Lam, “Overview of Chinese Character Encoding”, http://www.lib.cuhk.edu.hk/seminar/unicode/kt_lam_files/frame.htm http://www.lib.cuhk.edu.hk/seminar/unicode/kt_lam_files/frame.htm character sets

6 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 5 Background Different character sets assigned different code points to the same character (more precisely, the same glyth) Character Set Code Point for 余 (yu) BIG5A745 GB 18030-20005164 CCCII213131 216076 EACC276076 Unicode4F59 code points

7 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 6 Background Innovative supports CJK by storing the CJK internally in EACC and CCCII The internal code is not Unicode based 100 1 |6880-01|aYu, Guangzhong,|d1928- 245 10 |6880-02|aYu Guangzhong shi xuan 880 1 |6100-01/$1|a 余光中,|d1928- 880 10 |6245-02/$1|a 余光中詩選 [edit mode ctrl-w] 100 1 |6880-01|aYu, Guangzhong,|d1928- 245 10 |6880-02|aYu Guangzhong shi xuan 880 1 |6100-01/$1|a{213131}{213272}{213034},|d1928- 880 10 |6245-02/$1 |a{213131}{213272}{213034}{21585c}{215c4f} internal codes

8 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 7 Background Mapping table is required to convert internal codes to and from client encodings Once a good solution, but also created many problems. Many issues have been raised and discussed over the years Seminar on Chinese Information Processing in Libraries, HKUST Jan 1998 Seminar on Chinese Information Processing in Libraries Good discussion list: LIB-CHINESE ListservLIB-CHINESE mapping table InterfaceClient encoding code Internal code Telnet Big5 WebPAC Big5 Big5EACC/CCCII Millennium WebPAC UTF-8 UTF-8EACC/CCCII

9 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 8 Problems Problem 1 Multiple mapping of internal codes to one client code The code searched for or input to may not be the one desired Order of mappings may be different among local sites, thus inconsistent results in Z39.50 searching In III UTF-8 table, there are 1150 multiple mapping cases (2232 characters), including EACC and CCCII, some with high usage frequency. e.g. 台 (U+53F0), 漢 (U+6F22) Multiple mapping of 台 (tai) in UTF-8 EACC/CCCIIUnicodeMeaning 283b7d53F0 simplified form of the tai in “table” 檯 27605d53F0 simplified form of the tai in “typhoon” 颱 21353853F0“tai” in its proper form 27542b53F0 simplified form of the tai in “Taiwan” 臺 multiple mapping

10 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 9 Problems E.g. in UTF-8 214274 (CCCII)U+65E6 旦 (dan) 27565A (EACC)U+65E6 旦 Problem 2 In multi-mapping cases, there may be overlapping use of EACC and CCCII Overlapping introduces more multiple mappings Create workload when exchanging records with international bibliographic services which only accept EACC overlapping eacc & cccii E.g. in Big5 213131 (CCCII)A745 余 (yu) 276076 (EACC)A745 余

11 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 10 Problems Problem 3 III mapping table contains other problems In UTF-8 (Release 2002 Phase 3) errors 27615F is mapped to U+53CB 友, it should be U+53D1 发 missing cases 212F30 for U+3007 〇 is missing wrong types 213538 (U+53F0; 台 ) is typed as non-EACC, it should be EACC errors & missing

12 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 11 Problems Analysis done by local sites on UTF-8 mapping between April and June 2003 Questions: Can preferences be selected by local sites for multiple mappings? Can non-EACC codes be abandoned, those with EACC equivalents be converted to EACC in database? Can correct type of EACC/CCCII be re-assigned based on standard? analysis of UTF-8 Total entries: 23,669 (Rel. 2002 Phase 3) According to IIIStudied by local sites * by UST # by CityU EACC15,290 (65%)15,665 (66%)* multi-mapping linked: 224 multi-mapping unlinked: 47 Non-EACC7,954 (34%)8,004 (33%)* 954 have EACC equivalents “may be invalid internal code” 425 (1%)EACC 188# Non-EACC 237

13 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 12 Problems Problem 4 What triggered the HKIUG Unicode Project is the inconsistent software mapping between Big5 and UTF-8 in multiple mapping cases: Big5 client – mapped to the first entry UTF-8 client – mapped to the last entry software inconsistency

14 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 13 Problems Searching 才 (cai) in WebPAC Big5 (or Telnet Big5) Mapped to the first Internal Big5 213f7b A47E 28736d A47E software inconsistency (cont)

15 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 14 Problems Searching 才 (cai) in WebPAC UTF-8 (or Millennium) Mapped to the last Internal UTF-8 213f7b 624D 28736d 624D software inconsistency (cont)

16 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 15 Objective & Methodology A seminar was organized by CUHK in July 2003 http://www.lib.cuhk.edu.hk/seminar/unicode/ A HKIUG Working Group on Unicode Project was formed. Members: CUHK, CityU, HKU, HKUST Objective Solve software inconsistency between Big5 and UTF-8 Decide on One-to-one mapping or Many-to-one mapping Decide on Pure EACC or EACC and CCCII Clean up errors, wrong types and missing cases Prepare to transfer to Unicode based database

17 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 16 Objective & Methodology The working group further decided: Not to fix Big5 table (small character set, support only traditional Chinese, more multiple mappings, …, etc.) Propose a new UTF-8 mapping table to Innovative For EACC mapping, follow LC standard Allow multiple mappings of EACC; for unlinked cases, decide on the preferences For multiple mappings of EACC and CCCII, remove the CCCII Covert CCCII in database to EACC equivalents Avoid missing characters, include pure CCCII (though low percentage in database) (cont)

18 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 17 Procedures diac.utf8.hkiug created diac.utf8.hkiug diac.utf8 LC EACC 22717 EACC/CCCII Subtracted 66 Substitutes for Missing (U+3013) 15673 EACC 7044 pure CCCII + Remapped 287 PUA Selected preferences in multi-mapping linked and unlinked cases Corrected LC mappings prepared list for CCCII to EACC data conversion Subtracted 955 with EACC equivalent 15739 EACC merged 7999 CCCII extracted

19 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 18 Procedures source from LC Merged tables from LC's EACC to UCS/Unicode Mappings http://www.loc.gov/marc/specifications/specchareacc.html

20 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 19 Procedures Included pure CCCII from UTF-8 table (Rel 2002 Phase 3) CCCII with no EACC equivalents (pure CCCII) e.g. 217455 坓 22483E 洣 7,044Added to new table CCCII with EACC equivalents e.g. 213131 (CCCII) 余 276076 (EACC) 余 955Excluded from new table. Sent to III for data conversion source from diac.utf8

21 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 20 Procedures re-mapped PUA Re-mapped 297 Private User Area (PUA) to suggested alternates

22 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 21 Procedures Selected preference in multiple mapping EACC Multiple mapping Example# of cases Enhanced indexing? Labeled as Preference Linked same lower order bytes 4B3178 倩 213178 倩 160 (320 char) Yes"multi- mapping linked" not matter Unlinked different lower order bytes 283B7D 台 27605D 台 213538 台 27542B 台 49 (108 char) No"multi- mapping unlinked" selected case by case (based on HKUST study on word frequency & meaning) selected preference

23 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 22 Procedures Linked cases: HKIUG preference indicated selected preference (cont) Selected preference in EACC multiple mapping linked

24 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 23 Procedures Unlinked cases: HKIUG preference indicated selected preference (cont) Selected preference in EACC multiple mapping unlinked

25 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 24 Procedures Updated LC mappings Referenced from other sources Unihan OCLC USMARC Character Set for Chinese, Japanese, Korean (printed) Examples: 273C67LC mapped to U+E9D8 Remapped to U+5E72 ( 干 ) 4B3C2bLC mapped to U+E9C7 Remapped to U+67C3 ( 柃 ) updated LC mapping

26 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 25 Procedures CCCII with EACC Equivalents - for data conversion CCCII EACC list for conversion Prepared list for data conversion

27 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 26 Deliverables and Actions Deliverables to Innovative 1.diac.utf8.hkiug - HKIUG version of UTF-8 mapping table EACC 15,673 Pure CCCII 7,044 Total 22,717 2.hasEACC.txt - CCCII with EACC equivalents - 955 3.Final Report - Hong Kong Innovative Users Group (HKIUG) III-UTF8 Working Group Report Actions for Innovative 1.Endorse and install diac.utf8.hkiug 2.Replace CCCII listed in hasEACC.txt with their EACC equivalents in the database Note: local sites have the choice to implement the above actions or not (e.g. while adopting the new table, CUHK chose to run their own data conversion )

28 City University of Hong Kong Library 香港城市大學圖書館 The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 27 The HKIUG Unicode Project - End of Part I

29 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 28 The HKIUG Unicode Project - Part II by Ho Yee Ip CUHK University Library Systems December 8, 2003

30 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 29 Are the problems solved Resolve Big5 and UTF8 software inconsistency?  Yes (if abandon Big5 interfaces) Use the same preferred mappings among local sites?  Yes (if all sites adopt the new table) Able to search the desired code in multiple mapping?  Yes (if added entries are created) No overlapping of EACC and CCCII in multiple mapping?  Yes Clear up all errors and missing cases?  No (no-going job) Switch 100% to Millennium?  No (unfortunately, 2002 Phase 3 created more problems …)

31 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 30 Are the problems solved New problems in Release 2002 Phase 3 In Millennium Edit, implicitly convert non preferred entries to the preferred entry (may be an old problem in Phase 2) Worse, this “preferred” entry may not be the HKIUG preferred one. It is always mapped to the 2nd entry, which is wrong for multiple mappings > 2 Testing 1. in Millennium Cataloguing, input 台 in braced code {283B7D} 2. save record 3. check in telnet edit mode (Crt-W): still {283B7D} 4. re-save record in Millennium with no further editing 5. re-check in telnet: become {27542b} Note: Global update or amending attached records will not invoke this converting Millennium not yet ready for CJK editing! new problem

32 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 31 Are the problems solved Report from sites who have installed the new UTF-8 mapping table and run the data conversion successful? failed? unexpected outcome? installed sites

33 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 32 Follow-up Continue to clean up and supplement the mapping table Recommend updates and changes of EACC mapping to LC and III There are 169 difference mappings between III and LC. HKIUG followed LC Consider this case  III choice: 2D552EU+82FA 苺  LC choice: 2D552EU+8393 莓  Obviously different Consult: USMARC character set for Chinese, Japanese, Korean. Washington, D.C. : Library of Congress, 1986.  the glyth of 2D552E is 苺 (the same as III) Is III right or LC right? Others:  232D42, 396B33, 23355C mapping table

34 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 33 Follow-up Other differences between LC and III 232D42  III choice:232D42U+8842 衂  LC choice:232D42U+4610 ( 2 dots)  minor variation  US MARC (printed): 232D42 衂 (same as III) 396B33  III choice:396B33U+524F 剏  LC choice:396B33U+5259 剙 (2 dots)  minor variation  US MARC (printed): 396B33 剏 (same as III) 23355C  III choice:23355CU+8C63 豣  LC choice:23355CU+86C3 蛃  Obviously different  US MARC (printed): 23355C 豣 (same as III) mapping table (cont)

35 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 34 Follow-up Continue to clean up and supplement the mapping table Supplement diac.utf8.hkiug with additional CCCII  source: Unihan database file latest data ( e.g. ftp://ftp.unicode.org/Public/4.0-Update1/Unihan- 4.0.1d3b.zip)ftp://ftp.unicode.org/Public/4.0-Update1/Unihan- 4.0.1d3b.zip Amend diac.utf8.hkiug when LC update its code standard  source: LC MARC 21 code standard (http://www.loc.gov/marc/specifications/specchareacc.html)http://www.loc.gov/marc/specifications/specchareacc.html mapping table (cont)

36 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 35 Follow-up Change of cataloguing practice Provide added entries for unlinked multi-mapping codes Source data may not be the preferred code (by meaning) Transcription should be faithful to the source Added entries enhance retrieval e.g. 历 U+5386 历 {274349} 曆 {214349} 历 {27462A}preferred 歷 {21462A} Source: 万年历 added entries

37 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 36 Follow-up Source: 万年历 历 {274349} 曆 {214349} 历 {27462A}preferred 歷 {21462A} Action: About 29 cases out of the 49 unlinked cases need attention Data InputData storedRetrieval by glyphsHit? Input the non preferred one in braced format: 万年 {274349} {274F22} {213C65} {274349} 萬年曆 (i.e. by traditional glyphs: {214F22}{213C65} {214349}) Yes Create the added entry by inputting the glyphs: 万年历 {274F22} {213C65} {27462A} 万年历 (i.e. by simplified glyphs: {274F22}{213C65} {27462A}) Yes added entries (cont)

38 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 37 Follow-up Since Big5 mapping table is not fixed, cannot use Telnet Big5 mode any more; explore software: AnzioWin, putty In Telnet mode, INNOPAC UTF-8 port cannot support full screen editing, only line editing is feasible staff mode CJK display corrupted in full screen editing

39 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 38 Follow-up For some local sites, e.g. CUHK, AnzioWin is used. When AnzioWin is set to CCCII mode, its mapping table CCCII.UNI can be used for Unicode mapping. Deficiency: CCCII.UNI is one-to-one, non preferred entries cannot be included, e.g., # 274349 53D1 # not preferred 274C7B 53D1 Better to use Innopac UTF-8 port when it is ready for editing staff mode (cont)

40 University Library System, CUHK The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 39 Future To migrate to pure Unicode environment…. Abandoning EACC/CCCII will lose the linking of traditional, simplified and variant forms. 历 U+5386 曆 U+66C6 how to link? 歷 U+6B77 Linking information is available from Unihan website. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5386 Only if this linking is maintained by the vendor, migration can be considered.

41 City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting, 8-9 Dec 2003 Slide 40 {21387D} {215938} U+591A U+8B1D 多 謝 Thank You The HKIUG Unicode Project - The End


Download ppt "City University of Hong Kong Chinese University of Hong Kong The HKIUG Unicode Project, Philip Wong (CityU) & Ho Yee Ip (CUHK), Fourth HKIUG Meeting,"

Similar presentations


Ads by Google