Brian Hitchcock OCP DBA 8i Global Sales IT Sun Microsystems

Slides:

Advertisements

Similar presentations

Japanese Records and Whether or not to Switch from MARC 8 to Unicode Storage (with an Innovative Interfaces Millennium local system) The University of.

Advertisements

NLS and The Case of the Missing Kanji Brian Hitchcock OCP DBA 8, 8i, 9i Global Sales IT Sun Microsystems NoCOUG.

20 Copyright © 2008, Oracle. All rights reserved. Globalization.

The Windows Registry Adapted from

Data Representation Computer Organization &

A Guide to Oracle9i1 Introduction To Forms Builder Chapter 5.

Week 2 IBS 685. Static Page Architecture The user requests the page by typing a URL in a browser The Browser requests the page from the Web Server The.

Multiple Tiers in Action

1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.

Computer Science 101 Web Access to Databases Overview of Web Access to Databases.

Creating Web Page Forms

DB Audit Expert v1.1 for Oracle Copyright © SoftTree Technologies, Inc. This presentation is for DB Audit Expert for Oracle version 1.1 which.

Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.

Phil Brewster  One of the first steps – identify the proper data types  Decide how data (in columns) should be stored and used.

Sys Prog & Scripting - HW Univ1 Systems Programming & Scripting Lecture 15: PHP Introduction.

Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.

A Guide to SQL, Eighth Edition Chapter Three Creating Tables.

ABSTRACT Zirous Inc. is a growing company and they need a new way to track who their employees working on various different projects. To solve the issue.

UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.

Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.

Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.

M1G Introduction to Database Development 6. Building Applications.

PHP meets MySQL.

9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.

1 Accelerated Web Development Course JavaScript and Client side programming Day 2 Rich Roth On The Net

Lec 3: Data Representation Computer Organization & Assembly Language Programming.

ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.

Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation

Why does it matter how data is stored on a computer? Example: Perform each of the following calculations in your head. a = 4/3 b = a – 1 c = 3*b e = 1.

Tutorial 7 Creating Forms. Objectives Session 7.1 – Create an HTML form – Insert fields for text – Add labels for form elements – Create radio buttons.

Text and Graphics September 26, Unit 3.

Training Guide for Inzalo SOP Users. This guide has been prepared to demonstrate the use of the Inzalo Intranet based SOP applications. The scope of this.

Chapter 8 Collecting Data with Forms. Chapter 8 Lessons Introduction 1.Plan and create a form 2.Edit and format a form 3.Work with form objects 4.Test.

1.NET Web Forms Business Forms © 2002 by Jerry Post.

CHAPTER 7 Form & PHP. Introduction All of the following examples in this section will require two web pages. The first page retrieves information posted.

Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.

Personal Oracle8i Create a new user Create a new table Enter data into a new table Export & import data Start and exit SQL Plus SQL Plus Syntax.

1 PROJECT 4 WEB/HTML CUSTOMER SATISFACTION FORM Management Information Systems, 9 th edition, By Raymond McLeod, Jr. and George P. Schell © 2004, Prentice.

Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.

Siebel CRM Unicode Conversion – The DBA Perspective Brian Hitchcock OCP 8, 8i, 9i DBA Sun Microsystems DCSIT Technical.

1. When things go wrong: how to find SQL error Sveta Smirnova Principle Technical Support Engineer, Oracle.

Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.

Advanced Web 2012 Lecture 3 Sean Costain What is a Database? Sean Costain 2012 A database is a structured way of dealing with structured information.

Week 7 Lecture 2 Globalization Support in the Database.

16 Copyright © 2006, Oracle. All rights reserved. Using Globalization Support.

Siebel CRM Unicode Conversion 2 – The DBA Perspective Brian Hitchcock OCP 8, 8i, 9i DBA Sun Microsystems DCSIT Technical.

Root Cause and Other DBA Urban Legends Brian Hitchcock OCP 10g DBA Sun Microsystems SunFed.

Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.

Saving State on the WWW. The Issue  Connections on the WWW are stateless  Every time a link is followed is like the first time to the server — it has.

David Lawrence 7/8/091Intro. to PHP -- David Lawrence.

Creating a simple database This shows you how to set up a database using PHPMyAdmin (installed with WAMP)

Chapter 5 Introduction To Form Builder. Lesson A Objectives  Display Forms Builder forms in a Web browser  Use a data block form to view, insert, update,

Oracle Applications 11i Concepts II Brian Hitchcock OCP 11i DBA -- OCP 10g DBA Sun Microsystems Brian Hitchcock.

Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.

MISSION CRITICAL COMPUTING SQL Server Special Considerations.

Characters CS240.

CHAPTER 7 LESSON C Creating Database Reports. Lesson C Objectives  Display image data in a report  Manually create queries and data links  Create summary.

Session 11: Cookies, Sessions ans Security iNET Academy Open Source Web Development.

MYSQL AND MYSQL WORKBENCH MIS2502 Data Analytics.

World Wide Web has been created to share the text document across the world. In static web pages the requesting user has no ability to interact with the.

UTF-8, Perl and You By Rafael Almeria. Chapter 1: Introduction.

111 State Management Beginning ASP.NET in C# and VB Chapter 4 Pages

Lec 3: Data Representation

SQL and SQL*Plus Interaction

Binary, Decimal and Hexadecimal Numbers

Intro to PHP & Variables

Fundamentals of Data Structures

XML Problems and Solutions

Fundamentals of Data Representation

Presentation transcript:

Brian Hitchcock OCP DBA 8i Global Sales IT Sun Microsystems

Session #403 NLS and The Case of the Missing Kanji NLS -- National Language Support Kanji -- Japanese characters

Brian Hitchcock October 21, 2001Page 4 How It All Started  Existing Sybase database and application  Needed to convert to Oracle  Use Oracle Migration Workbench – OMWB works well  I wasn’t told there was multi-byte data in the Sybase database  After the migration to Oracle – Kanji data were missing

Brian Hitchcock October 21, 2001Page 5 Kanji Become Lost  How was Kanji stored in Sybase?  How was application working with Sybase?  Why lost when migrated to Oracle?  How to fix in Oracle?  Would application work with Oracle?

Brian Hitchcock October 21, 2001Page 6 Before Upgrade to Oracle Source System EUC-JP character encoding Application Sybase Db character set ISO1 Netscape Browser EUC-JP character code for this character is 0xB0A1 0xB0A1 Source system inserts bytes of Kanji characters into db Browser examines each byte, detects multi-byte characters, displays Kanji character Select “Japanese (Auto-Detect)” character set in Netscape to view Kanji characters Application retrieves char data generates HTML

Brian Hitchcock October 21, 2001Page 7 Moving Sybase Data to Oracle Flat file produced using Sybase bcp utility Existing Sybase Db character set ISO1 Netscape Browser EUC-JP character code for this character is 0xB0A1 0xB0A1 0x3021 Character set WE8ISO8859P1 Browser displays the characters for byte codes 0x30 and 0x21 which are 0 and ! 0! SQL*Loader Oracle defaults to US7ASCII 0xB0A1 0x3021 Oracle Db 0xB0A1 Application retrieves char data generates HTML

Brian Hitchcock October 21, 2001Page 8 Moving Sybase Data to Oracle  What happened? – Oracle database is US7ASCII character set  7-bits per character – Import stripped the 8th bit off each byte  8th bit set to 0 – 8-bit characters are now 7-bit characters – Original character data is lost – 8-bit characters can’t be represented in the US7ASCII character set

Brian Hitchcock October 21, 2001Page 9 US7ASCII to WE8ISO8859P1 B0A1 hexadecimal B 0 A strip off 8th (highest order) bit, set this bit to ! EUC-JP character code for this character is 0xB0A1 Browser reads each byte, sees the 8th bit set to 0, decides that each byte represents a single byte character, character codes 30 and 21 represent the characters 0 and ! Hex Decimal Binary A B C D E F

Brian Hitchcock October 21, 2001Page 10 Fix -- Sybase Data to Oracle Flat file produced using Sybase bcp utility Existing Sybase Db character set ISO1 Netscape Browser EUC-JP character code for this character is 0xB0A1 0xB0A1 Character set WE8ISO8859P1 Browser detects multi-byte characters, displays Kanji character SQL*Loader NLS_LANG=WE8ISO8859P1 0xB0A1 Oracle Db 0xB0A1 Application retrieves char data generates HTML Note: WE8ISO8859P1 character set does not support Kanji characters

Brian Hitchcock October 21, 2001Page 11 Current Oracle Production Source System EUC-JP character encoding Application Oracle Db character set WE8ISO8859P1 Netscape Browser EUC-JP character code for this character is 0xB0A1 0xB0A1 Source system inserts bytes of Kanji characters into db Browser examines each byte, detects multi-byte characters, displays Kanji character Select “Japanese (Auto-Detect)” character set in Netscape to view Kanji characters Application retrieves char data generates HTML Note: WE8ISO8859P1 character set does not support Kanji characters

Brian Hitchcock October 21, 2001Page 12 Existing Application  How does it store/retrieve Kanji? – Multi-byte Kanji characters – Stored in WE8ISO8859P1 (single-byte) db – Application  JDBC retrieves bytes from WE Oracle db  Java generates HTML, sent to client browser  Netscape, view HTML using “Japanese (Auto- Detect)” character set  Display Kanji

Brian Hitchcock October 21, 2001Page 13 Existing Application Netscape Browser Character set WE8ISO8859P1 Oracle Db Application Retrieves character data Generate HTML Send HTML to client browser Browser detects multi-byte characters, displays Kanji character ISO1 Converts to UCS2 UCS2 ISO Japanese Auto-Detect UCS byte Unicode

Brian Hitchcock October 21, 2001Page 14 Existing Application  Each piece of software makes some decision (default) about character set  You need to understand this process for your application

Brian Hitchcock October 21, 2001Page 15 What Really Happened?  Source Kanji data – From EUC-JP character set – Multi-byte  Kanji multi-byte stored in Sybase db – Default character set ISO-1  8-bit, single-byte  Kanji multi-byte stored in Oracle db – Character set WE8ISO8859P1  8-bit, single-byte

Brian Hitchcock October 21, 2001Page 16 Convert to UTF8  Why? – Eliminate all the issues shown so far – Store multiple languages correctly  Correctly encoded – Support clients inserting data in languages other than Japanese Kanji – Existing application can only support languages based on Latin characters and Kanji

Brian Hitchcock October 21, 2001Page 17 Conversion is Simple -- Isn’t It?  Export WE8ISO8859P1 database – Set export client NLS_LANG – AMERICAN_AMERICA.WE8ISO8859P1  Import into UTF8 database – Set import client NLS_LANG – AMERICAN_AMERICA.WE8ISO8859P1  Test application – Application works! – Is everything OK?

Brian Hitchcock October 21, 2001Page 18 Meanwhile, Back at the Ranch  While application testing is going on – Insert sample bytes for Kanji into WE db  Use Oracle SQL CHR() function – Export from WE db, import into UTF8 db – Examine same bytes in UTF8 db – Compare UTF8 bytes to manually generated UTF8 bytes for the Kanji characters – NOT the same bytes!  What does this mean?

Brian Hitchcock October 21, 2001Page 19 UTF8 Encoding Process  Think your life is boring? JA16EUC to UTF8 Conversion JA16EUC encoding is 0xB0A1, Unicode code point for this character is 4E9C (no formula for this, Oracle uses a lookup table) Number of bytes used in UTF8 encoding based on Unicode code point: Unicode UTF8 bytes 0x x007f 0xxxxxxx 0x x07ff 110xxxxx 10xxxxxx 0x xffff 1110xxxx 10xxxxxx 10xxxxxx 4E9C requires 3 bytes in UTF8 4E9C bit pattern is right-most 6 bits go to third UTF8 byte next 6 bits go to second UTF8 byte remaining 4 bits go in first UTF8 byte E 4 B A 9 C UTF8 character code is 0xE4BA9C Metalink Doc ID: Note: Determining the codepoint for UTF8 characters

Brian Hitchcock October 21, 2001Page 20 Bytes is Bytes Unicode UCS2 UTF8 ISO-2022-JP JIS EUC-JP Shift-JIS UTF8 byte code for this character is 0xE4BA9C Kanji character encodings shown for the various character sets Unicode byte code for this character is 0x4E9C ISO-2022-JP character code for this character is 0x3021 JIS row/cell values for this character are Row 16, Column 1 EUC-JP byte code for this character is 0xB0A1 Shift-JIS byte code for this character is 0x889F Formula Unicode lookup table

Brian Hitchcock October 21, 2001Page 21 Conversion Issue Application Oracle Db character set WE8ISO8859P1 Export File Import loads export file character set WE8ISO8859P1 Oracle Db Character set UTF8 Netscape Browser View file 0xB0A1 0xC2B0, 0xC2A1 EUC-JP character code for this character is 0xB0A1 Oracle export hard codes the source db character set into the export file Flat file Select char data Spool to file Browser displays the characters for character codes 0xC2B0, 0xC2A1 which are the degree sign and the inverted exclamation mark °¡ Correct UTF8 byte code for this character is 0xE4BA9C

Brian Hitchcock October 21, 2001Page 22 Import to UTF8 Conversion Existing WE8ISO8859P1 data to UTF8 Conversion EUC-JP encoding for the character is 0xB0A1, but import detects that the data came from a single-byte export file (WE8ISO8859P1) Import reads each byte, one at a time, 0xB0A1 becomes 0xB0 followed by 0xA1, and converts these to the Unicode (UCS2) equivalent -- for single-byte character codes, the Unicode equivalent simply has two leading bytes of 0's -- 0xB0 and 0xA1 become U+00B0 and U+00A1, import then converts from UCS2 to UTF8 Number of bytes used in UTF8 encoding based on Unicode code point: Unicode UTF8 bytes 0x x007f 0xxxxxxx 0x x07ff 110xxxxx 10xxxxxx 0x xffff 1110xxxx 10xxxxxx 10xxxxxx 00B0 and 00A1 both require 2 bytes in UTF8 00B0 bit pattern is 00A1 bit pattern is right-most 6 bits go to second UTF8 byte next 5 bits go in first UTF8 byte C 2 B 0 C 2 A 1 WE8ISO8859P1 character code 0xB0A1 becomes UTF character codes 0xC2B0, 0xC2A1 The correct conversion of this EUC-JP character code 0xB0A1 to UTF8 is 0xE4BA9C

Brian Hitchcock October 21, 2001Page 23 What Happened?  Oracle did exactly what is was told to do – Take bytes from WE database – Convert to UTF8 bytes – Export file was made from WE database – WE is single-byte character set – Convert each byte one at a time to UTF8 – Kanji character consists of 2 bytes in WE db – Converting each byte to UTF8 not the same as converting the pair of bytes to UTF8  Yeah, but, application works! (?)

Brian Hitchcock October 21, 2001Page 24 Where’s the Problem?  UTF8 db has Kanji as 0xC2B0, 0xC2A1  Correct UTF8 encoding is 0xE4BA9C  If new, correctly encoded Kanji is inserted – Database contains two sets of bytes for same Kanji character – How does app deal with this?  Existing app only works using Netscape Japanese (Auto-Detect) character set – App is not really UTF8, only works for Japanese characters

Brian Hitchcock October 21, 2001Page 25 How Does Application Work?  Review – Oracle db created using UTF8 character set – Java retrieves char data (bytes) from UTF8 db  Converts to UCS2 (Unicode) – Java code generates HTML – Client browser displays Kanji characters  Netscape, “Japanese (Auto-Detect)” char set  Application still works – bytes in UTF8 db don’t represent UTF8 encoded Kanji

Brian Hitchcock October 21, 2001Page 26 Application Works (?) Netscape Browser Character set UTF8 Oracle Db Application Retrieves character data Converts UTF8 to UCS2 Generate HTML Send HTML to client browser Browser detects multi-byte characters, displays Kanji character ISO1UCS2ISO1UTF8 Japanese Auto-Detect xC2B0, 0xC2A1 0xB0, 0xA1 EUC-JP character code for this character is 0xB0A1 Export WE database, import into UTF8 db

Brian Hitchcock October 21, 2001Page 27 Test Application  Insert bytes for correctly encoded Kanji – Into UTF8 db – Use CHR() function  Display this data using existing application – Does NOT display Kanji!  Using “Japanese (Auto-Detect)” character set – Try Netscape UTF8 character set  Doesn’t display Kanji – UTF8 character set should work, shouldn’t it?

Brian Hitchcock October 21, 2001Page 28 Where Are We?  Correctly encoded UTF8 multi-byte character data for Kanji does not work with existing application  Simply “converting” (export WE, import to UTF8) doesn’t result in correctly encoded UTF8 character data  Need to figure out what app code is doing – Whoever wrote it is gone – The usual state of affairs

Brian Hitchcock October 21, 2001Page 29 How To Debug App Code?  Don’t use app code – write very simple Java Servlet  (The Java Diva helps with this…) – Servlet simply retrieves character data from db  Runs in iPlanet web server – generates HTML for client browser  Use servlet to retrieve correct UTF8 Kanji – Does not display Kanji!  Fix servlet then can fix application code?

Brian Hitchcock October 21, 2001Page 30 Modified Servlet Code res.setContentType("text/html;charset=UTF-8"); PrintWriter out = new PrintWriter( new OutputStreamWriter(res.getOutputStream(), "UTF-8"),true); out.println("<META HTTP-EQUIV=" + DQ + "Content-Type" + DQ + " CONTENT=" + DQ + "text/html; charset=utf-8" + DQ + ">");

Brian Hitchcock October 21, 2001Page 31 Fix Application  Make same changes to application code  Browser displays Kanji correctly – Manually generated, correctly encoded UTF8  Application interacts with Dynamo – Need to reconfigure Dynamo for UTF8 data  Application fixed (?) – Works with correctly encoded UTF8 multi-byte data

Brian Hitchcock October 21, 2001Page 32 Is Application really fixed?  Fixed app retrieves correctly encoded UTF8 character data  What about existing character data? – Data that was exported from WE and imported into UTF8 db  Use fixed app code to retrieve existing data – Existing Kanji are not displayed  Original app did display existing data... – Existing data is not correctly encoded UTF8

Brian Hitchcock October 21, 2001Page 33 Fixed Application Netscape Browser Character set UTF8 Oracle Db Application Retrieves character data Converts UTF8 to UCS2 Generate HTML Send HTML to client browser Browser displays characters for the UTF8 bytes 0xC2B0, 0xC2A1 which are degree sign and upside down exclamation point UTF8UCS2UTF8 UTF8t xC2B0, 0xC2A1 0xB0, 0xA10xC2B0, 0xC2A1 EUC-JP character code for this character is 0xB0A1 °¡

Brian Hitchcock October 21, 2001Page 34 How To Fix Existing Data?  What’s wrong with existing data (UTF8 db) – Character data is not correctly encoded UTF8 – It is UTF8 encoded Unicode of each single byte that was exported from WE database  Before importing into UTF8 database? – EUC-JP character set (Latin ASCII and Kanji) – Stored in single-byte WE database  Need to convert UTF8 of WE of EUC-JP to correct UTF8 bytes for Kanji

Brian Hitchcock October 21, 2001Page 35 Review of Bytes is Bytes  Original Kanji character 0xB0A1 (EUC-JP)  Inserted into Oracle database – 0xB0, 0xA1 in WE8ISO8859P1 db  Exported/imported into Oracle UTF8 db – Individual bytes converted to UTF8  Original Kanji character was 2 bytes  Became 4 bytes in UTF8 db  0xC2B0, 0xC2A1  Correct UTF8 bytes are 0xE4BA9C

Brian Hitchcock October 21, 2001Page 36 How to Convert Existing Data?  Fix in Oracle WE before export/import – No point, export/import will ‘corrupt’ character data, will need to fix after export/import  Don’t export/import – SQL select each table to flat files from WE db – SQL*Loader into UTF8 database  Use CHARACTERSET JA16EUC option  More work moving each table one at a time

Brian Hitchcock October 21, 2001Page 37 SQL*Loader Option SQL Select to flat file for each table Netscape Browser 0xE4BA9C Character set UTF8 SQL*Loader CHARACTERSET JA16EUC 0xB0A1 Oracle Db 0xB0A1 Application Oracle Db character set WE8ISO8859P1 0xB0A1 EUC-JP character code for this character is 0xB0A1 Select “UTF8” character set to view Kanji characters Application retrieves char data generates HTML

Brian Hitchcock October 21, 2001Page 38 Convert Existing Data  Fix data after import into UTF8 database – Export from WE, import into UTF8 database – Use Oracle SQL CONVERT() function  CONVERT() from UTF8 to WE8ISO8859P1  CONVERT() from JA16EUC to UTF8 – Need to CONVERT() each column of each table that contains multi-byte data  How to be sure which columns to CONVERT()?  CONVERT() all columns that contain char data? – Must test using CONVERT() to verify it works

Brian Hitchcock October 21, 2001Page 39 Fix After Import Application Oracle Db character set WE8ISO8859P1 Export File Import loads export file character set WE8ISO8859P1 0xB0A1 0xB0, 0xA1 EUC-JP character code for this character is 0xB0A1 Application Oracle Db character set UTF8 0xC2B0, 0xC2A1 Convert from UTF8 to WE8ISO8859P1 Convert from EUC-JP to UTF8 0xE4BA9C Bytes of original EUC-JP Kanji character Correctly encoded UTF8 bytes for this Kanji are 0xE4BA9C

Brian Hitchcock October 21, 2001Page 40 Oracle CONVERT()  Syntax, examples – select CONVERT(,, )  select CONVERT(, WE8ISO8859P1, UTF8)  select CONVERT(, UTF8, JA16EUC) – Don’t re-run CONVERT() without testing  re-run may corrupt data  regenerate original source data, re-run CONVERT()

Brian Hitchcock October 21, 2001Page 41 Overall Conversion Process  What we did… – Identify tables/columns contain multi-byte data – Export from WE database – Import into UTF8 database  rows=n, create tables, don’t load data – Widen columns for UTF8 multi-byte data  increase to 3 times original width – Import into UTF8 database (again)  ignore=y, load data into existing tables

Brian Hitchcock October 21, 2001Page 42 Overall Conversion Process  Continued – CONVERT() columns that contain multi-byte data – Test, compare with data from existing application/data  Conversion includes converting all pieces of the application, not just the Oracle database

Brian Hitchcock October 21, 2001Page 43 Details - Source Char Set?  How did I determine this? – Original Kanji data was from EUC-JP – How was this determined?  Examine bytes of original character data  Display Original Kanji characters  Find single Kanji in Japanese dictionary Gives row-cell code of Kanji in JIS-0208  Using other reference sources manually generate bytes for the Kanji in various encodings  Compare with bytes of original Kanji data

Brian Hitchcock October 21, 2001Page 44 Rosetta Stone? Oracle8i National Language Support Guide Release 2 (8.1.6) December 1999 Part No. A page 3-22

Brian Hitchcock October 21, 2001Page 45 Reference Books Used The New Nelson Japanese-English Character Dictionary By John H. Haig,Andrew N. Nelson Published by Periplus Editions, Ltd Date Published: 11/1996 ISBN: The Unicode Standard: With CD-ROM By Unicode Consortium Published by Addison Wesley Longman, Inc. Date Published: 04/1995 ISBN: CJKV Information Processing By Ken Lunde,Gigi Estabrook (Editor) Published by O'Reilly & Associates, Incorporated Date Published: 01/1999 ISBN:

Brian Hitchcock October 21, 2001Page 46 Lessons Learned  Oracle (and Sybase) don’t store characters – They store bytes, strings of bytes  Normally, Oracle does NO checking of character set – does NOT check that bytes inserted represent correct characters in database character set  Only under specific circumstances does Oracle “apply” a character set to char data  Changing character set affects more than just the database

Brian Hitchcock October 21, 2001Page 47 Lessons Learned  Bytes of character from any char set can be stored in db of any charset – EUC-JP char in WE db, in UTF8 db – bytes in db are not ‘correct’ bytes for the character in the db character set – all apps, users, dbs must know that db contains char data from other char set – Any char set conversion may corrupt the char data -- import WE into UTF8 db

Brian Hitchcock October 21, 2001Page 48 Lessons Learned  Simply exporting db, importing into UTF8 does not solve the problems  Testing requires generating correctly encoded character data  Every piece of an application makes a decision about character set (default)  If all data in db really is in the db char set – export, import to db of other char set works  Need to see original character data – Verify data after char set conversion

Brian Hitchcock October 21, 2001Page 49 Fill Out a Survey and Get a Chance to Win a Compaq iPAQ! We want to know what you think! Fill out the survey that was handed out at the beginning of the session for a chance to win a Compaq iPAQ. Remember to include your name and in the available section and we will enter your name into two daily drawings to win an iPAQ