Creating Multi-Lingual and Multi-Locale Databases

Slides:



Advertisements
Similar presentations
2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.
Advertisements

Oracle to MySQL Database Migration SQLWays - Migration Software Presentation Copyright (c) Ispirer Systems Ltd. All Rights Reserved.
Beyond the Chrome Building Multi-Lingual and Multi-Locale Business Processes 24 th Internationalization and Unicode Conference Presented by Addison P.
Globalization Gotchas
Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizards Guide to PHP by David Lash.
From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Keys to Building a Multilingual Search Engine Thierry Sourbier.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
Data Definition and Integrity Constraints
CIT 613: Relational Database Development using SQL Revision of Tables and Data Types.
Session 2Introduction to Database Technology Data Types and Table Creation.
Creating Tables, Setting Constraints, and Datatypes What is a constraint and why do we use it? What is a datatype? What does CHAR mean? Page 97 in Course.
Chapter 10: Designing Databases
Creating Tables. 2 home back first prev next last What Will I Learn? List and provide an example of each of the number, character, and date data types.
Database Chapters.
WaveMaker Visual AJAX Studio 4.0 Training
Data Modeling and Database Design Chapter 1: Database Systems: Architecture and Components.
Advantage Data Dictionary. agenda Creating and Managing Data Dictionaries –Tables, Indexes, Fields, and Triggers –Defining Referential Integrity –Defining.
1 Nassau Community CollegeProf. Vincent Costa Acknowledgements: Introduction to Database Management, All Rights ReservedIntroduction to Database Management.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
Phil Brewster  One of the first steps – identify the proper data types  Decide how data (in columns) should be stored and used.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
IT – DBMS Concepts Relational Database Theory.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Overview of SQL Server Alka Arora.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.
Copyright © 2003 by Prentice Hall Module 4 Database Management Systems 1.What is a database? Data hierarchy and data organization Field, record, file,
 Introduction Introduction  Purpose of Database SystemsPurpose of Database Systems  Levels of Abstraction Levels of Abstraction  Instances and Schemas.
IMS 4212: Application Architecture and Intro to Stored Procedures 1 Dr. Lawrence West, Management Dept., University of Central Florida
DATABASES Pindaro Demertzoglou – Lally School of Management and Technology.
Fundamentals of Database Chapter 7 Database Technologies.
1 Working with MS SQL Server Textbook Chapter 14.
11 3 / 12 CHAPTER Databases MIS105 Lec15 Irfan Ahmed Ilyas.
Chapter 10: The Data Tier We discuss back-end data storage for Web applications, relational data, and using the MySQL database server for back-end storage.
M1G Introduction to Database Development 2. Creating a Database.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
3-Tier Client/Server Internet Example. TIER 1 - User interface and navigation Labeled Tier 1 in the following graphic, this layer comprises the entire.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Microsoft Access. Microsoft access is a database programs that allows you to store retrieve, analyze and print information. Companies use databases for.
Advanced Web 2012 Lecture 3 Sean Costain What is a Database? Sean Costain 2012 A database is a structured way of dealing with structured information.
Week 7 Lecture 2 Globalization Support in the Database.
Visual Programing SQL Overview Section 1.
IMS 4212: Data Manipulation 1 Dr. Lawrence West, MIS Dept., University of Central Florida Additional Data Manipulation Statements INSERT.
XML and Database.
Sql DDL queries CS 260 Database Systems.
IMS 4212: Data Modeling—Attributes and Domains 1 Dr. Lawrence West, Management Dept., University of Central Florida Attributes and Domains.
Database Basics BCIS 3680 Enterprise Programming.
Database Connectivity with ASP.NET. 2 Introduction Web pages commonly used to: –Gather information stored on a Web server database Most server-side scripting.
Basics of JDBC Session 14.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
1 Working with MS SQL Server Beginning ASP.NET in C# and VB Chapter 12.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
N5 Databases Notes Information Systems Design & Development: Structures and links.
Creating Database Objects
Fundamentals of DBMS Notes-1.
Introduction To Oracle
Managing Tables, Data Integrity, Constraints by Adrienne Watt
Attributes and Domains
ORACLE SQL Developer & SQLPLUS Statements
ISC440: Web Programming 2 Server-side Scripting PHP 3
Basic Concepts in Data Management
Teaching slides Chapter 8.
PT2520 Unit 5: Physical Design
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Attributes and Domains
Creating Database Objects
Presentation transcript:

Creating Multi-Lingual and Multi-Locale Databases International Unicode Conference 19 Presented by Addison Phillips Globalization Architect webMethods, Inc.

Introduction Audience: Beginning Developers Presenter: Addison Phillips Globalization Architect webMethods, Inc. mailto:aphillips@webmethods.com Presentation: http://www.inter-locale.com Creating complex systems in a global environment requires more than internationalized code. Since most Enterprise system rely on relational databases, a global-ready system must also consider database design in order to be truly effective. Tight timeframe for this discussion. Want to leave time for QNA at the end. This presentation will be available at my personal website (shown). The whitepaper contains some information not in this presentation, in part because other presentations today contained similar information. Note: webMethods products generally are written in Java. Most of the examples and information in this presentation are related to

Our Problem This presentation is based on the lessons learned in developing and deploying a B2B “conversation management” system (webMethods for Trading Networks) and “partnership management” software (webMethods PartnerConnect). The products we created share a central database that allow webMethods customers to manage their B2B trading partnerships. Terminology: A “trading partner” is a company that you want to do business with. An “initiative” is a specific opportunity to work together. A trading network generally has: “hub”, is the trading network. “partners”, are the folks you’re trading with. “initiatives”, the particular opportunities to work together When designing the system, we needed to take into account the business requirements of our customers, mostly Global 2000 companies. Since the questionnaires are user-defined, they can be authored in any language. And suppliers may come from a variety of locales. But the system is centrally managed. We needed to provide: 1.      The ability to store multilingual data without having separate database instances. 2.      The ability for a trading network to localize a questionnaire and serve it according to the supplier’s language preferences. The ability to customize the questionnaire depending on the locale of both the supplier and the hub owner (since different locations, for example, might have different requirements for the same general initiative).

Trading Networks Companies need to store information about transactions and business relationships world wide and in real time. We call this “Global Business Visibility”

Partner Connect Goals Centrally served (one instance). Centrally managed (initiatives can be deployed anywhere). Localized (so partners can interact with initiatives in their own language). Cultural and market sensitivity (customized to fit different market conditions locally). Created and managed by the customer entirely through HTML interface.

Profile and Conversation Management

Enter the Database Serve both global and local initiatives from a single instance. Store data in multiple writing systems (scripts, languages). Provide for actual differences in the data due to user location (“locale”). Provide for localization of global content. Provide for local content management.

Basic Rules for a Global DB Schema Expand fields to support changes in character encoding. Expand fields to support differences in the storage requirements of other locales (cultural or linguistic expansion, as opposed to encoding) Classify data as locale-neutral, locale-intrinsic, or locale-related and re-normalize the tables accordingly. Create efficient access to both global and locale-specific information.

Selecting an Encoding If a database instance will only serve a single locale (or compatible locales), then the character encoding can be selected based on local requirements (“legacy encoding”). If the database must store data from many locales (or incompatible writing systems), then the character encoding selected must be a Unicode encoding. Each database vendor has a unique approach to this. Encodings vary in terms of performance and capability. Generally the two choices you have are: UTF-8 UTF-16 (formerly known as UCS-2)

Character Encodings and DDL Each database vendor provides their own encoding support. Most support “legacy” encodings and their variants. Need a Unicode encoding to support multiple languages (globally)* Each vendor handles Unicode encodings differently. CREATE TABLE Address ( cust_id number, attn varchar(50), department varchar(50), street1 varchar(50), street2 varchar(50), city varchar(50), state char(2), zip varchar(5), country varchar(18));

Example: Cloudscape Cloudscape is a pure Java database. Uses java.lang.String objects to store char and varchar data, so all string data is stored as UCS-2. (1) java.lang.Character equals (1) unit in DDL CREATE TABLE Address ( cust_id number, attn varchar(50), department varchar(50), street1 varchar(50), street2 varchar(50), city varchar(50), state char(2), zip varchar(5), country varchar(18));

Example: Oracle Oracle provides several native Unicode encodings. The most commonly used one is called “UTF8”. Characters in UTF-8 range from one to four bytes* Char and varchar2 types are defined in bytes, not characters. So a varchar2(30) can reliably store 10 Unicode 2.1.8 characters (and as many as 30). Note that a varchar2(60) is required to store surrogate pairs. CREATE TABLE Address ( cust_id number, attn varchar2(150), department varchar2(150), street1 varchar2(150), street2 varchar2(150), city varchar2(150), state char(6), zip varchar2(15), country varchar2(18));

Oracle Example Create a table and insert values. Notice that “multibyte” values take more room to store.

Example: MS SQL Server 2000 SQL Server 2000 provides support for the UTF-16 encoding of Unicode via the nchar and nvarchar datatypes. Char and varchar2 must use a legacy encoding, with sizes defined in bytes (so a varchar(30) can store as many as 30 and as few as 15 characters in Shift-JIS [CP932]). Nchar and nvarchar are defined in characters, so an nvarchar(30) can store 30 characters. Note that an nvarchar(30) can only store 15 characters beyond U+FFFF. CREATE TABLE Address ( cust_id integer, attn nvarchar(50), department nvarchar(50), street1 nvarchar(50), street2 nvarchar(50), city nvarchar(50), state nchar(2), zip nvarchar (5), country nvarchar(18));

SQL Server Example Create a table. Insert data using “multibyte” characters. Insert data using “single-byte” characters.

Oracle Unicode Encoding Variations AL24UTFFSS. The original Unicode encoding supported by Oracle. It is not compatible with modern Unicode and should be avoided. UTF8. A multibyte encoding used by most versions of Oracle. This version encodes Unicode Scalar Values larger than U+FFFF as the UTF-8 sequence for a pair of surrogate characters. This results in binary sorting sequences compatible with UTF-16 representations. This violates the Unicode “shortest form” requirement (note that this is invisible to my Java application). *(All JDBC drivers adjust the connection to use this encoding automatically.) AL32UTF8. A UTF-8 encoding provided in Oracle 9i that correctly encodes Unicode Scalar Values larger than U+FFFE using the shortest form. Note that the sorting sequence is different. nchar/nvharchar support for UTF-16 in Oracle 9i.

MS SQL Server 2000 Issues Code Page 65001: This is Microsoft’s code page for UTF-8. It can not be used as a char/varchar/text encoding, even in the most recent versions of MS SQL Server. See http://support.microsoft.com/support/kb/articles/Q232/5/80.ASP for more information. You can use a different encoding (by setting the collation) for each data column, but this is not a very convenient way to work in a global environment. JDBC connections to SQL Server use the JDBC-ODBC driver. This driver cannot tell the difference between n-types and “regular” types, and thus cannot retrieve Unicode string values. Note that this also applies to several middleware products, notably Merant. Note that this also applies to use of variant text types in other databases (such as Oracle 9i).

Some Other Databases Sybase IBM DB/2 MySQL doesn’t support Unicode. ASE supports UTF-8. Sybase 11 supports UTF-8 via an add-on. ASE is adding support for UTF-16 via a new data type. IBM DB/2 Supports UTF-16 (as CCSID 13844). Supports UTF-8 as a database encoding. MySQL doesn’t support Unicode. The Open Source folks need to get to work …

Modifying Size Constraints UTF-8 has a maximum number of bytes-per-character of 4. Vast majority of characters use 3 or less. Older systems (JDK, for example) cannot access the 4-byte characters. Determine size requirements: Specific constraint --or-- Arbitrary constraint. Specific Size Limit: Check length using code (database fields have variable restrictions). For UTF-8: Multiply by 3 (or 4) bytes to get field length. Example: varchar2(10) becomes varchar2(30). Arbitrary Size Limit: Multiply the desired maximum by 3 bytes to get approximate size. Adjust according to database and performance requirements. Example: varchar2(100) becomes varchar2(255). [Was able to store 100 characters, now a “minimum maximum” of 85.]

Cultural Data Expansion Data also changes size (and sometimes type) because of “culture” or locale. Examples: Social Security Number is 13 digits in France “Postal code” not all numeric outside USA and may be quite long. Different address units than “State” Spanish users often have two or three “middle” names. Avoid arbitrarily small char and varchar field lengths. Most databases optimize storage of variable length fields. But avoid performance killing sizes. Oracle block size limitations. Oracle JDBC character conversion “latching” maximum (2000 bytes in 8.0.5).

Cultural Data Expansion State (char2) becomes province (up to 85 characters). ZIP code (varchar9) becomes postalcode (up to 50 characters and probably much more). Address fields expand from 50 bytes to 255 bytes (or about 85 characters). Don’t assume that the same fields will always represent the exact same data values. CREATE TABLE Address ( address_id char(24), cust_id char(24), contact_id char(24), country_id char(2), department varchar2(255), street1 varchar2(255), street2 varchar2(255), city varchar2(255), province varchar2(255), postalcode varchar2(150)); --Addr demo. --Fine for simple table, now the complex bits.

What’s Left? So far we’ve: We still need to: Expanded storage to deal with character encodings. Expanded storage to deal with cultural and linguistic data expansion. We still need to: Allow for localization of textual elements. Allow for relational changes due to cultural or linguistic requirements.

Basic Questionnaire Table Structure -------- Q_ID char(24) QUES_ID char(24) TYPE_ID char(2) QUESTION varchar(255) SEQ_NUM number

Structure with Localization QUESTION -------- Q_ID char(24)PK QUES_ID char(24)PK TYPE_ID char(2) DESCRIPTION varchar(255) SEQ_NUM number QUESTION_LOCALE --------------- Q_ID char(24)PK, FK QUES_ID char(24)PK, FK LCID number PK NAME varchar(255) Standard usage: DESC vs. NAME DESC is default locale value, so empty ResultSets can still function. LCID is a number in this particular design. Let’s examine why…

Selecting the Locale How Java does it: <baseclass>+<specific language>+ <specific country>+<specific variant> <baseclass>+<specific language>+ <specific country> <baseclass>+<specific language> <baseclass>+<default language>+ <default country>+<default variant> <baseclass>+<default language>+ <default country> <baseclass>+<default language> <baseclass> How can we replicate this in SQL?

One Method… Inefficient. Difficult to manage. QUESTIONLOCALE SELECT * FROM Questionnaire WHERE InitiativeID = ? SELECT * FROM Question WHERE Q_ID = ? (while more questions) (do) SELECT * FROM QuestionLocale WHERE Ques_ID = ? AND LCID=? (until you find a record…) (wend) QUESTIONLOCALE -------------- Q_ID QUES_ID LANG_ID TERRITORY_ID QUESTION Inefficient. Difficult to manage.

Our Solution Concept of “installed locale” Create associated “installed locale” records at the hub or questionnaire level. Perform locale negotiation once. No additional searches required. Let’s look… SELECT * FROM Questionnaire WHERE Initiative_ID=? SELECT * FROM Question WHERE Q_ID =? SELECT * FROM QuestionLocale WHERE Q_ID =? AND Ques_ID=? AND LCID=?

Appearance of Questions

Data and Locale Some data is locale neutral. Formatted at display time to match user’s locale. Values don’t vary by locale. Note: It may be in a language. Some data is locale related. Data locale implied by context. Formatting/Validation is supplied by context. Locale can be inherited or cascaded. Problems: Responses to the same questionnaire should be pooled. Hub owners don’t want to manage multiple questionnaires (especially if the only difference is language). Some data is locale intrinsic. Business Logic (format/validation) changes due to data’s locale. Locale must be tagged. Implies a separate table.

Simplify with Locale Related

QUESTIONS AND ANSWERS Presentation Available at http://www.inter-locale.com mailto:aphillips@webmethods.com