ObjectStudio for Unicode Alexander Augustin Getting ready for global markets.

Slides:



Advertisements
Similar presentations
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Advertisements

From UCS-2 to UTF-16 Discussion and practical example for the transition of a Unicode library from UCS-2 to UTF-16.
© 1998, Progress Software Corporation 1 Migration of a 4GL and Relational Database to Unicode Tex Texin International Product Manager.
UNICODE and UTF-8 Regensburg DOBES summer school Language Documentation Sebastian Drude
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
Free Pascal compiler internationalisation Rimgaudas Laucius Institute of Mathematics and Informatics, Vilnius University Lithuania.
Overview Digital Systems and Computer Systems Number Systems [binary, octal and hexadecimal] Arithmetic Operations Base Conversion Decimal Codes [BCD (binary.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ITEC 1011 Introduction to Information Technologies 2. Data Formats Chapt. 3.
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
9/15/09 - L3 CodesCopyright Joanne DeGroat, ECE, OSU1 Codes.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
Alexey Miroshnikov © Copyright InfoStroy Ltd., 2013.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
Dale & Lewis Chapter 3 Data Representation
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
©Brooks/Cole, 2003 Chapter 2 Data Representation.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Binary Numbers and ASCII and EDCDIC Mrs. Cueni. Data Representation  Human speech is analog because it uses continuous signals (waves) that vary in strength.
Agenda Data Representation – Characters Encoding Schemes ASCII
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
CSC 101 Introduction to Computing Lecture 9 Dr. Iftikhar Azim Niaz 1.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
DotNetConnect Andreas Tönne, Georg Heeg eK. Overview About Overview.NET Demo Architecture Limitations Benchmarks Conclusions.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Georg Heeg eK Baroper Str Dortmund Germany Tel: Fax: Georg Heeg AG.
Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Irvine, Kip R. Assembly Language for Intel-Based Computers 6/e, Signed Integers The highest bit indicates the sign. 1 = negative, 0 = positive.
Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
Building Database-backended Multilingual, Multimedia Data Repositories: The aAQUA Experience.
1 Problem Solving using Computers “Data....Representation, and Storage.
M204 - Data Representation
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
DATA REPRESENTATION 4 Y. Colette Lemard February 2009.
ASCII AND EBCDIC CODES By : madam aisha.
2. Data Formats. Introduction Examples pp Real World Data Computer Data Input device Dear Mom: Keyboard … Digital camera …
CHAPTER 1 COMPUTER SCIENCE II. HISTORY OF COMPUTERS (1.1) Eniac- one of the worlds first computers Used more electricity than an entire city block of.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
BINARY I/O IN JAVA CSC 202 November What should be familiar concepts after this set of topics: All files are binary files. The nature of text files.
Text and Images Key Revision Points.
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Computer Science II Chapter 1.
Lesson Objectives Aims You should be able to:
INTERNATIONALIZATION
Binary Numbers and ASCII and EDCDIC
Chapter 3 Data Storage.
TOPICS Information Representation Characters and Images
Coding Schemes and Number Systems
Data Representation Question: Characters
Ch2: Data Representation
2. Data Formats Chapt. 3.
Chapter 3 - Binary Numbering System
Varying Character Lengths
Presentation transcript:

ObjectStudio for Unicode Alexander Augustin Getting ready for global markets

Overview Problem description History of character sets and Encoding Goals and approach Features and technologies Limitations Conclusions

ObjectStudio ObjectStudio is an integrated Smalltalk environment for the Windows platform Access to most common Windows services and database systems, like DLL functions, COM, ODBC, Oracle … It’s Smalltalk – so almost anything is possible – except easy localization and processing multilingual data.

ObjectStudio in a Unicode World ObjectStudio (ANSI/OEM) Operating System (Unicode) Other programs (Unicode) Data sources (Unicode) ? ?

Go Multilingual! Applications in a global market must represent texts and names of Eastern Europe and Asia. User interfaces must be localizable Offer capabilities of handling multilingual Data Must be supported by the runtime environment and the development system Screenshot: Japanese Version of Microsoft Word

ObjectStudio Supports: ANSI (CP1252) and OEM (CP850) 8 Bit characters Adequate for: Writing source code Creating English UIs Processing English text files Accessing databases with English texts Screenshot: ObjectStudio Environment

Overview Problem description  History of character sets and Encoding Goals and approach Features and technologies Limitations Conclusions

The history of character sets Punch card – late 18th century Enhanced by Holerith (patented 1890) 5 channel punch tape – 19th century 2 5 = 32, not enough for 26 letters + 10 digits Solution: shift key as prefix state shift 8 channel punch tape – mid 20th century 7 bit US-ASCII + parity No support for umlauts VT220 terminal invents ISO8859-L Similar to Microsoft codepage 1252 Many character encodings for many languages EBCDIC, KOI8, ShiftJIS, …

Unicode Unicode - a standard defined by the Unicode consortium. Unicode assigns a unique number (code point) to each glyph Version reserves more than code points Several transformation formats for binary representation of Unicode code points UCS-2 (2Bytes/char), UTF-8 (1-4 bytes/char), UTF-16 (2/4 bytes/char)

Unicode World-wide unification effort for all characters of the world Supported by all major vendors! The solution for ObjectStudio!

Encoding CharacterCode Binary representation Transforming characters into their binary representation in another encoding One main problem when accessing external data sources Distinguish between specialized encodings and Unicode

Byte Encodings Differ in the value that represents a character in the encoding Do not differ in the binary format of the code ( always 1 Byte) Decimal value/Binary hexadecimal representation Encoding\characterÖ€ CP /D6128/80 CP852153/99-- ISO8859-L15214/D6164/A4 CharacterCodeBinary representation

Unicode Encodings Do not differ in the value (Code Point) that is assigned to a character Differ in the binary format of the value CharacterCode PointBinary representation Hexadecimal binary representation UTF\characterÖ (Code Point 214)€ (Code Point 8364) UCS-2 (little-endian)D6 00AC 20 UTF-8C3 96E2 82 AC

Goals 1. Enable Unicode! Extend encoding capabilities Provide native multilingual IO support 2. Extend external access features Add Unicode file access Add Unicode database access

Changes Create a Unicode VM Make ObjectStudio a native Windows Unicode application Adapted class library Make Smalltalk String/Symbol Objects 16bit Unicode strings (UCS-2) Add encodings External interfaces and resources C Calls Unicode File access Database access (ODBC, OCI)

Stream Encoding Ported from VisualWorks Use StreamEncoders and CharacterEncoders that „know“ the encoding Can be applied to any kind of stream with a byte-like buffer to encode or decode data EncodedStream Stream StreamEncoder Buffer Character Encoder

CharacterEncoder StreamEncoder Stream Encoding EncodedStream Stream Buffer Character Code Binary representation

StreamEncoding use cases Accessing external services and storages without UCS-2 support (e.g. ANSI C calls) Examples Access to databases without UCS-2 support Calling ANSI DLL functions without UCS-2 support String transfer via TCP/IP Access to text files with foreign encodings

Text file access Read/write access to any kind of text file UTF8, UTF16, UCS-2 little-endian, … CP1252 (Windows ANSI) CP850 (Windows OEM) And Many more Using EncodedStreams and NewFileStreams Example: read UTF-8 encoded file | fileStream encoder encodedStream result | fileStream := NewFileStream file: ‘example.txt’ mode: #binary onError: [ self error: ‘could not open file’ ]. encoder := StreamEncoder new: #utf8. encodedStream := EncodedStream on: fileStream encodedBy: encoder. result := encodedStream upToEnd. encodedStream close

External Database Access Supported Unicode database interfaces ODBC OCI (ORACLE Call Interface) Features Native access to Unicode data sources No application modifications needed Requirements ODBC: Version 3.5 OCI: OCI Client Version (9 i ) or higher

Limitations Source files continue to be OEM encoded Store Unicode text data in text files or external databases UIs sources can‘t contain Unicode strings Use external files/databases to store Unicode data for localizing UIs Planned to implement some localization support Implicit conversions between Strings and ByteArrays cannot be supported Use encoded streams or #asByteArrayEncoding:

Limitations Image files are not compatible Compile class files and create new images

Conclusion ObjectStudio Unicode Operating System (Unicode) Other programs (Unicode) Data sources (Unicode)

Availability ObjectStudio 7.0 for Unicode is available to the new CINCOM Smalltalk CD together with VisualWorks 7.3

Contact Information We provide project support to internationalize your ObjectStudio application Georg Heeg eK Baroper Str. 337 D Dortmund Tel: Fax: Georg Heeg AG Seestr. 131 CH-8027 Zürich Tel: Georg Heeg eK Mühlenstr. 19 D Köthen Tel: Fax:

  2004 Cincom Systems, Inc. All Rights Reserved Developed in the U.S.A. CINCOM,, and The World’s Most Experienced Software Company are trademarks or registered trademarks of Cincom Systems, Inc All other trademarks belong to their respective companies.