IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.

Slides:



Advertisements
Similar presentations
Globalization Gotchas
Advertisements

The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Building International Applications with Visual Studio.NET Achim Ruopp International Program Manager Microsoft Corporation.
XHTML Basics.
Bits and the "Why" of Bytes: Representing Information Digitally
Computer Science Basics CS 216 Fall Operating Systems interface to the hardware for the user and programs The two operating systems that you are.
Solutions for Multilingual Literature by XSL Formatter 6,800 known languages.
Binary Representation
1. Discrete / Continuous Representations Of numbers – binary & decimal Bits Hexadecimal - 'Hex' Representing text Bits and Bytes.
23-Jun-15 HTML. 2 Web pages are HTML HTML stands for HyperText Markup Language Web pages are plain text files, written in HTML Browsers display web pages.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Chapter 8_1 Bits and the "Why" of Bytes: Representing Information Digitally.
Chapter 8 Bits and the "Why" of Bytes: Representing Information Digitally.
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Unicode & W3C Jataayu Software C. Kumar January 2007.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Creating a Simple Page: HTML Overview
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
IBM Maximo Asset Management © 2007 IBM Corporation Tivoli Technical Exchange Calls Aug 31, Maximo - Multi-Language Capabilities Ritsuko Beuchert.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.
Lecture 2 Character Codes and Low-Structure Text Document Formats.
APPX Unicode Support APPX Release 6.0 will support Unicode APPX will support languages worldwide.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Kelly rowland WHAT WE ALL NEED!!. hoppadon formly of village deuce mafia...the hottest rap don spitting!!
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
Introduction to HTML. HTML Hyper-Text Markup Language: the foundation of the World-Wide Web Design goals:  Platform independence: pages can be viewed.
Chapter 2 Computer Hardware
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Copyright © IBM Corp., The Eclipse™ Babel Project Translation Server Kit Lo IBM™ Corporation.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
New RCLayout. Do product layout 3 improvements All products Local databases New functionalities.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Data Representation, Number Systems and Base Conversions
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
CIT3611 Software i18n Wk 4: Code sets, Online Help, Prototyping David Tuffley School of Computing & IT Griffith University.
Syntax of the HTML HyperText Markup Language. HTML Syntax  What is it?  Helps computer know how to display  What goes into it?  U+FEFF BYTE ORDER.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Module 7: SQL Server Special Considerations. Overview SQL Server High Availability Unicode.
The Information School of the University of Washington 15-Oct-2004cse digital1 Digital Representation INFO/CSE 100, Spring 2005 Fluency in Information.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
Binary Representation in Text
Binary Representation in Text
Bits and the "Why" of Bytes: Representing Information Digitally
Representing Information as bit patterns
TOPICS Information Representation Characters and Images
Dynamic Web Pages Jin Wu INF 385E Information Architecture
Trust and Culture on the Web
XML Problems and Solutions
ASCII and Unicode.
Presentation transcript:

IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy Heninger, IBM

IBM Globalization Center of Competency © 2006 IBM Corporation 2IUC 29, Burlingame, CAMarch 2006 Overview  What is character set detection?  How is it used?  Character set detection libraries  How ICU ’ s library is implemented  Conclusion

IBM Globalization Center of Competency © 2006 IBM Corporation 3IUC 29, Burlingame, CAMarch 2006 What is Character Set Detection?  Tower of Babel – Dozens of character encodings in common use – Web pages, s, plain text files – Protocols specify character encoding  Encoding information may be missing or incorrect – Encoding information may be missing – Server may have incorrectly overridden – Translator may have failed to update  Character set detection to the rescue!

IBM Globalization Center of Competency © 2006 IBM Corporation 4IUC 29, Burlingame, CAMarch 2006 How is Character Set Detection Used?  Web browsers, search engines, – Web pages, have character encoding information – This information may be missing or incorrect  File indexing – Must handle plain text files – Character encoding information may be incorrect

IBM Globalization Center of Competency © 2006 IBM Corporation 5IUC 29, Burlingame, CAMarch 2006 Character Set Detection Libraries  Mozilla – C++ and Java versions – Incremental operation  Windows API – ImultiLanguage2::DetectInputCodepage – ImultiLanguage2::DetectCodepageInIStream  ICU – C and Java versions

IBM Globalization Center of Competency © 2006 IBM Corporation 6IUC 29, Burlingame, CAMarch 2006 ICU ’ s Character Set Detection Library  Detection function – Returns character set, confidence  Conversion function – Converts data to Unicode  Convenience functions to do both

IBM Globalization Center of Competency © 2006 IBM Corporation 7IUC 29, Burlingame, CAMarch 2006 Three Classes of Character Sets  Single Byte – Each byte corresponds to one Unicode character  Multi-Byte – Two or more bytes represent a single Unicode character  Algorithmic – Encoding scheme produces distinctive byte patterns

IBM Globalization Center of Competency © 2006 IBM Corporation 8IUC 29, Burlingame, CAMarch 2006 Detecting Single Byte Character Sets  Can ’ t use byte patterns – Any byte legal in any position  Use statistical method – Have statistics for each language – Match statistics of input to each language – Assumes input is natural language plain text

IBM Globalization Center of Competency © 2006 IBM Corporation 9IUC 29, Burlingame, CAMarch 2006 Language Statistics  Trigrams – Groups of three adjacent letters – Treat runs of punctuation, spaces as single space  Data is list of most common trigrams – Computed from large, varied sample of text  Compute trigrams for input, compare – Confidence based on number of common trigrams

IBM Globalization Center of Competency © 2006 IBM Corporation 10IUC 29, Burlingame, CAMarch 2006 Single Byte Character Sets Detected By ICU NameLanguages ISO Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish ISO Czech, Hungarian, Polish, Romanian ISO Russian ISO Arabic ISO Greek ISO Hebrew ISO Turkish Windows-1251Russian Windows-1256Arabic KOI8-RRussian

IBM Globalization Center of Competency © 2006 IBM Corporation 11IUC 29, Burlingame, CAMarch 2006 Multi-Byte Character Set Detection  Used for Chinese, Japanese, Korean  Can use byte patterns – Rules for which bytes can be in each position – Can reject data that breaks the rules  Must use statistics – List of most commonly used characters – Confidence based on percentage of common characters

IBM Globalization Center of Competency © 2006 IBM Corporation 12IUC 29, Burlingame, CAMarch 2006 Chinese GB-2312, GBK, GB18030  GB-2312 (1980) – 6,763 Han characters  GBK (1995) – Extends GB-2312 – Adds all Han characters from Unicode 2.0  GB18030 (2000) – Extends GBK – Adds all of Unicode  ICU Always matches GB18030 – Common characters are from GB-2312 – GB18030 to Unicode converter will handle all three

IBM Globalization Center of Competency © 2006 IBM Corporation 13IUC 29, Burlingame, CAMarch 2006 Multi-Byte Character Sets Detected By ICU NameLanguage Shift-JISJapanese EUC-JPJapanese EUC-KRKorean GB18030Chinese Big5Chinese

IBM Globalization Center of Competency © 2006 IBM Corporation 14IUC 29, Burlingame, CAMarch 2006 Algorithmic Character Sets  Identified by distinctive byte sequences – Don ’ t need language statistics  UTF-8, UTF-16, UTF-32  ISO-2022-CN, ISO-2022-JP, ISO KR

IBM Globalization Center of Competency © 2006 IBM Corporation 15IUC 29, Burlingame, CAMarch 2006 Algorithmic Character Sets: UTF-8  Unicode encoding  Represents characters as sequence of one to four bytes  Can start with Byte Order Mark (BOM): – EF BB BF  Very distinctive byte pattern # of BytesAllowable Values at Each Position 1[00-7F] 2[C0-DF] [80-BF] 3[E0-EF] [80-BF] [80-BF] 4[F0-F7] [80-BF] [80-BF] [80-BF]

IBM Globalization Center of Competency © 2006 IBM Corporation 16IUC 29, Burlingame, CAMarch 2006 Algorithmic Character Sets: UTF-16  Unicode encoding  Represents characters as sequence of 16-bit words  Starts with Byte Order Mark (BOM): – FE FF (big-endian) – FF FE (little-endian)  Confidence based on presence of BOM –Could check for defined characters, script runs, etc.

IBM Globalization Center of Competency © 2006 IBM Corporation 17IUC 29, Burlingame, CAMarch 2006 Algorithmic Character Sets: UTF-32  Unicode encoding  Represents characters as 32-bit words  Can start with Byte Order Mark (BOM): – FE FF (big-endian) – FF FE (little-endian)  Confidence based on presence of characters in Unicode range  Byte pattern is fairly distinctive – Lots of zero bytes

IBM Globalization Center of Competency © 2006 IBM Corporation 18IUC 29, Burlingame, CAMarch 2006 Algorithmic Character Sets: ISO-2022  Used for Chinese, Japanese, Korean – Widely used in  Uses embedded escape sequences, shift codes – e.g. 1B is Korean escape sequence  Confidence based on escape sequences: – Presence of known sequences, absence of unknown – No overlap for Chinese, Japanese, Korean sequences

IBM Globalization Center of Competency © 2006 IBM Corporation 19IUC 29, Burlingame, CAMarch 2006 Character Set Detection and Markup  HTML documents contain headers, markup, JavaScript  Can interfere with language-based detection – Not part of text content – Uses Latin alphabet  ICU provides a basic markup filter – Use if text known to contain markup – Use for languages written in Latin alphabet

IBM Globalization Center of Competency © 2006 IBM Corporation 20IUC 29, Burlingame, CAMarch 2006 How Much Text is Required?  Good results with a few hundred bytes of plain text  Complex web sites can have kilobytes of markup – Usually at the beginning – Our experience: 6 kilobytes is enough  Trade-off between speed and accuracy  Test results:

IBM Globalization Center of Competency © 2006 IBM Corporation 21IUC 29, Burlingame, CAMarch 2006

IBM Globalization Center of Competency © 2006 IBM Corporation 22IUC 29, Burlingame, CAMarch 2006 Language Detection  Language detected as side effect  No language for UTF encodings – We could adapt single-byte data  Closely related languages my be confused – e.g. French, Spanish, Portuguese  Use linguistic analysis libraries for more accuracy  Test results:

IBM Globalization Center of Competency © 2006 IBM Corporation 23IUC 29, Burlingame, CAMarch 2006

IBM Globalization Center of Competency © 2006 IBM Corporation 24IUC 29, Burlingame, CAMarch 2006 Cautions  Character set detection is not 100% reliable – Based on statistics – Assumes data is natural language text – Doesn ’ t have data for all encodings  Designed to work on plain text – Markup, etc. will confuse it – Won ’ t work on binary formats, like word processing documents

IBM Globalization Center of Competency © 2006 IBM Corporation 25IUC 29, Burlingame, CAMarch 2006 Conclusions  Can read and understand text in unknown encoding  Any program that reads text from uncontrolled sources can benefit  Freely available implementations make character set detection easy to use

IBM Globalization Center of Competency © 2006 IBM Corporation 26IUC 29, Burlingame, CAMarch 2006 Questions and Answers

IBM Globalization Center of Competency © 2006 IBM Corporation 27IUC 29, Burlingame, CAMarch 2006 Character Sets Detected by ICU NameTypeLanguages ISO Single ByteEnglish, German, French, Spanish, Danish ISO Single ByteCzech, Hungarian, Polish ISO Single ByteRussian ISO Single ByteArabic ISO Single ByteGreek ISO Single ByteHebrew ISO Single ByteTurkish KOI8-RSingle ByteRussian Shift JISMultiByteJapanese EUC JPMultiByteJapanese ISO 2022 JPAlgorithmicJapanese GB18030MultiByteChinese ISO 2022 CNAlgorithmicChinese Big5MultiByteChinese EUC KRMultiByteKorean ISO 2022 KRAlgorithmicKorean UTF 8/16/32AlgorithmicAll (Unicode)