Charset to UTF. Good Old Old Days Is there any other language but American ?? EBCDIC ASCII.

Slides:



Advertisements
Similar presentations
DICOM INTERNATIONAL CONFERENCE & SEMINAR Oct 9-11, 2010 Rio de Janeiro, Brazil Building a DICOM Library in C# Victor Derks GE Healthcare.
Advertisements

Lecture 2 1 Encoding Schemes Encoding methods: a method of encoding at binary level to ensure identification and the use of a mixture of different character.
Representing Information as Bit Patterns
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
Addition : _________________ Binary Numbers (contd)
Lecture 3 1 ISO/IEC and Unicode It is a coded character set(codeset) –Designed for text processing and exchange Features: –Universal: characters.
Representing Information in Binary (Continued)
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Biblical Hebrew Alphabet Flash Cards For use in the study of Biblical Hebrew. Created by Ron Henzel. Copyright © 2013, reading Scripture. May be distributed.
BeHaalotcha Hebrew and Torah Review and add Yod, Vav and Tav.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
El Shaddai Ministries June 6, 2015 Welcome to Sabbath Service! Sivan 19, 5775 Write your name in ENGLISH here. ____________________________ Write your.
El Shaddai Ministries July 11, 2015 Welcome to Sabbath Service! Tammuz 24, Write your name in ENGLISH. _________________________________ Write your.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Health Information Standardization and Asian Languages Michio Kimura M.D. Ph.D. Director and Professor of Medical Informatics Department Hamamatsu University.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
Week 4 Number Systems.
APPX Unicode Support APPX Release 6.0 will support Unicode APPX will support languages worldwide.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
San Jose, California, September 2002 Compact Encodings of Unicode Markus W. Scherer Unicode/G11N Software Engineer IBM Globalization Center of Competency.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
It’s your choice! French or Spanish?. World Languages French Ms.Reed Spanish Ms. Reed Mr. Draper.
Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Dale Roberts, Lecturer Information.
ICT Foundation 1 Copyright © 2010, IT Gatekeeper Project – Ohiwa Lab. All rights reserved. Character representation.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Hebrew and Torah Bemidbar Numbers 25.1 Brought to you by the letters א Alef, ל Lamed and מ Mem.
Read from right to left 26 letters, some consonants, some vowels No capital letters 5 letters change form at the end Inconsistent Pronunciation (cat vs.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Oracle9i Database Administrator: Implementation and Administration 1 Chapter 14 Globalization Support in the Database.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Strings in MIPS. Chapter 2 — Instructions: Language of the Computer — 2 Character Data Byte-encoded character sets – ASCII: 128 characters 95 graphic,
1 El Shaddai Ministries February 21, 2015 Welcome to Sabbath Service! Adar 2, 5775.
Week 7 Lecture 2 Globalization Support in the Database.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
Shelach Brought to you by ח Chet and כ ך Chaf. Review ד Dalet ב Bet ר Resh ה Hey א Alef ל Lamed ם מ Mem ג Gimmel ש Shin ן נ Nun י Yod ו Vav ת Tav.
Korach – Hebrew and Torah New Letters ק Kuf, פף Pey, צץ Tzadi, ע Ayin.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
1 El Shaddai Ministries February 13, 2016 Welcome to Sabbath Service! Adar 4, 5776 ENGLISH NAME:_____________________________ HEBREW NAME:______________________________.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
1 El Shaddai Ministries February 13, 2016 Welcome to Sabbath Service! Adar 4, 5776 ENGLISH NAME:__________________________ HEBREW NAME:___________________________.
The Hebrew Alphabet: The Consonants
Searching, Modifying, and Encoding Text. Parts: 1) Forming Regular Expressions 2) Encoding and Decoding.
THE CODING SYSTEM FOR REPRESENTING DATA IN COMPUTER.
Characters must also be encoded in binary. ASCII maps characters to numbers.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
Mángo Languages UM libraries.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Machine level representation of data Character representation
Lesson Objectives Aims You should be able to:
Let us take a look at a multilingual version of the Wikipedia icon.
TOPICS Information Representation Characters and Images
Take Away English Level 3. Classroom Teaching Tool
Lecture 3 ISE101: Computing Fundamentals
Representing Characters
What is the characteristic of an alphabetic writing system?
Zebra Technologies Technical Support CBT
Aleph a Beth b Gimel g Daleth d He h Waw w Zayin z Heth x Teth j
Languages.
Strings.
Hebrew and Torah x Samech y Tet
Text Encoding.
COUNTRIES NATIONALITIES LANGUAGES.
Hebrew and Torah Bemidbar Numbers 25.1 Brought to you by the letters
Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2
Presentation transcript:

Charset to UTF

Good Old Old Days Is there any other language but American ?? EBCDIC ASCII

Good Old Days Ascii: – latin – French,Italian, German etc. or Greek or Hebrew or Russian etc.

Multibyte Japanese – SJIS, EUC Chinese – Big5, GB Korean

Babel’s Tower

Many Languages Hebrew Japanese Arabic In the same doc/line/screen

Unicode All Languages Each char – 2 bytes – problem: Not string - wide char

UTF8 One to one with Unicode 1-3 regular chars Well defined algorithm

Hebrew to Unicode 05D0 60 HEBREW LETTER ALEF 05D1 61 HEBREW LETTER BET 05D2 62 HEBREW LETTER GIMEL 05D3 63 HEBREW LETTER DALET 05D4 64 HEBREW LETTER HE 05D5 65 HEBREW LETTER VAV 05D6 66 HEBREW LETTER ZAYIN 05D7 67 HEBREW LETTER HET 05D8 68 HEBREW LETTER TET 05D9 69 HEBREW LETTER YOD 05DA 6A HEBREW LETTER FINAL KAF 05DB 6B HEBREW LETTER KAF 05DC 6C HEBREW LETTER LAMED 05DD 6D HEBREW LETTER FINAL MEM 05DE 6E HEBREW LETTER MEM and likewise for each charset

Need for Conversion Existing Data New data: Editors work in specific charsets, not in utf/unicode

Brute Force Foreach org_char convert to utf

Perl way 1 use ENCODE; ($if, open my $in, "<:encoding(iso )", $if; open my $out, ">:encoding(utf8)", $of; while( ) { print $out $_; } close $in;

Perl way 2 perl -MEncode -e '($if, my $in, " :encoding(utf8)", $of;while( ){ print $out $_; }' infile outfile