The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.

Slides:



Advertisements
Similar presentations
CMSC 150 DATA REPRESENTATION CS 150: Mon 30 Jan 2012.
Advertisements

Some computer fundamentals and jargon Memory: Basic element is a bit – value = 0 or 1 Collection of “n” bits is a “byte” Collection of several bytes is.
A-Level Computing#BristolMet Session Objectives#8 express numbers in binary, octal and hexadecimal explain the use of code to represent a character set.
IT Systems What Number? EN230-1 Justin Champion C208 –
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
1 The Information School of the University of Washington Nov 6fit more-digital © 2006 University of Washington Digital Information INFO/CSE 100,
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Binary Conversions Number systems Binary to decimal Decimal to binary.
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Computer Systems Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.
1 © 2000, Cisco Systems, Inc. DNSSEC IDN Patrik Fältström
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Binary Numbers and ASCII and EDCDIC Mrs. Cueni. Data Representation  Human speech is analog because it uses continuous signals (waves) that vary in strength.
Agenda Data Representation – Characters Encoding Schemes ASCII
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Chapter 1 Data Storage(2) Yonsei University 1 st Semester, 2014 Sanghyun Park.
Data Representation S2. This unit covers how the computer represents- Numbers Text Graphics Control.
Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.
Computer Structure & Architecture 7c - Data Representation.
INFOCODING BASICS & EXAMPLES OF CURRENT USE Introduction to Computer Science Using Ruby (c) 2010 Gideon Frieder.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Computing Theory – F453 Number Systems. Data in a computer needs to be represented in a format the computer understands. This does not necessarily mean.
Text and Graphics September 26, Unit 3.
1 3 Computing System Fundamentals 3.5 Data Representation.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Data Representation Conversion 24/04/2017.
Strings in MIPS. Chapter 2 — Instructions: Language of the Computer — 2 Character Data Byte-encoded character sets – ASCII: 128 characters 95 graphic,
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
THE BINARY NUMBER SYSTEM “There are only 10 types of people in this world: Those who understand BINARY and those who do not.”
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
M204 - Data Representation
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
Representing Characters in a Computer System Representation of Data in Computer Systems.
Basics of Unicode (base upon a presentation by NRSI, SIL International)
THE CODING SYSTEM FOR REPRESENTING DATA IN COMPUTER.
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Binary Numbers and ASCII and EDCDIC
Information Support and Services
Data Encoding Characters.
Data Representation ASCII.
Representing Characters
Data Representation Question: Characters
LING 388: Computers and Language
Fundamentals of Data Representation
COMS 161 Introduction to Computing
INFO/CSE 100, Spring 2005 Fluency in Information Technology
COMS 161 Introduction to Computing
Digital Encodings.
A-level Computer Science
INFOCODING BASICS & EXAMPLES OF CURRENT USE
Chapter 3 - Binary Numbering System
Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2
Presentation transcript:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika Bejdová, 2013/2014

The easiest way to understand this stuff is to go chronologically. 2/14

ASCII  It was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc.  7 bits  Codes below 32 were called unprintable and were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in. 3/14

The trouble  IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters... horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc.,a bunch of line drawing characters  on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel, so when Americans would send their résumés to Israel they would arrive r()sum()s  But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. 4/14

Unicode  Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:  A ->  In Unicode, a letter maps to something called a code point which is still just a theoretical concept  the letter A is a platonic ideal. It's just floating in heaven: A  This platonic A is different than B, and different from a, but the same as A and A and A. 5/14

Magic number – code point  U+0639 (Arabic letter Ain)  The U+ means "Unicode“  the numbers are hexadecimal  U+0041 (English letter A) 6/14

Hello U+0048 U+0065 U+006C U+006C U+006F 7/14

Encodings U+0048 U+0065 U+006C U+006C U+006F  let's just store those numbers in two bytes each: C 00 6C 00 6F  Couldn't it also be: C 00 6C 00 6F 00 ?  FE FF  FF FE 8/14

Look at all those zeros!  the brilliant concept of UTF-8UTF-8  every code point from is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes  This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong 9/14

Hello U+0048 U+0065 U+006C U+006C U+006F has changed to C 6C 6F which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet (Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice.) 10/14

Encoding of Unicode  UTF-16 (because it has 16 bits) FE FF or FF FE  UTF-8 11/14

Unicode code points can be encoded in any old-school encoding scheme, too!  you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up!  If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> � 12/14

Hundreds of traditional encodings  They can only store some code points correctly and change all the other code points into question marks  Windows-1252 (the Windows 9x standard for Western European languages)  ISO , aka Latin-1 (also useful for any Western European language) ISO  UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly 13/14

Sources:   - online converter 14/14