Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.

Similar presentations


Presentation on theme: "Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A."— Presentation transcript:

1 Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. jan.j.zidek@tieto.com ☺ U+263A

2 Table of contents Puzzles3 Game4 Questionnaire5 What Is It?6 What Contains More Characters?7 Unicode Overview8 Evolution of Character Encoding9 Can you read it?10 Unicode Standard11 Timeline12 Unicode Characters13 Unicode on the Web15 Unicode Script Blocks17 17 Code Point Planes18 BMP – Basic Multilingual Plane19 Characters per Plane20 Encoding21 Unicode Encoding22 UTF-3223 UTF-825 UTF-1629 From Unicode 1.0 to Unicode 2.030 Surrogates31 UTF-16 Transformation32 Surrogates Mapping33 UTF-16 Encoding Example34 Endianness35 Endianness in Normal Life36 Unicode BOM – Byte Order Mark37 Encodings Summary 38 Properties of Each UTFs39 Unicode Encodings Length40 Unicode Encoding – Example 141 Unicode Encoding – Example 242 Useful Stuff43 2Unicode2016-02-26

3 Puzzles 3Unicode2012-02-15 ¿ U+00BF

4 Game 2012-02-154Unicode

5 Questionnaire 2012-02-155Unicode Who knows what binary code is? Who can convert between decimals and binaries? Who knows what hexadecimal code is? Who can convert between hexadecimals and binaries? Who has heard the word Unicode? Who has heard the word UTF-8, UTF-16, UTF-32? Who creates web pages?

6 What Is It? 2012-02-156Unicode

7 What Contains More Characters? Unicode? UTF-8? UTF-16? UTF-32? 2012-02-157Unicode

8 Unicode Overview U+1F026

9 Evolution of Character Encoding Pre-standards ASCII – 1960s – 7 bits – 128 characters Extended ASCII – 8 bits – 128 characters more →Kód bratří Kamenických →MS-DOS CP852, … →ISO 8859-1, ISO 8859-2, ISO 8859-3, … →Microsoft CP1252, CP1250, … →…, …, … 2012-02-159Unicode

10 Can you read it? 2012-02-1510Unicode

11 Unicode Standard Character coding system http://www.unicode.org/ 2012-02-1511Unicode

12 Timeline YearVersionFeaturesCharacters defined Address space 1991Unicode 1.0Code space: 16 bits U+0000 – U+FFFF 1111 1111 7,16165,536 1996Unicode 2.0Code space: 21 bits U+0000 – U+10FFFF 1 0000 1111 1111 1111 1111 38,9501,114,112 (17 * 65536) 2015Unicode 8.0120,7371,114,112 2012-02-1512Unicode

13 Unicode Characters 13Unicode2012-02-15

14 Unicode Characters 14Unicode2012-02-15

15 Unicode on the Web 15Unicode2012-02-15

16 Unicode on the Web Before HTML 5: HTML 5: NCR: a 16Unicode2012-02-15

17 Unicode Script Blocks 17Unicode2012-02-15

18 17 Code Point Planes 18Unicode2012-02-15

19 BMP – Basic Multilingual Plane 19Unicode2012-02-15

20 Characters per Plane 20Unicode2012-02-15

21 Encoding 21Unicode2012-02-15 U+1F427

22 Unicode Encoding CEF: Character Encoding Form UTF: Unicode Transformation Format 2012-02-1522Unicode

23 UTF-32 U+1F467

24 UTF-32 1:1 24Unicode2012-02-15

25 UTF-8 U+1F466

26 UTF-8 Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6 70000 007F0xxx xxxx 110000 07FF110x xxxx10xx xxxx 160000 FFFF1110 xxxx10xx xxxx 21001F FFFF1111 0xxx10xx xxxx 2603FF FFFF1111 10xx10xx xxxx 317FFF FFFF1111 110x10xx xxxx 2012-02-1526Unicode

27 UTF-EBCDIC Bits Last code point Byte 1Byte 2Byte 3Byte 4Byte 5Byte 6 70000 007F0xxx xxxx 110000 009F100x xxxx 160000 03FF110x xxxx101x xxxx 210000 3FFF1110 xxxx101x xxxx 260003 FFFF1111 00xx101x xxxx 310010 FFFF1111 100x101x xxxx 2012-02-1527Unicode

28 UTF-8 Example 28Unicode2012-02-15

29 UTF-16 U+1F467

30 From Unicode 1.0 to Unicode 2.0 65,536 characters ought to be enough for anybody Workaround concept for Backward Compatibility Surrogates Planes (65,536 characters) Original Unicode 1.0  Basic Multilingual Plane Added 16 extra planes Total 17 * 65,536 = 1,114,112 characters 2012-02-1530Unicode

31 Surrogates Range Mask HighU+D800 – U+DBFF D8 XX1101 1000 xxxx xxxx D9 XX1101 1001 xxxx xxxx DA XX1101 1010 xxxx xxxx DB XX1101 1011 xxxx xxxx 8*2561,024 high surrogates LowU+DC00 – U+DFFF DC XX1101 1100 xxxx xxxx DD XX1101 1101 xxxx xxxx DE XX1101 1110 xxxx xxxx DF XX1101 1111 xxxx xxxx 8*2561,024 low surrogates Combinations1,024 * 1,024 1,048,576 new characters 2012-02-1531Unicode

32 UTF-16 Transformation 32Unicode2012-02-15

33 Surrogates Mapping 33Unicode2012-02-15 hi \ loDC00DC01DC02DC03…DFF0DFFF D8001 00001 00011 00021 0003…1 03FE1 03FF D8011 04001 04011 04021 0403…1 07FE1 07FF D8021 08001 08011 08021 0803…1 0BFE1 0BFF D8031 0C001 0C011 0C021 0C03…1 0FFE1 0FFF ⋮⋮⋮⋮⋮⋱⋮⋮ DBFB10 EC0010 EC0110 EC0210 EC03…10 EFFE10 EFFF DBFC10 F00010 F00110 F00210 F003…10 F3FE10 F3FF DBFD10 F40010 F40110 F40210 F403…10 F7FE10 F7FF DBFE10 F80010 F80110 F80210 F803…10 FBFE10 FBFF DBFF10 FC0010 FC0110 FC0210 FC03…10 FFFE10 FFFF

34 UTF-16 Encoding Example 34Unicode2012-02-15

35 Endianness 35Unicode2012-02-15

36 Endianness in Normal Life Language92Endian ninety-two (90-2) Big zweiundneunzig (2-and-90) Little quatre-vingt-douze (4-20-12) UsageFormEndian Java packagecom.tieto.intraBig Domain nameintra.tieto.comLittle 2012-02-1536Unicode

37 Unicode BOM – Byte Order Mark U+FEFF BOM use is optional at the start of the text stream 2012-02-1537Unicode

38 Encodings Summary U+1F3B8

39 Properties of Each UTFs NameUTF-8UTF-16UTF-16BEUTF-16LEUTF-32UTF-32BEUTF-32LE Smallest code point 0000 Largest code point 10FFFF Code unit size 8 bits16 bits 32 bits Byte orderN/A big-endian little- endian big-endian little- endian Fewest bytes per character 1222444 Most bytes per character 4444444 2012-02-1539Unicode

40 Unicode Encodings Length Code rangeUTF-8UTF-EBCDICUTF-16UTF-32GB 18030 00 0000 – 00 007F1 1 2 4 1 00 0080 – 00 009F 2 2 for characters inherited from GB 2312/GBK (e.g. most Chinese characters)GB 2312GBK 4 for everything else 00 00A0 – 00 03FF2 00 0400 – 00 07FF 3 00 0800 – 00 3FFF 3 00 4000 – 00 FFFF 4 01 0000 – 03 FFFF 44 4 04 0000 – 10 FFFF5 2012-02-1540Unicode

41 Unicode Encoding – Example 1 41Unicode2012-02-15

42 Unicode Encoding – Example 2 42Unicode2012-02-15

43 Useful Stuff א U+05D0

44 Useful Utilities Online Character Converter http://code.cside.com/3rdpage/us/javaUnicode/converter.html BabelMap 6.0.0.2 http://babelstone.co.uk/Software/BabelMap.html Unibook 5.2.0 http://unicode.org/unibook/ Alan Wood’s Unicode Resources http://www.alanwood.net/unicode/index.html Microsoft TrueTypeProperty Extension http://www.microsoft.com/typography/TrueTypeProperty21.mspx Uniview 6.1 http://rishida.net/scripts/uniview/ 2012-02-1544Unicode

45 Useful links http://en.wikipedia.org/wiki/Unicode_font#Comparison_of_fo ntshttp://en.wikipedia.org/wiki/Unicode_font#Comparison_of_fo nts 2012-02-1545Unicode

46 Good Night! U+1F4A4


Download ppt "Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A."

Similar presentations


Ads by Google