Presentation is loading. Please wait.

Presentation is loading. Please wait.

ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.

Similar presentations


Presentation on theme: "ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s."— Presentation transcript:

1 ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

2 Overview It’s not your father’s character set –8 bit characters –ASCII –The rest of the world wakes up to computers Unicode –Character codes –Different flavors Encoding and Decoding classes Example

3 The Good Old Days Focus on unaccented, English letters Every letter, number, capital, etc Represented by codes 0-127 Space, 32; “A”, 65; “a”, 97 Used 7 bits, one bit free on most computers Wordstar and the 8 th bit Below 32 – control bits  7, beep; 12, formfeed

4

5 8 th bit, values 128-255 Everybody had their own ideas OEM Character sets IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) Outside U.S.  different languages –Code 130

6 8 th bit, values 128-255 Everybody had their own ideas OEM Character sets IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) Outside U.S.  different languages –Code 130

7 8 th bit, values 128-255 Everybody had their own ideas OEM Character sets IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) Outside U.S.  different languages –Code 130 é in US, Gimel ג character in Israel –Difficult to exchange documents Code pages – regional definition of bit values 128-255 –Israel: Code page 862 –Greek: Code page 737 –ISO/ANSI code pages Asia – Alphabets had thousands of characters –No way to store in one byte (8 bits)

8 Unicode Not a 16-bit code A new way of thinking about characters Old way: –Character “A” maps to memory or disk bits –A-> 0100 0001 Unicode way: –Each letter in every alphabet maps to a “code point” –Abstract concept –“A” is Platonic “form” – just floats out there –A -> U+0639  code point

9 Unicode Hello -> U+0048 U+0065 U+006C U+006C U+006F Storing in 2 bytes each: –0048 0065 006C 006C 006F (big endian) –Or 4800 6500 6C00 6C00 6F00 (little endian) Need to have a Byte Order Mark (BOM) at beginning of stream UTF8 coding system –Stores Unicode points (magic numbers) as 8 bit bytes –Values 0-127 go into byte 1 –Values 128+ go into bytes 2, 3, etc. –For characters up to 127, UTF8 looks just like ASCII

10 UNICODE Encodings UTF-8 UTF-16 – characters stored in 2 byte, 16-bit (halfword) sequences – also called UTF-2 UTF-32 – characters stored in 4byte, 32 bit sequences UTF-7 – forces a zero in high order bit - firewalls Ascii Encoding – everything above 7 bits is dropped

11 Definitions.NET uses UTF-16 encoding internally to store text Encoding: –transfers a set of Unicode characters into a sequence of bytes –Send a string to a file or a network stream Decoding: –transfers a sequence of bytes into a set of Unicode characters –Read a string from a file or a network stream StreamReader, StreamWriter default to UTF-8

12 Encoding/Decoding Classes UTF32Encoding class –Convert characters to and from UTF-32 encoding UnicodeEncoding class –Convert characters to and from UTF-16 encoding UTF8Encoding class to convert to and from UTF-8 encoding – 1, 2, 3, or 4 bytes per char ASCIIEncoding class to convert to and from ASCII Encoding – drops all values > 127 System.Text.Encoding supports a wide range of ANSI/ISO encodings

13 Convert a string into a stream of encoded bytes 1.Get an encoding object Encoding e = Encoding.GetEncoding(“Korean”); 2. use the encoding object’s GetBytes() method to convert a string into its byte representation byte[ ] encoded; encoded = e.GetBytes(“I’m gonna be Korean!”); Demo: D:\_Framework 2.0 Training Kits\70-536\Chapter 03\EncodingDemo

14 Write a file in encoded form FileStream fs = new FileStream("text.txt", FileMode.OpenOrCreate);... StreamWriter t = new StreamWriter (fs, Encoding.UTF8); t. Write("This is in UTF8"); Read an encoded file FileStream fs = new FileStream("text.txt", FileMode.Open);... StreamReader t = new StreamReader(fs, Encoding.UTF8); String s = t.ReadLine();

15 Summary ASCII is one of oldest encoding standards. UNICODE provides multilingual support System.Text.Encoding has static methods for encoding and decoding text. Use an overloaded Stream constructor that accepts an encoding object when writing a file. Not necessary to specify Encoding object when reading, will default.

16 References www.unicode.org Unicode and.Net – what does.NET Provide? http://www.developerfusion.co.uk/show/4710/3/ http://www.developerfusion.co.uk/show/4710/3/ Hello Unicode, Goodbye ASCII http://www.nicecleanexample.com/ViewArticle.a spx?TID=unicode_encoding http://www.nicecleanexample.com/ViewArticle.a spx?TID=unicode_encoding The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http://www.joelonsoftware.com/articles/Unicode. html http://www.joelonsoftware.com/articles/Unicode. html


Download ppt "ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s."

Similar presentations


Ads by Google