ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s.

Slides:



Advertisements
Similar presentations
Advanced.Net Framework 2.0 David Ringsell MCPD MCSD MCT MCAD.
Advertisements

IT Systems What Number? EN230-1 Justin Champion C208 –
Pemrograman Dasar - Data Types1 THINGS & STUFF Identifier, Variable, Constant, Data type, operator, expression, assignment, javadoc.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
1 The Information School of the University of Washington Nov 6fit more-digital © 2006 University of Washington Digital Information INFO/CSE 100,
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
MIS316 – BUSINESS APPLICATION DEVELOPMENT – Chapter 14 – Files and Streams 1Microsoft Visual C# 2012, Fifth Edition.
COMPUTER FUNDAMENTALS David Samuel Bhatti
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Computer Systems Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text.
Decisions in Python Comparing Strings – ASCII History.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Understanding Input/Output (I/O) Classes Lesson 5.
Basics of HTML Shashanka Rao. Learning Objectives 1. HTML Overview 2. Head, Body, Title and Meta Elements 3.Heading, Paragraph Elements and Special Characters.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Basics of computer Franck Theeten CABIN training, June 2013 Royal Museum for Central Africa, Tervuren.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Computing Higher - Unit 1… Computer Systems 1 Higher Computing Unit 1 – Topic 1 Data Representation.
Lecture 2 Character Codes and Low-Structure Text Document Formats.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Reference: Lecturer Lecturer Reham O. Al-Abdul Jabba lectures for cap211 Files and Streams- I.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Strings in MIPS. Chapter 2 — Instructions: Language of the Computer — 2 Character Data Byte-encoded character sets – ASCII: 128 characters 95 graphic,
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
Chapter 14: Files and Streams. 2Microsoft Visual C# 2012, Fifth Edition Files and the File and Directory Classes Temporary storage – Usually called computer.
Charset to UTF. Good Old Old Days Is there any other language but American ?? EBCDIC ASCII.
Representation of Characters
Text Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See
CSC 298 Streams and files.
File Input and Output Chapter 14 Java Certification by:Brian Spinnato.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
Data Representation. How is data stored on a computer? Registers, main memory, etc. consists of grids of transistors Transistors are in one of two states,
Characters CS240.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Searching, Modifying, and Encoding Text. Parts: 1) Forming Regular Expressions 2) Encoding and Decoding.
Unicode WTF is UTF? (for Secondary School Students) Jan Zidek Tieto Czech s.r.o. ☺ U+263A.
1.4 Representation of data in computer systems Character.
1 Non-Numeric Data Representation V1.0 (22/10/2005)
BINARY I/O IN JAVA CSC 202 November What should be familiar concepts after this set of topics: All files are binary files. The nature of text files.
Nat 4/5 Computing Science Data Representation Lesson 3: Storing Text
Text and Images Key Revision Points.
Binary Representation in Text
Binary Representation in Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
JAVA MULTIPLE CHOICE QUESTION.
Data Transfer ASCII FILES.
Information Support and Services
Representing Information as bit patterns
Data Encoding Characters.
TOPICS Information Representation Characters and Images
LING 388: Computers and Language
An overview of Java, Data types and variables
Presenting information as bit patterns
Digital Encodings.
LING 388: Computers and Language
Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2
ASCII and Unicode.
Presentation transcript:

ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s

Overview It’s not your father’s character set –8 bit characters –ASCII –The rest of the world wakes up to computers Unicode –Character codes –Different flavors Encoding and Decoding classes Example

The Good Old Days Focus on unaccented, English letters Every letter, number, capital, etc Represented by codes Space, 32; “A”, 65; “a”, 97 Used 7 bits, one bit free on most computers Wordstar and the 8 th bit Below 32 – control bits  7, beep; 12, formfeed

8 th bit, values Everybody had their own ideas OEM Character sets IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) Outside U.S.  different languages –Code 130

8 th bit, values Everybody had their own ideas OEM Character sets IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) Outside U.S.  different languages –Code 130

8 th bit, values Everybody had their own ideas OEM Character sets IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc.) Outside U.S.  different languages –Code 130 é in US, Gimel ג character in Israel –Difficult to exchange documents Code pages – regional definition of bit values –Israel: Code page 862 –Greek: Code page 737 –ISO/ANSI code pages Asia – Alphabets had thousands of characters –No way to store in one byte (8 bits)

Unicode Not a 16-bit code A new way of thinking about characters Old way: –Character “A” maps to memory or disk bits –A-> Unicode way: –Each letter in every alphabet maps to a “code point” –Abstract concept –“A” is Platonic “form” – just floats out there –A -> U+0639  code point

Unicode Hello -> U+0048 U+0065 U+006C U+006C U+006F Storing in 2 bytes each: – C 006C 006F (big endian) –Or C00 6C00 6F00 (little endian) Need to have a Byte Order Mark (BOM) at beginning of stream UTF8 coding system –Stores Unicode points (magic numbers) as 8 bit bytes –Values go into byte 1 –Values 128+ go into bytes 2, 3, etc. –For characters up to 127, UTF8 looks just like ASCII

UNICODE Encodings UTF-8 UTF-16 – characters stored in 2 byte, 16-bit (halfword) sequences – also called UTF-2 UTF-32 – characters stored in 4byte, 32 bit sequences UTF-7 – forces a zero in high order bit - firewalls Ascii Encoding – everything above 7 bits is dropped

Definitions.NET uses UTF-16 encoding internally to store text Encoding: –transfers a set of Unicode characters into a sequence of bytes –Send a string to a file or a network stream Decoding: –transfers a sequence of bytes into a set of Unicode characters –Read a string from a file or a network stream StreamReader, StreamWriter default to UTF-8

Encoding/Decoding Classes UTF32Encoding class –Convert characters to and from UTF-32 encoding UnicodeEncoding class –Convert characters to and from UTF-16 encoding UTF8Encoding class to convert to and from UTF-8 encoding – 1, 2, 3, or 4 bytes per char ASCIIEncoding class to convert to and from ASCII Encoding – drops all values > 127 System.Text.Encoding supports a wide range of ANSI/ISO encodings

Convert a string into a stream of encoded bytes 1.Get an encoding object Encoding e = Encoding.GetEncoding(“Korean”); 2. use the encoding object’s GetBytes() method to convert a string into its byte representation byte[ ] encoded; encoded = e.GetBytes(“I’m gonna be Korean!”); Demo: D:\_Framework 2.0 Training Kits\70-536\Chapter 03\EncodingDemo

Write a file in encoded form FileStream fs = new FileStream("text.txt", FileMode.OpenOrCreate);... StreamWriter t = new StreamWriter (fs, Encoding.UTF8); t. Write("This is in UTF8"); Read an encoded file FileStream fs = new FileStream("text.txt", FileMode.Open);... StreamReader t = new StreamReader(fs, Encoding.UTF8); String s = t.ReadLine();

Summary ASCII is one of oldest encoding standards. UNICODE provides multilingual support System.Text.Encoding has static methods for encoding and decoding text. Use an overloaded Stream constructor that accepts an encoding object when writing a file. Not necessary to specify Encoding object when reading, will default.

References Unicode and.Net – what does.NET Provide? Hello Unicode, Goodbye ASCII spx?TID=unicode_encoding spx?TID=unicode_encoding The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) html html