Characters CS240.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Information Representation
ICS312 Set 2 Representation of Numbers and Characters.
Lecture 10 : Huffman Encoding Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang University Lecture notes : courtesy.
Huffman Encoding 16-Apr-17.
CS 206 Introduction to Computer Science II 04 / 29 / 2009 Instructor: Michael Eckmann.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
CS430 © 2006 Ray S. Babcock Lossy Compression Examples JPEG MPEG JPEG MPEG.
Data Compression Basics & Huffman Coding
Data Storage. SIGN AND MAGNITUDE Storing and representing numbers.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
Digital Data Patrice Koehl Computer Science UC Davis.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Management Information Systems Lection 06 Archiving information CLARK UNIVERSITY College of Professional and Continuing Education (COPACE)
CSE Lectures 22 – Huffman codes
Dale & Lewis Chapter 3 Data Representation
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Chapter 5 Data representation.
UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.
ASCII and Unicode.
Agenda Data Representation – Characters Encoding Schemes ASCII
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Computer Math CPS120: Data Representation. Representing Data The computer knows the type of data stored in a particular location from the context in which.
Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.
Huffman Codes. Encoding messages  Encode a message composed of a string of characters  Codes used by computer systems  ASCII uses 8 bits per character.
Lec 3: Data Representation Computer Organization & Assembly Language Programming.
Lecture Objectives  To learn how to use a Huffman tree to encode characters using fewer bytes than ASCII or Unicode, resulting in smaller files and reduced.
Binary, Decimal and Hexadecimal Numbers Svetlin Nakov Telerik Corporation
CS-2852 Data Structures LECTURE 13B Andrew J. Wozniewicz Image copyright © 2010 andyjphoto.com.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Huffman Code and Data Decomposition Pranav Shah CS157B.
The character data type char. Character type char is used to represent alpha-numerical information (characters) inside the computer uses 2 bytes of memory.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Chapter 3 Data Representation. 2 Compressing Files.
Comp 335 File Structures Data Compression. Why Study Data Compression? Conserves storage space Files can be transmitted faster because there are less.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
Lossless Decomposition and Huffman Codes Sophia Soohoo CS 157B.
Coding and Formatting Data for Network Use Syntax – Format (2-bytes for port number) Semantics – Meaning of Allowed Content (binary 16-bit number, Port.
1Computer Sciences Department. 2 Advanced Design and Analysis Techniques TUTORIAL 7.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
1.4 Representation of data in computer systems Character.
Introduction to computer science Lec2 cs111. Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8- bit character encoding used mainly on.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
Data Representation COE 308 Computer Architecture
3.3 Fundamentals of data representation
Machine level representation of data Character representation
Data Representation ICS 233
Lec 3: Data Representation
Lesson Objectives Aims You should be able to:
Design & Analysis of Algorithm Huffman Coding
Data Representation.
Assignment 6: Huffman Code Generation
Number Representation
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Data Representation COE 301 Computer Organization
The Huffman Algorithm We use Huffman algorithm to encode a long message as a long bit string - by assigning a bit string code to each symbol of the alphabet.
Ch2: Data Representation
Advanced Algorithms Analysis and Design
Presenting information as bit patterns
Digital Encodings.
Data Representation ICS 233
Huffman Encoding.
Huffman Coding Greedy Algorithm
Data Representation COE 308 Computer Architecture
Presentation transcript:

Characters CS240

ASCII American Standard Code for Information Interchange based on the English alphabet Originally developed for telegraph/teleprinters. encodes 128 specified characters into 7 bits Letters, numbers, control characters 8th bit was used by computer vendors to “extend” ASCII by adding their own characters (i.e. line drawing characters)

ASCII Table

Unicode Assigns a number or “code point” to each character. 1,114,112 code points from 0 to 0x10FFFF. Latest version of Unicode standard contains more than 110,000 characters. First 128 code points are same as ASCII. A code point is written as “U+” followed by a hexadecimal number. i.e. A is U+0041

Unicode Encoding Can be implemented by different character encodings: UCS-2 – uses 2 bytes. Obsolete. It cannot encode every character in the current Unicode standard. UTF-8 – uses 1 to 4 bytes. Used by Unix and HTML. UTF-16 – uses 2 bytes or 4 bytes. Extends UCS-2. Used by Windows XP, Windows Vista, Windows 7

UTF-8 UCS Transformation Format—8-bit a variable-width encoding designed for backward compatibility with ASCII The first 128 characters of Unicode are encoded with the same binary value as ASCII encodes each of the 1,112,064 code points in the Unicode character set using one to four bytes

UTF-8 used in HTML <!DOCTYPE html> <html lang="en" dir="ltr" class="client-nojs"> <head> <meta charset="UTF-8" />

UTF-8 Encoding Scheme Code Points 0 to 127 - Same value as the ASCII code. It is represented with one byte, with the high order bit set to 0. Code points 128 and higher are represented by 2 to 6 bytes, with a leading byte and one or more continuation bytes. The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence. All continuation bytes contain ‘10’ in the high order bits. The code point’s binary value is encoded in the remaining bits, with 0’s padded to the left if necessary.

UTF-8 Encoding Scheme Example: $ = U+0024 U+0024 is within the range of U+0000 to U+007F, so it is stored in 1 byte. (24)16 is (36)10 or (100100)2 According to the specifications, the format for storing into one byte is: 0xxxxxxx where 0 is placed in the highest order bit and the binary is stored into the remaining 7 bits. 100100 is only 6 bits, so we need to pad it with a 0 to make it seven bits: 100100 -> 0100100 The resulting UTF-8 encoded byte will be: 00100100

UTF-8 Encoding Scheme Example: ¢ = U+00A2 U+00A2 is within the range of U+0080 to U+07FF, so it is stored in 2 bytes. (A2)16 is (162)10 or (10100010)2 According to the specifications, the format for storing into two bytes is: 1st byte: 110xxxxx where 110 is placed in the higher order bits, indicating that this is a leading byte of 2 bytes. 2nd byte: 10xxxxxx where 10 is placed in the higher order bits, indicating that this is a continuation byte. 10100010 is only 8 bits, so we need to pad it with a 0 to make it 11 bits: 10100010 -> 00010100010 The resulting UTF-8 encoded bytes will be: 11000010 10100010

Compression Lossless data compression – allows original data to be reconstructed from compressed data without loss of information. Lossy data compression – allows for some loss. Reconstructed data is an approximation of original data.

Huffman Encoding Assigns binary codes to characters to reduce the overall number of bits that would otherwise be needed to store a string of characters. Uses a binary tree to generate a code table for lossless data compression. A variable length binary code is assigned to each character Most frequent characters represented by the smallest binary numbers Least frequent characters represented by the largest binary numbers

Huffman Encoding ABRACADABRA 5 distinct characters. In fixed width encoding, would require 3 bits to represent the 5 distinct characters. i.e. A = 000 B = 001 R = 010 C = 011 D = 100 If each character is 3 bits, and ABRACADABRA has 11 characters, we would need 33 bits to store that word.

Generate a Huffman Table List all the characters of the word: ABRACADABRA List the frequency of each character A: 5 B: 2 R: 2 C: 1 D: 1

Generate a Huffman Table A: 5 B: 2 R: 2 C: 1 D: 1 2 C D

Generate a Huffman Table A: 5 B: 2 R: 2 2 4 R 2 C D

Generate a Huffman Table 6 B 4 R 2 C D

Generate a Huffman Table 11 A 6 B 4 R 2 C D

Generate a Huffman Table 11 Generate Code: Assign 0 to left branches Assign 1 to right branches A = 0 B = 10 R = 110 C = 1110 D = 1111 1 A 6 1 B 4 1 R 2 1 C D

Encode data Using: A = 0 B = 10 R = 110 C = 1110 D = 1111 10 110 1110 1111 23 bits to encode ‘ABRACADABRA’ vs 33 bits.