ASCII and Unicode.

Slides:



Advertisements
Similar presentations
Introduction to Computers Part II
Advertisements

Information Representation
Technology ICT Option: Data Representation. Data Representation In our everyday lives, we communicate with each other using analogue data. This data takes.
Computer Science Basics CS 216 Fall Operating Systems interface to the hardware for the user and programs The two operating systems that you are.
IT Systems What Number? EN230-1 Justin Champion C208 –
1 Computers and Representations Ascii vs. Binary Files Over the last few million years, Earth has experienced numerous ice ages when vast regions of the.
MCT260-Operating Systems I Operating Systems I Using Text Editors.
Introduction to Computers and Programming. Some definitions Algorithm: –A procedure for solving a problem –A sequence of discrete steps that defines such.
Binary Expression Numbers & Text CS 105 Binary Representation At the fundamental hardware level, a modern computer can only distinguish between two values,
Chapter 1 Data Storage. 2 Chapter 1: Data Storage 1.1 Bits and Their Storage 1.2 Main Memory 1.3 Mass Storage 1.4 Representing Information as Bit Patterns.
Data Representation in Computers
Review1 What is multilingual computing? Bilingual, trilingual, vs. Multilingual What are the fundamental issues in multi-lingual computing? –Representation.
Binary and Decimal Numbers
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
COMPUTER FUNDAMENTALS David Samuel Bhatti
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Faculty: Anita Kanavalli Department of CSE M S Ramaiah Institute of Technology Bangalore E mail-
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Week 4 Number Systems.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
CSC 101 Introduction to Computing Lecture 9 Dr. Iftikhar Azim Niaz 1.
Topics Introduction Hardware and Software How Computers Store Data
CPS120: Introduction to Computer Science
1 Digital Technology and Computer Fundamentals Chapter 1 Data Representation and Numbering Systems.
Copyright © 2003 by Prentice Hall Module 5 Central Processing Unit 1. Binary representation of data 2. The components of the CPU 3. CPU and Instruction.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Number Systems Spring Semester 2013Programming and Data Structure1.
Text and Graphics September 26, Unit 3.
Bits & Bytes Created by Chris McAbee For AAMU AGB199 Extra Credit Created from information copied and pasted from
CS161 Computer Programming Instructor: Maria Sabir Fall 2009 Lecture #1.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
Computer Science Binary. Binary Code Remember the power supply that is inside your computer and how it sends electricity to all of the components? That.
Data Representation Conversion 24/04/2017.
EEL 3801C EEL 3801 Part I Computing Basics. EEL 3801C Data Representation Digital computers are binary in nature. They operate only on 0’s and 1’s. Everything.
Data Representation, Number Systems and Base Conversions
Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Scott Marino MSMIS Kean University MSAS5104 Introduction to Programming with Data Structures and Algorithms Week 2 Scott Marino.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.
Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
1 Problem Solving using Computers “Data....Representation, and Storage.
M204 - Data Representation
MISSION CRITICAL COMPUTING SQL Server Special Considerations.
Characters CS240.
ASCII AND EBCDIC CODES By : madam aisha.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
COMPUTER SYSTEM A computer system is define as combination of components designed to process data and store files. A computer system consists of four.
1.4 Representation of data in computer systems Character.
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
INFS 211: Introduction to Information Technology
Topics Introduction Hardware and Software How Computers Store Data
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Topics Introduction Hardware and Software How Computers Store Data
Data Representation Conversion 05/12/2018.
Presenting information as bit patterns
COMS 161 Introduction to Computing
ASCII and Unicode.
Presentation transcript:

ASCII and Unicode

Learning Outcomes

Terms

Outline ASCII Code Unicode system Discuss the Unicode’s main objective within computer processing Computer processing before development of Unicode Unicode vs. ASCII Different kinds of Unicode encodings Significance of Unicode in the modern world

From Bit & Bytes to ASCII Bytes can represent any collection of items using a “look-up table” approach ASCII is used to represent characters ASCII American Standard Code for Information Interchange http://en.wikipedia.org/wiki/ASCII

ASCII It is an acronym for the American Standard Code for Information Interchange. It is a standard seven-bit code that was first proposed by the American National Standards Institute or ANSI in 1963, and finalized in 1968 as ANSI Standard X3.4. The purpose of ASCII was to provide a standard to code various symbols ( visible and invisible symbols)

ASCII In the ASCII character set, each binary value between 0 and 127 represents a specific character. Most computers extend the ASCII character set to use the full range of 256 characters available in a byte. The upper 128 characters handle special things like accented characters from common foreign languages.

In general, ASCII works by assigning standard numeric values to letters, numbers, punctuation marks and other characters such as control codes. An uppercase "A," for example, is represented by the decimal number 65."

Bytes: ASCII By looking at the ASCII table, you can clearly see a one-to-one correspondence between each character and the ASCII code used. For example, 32 is the ASCII code for a space. We could expand these decimal numbers out to binary numbers (so 32 = 00100000), if we wanted to be technically correct -- that is how the computer really deals with things.

Bytes: ASCII Computers store text documents, both on disk and in memory, using these ASCII codes. For example, if you use Notepad in Windows XP/2000 to create a text file containing the words, "Four score and seven years ago," Notepad would use 1 byte of memory per character (including 1 byte for each space character between the words -- ASCII character 32). When Notepad stores the sentence in a file on disk, the file will also contain 1 byte per character and per space. Binary number is usually displayed as Hexadecimal to save display space.

Take a look at a file size now. Take a look at the space of your p drive

Bytes: ASCII If you were to look at the file as a computer looks at it, you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character (see below). So on disk, the numbers for the file look like this: F o u r a n d s e v e n 70 111 117 114 32 97 110 100 32 115 101 118 101 110

Externally, it appears that human beings will use natural languages symbols to communicate with computer. But internally, computer will convert everything into binary data. Then process all information in binary world. Finally, computer will convert binary information to human understandable languages.

When you type the letter A, the hardware logic built into the keyboard automatically translates that character into the ASCII code 65, which is then sent to the computer. Similarly, when the computer sends the ASCII code 65 to the screen, the letter A appears.

ascii ASCII stands for American Standard Code for Information Interchange First published on October 6, 1960 ASCII is a type of binary data

Ascii part 2 ASCII is a character encoding scheme that encodes 128 different characters into 7 bit integers Computers can only read numbers, so ASCII is a numerical representation of special characters Ex: ‘%’ ‘!’ ‘?’

Ascii part 3 ASCII code assigns a number for each English character Each letter is assigned a number from 0-127 Ex: An uppercase ‘m’ has the ASCII code of 77 By 2007, ASCII was the most commonly used character encoding program on the internet

(This is a funny picture) 01010100 01101000 01101001 01110011 00100000 01101001 01110011 00100000 01100001 00100000 01100110 01110101 01101110 01101110 01111001 00100000 01110000 01101001 01100011 01110100 01110101 01110010 01100101

Large files Large files can contain several megabytes 1,000,000 bytes are equivalent to one megabyte Some applications on a computer may even take up several thousand megabytes of data

revisit “char” data type In C, single characters are represented using the data type char, which is one of the most important scalar data types. char achar; achar=‘A’; achar=65;

Character and integer A character and an integer (actually a small integer spanning only 8 bits) are actually indistinguishable on their own. If you want to use it as a char, it will be a char, if you want to use it as an integer, it will be an integer, as long as you know how to use proper C++ statements to express your intentions.

General Understanding of the Unicode System http://www.youtube.com/watch?v=ot3VKnP4Mz0

What is Unicode? A worldwide character-encoding standard Its main objective is to enable a single, unique character set that is capable of supporting all characters from all scripts, as well as symbols, that are commonly utilized for computer processing throughout the globe Fun fact: Unicode is capable of encoding about at least 1,110,000 characters!

Before Unicode Began… During the 1960s, each letter or character was represented by a number assigned from multiple different encoding schemes used by the ASCII Code Such schemes included code pages that held as many as 256 characters, with each character requiring about eight bits of storage! Made it insufficient to manage character sets consisting of thousands of characters such as Chinese and Japanese characters Basically, character encoding was very limited in how much it was capable of containing Also did not enable character sets of various languages to integrate

The ASCII Code Acronym for the American Standard Code for Information Interchange A computer processing code that represents English characters as numbers, with each letter assigned a number from 0 to 127 For instance,  the ASCII code for uppercase M is 77 The standard ASCII character set uses just 7 bits for each character Some larger character sets in ASCII code incorporate 8 bits, which allow 128 additional characters used to represent  non-English characters, graphics symbols, and mathematical symbols ASCII vs Unicode

This compares what ASCII and Unicode are able to encode This indicates how different characters are organized into representing a unique character set This depicts how Unicode is capable of encoding characters from virtually every kind of language This shows how Unicode can manipulate the style and size of each character This compares what ASCII and Unicode are able to encode

Various Unicode Encodings Name UTF-8 UTF-16 UTF-16BE UTF-16LE UTF-32 UTF-32BE UTF-32LE Smallest code point 0000 Largest code point 10FFFF Code unit size 8 bits 16 bits 32 bits Byte order N/A <BOM> big-endian little-endian Fewest bytes per character 1 2 4 Most bytes per character http://www.unicode.org/faq/utf_bom.html

Unicode’s Growth Over Time This graph shows the number of defined code points in Unicode from its first release in 1991 to the present http://emergent.unpythonic.net/01360162755

ASCII vs Unicode -Has 128 code points, 0 through 127 -Can only encode characters in 7 bits -Can only encode characters from the English language -Has about 1,114,112 code positions -Can encode characters in 16-bits and more -Can encode characters from virtually all kinds of languages -It is a superset of ASCII -Both are character codes -The 128 first code positions of Unicode mean the same as ASCII

Method of Encoding Unicode Transformation Format (UTF) An algorithmic mapping from virtually every Unicode code point to a unique byte sequence Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again Most texts in documents and webpages is encoded using some of the various UTF encodings The conversions between all UTF encodings are algorithmically based, fast and lossless Makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing

Unicode Transformation Format Encodings UTF-7 Uses 7 bits for each character. It was designed to represent ASCII characters in email messages that required Unicode encoding Not really used as often UTF-8 The most popular type of Unicode encoding It uses one byte for standard English letters and symbols, two bytes for additional Latin and Middle Eastern characters, and three bytes for Asian characters Any additional characters can be represented using four bytes UTF-8 is backwards compatible with ASCII, since the first 128 characters are mapped to the same values

UTF Encodings (Cont…) UTF-16 UTF-32 Makes it space inefficient An extension of the "UCS-2" Unicode encoding, which uses at least two bytes to represent about 65,536 characters Used by operating systems such as Java and Qualcomm BREW UTF-32 A multi-byte encoding that represents each character with 4 bytes Makes it space inefficient Main use is in internal APIs where the data is single code points or glyphs, rather than strings of characters Used on Unix systems sometimes for storage of information

What can Unicode be Used For? Encode text for creation of passwords Encode characters used in email settings Modify characters used in documents Encodes characters to display in all webpages

Why is Unicode Important? By providing a unique set for each character, this systemized standard creates a simple, yet efficient and faster way of handling tasks involving computer processing Makes it possible for a single software product or a single website to be designed for multiple countries, platforms, and languages Can reduce the cost over using legacy character sets No need for re-engineering! Unicode data can be utilized through a wide range of systems without the risk of data corruption Unicode serves as a common point in the conversion of between other character encoding schemes It is a superset of all of the other common character encoding schemes Therefore, it is possible to convert from one encoding scheme to Unicode, and then from Unicode to the other encoding scheme.

Unicode in the Future… Unicode may be capable of encoding characters from every language across the globe Can become the most dominant and resourceful tool in encoding every kind of character and symbol Integrates all kinds of character encoding schemes into its operations

Summary Unicode’s ability to create a standard in which virtually every character is represented through its complicated operations has revolutionized the way computer processing is handled today. It has emerged as an effective tool for processing characters within computers, replacing old versions of character encodings, such as the ASCII. Unicode’s capacity has substantially grown since its development, and continues to expand on its capability of encoding all kinds of characters and symbols from every language across the globe. It will become a necessary component of the technological advances that we will inevitably continue to produce in the near future, potentially creating new ways of encoding characters.

Pop Quiz! 1. What is the main purpose of the Unicode system? -To enable a single, unique character set that is capable of supporting all characters from all scripts and symbols 2. How many code points is Unicode capable of encoding? -About 1,114,112 code points

References Cavalleri, Beshar Bahjat & Igor. Unicode 101: An Introduction to the Unicode Standard. 2014. Web. 17 09 2014. <http://www.interproinc.com/articles/unicode-101-introduction-unicode-standard>. Constable, Peter. Understanding Unicode. 13 06 2001. Web. 17 09 2014. <http://scripts.sil.org/cms/scripts/page.php?item_id=IWS-Chapter04a>. "UTF." Teach Terms. N.p., 20 Apr. 2012. Web. 13 Nov. 2014. <http%3A%2F%2Fwww.techterms.com%2Fdefinition%2Futf>. "UTF-8, UTF-16, UTF-32 & BOM." FAQ. N.p., n.d. Web. 13 Nov. 2014. <http://www.unicode.org/faq/utf_bom.html>.