מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.

Slides:



Advertisements
Similar presentations
Review of HTML Ch. 1.
Advertisements

The Binary Numbering Systems
1 CSE1301 Computer Programming Lecture 29: Number Representation (Part 1)
מבנה מחשב תרגול 2 ייצוג מספרים רציונליים. תמר שרוט, נועם חזון Fixed Point vs. Floating Point We’ve already seen two ways to represent a positive integer.
Data Representation (in computer system) Computer Fundamental CIM2460 Bavy LI.
15 September How Computers Work: Other Forms of Data.
COMPUTER FUNDAMENTALS David Samuel Bhatti
Chapter 3 Data Representation Text Characters. 2 Representing Text To represent a text document in digital form, we need to be able to represent every.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Decisions in Python Comparing Strings – ASCII History.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Binary Numbers and ASCII and EDCDIC Mrs. Cueni. Data Representation  Human speech is analog because it uses continuous signals (waves) that vary in strength.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Computer Programming I. Today’s Lecture  Components of a computer  Program  Programming language  Binary representation.
Data Representation.
Digital Logic Design Lecture 3 Complements, Number Codes and Registers.
Introduction to Computer Design CMPT 150 Section: D Ch. 1 Digital Computers and Information CMPT 150, Chapter 1, Tariq Nuruddin, Fall 06, SFU 1.
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
CS151 Introduction to Digital Design
1 3 Computing System Fundamentals 3.5 Data Representation.
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
Data Files on Computers Text Files (ASCII) Files that can be created by typing on the keyboard while using a text editor such as notepad or TextEdit.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Data Representation, Number Systems and Base Conversions
The Information School of the University of Washington Oct 13fit digital1 Digital Representation INFO/CSE 100, Fall 2006 Fluency in Information Technology.
The Information School of the University of Washington 15-Oct-2004cse digital1 Digital Representation INFO/CSE 100, Spring 2005 Fluency in Information.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Data Encoding COSC Computers and Data Computers store information as sequences of bits Computers store many types of data: numbers text audio images.
Characters CS240.
Representing Characters in a Computer System Representation of Data in Computer Systems.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
Understanding Computers
1.4 Representation of data in computer systems Character.
Lecture Coding Schemes. Representing Data English language uses 26 symbols to represent an idea Different sets of bit patterns have been designed to represent.
Understanding binary Understanding Computers.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Chapter 3 Data Representation Text Characters
NUMBER SYSTEMS.
Arithmetic Shifts and Character Representation
Binary 1 Basic conversions.
Binary Numbers and ASCII and EDCDIC
Data Transfer ASCII FILES.
Chapter 3 Data Storage.
Representing Information as bit patterns
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Representing Characters
LING 388: Computers and Language
Digital Representation
Fundamentals of Data Representation
Presenting information as bit patterns
C1 Number systems.
COMS 161 Introduction to Computing
Digital Encodings.
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Learning Intention I will learn how computers store text.
LING 388: Computers and Language
C Programming Language
Chapter 3 - Binary Numbering System
ASCII and Unicode.
Presentation transcript:

מבנה מחשב תרגול 2 ייצוג תווים בחומרה

A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe in germs. Joel Spolsky

תמר שרוט, נועם חזון3 Introduction Computers are considered "number crunchers“. Humans work with characters. Character data isn't just alphabetic characters, but also numeric characters, punctuation, spaces, etc. Most keys on the central part of the keyboard (except shift, caps lock) are characters. Everything represented by a computer is represented by binary sequences. We use standard encodings (binary sequences) to represent characters.

תמר שרוט, נועם חזון4 Introduction (2) The two's complement method is used to represent integer numbers, because it has nice mathematical properties, in particular. However, there aren't such properties for character data, so assigning binary codes for characters is somewhat arbitrary. The most common character representation is ASCII, which stands for American Standard Code for Information Interchange. The ASCII code defines what character is represented by each binary sequence.

תמר שרוט, נועם חזון5 The ASCII code

תמר שרוט, נועם חזון6 The ASCII code (2) There are two reasons to use ASCII:  A way to represent characters.  An acceptable standard. Different bit patterns are used for each different character that needs to be represented. A nice property – The lowercase (uppercase; digits) letters are contiguous. Applications:  ‘a’ < ‘b’; 'A' < 'B‘; ‘0’<‘1’.  ‘a’ – ‘A’ = ‘b’ – ‘B’ = …. = ‘z’ – ‘Z’ = 32.  ‘1’ – ‘0’ = 1 – 0.

תמר שרוט, נועם חזון7 The ASCII code (3) Note:  ‘a’ ≠ ‘A’.  0 ≠ ‘0’ (‘0’ = 48).  The characters between 0 and 31 are generally not printable (control characters that affect how text is processed, etc). 32 is the space character.  There are 128 (= 2^7) ASCII characters.  The eighth bit being used as a parity bit to detect transmission errors.

תמר שרוט, נועם חזון8 The ASCII’s disadvantage The greatest disadvantage: biased for the English language character set. Missing:  Mathematical symbols.  European languages (as well as Hebrew). Solution: use the 8 th bit as well (Extended ASCII). Switching up to 256 letters, which is plenty for most alphabet based languages.

תמר שרוט, נועם חזון9 Extended ASCII Problems:  Not enough for Asian languages, which are word- based (thousands of characters).  Can’t add more than one languages (é = ג; from France to Israel and vise verses). Code-Pages – different characters encoding. Identical only in the first 128 codes (the ASCII part). Works reasonably in small networks that use the same coding. Problem: The Internet!

תמר שרוט, נועם חזון10 Unicode An effort to create a single character set that include every reasonable writing system. Uses 2 bytes to represent a character.  1 st byte + 2 nd empty byte – used to represent the ASCII characters.  1 st + 2 nd bytes – used to represent other characters. The UCS-2 (2-Bytes Universal Character Set. Also known as UTF-16) disadvantages:  Endians.  Doubles the files size.  Doesn’t support old files.

תמר שרוט, נועם חזון11 Endians Now when the characters are stored in more than one byte the bytes order (high / low endian) matter! Causes problems when transferring files between different computers. Solution: “Union Byte Order Mark” – 0xFEFF (in a 16-bit Unicode).  Always place the mark at the beginning of the characters’ stream.  While receiving an input that start with 0xFFFE – the programmer knows she must swap every other byte.

תמר שרוט, נועם חזון12 Unicode – cons. Yet:  Not every Unicode string has a byte order mark at the beginning.  Pure English files are doubled for no reason.  Old files must be converted. Unicode was abandoned for several years (until 1992). Solution: UTF-8 (8-bit-Unicode-Transfer- Format).

תמר שרוט, נועם חזון13 UTF-8 This is a variable length character encoding. Every code-point from (ASCII’s original codes) is stored in a single byte. Code points 128 and above are stored using 2-4 bytes according to the character code-point (it is possible to use 6 bytes). Outcomes:  Pure English files are identical to ASCII files. No unneeded doubled files. No need to convert old files.  Enables representation of richer character set through the extra bytes.  Frequent characters use shorter encodings.

תמר שרוט, נועם חזון14 UTF-8 – How does it work? If we have an ASCII character:  It will be placed in one byte and the MSB will be zero. Otherwise: we need more than one byte!  The first byte will tell us how many bytes are used to encode the character.  The first byte will start (MSB) with a sequence of ones followed by a single zero. The sequence length will be the number of bytes used to encode the character.  Each additional byte will have the value 10 in its MSB.  The reset of the bits will be used to encode the character.

תמר שרוט, נועם חזון15 Other encodings There are hundreds of different encodings. UTF-7, UTF-8, UTF-16 and UTF-32 are the most reliable when working with languages other than English. When passing a sequence of characters (strings, files etc.) one must mention which encoding methods is used. Or else:  Gibberish.  Question marks.  Wrong representation of several characters.

תמר שרוט, נועם חזון16 Standards  Content-Type: text / plain; charset = “UTF-8” Web page: tag  <meta http-equiv=“Content-Type” content = “text/html; charset=utf-8”> …

תמר שרוט, נועם חזון17 Libraries for managing encodings Th ere are many libraries that support different characters encoding. I.e.:  Iconv (Or a more stable implementation: libiconv). (Mostly Unix).  Codecs module (python).  “The International Component for Unicode” (ICU) (There are libraries for C/C++ & Java).  UTF8-CPP (C++).