PACS – 11/16/13 1 Unicode With everything becoming globalized these days, more characters to represent a wider array of languages than just English are.

Slides:



Advertisements
Similar presentations
1. XP 2 * The Web is a collection of files that reside on computers, called Web servers. * Web servers are connected to each other through the Internet.
Advertisements

Using Matrices in Real Life
Copyright © 2003 Pearson Education, Inc. Slide 3-1 Created by Cheryl M. Hughes The Web Wizards Guide to XML by Cheryl M. Hughes.
Worldwide typography (and how to apply JIS-X to Unicode) Michel Suignard Microsoft Corporation.
Problem Solving & Program Design in C
Tutorial 3 – Creating a Multiple-Page Report
Tutorial 9 – Creating On-Screen Forms Using Advanced Table Techniques
XP New Perspectives on Microsoft Office Word 2003 Tutorial 7 1 Microsoft Office Word 2003 Tutorial 7 – Collaborating With Others and Creating Web Pages.
Introduction to HTML, XHTML, and CSS
INTERNET PROTOCOLS Class 9 CSCI 6433 David C. Roberts Entire contents copyright 2011, David C. Roberts, all rights reserved.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Configuration management
Chapter 11: Models of Computation
© Telcordia Technologies 2004 – All Rights Reserved AETG Web Service Advanced Features AETG is a service mark of Telcordia Technologies. Telcordia Technologies.
Yong Choi School of Business CSU, Bakersfield
Chapter 10: Virtual Memory
Binary Values and Number Systems
The World Wide Web. 2 The Web is an infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that.
Services Course Windows Live SkyDrive Participant Guide.
Getting Familiar with Web Pages 1 2 The Internet Worldwide collection of interconnected computer networks that enables businesses, organizations, governments,
Essentials for Design JavaScript Level One Michael Brooks
INTRODUCTORY MICROSOFT WORD Lesson 7 – Working With Documents
Pasewark & Pasewark Microsoft Office XP: Introductory Course 1 INTRODUCTORY MICROSOFT WORD Lesson 8 – Increasing Efficiency Using Word.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
© Paradigm Publishing, Inc Access 2010 Level 2 Unit 2Advanced Reports, Access Tools, and Customizing Access Chapter 8Integrating Access Data.
Benchmark Series Microsoft Excel 2013 Level 2
Using Binary Coding Information Remember  Bit = 0 or 1, Binary Digit  Byte = the number of bits used to represent letters, numbers and special characters.
1. Discrete / Continuous Representations Of numbers – binary & decimal Bits Hexadecimal - 'Hex' Representing text Bits and Bytes.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
CCE-EDUSAT SESSION FOR COMPUTER FUNDAMENTALS Date: Session III Topic: Number Systems Faculty: Anita Kanavalli Department of CSE M S Ramaiah.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
CHARACTERS Data Representation. Using binary to represent characters Computers can only process binary numbers (1’s and 0’s) so a system was developed.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture: Character sets
Sophia Antipolis, September 2006 Multilinguality, localization and internationalization Miruna Bădescu Finsiel Romania.
Localizing OpenClinica Hiroaki Honshuku: SQA 1. © What is Character Encoding?  Morse Code (1840) → Latin Alphabet  ASCII (1963)  The American Standard.
Globalisation & Computer Systems week 5 1. Localisation presentations 2.Character representation and UNICODE UNICODE design principles UNICODE character.
ASCII and Unicode.
Encoding and fonts Edward Garrett Software Developer, ELAR.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
Agenda Data Representation – Characters Encoding Schemes ASCII
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
File Formats Chapter 9 Bit Literacy. File formats are often ignored by users Applications automatically save files in the application’s format All formats.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
Computer System Basics 1 Number Systems & Text Representation Computer Forensics BACS 371.
Globalisation & Computer systems Week 5/6 Character representation ACII and code pages UNICODE.
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
Base 2 Numbering System Chapter 1.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
Representing Characters in a Computer System Representation of Data in Computer Systems.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
DATA REPRESENTATION - TEXT
Binary Representation in Text
Binary Representation in Text
Data Representation.
Binary 1 Basic conversions.
Characters & Fonts Digital Multimedia, 2nd edition
TOPICS Information Representation Characters and Images
Characters & Fonts Digital Multimedia, 2nd edition
ASCII and Unicode.
Presentation transcript:

PACS – 11/16/13 1 Unicode With everything becoming globalized these days, more characters to represent a wider array of languages than just English are necessary. We'll look at Unicode as a solution. Unicode contains a repertoire of more than 110,000 characters covering 100 languages. Think not only about Chinese, Cyrillic, Hebrew, etc. but also Cherokee, Runic, Mandaic, Bamum, Tagalog, and so on.

PACS – 11/16/13 2 Unicode Because more than one byte is often needed for a Unicode character, special handling for Unicode text is required. When special techniques are not followed to handle Unicode, web pages present boxes, question marks or jumbles of random characters instead of what was intended.

PACS – 11/16/13 3 Unicode

PACS – 11/16/13 4 Unicode

PACS – 11/16/13 5 Unicode

PACS – 11/16/13 6 Unicode

PACS – 11/16/13 7 Unicode From the Unicode Consortium:Unicode Consortium “Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”

PACS – 11/16/13 8 Unicode “Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.”

PACS – 11/16/13 9 Unicode Some history: Early computers just processed numbers. When people decided that characters needed to be processed also, different manufacturers came up with their own solutions. There were different word lengths (12, 16, 18, 24, 32, 36) and different byte lengths (4, 6, 8). Most implementations were for upper case only. Punctuation support was sporadic. Six-bit bytes only supported 64 different characters – not enough for upper and lower case.

PACS – 11/16/13 10 Unicode The American Standard Code for Information Interchange (ASCII) is a character-encoding scheme originally based on the English alphabet that encodes 128 specified characters - the numbers 0-9, the letters a-z and A-Z, some basic punctuation symbols, some control codes that originated with Teletype machines, and a blank space - into the 7-bit binary integers. Work started in 1960, published during 1963, revised during 1967, and most recently updated during 1986.

PACS – 11/16/13 11 Unicode

PACS – 11/16/13 12 Unicode ASCII used 7-bits for one character. This allowed an eighth bit to be used as a parity bit on paper tape or magnetic tape. Most early network links were 7 bit and the SMTP spec called for 7-bit characters. That’s why you see base-64 encoding of binary files in s – it converts the file to an equivalent string of 7-bit characters to be successfully transferred. As 8-bit bytes became standard, more uses were found for the top 128 characters in the 256 character space.

PACS – 11/16/13 13 Unicode Many systems used the top 128 characters for block graphics for gaming or charting. DEC first devised a Multinational Character Set which had the accented characters needed by a majority of the European languages plus a few more special symbols. Apple made their own set - Mac OS Roman – included math symbols in addition to the diacritical marks. Postscript had its own set. Microsoft came up with Windows-1252 which included more special symbols in the 80-9f positions. ANSI codified the 256-character extension of ASCII.

PACS – 11/16/13 14 Unicode

PACS – 11/16/13 15 Unicode What a mess! Enter Unicode. While ASCII is limited to 128 characters, Unicode supports more characters by separating the concepts of unique identification (using natural numbers called code points) and encoding (to 8-, 16- or 32-bit binary formats, called UTF-8, UTF-16 and UTF-32). To allow backward compatibility, the 128 ASCII and 256 ANSI or ISO (Latin 1) characters are assigned Unicode/UCS code points that are the same as their codes in the earlier standards.

PACS – 11/16/13 16 Unicode The most common implementation is utf-8 which can represent all characters in between 1 and 4 bytes with up to 21 bits of data.

PACS – 11/16/13 17 Unicode The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This covers almost all Latin-derived alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane (which contains virtually all common characters). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK (Chinese, Japanese, Korean) characters and various historic scripts and mathematical symbols.

PACS – 11/16/13 18 Unicode A few examples:

PACS – 11/16/13 19 Unicode Some sequences of bytes are invalid: Invalid bytes listed in the Unicode standard An unexpected continuation byte A start byte not followed by enough continuation bytes An Overlong Encoding i.e. more zeroes than needed A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

PACS – 11/16/13 20 Unicode In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented.

PACS – 11/16/13 21 Unicode A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.

PACS – 11/16/13 22 Unicode In HTML, there is also a standard set of 252 named character entities for characters that are either not found in certain character encodings or are markup sensitive in some contexts (for example angle brackets and quotation marks). Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.

PACS – 11/16/13 23 Unicode Character entities can be included in an HTML document via the use of entity references, which take the form &EntityName;, where EntityName is the name of the entity. For example, —, equivalent to — or —, represents U+2014: the em dash character "—" even if the character encoding used doesn't contain that character.

PACS – 11/16/13 24 Unicode

PACS – 11/16/13 25 Unicode Even with all the care to preserve the proper bytes for the code points during the transmission, you will still need a font that includes the characters needed. Arial 3,415 characters. Arial Unicode MS 38,917 characters. If your font can’t display a given character, typically a box or question mark is shown.

PACS – 11/16/13 26 Unicode Most non-ASCII characters result from: ‘Fancy’ punctuation characters from MS Office apps. E.g. check the single quotes in the previous line! Special characters for cents, degrees, math symbols, etc. International languages needing diacritical marks or non-Latin letters.

PACS – 11/16/13 27 Unicode How does Unicode affect PHP coding? Character encoding for the HTML file may be set wrong by the server or inside the file. Main logic problems stem from the fact that length of a string in bytes will probably be greater than the number of characters that will display. Note that this will affect field sizes in MySQL. Field size might go up 4x to handle the same number of characters.

PACS – 11/16/13 28 Unicode Set the encoding in HTML But you’ll have to make sure the server is saying the same thing. Scripts can use the header function: headerheader('Content-Type:text/html; charset=UTF-8');

PACS – 11/16/13 29 Unicode Do not EVER use functions that convert case (strtolower, strtoupper, ucfirst, ucwords) or claim to be case-insensitive (str_ireplace, stristr, strcasecmp). Think twice before using functions that count characters (strlen will return bytes, not characters; str_split and word_wrap may corrupt a string).

PACS – 11/16/13 30 Unicode Sorting becomes a challenge. Consider the different representations of vowels with diacriticals. Regular expressions will have trouble deciding which characters are letters among other problems. Character case conversion becomes much harder to do as lower/upper pairs appear throughout the code tables.

PACS – 11/16/13 31 Unicode Generating Unicode characters is not easy either. Any editing must be done with a Unicode aware program. Most editors will mangle files by converting Unicode European characters into ANSI equivalents. MS Word is a capable Unicode editor. See Unicode.org for information about generating characters. CJK languages have special challenges.

PACS – 11/16/13 32 Unicode Links: htmlpurifier.org/docs/enduser-utf8.html phputf8.sourceforge.net unicode.org