UTF-8, Perl and You By Rafael Almeria. Chapter 1: Introduction.

Slides:



Advertisements
Similar presentations
Introduction to PHP MIS 3501, Fall 2014 Jeremy Shafer
Advertisements

The Binary Numbering Systems
Data Representation Computer Organization &
Introduction to Perl. How to run perl Perl is an interpreted language. This means you run it through an interpreter, not a compiler. Your program/script.
Assignment 1 Pointers ● Be sure to use all tags properly – Don't use a tag for something it wasn't designed for – Ex. Do not use heading tags... for regular.
מבנה מחשב תרגול 2 ייצוג תווים בחומרה. A programmer that doesn’t care about characters encoding in not much better than a medical doctor who doesn’t believe.
Data Representation Kieran Mathieson. Outline Digital constraints Data types Integer Real Character Boolean Memory address.
28-Jun-15 Number Systems. 2 Bits and bytes A bit is a single two-valued quantity: yes or no, true or false, on or off, high or low, good or bad One bit.
Guide To UNIX Using Linux Third Edition
1 HTML’s Transition to XHTML. 2 XHTML is the next evolution of HTML Extensible HTML eXtensible based on XML (extensible markup language) XML like HTML.
Unicode, character sets, and a a little history. Historical Perspective First came EBCIDIC (6 Bits?) Then in the early 1960s came ASCII – Most computers.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky Veronika.
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
ECA 228 Internet/Intranet Design I Meta Tags & Directories.
1 ‘Dynamic’ Web Pages So far, we have developed ‘static’ web-pages, e.g., cv.html, repair.html and order.html. There is often a requirement to produce.
Pemrograman Berbasis WEB XML part 2 -Aurelio Rahmadian- Sumber: w3cschools.com.
Introduction to Java Appendix A. Appendix A: Introduction to Java2 Chapter Objectives To understand the essentials of object-oriented programming in Java.
LING 408/508: Programming for Linguists Lecture 2 August 28 th.
XML introduction to Ahmed I. Deeb Dr. Anwar Mousa  presenter  instructor University Of Palestine-2009.
Encoding and fonts Edward Garrett Software Developer, ELAR.
Computers Organization & Assembly Language
Introduction to Perl Practical Extraction and Report Language or Pathologically Eclectic Rubbish Lister or …
Introduction to Python
IT-101 Section 001 Lecture #3 Introduction to Information Technology.
Introduction to Unix – CS 21 Lecture 16. Lecture Overview LaTeX History Running and creating LaTeX documents Documents and Articles Tables Lists Fonts.
MA/CSSE 473 Day 31 Student questions Data Compression Minimal Spanning Tree Intro.
Web page - A Web page is a simple text file that contains a set of HTML tags (code) that describe (to the browser) what should go on a web page. It may.
XHTML. Introduction to XHTML What Is XHTML? – XHTML stands for EXtensible HyperText Markup Language – XHTML is almost identical to HTML 4.01 – XHTML is.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Lec 3: Data Representation Computer Organization & Assembly Language Programming.
Information Representation. Reading and References Reading - Computer Organization and Design, Patterson and Hennessy Chapter 2, sec. 2.4, 2.9 (first.
XP Tutorial 10New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with JavaScript Creating a Programmable Web Page for North Pole.
XP Tutorial 9 1 Working with XHTML. XP SGML 2 Standard Generalized Markup Language (SGML) A standard for specifying markup languages. Large, complex standard.
Computer Programming 2 Lab(1) I.Fatimah Alzahrani.
6 Chapter 61 Looping Programming Logic and Design, Second Edition, Comprehensive 6.
CIS67 Foundations for Creating Web Pages Professor Al Fichera Rev. August 25, 2010—All HTML code brought to XHTML standards.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Introduction to Perl “Practical Extraction and Report Language” “Pathologically Eclectic Rubbish Lister”
Unicode Normalize Engine Submitted by: Jose Yallouz Shlomi Ben-Shabat Supervisor: Maxim Gurevich.
Fall 2002CS 150: Intro. to Computing1 Streams and File I/O (That is, Input/Output) OR How you read data from files and write data to files.
Working With Objects Tonga Institute of Higher Education.
Chapter 4 Literals, Variables and Constants. #Page2 4.1 Literals Any numeric literal starting with 0x specifies that the following is a hexadecimal value.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
Objective: To describe the evolution of the Internet and the Web. Explain the need for web standards. Describe universal design. Identify benefits of accessible.
File Input and Output Chapter 14 Java Certification by:Brian Spinnato.
Understanding Character Encodings Basics of Character Encodings that all Programmers should Know. Pritam Barhate, Cofounder and CTO Mobisoft Infotech.
Characters CS240.
Module Road Map Assignment Road Map Notice we have linked the conduit directly to the presentation layer. This is normally a bad idea!
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
Announcements Assignment 1 will be regraded for all who’s score (not percentage) is less than 6 (out of 65). If your score is 6 or higher, but you feel.
Announcements You will receive your scores back for Assignment 2 this week. You will have an opportunity to correct your code and resubmit it for partial.
Principles of Programming - NI Chapter 10: Character & String : In this chapter, you’ll learn about; Fundamentals of Strings and Characters The difference.
1 Agenda  Unit 7: Introduction to Programming Using JavaScript T. Jumana Abu Shmais – AOU - Riyadh.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
FIND THE VOLUME: 5 in 8 in 4 in.
Lec 3: Data Representation
Data Representation.
Introduction to Scripting
Intro to PHP & Variables
Strings, Line-by-line I/O, Functions, Call-by-Reference, Call-by-Value
LING 388: Computers and Language
Chapter 17 Binary I/O Dr. Clincy - Lecture.
Conditions and Ifs BIS1523 – Lecture 8.
T. Jumana Abu Shmais – AOU - Riyadh
Introduction to Primitive Data types
XML Problems and Solutions
Fundamentals of Data Representation
Chapter 3 DataStorage Foundations of Computer Science ã Cengage Learning.
Homework Applied for cs240? (If not, keep at it!) 8/10 Done with HW1?
Introduction to Primitive Data types
Presentation transcript:

UTF-8, Perl and You By Rafael Almeria

Chapter 1: Introduction

1 - Introduction This talk does not deal with the motivation for using utf-8.

1 - Introduction  This talk is about:  Implementation details.  Understanding UTF-8.  Converting your data,  And knowing how to fix common problems.

1 - Introduction  Some assumptions:  Language: Perl  Unix Operating System  Input encoded as: ASCII, ISO /Latin-1 or Windows  Output encoded as: UTF-8

1 - Introduction  What we’ll cover in this talk:  A primer on character encoding  A simplifying principle  UTF-8  Perl & UTF-8  Making the Browser Happy  Encoding Hell

Chapter 2: A Very Brief Primer on Character Encoding.

2 - A Very Brief Primer on Character Encoding. What is a character encoding?

2 - A Very Brief Primer on Character Encoding. It’s a specific way to represent the characters in a given character set.

2 - A Very Brief Primer on Character Encoding. A character set may have a numerical ordering on it for use with a given character encoding.

2 - A Very Brief Primer on Character Encoding. The number given to a specific character in an ordered character set is its code point.

2 - A Very Brief Primer on Character Encoding. Do not confuse the character’s code point with its representation!

2 - A Very Brief Primer on Character Encoding. It may be the same for ASCII, ISO and Windows-1252 and…

2 - A Very Brief Primer on Character Encoding. it may be the same for 1-byte UTF-8 but…

2 - A Very Brief Primer on Character Encoding. it’s definitely not true for multi-byte UTF-8.

2 - A Very Brief Primer on Character Encoding. It’s a common problem. So don’t confuse them!

Chapter 3: A Simplifying Principle

3 - A Simplifying Principle  If all of our data is encoded using only the following encodings (code point ranges are in parenthesis):  ASCII (0x00 - 0x7F)  ISO /Latin-1 (0x00 - 0xFF)  Windows-1252 (0x00 - 0xFF)

3 - A Simplifying Principle and if we only care about printable content then ASCII  ISO  Windows-1252

3 - A Simplifying Principle We can treat everything as Windows-1252!

3 - A Simplifying Principle This should be ok if we are sure that the documents are from one of these three kinds of encodings but we’re not sure how each document is encoded.

Chapter 4: UTF-8. A Brave New World

4 - UTF-8. A Brave New World It supports every language you’ll probably ever need.

4 - UTF-8. A Brave New World No need for Windows-1252 this and Windows-1253 that.

4 - UTF-8. A Brave New World Its code point range is from 0x00 to 0x10FFFF

4 - UTF-8. A Brave New World It uses a variable (1 to 4) byte encoding.

4 - UTF-8. A Brave New World 1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.

4 - UTF-8. A Brave New World 1-byte UTF-8  ASCII MSBit is 0 code point  representation

4 - UTF-8. A Brave New World  Examples of 1-byte UTF-8:  “A” ->  “&” ->  “5” ->

4 - UTF-8. A Brave New World 2-byte UTF-8 is used for code points in the range 0x0080 to 0x07FF.

4 - UTF-8. A Brave New World 2-byte UTF-8 code point != representation

4 - UTF-8. A Brave New World The code point is broken apart into two pieces.

4 - UTF-8. A Brave New World The five MSBits of the code point are assigned to the first byte and the six LSBits are assigned to the second byte.

4 - UTF-8. A Brave New World For the first byte of 2-byte UTF-8 The three MSBits are set to 110 The remaining bits are the five MSBits of the code point.

4 - UTF-8. A Brave New World For the second byte of 2-byte UTF-8 The two MSBits are set to 10 The remaining bits are the six LSBits of the code point.

4 - UTF-8. A Brave New World 3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF.

4 - UTF-8. A Brave New World 3-byte UTF-8 code point != representation

4 - UTF-8. A Brave New World The code point is broken apart into three pieces.

4 - UTF-8. A Brave New World  The four MSBits of the code point are assigned to the first byte.  The middle six bits are assigned to the second byte.  The six LSBits are assigned to the third byte.

4 - UTF-8. A Brave New World For the first byte of 3-byte UTF-8 The four MSBits are set to 1110 The remaining bits are the four MSBits of the code point.

4 - UTF-8. A Brave New World For the second byte of 3-byte UTF-8 The two MSBits are set to 10 The remaining bits are the six middle bits of the code point.

4 - UTF-8. A Brave New World For the third byte of 3-byte UTF-8 The two MSBits are set to 10 The remaining bits are the six LSBits of the code point.

4 - UTF-8. A Brave New World 4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF.

4 - UTF-8. A Brave New World 4-byte UTF-8 code point != representation

4 - UTF-8. A Brave New World The code point is broken apart into four pieces.

4 - UTF-8. A Brave New World  The three MSBits of the code point are assigned to the first byte.  The next six MSBits are assigned to the second byte.  Another of the next six MSBits are assigned to the third byte.  The six LSBits are assigned to the fourth byte.

4 - UTF-8. A Brave New World For the first byte of 4-byte UTF-8 The five MSBits are set to The remaining bits are the three MSBits of the code point.

4 - UTF-8. A Brave New World For the second byte of 4-byte UTF-8 The two MSBits are set to 10 The remaining bits are the next six middle bits of the code point.

4 - UTF-8. A Brave New World For the third byte of 4-byte UTF-8 The two MSBits are set to 10 The remaining bits are the next six middle bits of the code point.

4 - UTF-8. A Brave New World For the fourth byte of 4-byte UTF-8 The two MSBits are set to 10 The remaining bits are the six LSBits of the code point.

Chapter 5: Perl & UTF-8

5 - Perl & UTF-8 If you want to create UTF-8 strings in your Perl code then all you have to do is use the following notation: \x{codepoint}

5 - Perl & UTF-8 For example, to create the string “niño”: my $str = “ni\x{f1}o”;

5 - Perl & UTF-8 To write this string to STDOUT you might do this: binmode STDOUT, “:utf8”; print $str;

5 - Perl & UTF-8 To undo it, do this: binmode STDOUT; print $str;

5 - Perl & UTF-8 Or to write UTF-8 data to disk, you could do this: open(OFILE, “>:utf8”, $filename); print OFILE $str;

5 - Perl & UTF-8 To read UTF-8 data from disk, you could do this: open(IFILE, “ ;

5 - Perl & UTF-8 To convert Windows-1252 to UTF-8, you could do something like this: use Text::Iconv; use Encode; my $utf8_str = Text::Iconv- >new(“WINDOWS-1252”, “UTF-8”)- >convert($str); Encode::_utf8_on($utf8_str);

Chapter 6: Making the Browser Happy

6 - Making the Browser Happy All the efforts up to now will be for naught if the browser doesn’t understand how the page is encoded.

6 - Making the Browser Happy To make the browser aware of the nature of the data either add…

6 - Making the Browser Happy Content-type: text/html; charset=utf-8

6 - Making the Browser Happy or if you want to tag each document…

6 - Making the Browser Happy for XML add this declaration at the top of the document:

6 - Making the Browser Happy for HTML add this declaration at the top of the section of the document:

6 - Making the Browser Happy for XHTML add this declaration at the top of the section of the document:

Chapter 7: Encoding Hell

7 - Encoding Hell So now we think we understand UTF-8…

7 - Encoding Hell …and we think we understand how to process this data in Perl but…

7 - Encoding Hell there is still SO MUCH OPPORTUNITY for things to go wrong!

7 - Encoding Hell The Byte Order Mark (0xFEFF code point) is one of them.

7 - Encoding Hell The intention is probably good but it can cause much grief.

7 - Encoding Hell Solution is to cut out the byte sequence EF BB BF from the beginning of the document.

7 - Encoding Hell Encoded Gibberish. (It takes several forms)

7 - Encoding Hell All Gibberish

7 - Encoding Hell If it’s all gibberish then maybe the data is ok but you’re looking at it using the wrong pair of glasses. Change the document encoding declaration. Or try changing your browser’s or application’s encoding setting.

7 - Encoding Hell Partially Gibberish (Two Cases)

7 - Encoding Hell First Case: What does it look like? Niño vs Ni?o Niño vs Ni o

7 - Encoding Hell You likely have the dreaded “mixed encoding” nightmare. Probably someone has poured ISO or Windows-1252 into a UTF-8 document or vice-versa. You will need to figure out which bytes are which and clean the document up to make it pure UTF-8.

7 - Encoding Hell Second Case: What does it look like? niño (viewed in UTF-8 mode) niño (viewed in Windows-1252 mode)

7 - Encoding Hell You likely have the double encoding problem. Sometimes some of the data gets encoded as UTF-8 twice! Again, you’ll need to look at the bytes and fix it.

7 - Encoding Hell Now some odds and ends…

7 - Encoding Hell HTML::Entities::decode_entities doesn’t always do what you think. Sometimes it returns ISO instead of UTF-8. Caveat programmer!

7 - Encoding Hell Be careful if you’re using the encode or decode routines from Encode.pm, they may not set the string’s UTF-8 flag appropriately.

7 - Encoding Hell And as a checklist of sorts when you’re debugging…

7 - Encoding Hell  When debugging…make sure that  The data has been encoded properly  The data has been flagged as UTF-8  That it has been written out properly.  That the document has the appropriate encoding declaration.  That your terminal or browser has been set to the correct encoding.

Conclusion

We notice that it is not easy to navigate the transition from traditional encodings to UTF-8 but with perseverance it is doable. We have illustrated the common encodings, how to process our information in this environment and how to tackle the common issues that might arise.

References

 table.pl?htmlent=1 A nice list of UTF-8 characters, their character entities, code points and representation. table.pl?htmlent=1   placement_character placement_character  

References       enc/ enc/

References  icode icode  TF TF  