UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
Characters and Strings. Characters In Java, a char is a primitive type that can hold one single character A character can be: –A letter or digit –A punctuation.
COMPUTER FUNDAMENTALS David Samuel Bhatti
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
CODING SYSTEMS CODING SYSTEMS CODING SYSTEMS. CHARACTERS CHARACTERS digits: 0 – 9 (numeric characters) letters: alphabetic characters punctuation marks:
2.1.4 BINARY ASCII CHARACTER SETS A451: COMPUTER SYSTEMS AND PROGRAMMING.
Decisions in Python Comparing Strings – ASCII History.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Chapter 4: Representation of data in computer systems: Characters OCR Computing for GCSE © Hodder Education 2011.
It is pronounced ‘askee’
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CISC1100: Binary Numbers Fall 2014, Dr. Zhang 1. Numeral System 2  A way for expressing numbers, using symbols in a consistent manner.  " 11 " can be.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SEC (1.4) Representing Information as bit patterns.
Strings in MIPS. Chapter 2 — Instructions: Language of the Computer — 2 Character Data Byte-encoded character sets – ASCII: 128 characters 95 graphic,
Representing Characters in a computer Pressing a key on the computer a code is generated that the computer can convert into a symbol for displaying or.
REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
The Information School of the University of Washington Oct 13fit digital1 Digital Representation INFO/CSE 100, Fall 2006 Fluency in Information Technology.
Representation of Characters
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
© 2001, Penn State University Encoding on the Internet Elizabeth J. Pyatt CETS.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Representing Characters in a Computer System Representation of Data in Computer Systems.
Information Coding Schemes Group Member : Yvonne Tiffany Jurifah bt Junaidi Clara Jane George.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Character representation in the computers Home Assignment 1 Assigned. Deadline 2016 January 24th, Sunday.
ABCDEF 00 NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI 10 DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS 20SP!"#$%&'()*+,-./ :;?
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
1.4 Representation of data in computer systems Character.
Introduction to computer science Lec2 cs111. Extended Binary Coded Decimal Interchange Code (EBCDIC) is an 8- bit character encoding used mainly on.
AP CSP: Encoding and Sending Formatted Text
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
Binary 1 Basic conversions.
Data Transfer ASCII FILES.
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
Computation with strings 2 Day 3 - 9/02/16
Representing Information as bit patterns
Data Encoding Characters.
TOPICS Information Representation Characters and Images
Computation with strings 3 Day 4 - 9/07/16
Representing Characters
Regular expressions 2 Day /23/16
String Encodings and Penny Math
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Presenting information as bit patterns
Regular expressions 3 Day /26/16
functions: argument, return value
Computation with strings 4 Day 5 - 9/09/16
String Encodings and Penny Math
C Programming Language
ASCII LP1.
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 22-Sept-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction. 

The quiz was the review. Review of Lists 22-Sept NLP, Prof. Howard, Tulane University

Open Spyder 22-Sept NLP, Prof. Howard, Tulane University

6. Non-English characters: one code to rule them all 22-Sept NLP, Prof. Howard, Tulane University

Did you know … 1. >>> unsorted = 2. >>> sorted(unsorted) 3. ['*', '6', 'A', 'a'] 22-Sept-2014NLP, Prof. Howard, Tulane University 6

Introduction  So your program is humming along, and it hits the string 'cañón' and chokes. For instance, it may try to find out the length of cañón: 1. >>> S = 'cañón' 2. >>> len(S) 3. >>> from re import findall 4. >>> findall(r'\w{5}',S) 5. >>> T = findall(r'.{5}',S) 6. >>> T 7. ['ca\xc3\xb1\xc3'] 8. >>> U = ''.join(T) 9. >>> print U 10. >>> findall(r'.{7}',S) 11. ['ca\xc3\xb1\xc3\xb3n'] 12. >>> T = findall(r'.{7}',S) 13. >>> U = ''.join(T) 14. >>> print U 15. cañón 22-Sept-2014NLP, Prof. Howard, Tulane University 7

6.1. English characters and ASCII  Computers were originally designed to use the English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced / ˈ æski/ or “ass- kee”, see ASCII in Wikipedia.ASCII  ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete. 22-Sept-2014NLP, Prof. Howard, Tulane University 8

ASCII characters ABCDEF 0–––––––––––––––– 1–––––––––––––––– 2 !“#$%&‘()*+,-./ :;<=>? 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~– 22-Sept-2014NLP, Prof. Howard, Tulane University 9

So now you know … 1. >>> unsorted = 2. >>> sorted(unsorted) 3. ['*', '6', 'A', 'a'] 4. >>> ord(' ') 5. >>> ord('!') 6. >>> ord('~') 7. >>> chr(32) 8. >>> chr(33) 9. >>> chr(126) 10. >>> chr(127) 22-Sept-2014NLP, Prof. Howard, Tulane University 10

Background 6.2. Unicode and UTF-8 22-Sept NLP, Prof. Howard, Tulane University

Character encoding in Python 22-Sept-2014NLP, Prof. Howard, Tulane University 12

7. NLTK and Internet corpora but I am going to fold this chapter into §1 & §2, so the chapter numbering will change. Next time 22-Sept-2014NLP, Prof. Howard, Tulane University 13