Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 22-Sept-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/

3 The quiz was the review. Review of Lists 22-Sept-2014 3 NLP, Prof. Howard, Tulane University

4 Open Spyder 22-Sept-2014 4 NLP, Prof. Howard, Tulane University

5 6. Non-English characters: one code to rule them all 22-Sept-2014 5 NLP, Prof. Howard, Tulane University

6 Did you know … 1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a'] 22-Sept-2014NLP, Prof. Howard, Tulane University 6

7 Introduction  So your program is humming along, and it hits the string 'cañón' and chokes. For instance, it may try to find out the length of cañón: 1. >>> S = 'cañón' 2. >>> len(S) 3. >>> from re import findall 4. >>> findall(r'\w{5}',S) 5. >>> T = findall(r'.{5}',S) 6. >>> T 7. ['ca\xc3\xb1\xc3'] 8. >>> U = ''.join(T) 9. >>> print U 10. >>> findall(r'.{7}',S) 11. ['ca\xc3\xb1\xc3\xb3n'] 12. >>> T = findall(r'.{7}',S) 13. >>> U = ''.join(T) 14. >>> print U 15. cañón 22-Sept-2014NLP, Prof. Howard, Tulane University 7

8 6.1. English characters and ASCII  Computers were originally designed to use the English alphabet, and in particular, an encoding of it called the American Standard Code for Information Interchange, abbreviated ASCII and pronounced / ˈ æski/ or “ass- kee”, see ASCII in Wikipedia.ASCII  ASCII is ultimately based on telegraph codes and represents the numbers 0-9, the English letters a-z and A-Z, the English punctuation symbols plus a blank space, along with control codes that originated with Teletype machines, some of which are now obsolete. 22-Sept-2014NLP, Prof. Howard, Tulane University 8

9 ASCII characters 0123456789ABCDEF 0–––––––––––––––– 1–––––––––––––––– 2 !“#$%&‘()*+,-./ 30123456789:;<=>? 4@ABCDEFGHIJKLMNO 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~– 22-Sept-2014NLP, Prof. Howard, Tulane University 9

10 So now you know … 1. >>> unsorted = 'a*@A6' 2. >>> sorted(unsorted) 3. ['*', '6', '@', 'A', 'a'] 4. >>> ord(' ') 5. >>> ord('!') 6. >>> ord('~') 7. >>> chr(32) 8. >>> chr(33) 9. >>> chr(126) 10. >>> chr(127) 22-Sept-2014NLP, Prof. Howard, Tulane University 10

11 Background 6.2. Unicode and UTF-8 22-Sept-2014 11 NLP, Prof. Howard, Tulane University

12 6.2.1. Character encoding in Python 22-Sept-2014NLP, Prof. Howard, Tulane University 12

13 7. NLTK and Internet corpora but I am going to fold this chapter into §1 & §2, so the chapter numbering will change. Next time 22-Sept-2014NLP, Prof. Howard, Tulane University 13


Download ppt "UNICODE DAY 12 - 9/22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google