Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 24-Sept-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/

3 Review of Unicode 24-Sept-2014 3 NLP, Prof. Howard, Tulane University

4 ASCII characters 0123456789ABCDEF 0–––––––––––––––– 1–––––––––––––––– 2 !“#$%&‘()*+,-./ 30123456789:;<=>? 4@ABCDEFGHIJKLMNO 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~– 24-Sept-2014NLP, Prof. Howard, Tulane University 4

5 6.2.1. Character encoding in Python 24-Sept-2014NLP, Prof. Howard, Tulane University 5

6 Open Spyder 24-Sept-2014 6 NLP, Prof. Howard, Tulane University

7 6. Non-English characters: one code to rule them all 24-Sept-2014 7 NLP, Prof. Howard, Tulane University

8 6.2.2. What happens when you type a non-ASCII character into a Python console? 1. >>> import sys 2. >>> sys.getdefaultencoding() 1. >>> special = 'ó' 2. >>> special 3. '\xc3\xb3' 4. >>> print special ó 24-Sept-2014NLP, Prof. Howard, Tulane University 8

9 6.2.3. How to translate into and out of Unicode with decode() and encode() 1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n' 5. >>> len(uS1) 6. 5 7. >>> utf8S1 = uS1.encode('utf8') 8. >>> print utf8S1 9. cañón 24-Sept-2014NLP, Prof. Howard, Tulane University 9

10 6.2.4.1. How to turn on non-ASCII character matching with re.UNICODE 1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' # same as before 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n' 5. >>> import re 6. >>> lS1 = re.findall(r'\w{5}', uS1, re.U) 7. >>> lS1 8. [u'ca\xf1\xf3n'] 9. >>> eS1 = ''.join(lS1) 10. >>> eS1 11. u'ca\xf1\xf3n' 12. >>> utf8S1 = eS1.encode('utf8') 13. >>> utf8S1 14. 'ca\xc3\xb1\xc3\xb3n' 15. >>> print 16. cañón 24-Sept-2014NLP, Prof. Howard, Tulane University 10

11 6.2.5. How to translate between Unicode strings and numbers with ord() and unichar() 1. >>> 'ó' 2. '\xc3\xb3' 3. >>> 'ó'.decode('utf8') 4. u'\xf3' 5. >>> ord(u'\xf3') 6. 243 7. >>> unichr(243) 8. u'\xf3' 9. test = unichr(243).encode('utf8') 10. >>> print test 11. ó 24-Sept-2014NLP, Prof. Howard, Tulane University 11

12 I am going to fold the Unicode chapter into §1 & §2 and move the next chapter (§8) up a notch, so the chapter numbering will change. Chapter numbering 24-Sept-2014 12 NLP, Prof. Howard, Tulane University

13 Up to now, your short programs are entirely dependent on you for making decisions. This is fine for pieces of text that fit on a single line, but is clearly insufficient for texts that run to hundreds of lines in length. You will want Python to make decisions for you. How to tell Python to do so is the topic of this chapter, and falls under the rubric of control. 8. Control 24-Sept-2014 13 NLP, Prof. Howard, Tulane University

14 The first step in making a decision is to distinguish those cases in which the decision applies from those in which it does not. In computer science, this is usually known as a condition. 8.1. Conditions 24-Sept-2014 14 NLP, Prof. Howard, Tulane University

15 8.1.1. How to check for the presence of an item with in  Perhaps the simplest condition in text processing is whether an item is present or not. Python handles this in a way that looks a lot like English: 1. >>> greeting = 'Yo!' 2. >>> 'Y' in greeting 3. >>> 'o' in greeting 4. >>> '!' in greeting 5. >>> 'o!' in greeting 6. >>> 'Yo!' in greeting 7. >>> 'Y!' in greeting 8. >>> 'n' in greeting 9. >>> '?' in greeting 10. >>> '' in greeting 24-Sept-2014NLP, Prof. Howard, Tulane University 15

16 in & lists  Lists behave exactly like strings, with the proviso that the string being asked about match a string in the list exactly: 1. >>> fruit = ['apple', 'cherry', 'mango', 'pear', 'watermelon'] 2. >>> 'apple' in fruit 3. >>> 'peach' in fruit 4. >>> 'app' in fruit 5. >>> '' in fruit 6. >>> [] in fruit 24-Sept-2014NLP, Prof. Howard, Tulane University 16

17 Python can understand sequences of in conditions 1. >>> 'app' in 'apple' in fruit 2. # 'app' in 'apple' > True 3. # 'apple' in lst > True 4. >>> 'aple' in 'apple' in fruit 5. >>> 'pea' in 'peach' in fruit 24-Sept-2014NLP, Prof. Howard, Tulane University 17

18 8.1.2. How to check for the absence of an item with not in 1. >>> not 'n' in greeting 2. >>> 'n' not in greeting 3. >>> 'Y' not in greeting 4. >>> 'Y!' not in greeting 5. >>> 'Yo' not in greeting 6. >>> '' not in greeting 7. >>> 'apple' not in fruit 8. >>> 'peach' not in fruit 9. >>> 'app' not in fruit 10. >>> '' not in fruit 11. >>> 'pee' not in 'peach' not in fruit 12. >>> 'pea' not in 'peach' not in fruit 13. >>> 'pea' not in 'apple' not in fruit 24-Sept-2014NLP, Prof. Howard, Tulane University 18

19 More on control Next time 24-Sept-2014NLP, Prof. Howard, Tulane University 19


Download ppt "UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google