UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
TEXT STATISTICS 7 DAY /05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Binary Representation Introduction to Computer Science and Programming I Chris Schmidt.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Starting Out with Programming Logic & Design First Edition by Tony Gaddis.
ASCII and Unicode. ASCII Inside a computer, EVERYTHING is a number – that includes music, sound, and text. In the early days of computers, every manufacturer.
Decisions in Python Comparing Strings – ASCII History.
Dale & Lewis Chapter 3 Data Representation
Working with Files CSC 161: The Art of Programming Prof. Henry Kautz 11/9/2009.
Introduction to Computing Using Python Chapter 6  Encoding of String Characters  Randomness and Random Sampling.
CHAPTER 4: INTRODUCTION TO COMPUTER ORGANIZATION AND PROGRAMMING DESIGN Lec. Ghader Kurdi.
Chapter 3 Representing Numbers and Text in Binary Information Technology in Theory By Pelin Aksoy and Laura DeNardis.
CMPT 120 How computers run programs Summer 2012 Instructor: Hassan Khosravi.
TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
Structured programming 4 Day 34 LING Computational Linguistics Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Characters In Java single characters are represented using the data type char. Character constants are written as symbols enclosed in single quotes, for.
1 INFORMATION IN DIGITAL DEVICES. 2 Digital Devices Most computers today are composed of digital devices. –Process electrical signals. –Can only have.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Examples of comparing strings. “ABC” = “ABC”? yes “ABC” = “ ABC”? No! note the space up front “ABC” = “abc” ? No! Totally different letters “ABC” = “ABCD”?
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Strings in MIPS. Chapter 2 — Instructions: Language of the Computer — 2 Character Data Byte-encoded character sets – ASCII: 128 characters 95 graphic,
REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.
Sequencing The most simple type of program uses sequencing, a set of instructions carried out one after another. Start End Display “Computer” Display “Science”
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
OCR Computing GCSE © Hodder Education 2013 Slide 1 OCR GCSE Computing Python programming 3: Built-in functions.
ABCDEF 00 NULSOHSTXETXEOTENQACKBELBSHTLFVTFFCRSOSI 10 DLEDC1DC2DC3DC4NAKSYNETBCANEMSUBESCFSGSRSUS 20SP!"#$%&'()*+,-./ :;?
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Scala File I/O. Types of files There are two kinds of files: Text files, and binary files Of course, it’s not that simple… Text files Text files are “human.
CPSC 233 Tutorial January 21 st /22 nd, Linux Commands.
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Chapter 8 & 11: Representing Information Digitally
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
Lesson 1 An Introduction
Lists 2 Day /19/14 LING 3820 & 6820 Natural Language Processing
Computation with strings 2 Day 3 - 9/02/16
Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Computation with strings 3 Day 4 - 9/07/16
File Handling Programming Guides.
Regular expressions 2 Day /23/16
Chapter 8 Characters In Java single characters are represented using the data type char. Character constants are written as symbols enclosed in single.
String Encodings and Penny Math
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Comparing Strings – How to
Fundamentals of Data Representation
Regular expressions 3 Day /26/16
Computation with strings 4 Day 5 - 9/09/16
String Encodings and Penny Math
Starter Activities GCSE Python.
Lecture 36 – Unit 6 – Under the Hood Binary Encoding – Part 2
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
ASCII and Unicode.
Presentation transcript:

UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 24-Sept-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction. 

Review of Unicode 24-Sept NLP, Prof. Howard, Tulane University

ASCII characters ABCDEF 0–––––––––––––––– 1–––––––––––––––– 2 !“#$%&‘()*+,-./ :;<=>? 5PQRSTUVWXYZ[\]^_ 6`abcdefghijklmno 7pqrstuvwxyz{|}~– 24-Sept-2014NLP, Prof. Howard, Tulane University 4

Character encoding in Python 24-Sept-2014NLP, Prof. Howard, Tulane University 5

Open Spyder 24-Sept NLP, Prof. Howard, Tulane University

6. Non-English characters: one code to rule them all 24-Sept NLP, Prof. Howard, Tulane University

What happens when you type a non-ASCII character into a Python console? 1. >>> import sys 2. >>> sys.getdefaultencoding() 1. >>> special = 'ó' 2. >>> special 3. '\xc3\xb3' 4. >>> print special ó 24-Sept-2014NLP, Prof. Howard, Tulane University 8

How to translate into and out of Unicode with decode() and encode() 1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n' 5. >>> len(uS1) >>> utf8S1 = uS1.encode('utf8') 8. >>> print utf8S1 9. cañón 24-Sept-2014NLP, Prof. Howard, Tulane University 9

How to turn on non-ASCII character matching with re.UNICODE 1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' # same as before 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n' 5. >>> import re 6. >>> lS1 = re.findall(r'\w{5}', uS1, re.U) 7. >>> lS1 8. [u'ca\xf1\xf3n'] 9. >>> eS1 = ''.join(lS1) 10. >>> eS1 11. u'ca\xf1\xf3n' 12. >>> utf8S1 = eS1.encode('utf8') 13. >>> utf8S1 14. 'ca\xc3\xb1\xc3\xb3n' 15. >>> print 16. cañón 24-Sept-2014NLP, Prof. Howard, Tulane University 10

How to translate between Unicode strings and numbers with ord() and unichar() 1. >>> 'ó' 2. '\xc3\xb3' 3. >>> 'ó'.decode('utf8') 4. u'\xf3' 5. >>> ord(u'\xf3') >>> unichr(243) 8. u'\xf3' 9. test = unichr(243).encode('utf8') 10. >>> print test 11. ó 24-Sept-2014NLP, Prof. Howard, Tulane University 11

I am going to fold the Unicode chapter into §1 & §2 and move the next chapter (§8) up a notch, so the chapter numbering will change. Chapter numbering 24-Sept NLP, Prof. Howard, Tulane University

Up to now, your short programs are entirely dependent on you for making decisions. This is fine for pieces of text that fit on a single line, but is clearly insufficient for texts that run to hundreds of lines in length. You will want Python to make decisions for you. How to tell Python to do so is the topic of this chapter, and falls under the rubric of control. 8. Control 24-Sept NLP, Prof. Howard, Tulane University

The first step in making a decision is to distinguish those cases in which the decision applies from those in which it does not. In computer science, this is usually known as a condition Conditions 24-Sept NLP, Prof. Howard, Tulane University

How to check for the presence of an item with in  Perhaps the simplest condition in text processing is whether an item is present or not. Python handles this in a way that looks a lot like English: 1. >>> greeting = 'Yo!' 2. >>> 'Y' in greeting 3. >>> 'o' in greeting 4. >>> '!' in greeting 5. >>> 'o!' in greeting 6. >>> 'Yo!' in greeting 7. >>> 'Y!' in greeting 8. >>> 'n' in greeting 9. >>> '?' in greeting 10. >>> '' in greeting 24-Sept-2014NLP, Prof. Howard, Tulane University 15

in & lists  Lists behave exactly like strings, with the proviso that the string being asked about match a string in the list exactly: 1. >>> fruit = ['apple', 'cherry', 'mango', 'pear', 'watermelon'] 2. >>> 'apple' in fruit 3. >>> 'peach' in fruit 4. >>> 'app' in fruit 5. >>> '' in fruit 6. >>> [] in fruit 24-Sept-2014NLP, Prof. Howard, Tulane University 16

Python can understand sequences of in conditions 1. >>> 'app' in 'apple' in fruit 2. # 'app' in 'apple' > True 3. # 'apple' in lst > True 4. >>> 'aple' in 'apple' in fruit 5. >>> 'pea' in 'peach' in fruit 24-Sept-2014NLP, Prof. Howard, Tulane University 17

How to check for the absence of an item with not in 1. >>> not 'n' in greeting 2. >>> 'n' not in greeting 3. >>> 'Y' not in greeting 4. >>> 'Y!' not in greeting 5. >>> 'Yo' not in greeting 6. >>> '' not in greeting 7. >>> 'apple' not in fruit 8. >>> 'peach' not in fruit 9. >>> 'app' not in fruit 10. >>> '' not in fruit 11. >>> 'pee' not in 'peach' not in fruit 12. >>> 'pea' not in 'peach' not in fruit 13. >>> 'pea' not in 'apple' not in fruit 24-Sept-2014NLP, Prof. Howard, Tulane University 18

More on control Next time 24-Sept-2014NLP, Prof. Howard, Tulane University 19