Programming for Linguists

1 Programming for Linguists
An Introduction to Python 22/12/2011

2 Feedback Ex. 1) Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of “men”, “women”, and “people” in each document. What has happened to the usage of these words over time?

import nltk from nltk.corpus import state_union cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in state_union.fileids( ) for word in state_union.words(fileids = fileid)) fileids = state_union.fileids( ) search_words = ["men", "women", "people"] cfd.tabulate(conditions = fileids, samples = search_words)

4 Ex 2) According to Strunk and White's Elements of Style, the word “however”, used at the start of a sentence, means "in whatever way" or "to whatever extent", and not "nevertheless". They give this example of correct usage: However you advise him, he will probably do as he thinks best. Use the concordance tool to study actual usage of this word in 5 NLTK texts.

import nltk from import * texts = [text1, text2, text3, text4, text5] for text in texts: print text.concordance("however")

6 Ex 3) Create a corpus of your own of minimum 10 files containing text fragments. You can take texts of your own, the internet,… Write a program that investigates the usage of modal verbs in this corpus using the frequency distribution tool and plot the 10 most frequent words.

7 import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = “/Users/claudia/my_corpus” #corpus_root = “C:\Users\...” my_corpus = PlaintextCorpusReader (corpus_root, '.*’) words = my_corpus.words( ) cfd = nltk.ConditionalFreqDist((fileid, word) for fileid in my_corpus.fileids( ) for word in my_corpus.words(fileid))

fileids = my_corpus.fileids( ) modals = ['can', 'could', 'may', 'might', 'must', 'will’ cfd.tabulate(conditions = fileids, samples = modals) fd = nltk.FreqDist(words) all_tokens = fd.keys( ) for t in all_tokens: if re.match(r'[^a-zA-Z0-9]+', t): all_tokens.remove(t) most_frequent=all_tokens[:10] most_frequent.plot( )

9 Ex 1) Choose a website. Read it in in Python using the urlopen function, remove all HTML mark-up and tokenize it. Make a frequency dictionary of all words ending with ‘ing’ and sort it on its values (decreasingly). Ex 2) Write the raw text of the text in the previous exercise to an output file.

10 import nltk import re url= “website” from urllib import urlopen htmltext= urlopen(url).read( ) rawtext= nltk.clean_html(htmltext) rawtext2= rawtext.lower( ) tokens= nltk.wordpunct_tokenize(rawtext2) my_text= nltk.Text(tokens) wordlist_ing= [w for w in tokens if'^.*ing$',w)]

11 freq_dict= { } for word in wordlist_ing: if word not in freq_dict: freq_dict[word] = 1 else: freq_dict[word] = freq_dict[word]+1 from operator import itemgetter sorted_wordlist_ing = sorted(freq_dict.iteritems(), key= itemgetter(1), reverse=True)

Ex 2) output_file = open(“dir/output.txt","w") output_file.write(str(rawtext2)+"\n") output_file.close( )

13 Ex 3) Write a script that performs the same classification task as we saw today using word bigrams as features instead of single words.

14 Some Mentioned Issues Loading your own corpus in NLTK with no subcategories: import nltk from nltk.corpus import PlaintextCorpusReader loc = “/Users/claudia/my_corpus” #Mac loc = “C:\Users\claudia\my_corpus” #Windows 7 my_corpus = PlaintextCorpusReader(loc, “.*”)

15 Loading your own corpus in NLTK with subcategories:
import nltk from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” #Mac loc=“C:\Users\claudia\my_corpus” #Windows 7 my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt', cat_pattern= r'(cat1|cat2)/.*')

16 Dispersion Plot determine the location of a word in the text: how many words from the beginning it appears Each stripe represents an instance of a word, and each row represents the entire text.

17 Exercises Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the text, and converts the words to lowercase. You can get a list of all punctuation marks by: import string print string.punctuation

18 import nltk, string def strip(filepath): f = open(filepath, ‘r’) text = ) tokens = nltk.wordpunct_tokenize(text) for token in tokens: token = token.lower( ) if token in string.punctuation: tokens.remove(token) return tokens

If you want to analyse a text, but filter out a stop list first (e.g. containing “the”, “and”,…), you need to make 2 dictionaries: 1 with all words from your text and 1 with all words from the stop list. Then you need to subtract the 2nd from the 1st. Write a function subtract(d1, d2) which takes dictionaries d1 and d2 and returns a new dictionary that contains all the keys from d1 that are not in d2. You can set the values to None.

def subtract(d1, d2): d3 = { } for key in d1.keys(): if key not in d2: d3[key] = None return d3

21 Let’s try it out: import nltk from import * from nltk.corpus import stopwords d1 = { } for word in text7: d1[word] = None

22 wordlist = stopwords.words(“english”) d2 = { } for word in wordlist: d2[word] = None rest_dict = subtract(d1, d2) wordlist_min_stopwords=rest_dict.keys( )

23 Questions?

24 Evaluation Assignment
Deadline = 23/01/2012 Conversation in the week of 23/01/12 If you need any explanation about the content of the assignment, feel free to e- mail me

25 Further Reading Since this was only a short introduction to programming in Python, if you want to expand your programming skills further: see chapters 15 – 18 about object- oriented programming

26 Think Python. How to Think Like a Computer Scientist?
NLTK book Official Python documentation: There is a newer version of Python available, but it is not (yet) compatible with NLTK

27 Our research group: CLiPS: Computational Linguistics and Psycholinguistics Research Center Our projects:

28 Happy holidays and success with your exams

