Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming for Linguists

Similar presentations

Presentation on theme: "Programming for Linguists"— Presentation transcript:

1 Programming for Linguists
An Introduction to Python 20/12/2012

2 Oef. 1 import string #remove punctuation function def removePunct(sent): no_punct = sent.translate(None,string.punctuation) return no_punct #split sentence into words function def getWords(sent): words = sent.split() return words

3 #use previous functions to get the average word length def avWordLength(sent): #call removePunct function no_punct = removePunct(sent) #use the result in the getWords function words = getWords(no_punct) # work with the result from getWords lengths = [] for w in words: lengths.append(len(w)) av_word_length = sum(lengths)/float(len(words)) return av_word_length

4 Oef 2 import re doc = open('/Users/claudia/Desktop/. my_text
Oef 2 import re doc = open('/Users/claudia/Desktop/ my_text.txt', 'r’) my_text = ) def findWords(text): pattern = r'(\S{0,}((aa|ee|oo|uu)\S{0,})’ words = re.findall(pattern, text) return words doubleVowels = findWords(my_text)

5 Oef 3 from collections import defaultdict def wordFeats(text): short_val = 10 long_val = 0 short_word = 'geen’ long_word = 'geen’ hapaxes = [ ] wordFreqs = defaultdict(int) no_punct = removePunct(text) words = getWords(no_punct) for w in words: if len(w) > long_val: long_word = w long_val = len(w) if len(w) < short_val: short_word = w short_val = len(w)

6 wordFreqs[w] += 1 for word in wordFreqs: if wordFreqs[word] == 1: hapaxes.append(word) print 'shortest', short_word print 'longest', long_word print 'hapaxes', hapaxes wordFeats(my_text)

7 Oef 4. def findWords2(text):. no_punct = removePunct(text)
Oef 4. def findWords2(text): no_punct = removePunct(text) pattern1 = r'((d|D)e|(H|h)et|(E|e)en)’ pattern2 = r'\S+dt’ pattern3 = r'[A-Z]\S+’ print re.findall(pattern1, no_punct) print re.findall(pattern2, no_punct) print re.findall(pattern3, no_punct)

8 Vorige les Oef 1. from nltk import. from nltk
Vorige les Oef 1. from nltk import * from nltk.corpus import gutenberg def getHapaxes(text): new_words = [word.lower() for word in gutenberg.words(text)] fdist = FreqDist(new_words) return fdist.hapaxes( ) print getHapaxes('shakespeare-hamlet.txt’)

9 Oef 2 from nltk.corpus import brown cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories( ) for word in brown.words(categories =genre)) genres = [‘news’, ‘humor’, ‘government’, ‘science-fiction’ ] prons = [‘I’, ‘you’, ‘he’, ‘she’, ‘we’, ‘they’] cfd.tabulate(conditions=genres, samples=prons)

10 Oef 3. from nltk.corpus import nps_chat def findWords3(corpus): ok = [ ] words = corpus.words( ) fdist = FreqDist(words) for word in fdist: if len(word) > 5 and fdist[word] > 5: ok.append(word) return ok print findWords3(nps_chat)

11 loc = “/Users/claudia/my_corpus”
Oef 4. from nltk.corpus import PlaintextCorpusReader loc = “/Users/claudia/my_corpus” my_corpus = PlaintextCorpusReader(loc, “.*”) def lexDiv(corpus): results = [ ] for fileid in corpus.fileids( ): totalWords = len(corpus.words(fileid)) uniqueWords = len(set(corpus.words(fileid))) results.append(uniqueWords/float(totalWords)) return sum(results)/len(results) print lexDiv(my_corpus)

12 Oef 5. from nltk.corpus import CategorizedPlaintextCorpusReader loc=“/Users/claudia/my_corpus” my_corpus = CategorizedPlaintextCorpusReader(loc, '(?!\.svn).*\.txt’,cat_pattern=r’(10s| 20s |30s)/.*') cfd = nltk.ConditionalFreqDist((category, word) for category in my_corpus.categories( ) for word in my_corpus.words(categories=category)) subcats = my_corpus.categories( ) chat = [‘lol’, ‘omg’, ‘brb’] cfd.tabulate(conditions=subcats, samples=chat)

13 Dispersion Plot determine the location of a word in the text: how many words from the beginning it appears Each stripe represents an instance of a word, and each row represents the entire text.

14 Remove stopwords import nltk from import * from nltk.corpus import stopwords stopList = stopwords.words(“english”) How do you remove these stopwords from e.g. the nps_chat corpus’ words?

15 from nltk. corpus import nps_chat words = nps_chat
from nltk.corpus import nps_chat words = nps_chat.words( ) filtered = [word for word in words if word not in stopList]

16 Questions?

17 Further Reading Since this was only a short introduction to programming in Python, if you want to expand your programming skills further, see: (official Python website) (questions forum)

18 Think Python. How to Think Like a Computer Scientist. http://www
NLTK book

19 If you are interested in our work in computational linguistics/doing your thesis:

20 Happy holidays and good luck with your exams

Download ppt "Programming for Linguists"

Similar presentations

Ads by Google