LING 388: Computers and Language

LING 388: Computers and Language
Lecture 16

Administrivia Homework 7 graded Today's Topics: Review Homework 7
Install nltk and nltk_data for Python (Homework 8)

Homework 7 Review What went wrong on the High Street in 2018?
?intlink_from_url= ext/long-reads&link_location=live-reporting-story file hw7.txt Using regexs in Python, find: Find the numbers in the article. List them. How many of them are there? Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations etc.), e.g. Toys R Us or New Look. List them. How many of them are there? How could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code. How many named entities now?

Homework 7 Review Find the numbers in the article. List them. How many of them are there? 169-year-old 2, % One in every five pounds Sample code: import re f = open('hw7.txt') text = f.read() re.findall(r'\b[\d,\.]+\b', text) ['2018', '12', '2018', '16', '93,000', '900,000', '125', '2,200', '2018', '2018', '40', '5,500', '10', '20', '2018', '31', '25', '30', '165', '.', '169', '59', '23', '2018', '2018', '2,692', '4,042', '2019'] len(re.findall(r'\b[\d,.]+\b', text)) 28 re.findall(r'\b\d[\d,\.]*\b', text) ['2018', '12', '2018', '16', '93,000', '900,000', '125', '2,200', '2018', '2018', '40', '5,500', '10', '20', '2018', '31', '25', '30', '165', '169', '59', '23', '2018', '2018', '2,692', '4,042', '2019'] len(re.findall(r'\b\d[\d,\.]*\b', text)) 27

Homework 7 Review Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations etc.), e.g. Toys R Us or New Look. List them. How many of them are there? Sample code: re.findall(r'\b[A-Z][a-z]*\b', text) ['What', 'High', 'Street', 'By', 'Emma', 'Simpson', 'Business', 'News',…] len(re.findall(r'\b[A-Z][a-z]*\b', text)) 269 re.findall(r'\b[A-Z][a-z]*\b(?:\s+[A-Z][a-z]*\b)*', text) ['What', 'High Street', 'By Emma Simpson Business', 'News\n\nA', 'Maplin', 'The', 'High Street', …] len(re.findall(r'\b[A-Z][a-z]*\b(?:\s+[A-Z][a-z]*\b)*', text)) 211

Homework 7 Review Find all the named entities (approximately everything beginning with an uppercase letter denoting people, places, organizations etc.), e.g. Toys R Us or New Look. List them. How many of them are there? Sample code: re.findall(r'\b[A-Z][a-z]*\b(?:[^\S\n]+[A-Z][a-z]*\b)*', text) ['What', 'High Street', 'By Emma Simpson Business', 'News', 'A', 'Maplin', 'The', 'High Street', 'But', 'I', 'Poundworld', 'It', 'Jenny Evans', 'June', 'It', 'Nicola', 'Poundworld', 'Wolverhampton', 'We', 'But I', 'I', 'I', 'I', 'It', 'People', 'The', 'Jenny', 'She', 'Nicola', 'House', 'Fraser', 'Shrewsbury', 'Working', 'In', 'September', 'Office', 'National Statistics', 'Two', 'British Retail Consortium', 'That', 'Poundworld', 'Toys R Us', 'Maplin', 'British High Streets', 'Other', 'Homebase', 'Mothercare', 'Carpetright', 'New Look', 'And', 'Christmas', 'Its', 'I', 'I', 'Sir Ian Cheshire' …] len(re.findall(r'\b[A-Z][a-z]*\b(?:[^\S\n]+[A-Z][a-z]*\b)*', text)) 214

Homework 7 Review How could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code. How many named entities now? Sample code: f.close() f = open('hw7.txt') lines = f.readlines() len(lines) 160 lines[0] 'What went wrong on the High Street in 2018?\n' for line in lines: ... if line.split() != []: ... print('{} '.format(line.split()[0]),end='') ... What By A The "I Jenny It "We "I Jenny Working In Two That Poundworld, And "I The But Retail Technology Chart "[The That's These "They Toys Maplin, Like Other As The Demand There's Retail Online "If "So Presentational Find Watch Presentational It To But It's "The Property Shoppers Many The Homebase, But The Drowning Creditors, Mike Presentational It's The Large "2018 It's Some And "On Town In Closures The Many "The Many Mike Even Model Asos The "It And The But "It's ''This But Additional

Homework 7 Review How could you filter out the words at the beginning of each sentence that aren't really named entities? Show your code. How many named entities now? Sample code: stopwords = ['What','A','The','This','That',"That's",'These','I','He','She','It','We','You','They','If','But','And', 'As','By','In'] nes = re.findall(r'\b[A-Z][a-z]*\b(?:[^\S\n]+[A-Z][a-z]*\b)*', text) len([ne for ne in nes if ne.split()[0] not in stopwords]) 141 [ne for ne in nes if ne.split()[0] not in stopwords] ['High Street', 'News', 'Maplin', 'High Street', 'I', 'Poundworld', 'Jenny Evans', 'June', 'Nicola', 'Poundworld', 'Wolverhampton', 'We', 'I', 'I', 'I', 'People', 'Jenny', 'She', 'Nicola', 'House', 'Fraser', 'Shrewsbury', 'Working', 'September', 'Office', 'National Statistics', 'Two', 'British Retail Consortium', 'Poundworld', 'Toys R Us', 'Maplin', 'British High Streets', 'Other', 'Homebase', 'Mothercare', 'Carpetright', 'New Look', 'Christmas', 'Its', 'I', 'I', 'Sir Ian Cheshire', 'B', 'Q', 'Debenhams', 'First', 'Beast', 'East', 'February', 'World Cup', 'Retail', 'Technology', 'One', 'Chart', 'November', 'Sir Ian', 'Toys R Us', 'February', 'Toys R Us', 'They', 'If', 'Natalie Berg', 'Toys R', 'Maplin', 'Its', 'Amazon', 'Like Toys R Us', 'Maplin', 'Across', 'Other', 'Europe', 'Demand', 'There', 'We', 'Retail', 'Online', 'If', 'Sir Ian', 'So', 'Most', 'Presentational', 'Find', 'Watch The Retail Year', 'News Channel', 'December', 'Presentational', 'To', 'There', 'They', 'Sir Ian', 'Property', 'Sir Ian', 'Debenhams', 'Shoppers', 'Debenhams', 'Oxford Street', 'Getty Images', 'Many', 'Company Voluntary Arrangement', 'Homebase', 'Mothercare', 'New Look', 'Carpetright', 'House', 'Fraser', 'Its', 'Drowning', 'House', 'Fraser', 'Creditors', 'Sports Direct', 'Mike Ashley', 'Mike Ashley', 'Presentational', 'Mr Ashley', 'House', 'Fraser', 'Large', 'M', 'S', 'Debenhams', 'Natalie Berg', 'Ultimately', 'Some', 'On', 'Yet', 'Mark Williams', 'Revo', 'Town', 'Closures', 'Local Data Company', 'Many', 'Mark Williams', 'Many', 'Mike Ashley', 'November', 'Christmas', 'Even', 'Asos', 'City', 'December', 'Model', 'Asos', 'Asos', 'November', 'Richard Hyman', 'Brexit', 'Armageddon', 'Helen Dickinson', 'British Retail Consortium', 'Although', 'Retail', 'Additional', 'Lora Jones']

Platforms Today I'll go through installation for:
MacOS (and Linux) Windows 10 Your homework assignment: install nltk, and check to see if it works (by next time) In the next few lectures:

NLTK 3.4 Install See http://www.nltk.org/i nstall.html
Use pip3 (for python3) to install packages from the Python Package Index (PyPI) sudo pip3 install -U nltk updated my nltk to 3.4

NLTK Data Install See http://www.nltk.org/data.html python3
If you get an SSL certificate error message, run: /Applications/Python 3.6/Install Certificates.command

Windows 10: setup Environment variable PATH should be set correctly to point to Python 3 install directory Type in search: Edit environment variables for your account

Windows 10: install nltk On the command line:
pip3 install pyyaml nltk Package pyyaml must be used somewhere in nltk … Source: n2/faq.html

Windows 10: install numpy and test nltk
On the command line: pip3 install numpy (the chunking algorithm uses it) Let's test nltk: .word_tokenize () converts a string into words .pos_tag() does part-of-speech tagging .ne_chunk() does named entity recognition

Windows 10: test nltk .draw() takes a Tree object and draws it in a pop-up window

Windows 10: install nltk data
Install corpus data (from inside Python) using nltk.download()

Windows 10: test nltk data
There is a sample of the well-known Penn Treebank Wall Street Journal (WSJ) corpus included 3,914 parsed sentences 49,000+ parsed sentences in the full corpus

From the class exercise

nltk: where is it installed?

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback