Presentation is loading. Please wait.

Presentation is loading. Please wait.

October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document.

Similar presentations


Presentation on theme: "October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document."— Presentation transcript:

1 October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document Collections Applications: Anatomy of a Search Engine NLTK

2 October 2005CSA3180: Text Processing I2 Language Encoding Issues Different encoding methods Different languages Unicode Standard Further information: –Unicode Consortium –Jukka Korpela Tutorial http://www.cs.tut.fi/~jkorpela/chars.html

3 October 2005CSA3180: Text Processing I3 Language Encoding Issues Character Repertoire – set of distinct characters Character Code – mapping between characters and positive integers Character Encoding – algorithm for presenting characters using particular code

4 October 2005CSA3180: Text Processing I4 Language Encoding Issues Encoding using octets Common Encodings: –ASCII –ISO Latin I (ISO 8859-1) –ISO Latin II + III Extensions (for Maltese) –Unicode & UTF-8 –ANSI –Cyrillic and Chinese Encodings

5 October 2005CSA3180: Text Processing I5 Language Encoding Issues Text encoding on the Web MIME Standard –Content-Type: text/html; charset=iso-8859-1 –Used in Email and Web Servers –Problems in implementation: few encodings properly supported –UTF-8 recommended

6 October 2005CSA3180: Text Processing I6 Common Corpora WordNet TREC/ACE/TIDES Corpora Linguistic Data Consortium (LDC) –GigaWord (News) –Tree Banks –MUC (Message Understanding Conference) –TIPSTER (Information Retrieval)

7 October 2005CSA3180: Text Processing I7 Handling Large Document Collections Special issues involved in processing Hierarchical directory structures File indexes Batch processing – start, resume, pause, end Job scheduling

8 October 2005CSA3180: Text Processing I8 Applications Anatomy of a Search Engine (Larry Page and Sergey Brin) Describes the internals of Google NLP in everyday life!

9 October 2005CSA3180: Text Processing I9 Next Sessions… Natural Language Toolkit (NLTK) http://nltk.sourceforge.net/ Please download and install!


Download ppt "October 2005CSA3180: Text Processing I1 CSA3180: Natural Language Processing Text Processing 1 Language Encoding Issues Common Corpora Handling Large Document."

Similar presentations


Ads by Google