Presentation is loading. Please wait.

Presentation is loading. Please wait.

NOOJ 0.1 Max Silberztein Université de Franche-Comté 6th INTEX Workshop Sofia, Bulgaria, May 2003.

Similar presentations


Presentation on theme: "NOOJ 0.1 Max Silberztein Université de Franche-Comté 6th INTEX Workshop Sofia, Bulgaria, May 2003."— Presentation transcript:

1 NOOJ 0.1 Max Silberztein Université de Franche-Comté max.silberztein@univ-fcomte.fr 6th INTEX Workshop Sofia, Bulgaria, May 2003

2 NOOJ v0.1 Rewritten entirely Fully compatible with INTEX 4.3x Corpus processor WEB support Support for any Text Encoding & Format Object-Oriented linguistic engine Dynamic Programming with Published methods

3 Corpus Processor A Corpus is a set of homogeneous files: same language, same linguistic resources A corpus may include tens of thousand small (i.e. WEB pages) or large files (i.e. Le Monde, 1 year), stored anywhere Different corpora can share certain text files with no extra cost

4 Text Encoding & Format support Native support means NO FILE CONVERSION -- for TXT files: Windows Default (8 or 16 bits), DOS & ISO (any codepage), Unicode (7 bits, 8 bits, 16 bits, low and big endian) -- for HTML (any encoding) and RTF (ASCII & Unicode) files -- for Microsoft Word files, any version including Apple -- No limit: XML, PDF, LaTeX, Outlook...

5

6 WEB support Nooj includes a WEB crawler that can import WEB sites Exploration is performed up to a user- declared depth, or until the WEB site is fully explored (danger!) Indirections are processed during the exploration; they may produce empty text files (i.e. no text unit in the WEB page).

7 OO linguistic engine the engine can be easily adapted –inheritance means that one can build quickly a new module by inheriting another module’s properties, by default –override means that one needs to provide description only for the methods that perform tasks differently dynamic programming: –NOOJ loads parts of the linguistic engine only when needed –describing extremely specific phenomena or behavior carries no cost for the overall architecture open interface: –Applications access NOOJ from command-line programs (i.e. SHELL), as well as from object & class methods, from user’s programs or from other applications (such as Microsoft Office or Adobe Acrobat).

8 Français.il namespace Nooj { public class Français: Language // Language is a virtual class; MUST BE OVERRIDEN { // tokenizer public override static bool rightToLeft () { return false; } // true for Arabic public override static bool oneCharPerToken () { return false; } // true for Chinese public override static bool transcription () { return false } // true if text processed != text displayed... // tokens’ properties public override static bool upperCaseLetter (char letter) {... } public override static bool lowerCaseLetter (char letter) {... } public override static bool lowerCassForm (string token) {... }... // dictionary & list lookup & match public override static int compareForms (string wform1,string wform2) {... } public override static int matchForm (string wform,string entry) {... }... // localization... }

9 Perspectives Text processing is fully operational; Linguistic engine is ½ operational Morphological module by September Dictionaries (new types & tools) by Sept. Grammars developing tools (new types and tools) by end of 2003 => Alpha version by the end of 2003 All 4.3x functionalities by may 2004


Download ppt "NOOJ 0.1 Max Silberztein Université de Franche-Comté 6th INTEX Workshop Sofia, Bulgaria, May 2003."

Similar presentations


Ads by Google