Presentation is loading. Please wait.

Presentation is loading. Please wait.

LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.

Similar presentations


Presentation on theme: "LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates."— Presentation transcript:

1 LINGUATECA http://www.linguateca.pt FLUP/CLUP http://www.letras.up.pt The Corpógrafo – a Web-based environment for corpora research extract Term Candidates Term Candidates list store terminological entries, examples and Meta-Data in DB Regexp Concordance KWIC / Window N-Grams Corpógrafo under GPL soon. Multiple Corpógrafos installed in several university departaments and countries: the “Corpógrafo Community” Centralized database to collect terminology / conceptual maps from the Corpógrafo Community Large-Scale Terminology/ Knowledge Resources for Specialized Search Engines, Technical Writing, Translation, etc The future… Motivation Build an environment that helps users in the entire process of corpora research. The tool should not require advanced computer skills and should be easy to use by all types of users, from students to researchers. Functionalities required: Web access: use anywhere, anytime from any computer. No software installations. Collect texts: text extraction from structured files, downloading texts from the Web Text pre-processing: “cleaning” text, segmentation, text annotation, text encoding searchable or exchangeable format; Corpus search: regular expression concordances, collocation extraction, frequency based statistics (N-grams count); Information extraction: terminology, semantic relations, conceptual maps Knowledge-resource building: specific-domain glossaries, thesauri, terminological databases and ontologies; categorized word-lists; Comparable corpora studies: compilation and search over comparable corpora Exporting results to other formats and applications: to standard terminological databases, translation memories, etc. Terminology Extraction General Corpora Studies Collect Texts Text Extraction Corpora (several languages) Web DOC TXT PS PDF HTML create and manage multilingual Terminology DB’s Improving processing and research of the Portuguese language Fostering collaboration among researchers Providing public and free-of-charge tools to the community Linguateca – Our mission! Text Pre-Processing and Categorization (Meta-Data) Corpora search Term Definitions and Semantic Relations 1.edit term meta-data (source, authors, morphology, etc.) 2. match bilingual equivalents 3. obtain statistical information from corpora about each term 1.query DB, navigate DB 2.export DB to XML file 3.automatic generation of documentation (HTML) DCR JPEG WAV QT WMF Associate: 1.explanation videos / pictures 2.Sound file (pronounciation) Media file repository Two years after its debut at CL2003, Corpógrafo reaches version 3 Corpógrafo is now a mature environment, ready to be further expanded More than 100 regular users. More than 400 user accounts. Many lessons learned from practice: usability, technology, linguistics A corpus linguistics research community has grown along with Corpógrafo Large Terminology / Knowledge Engineering projects are now possible Corpógrafo V3: two years after… Have a look at (version 3 will be on-line in August 2005): http://www.linguateca.pt/corpografo Where to find Corpógrafo? Corpógrafo is built over SAGI, a web operative system developed by Linguateca. SAGI uses “LAMP”: Linux OS, Apache Web Server, MySQL RDBMS, Perl SAGI allows complete control over CGI processes and helps programmers build web interfaces Under the hood Luís Sarmento Belinda Maia Diana Santos Luís Cabral Ana Sofia Pinto las@letras.up.pt bmaia@mail.telepac.pt Diana.Santos@sintef.no lcabral@letras.up.pt asofia@letras.up.pt Corpógrafo’s workflow overview:


Download ppt "LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates."

Similar presentations


Ads by Google