Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing.

Similar presentations


Presentation on theme: "Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing."— Presentation transcript:

1 Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb marko.tadic@ffzg.hr

2 Bruxelles, 2006-03-10 Project AIDE  idea for a project  September 2004, conference at JRC, Ispra  interdisciplinary collaboration of 3 institutions  Croatian Information Documentation Referral Agency (HIDRA)  Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb  Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb

3 Bruxelles, 2006-03-10 AIDE – collaborating institutions  HIDRA  collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia  coordinator Maja Cvitaš, M.A.  ZEMRIS  research in the field of artificial intelligence, neural networks, machine learning, data and text mining  coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder  ZZL  computational linguistic research and building language technologies for Croatian  coordinator prof. Marko Tadić

4 Bruxelles, 2006-03-10 AIDE – project objective Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus

5 Bruxelles, 2006-03-10 AIDE – how?  automatic indexing, how?  program which “learns to index”  Joint Research Center of EC (JRC), Ispra, Italy  at least 10,000 manually indexed documents  3-5 descriptors per document  10-15 documents per descriptor  indexed documents stored in XML format  Steinberger (2003)  compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors  situation with Croatian documentation in 2004.  there were only few hundreds of documents indexed  manual indexing: painfully slow

6 Bruxelles, 2006-03-10 AIDE – how?  how could we speed up the manual indexing?  plan:  to develop a workstation for computer aided document indexing  conduct the research and development of algorithms in the field of computational linguistics/language technologies  insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)

7 Bruxelles, 2006-03-10 CADIS: two windows Document window Eurovoc browser window

8 Bruxelles, 2006-03-10 Document Window

9 Bruxelles, 2006-03-10

10 CADIS features  Enhanced user interface  list of descriptors appearing in document

11 Bruxelles, 2006-03-10 CADIS features  Descriptors and non-descriptors marked in document

12 Bruxelles, 2006-03-10 CADIS features  Lists of n-grams

13 Bruxelles, 2006-03-10 CADIS features  Integration of corpus analysis  greyed n-grams are statistically relevant in the corpus

14 Bruxelles, 2006-03-10 CADIS features  Manual marking of significant n-grams — important step towards automatic indexing

15 Bruxelles, 2006-03-10 Eurovoc browser window

16 Bruxelles, 2006-03-10 Further development  CADIS for other languages?  already for Croatian and English  usable for other languages without linguistic module  cooperation needed with respective language technology experts for development of linguistic module for other languages  partners for EU project proposals for the next step  AIDE  research on machine learning and text-mining  use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc  establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

17 Bruxelles, 2006-03-10 http://textmining.zemris.fer.hr

18 Bruxelles, 2006-03-10 Conclusion  CADIS is unique in Europe  Web info at:  HIDRA: www.hidra.hr/hidra/aide/aide.htmwww.hidra.hr/hidra/aide/aide.htm  ZEMRIS: textmining.zemris.fer.hrtextmining.zemris.fer.hr  for download contact: bojana.dalbelo@fer.hr

19 Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb marko.tadic@ffzg.hr


Download ppt "Bruxelles, 2006-03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing."

Similar presentations


Ads by Google