Presentation is loading. Please wait.

Presentation is loading. Please wait.

Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.

Similar presentations


Presentation on theme: "Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty."— Presentation transcript:

1 Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb marko.tadic@ffzg.hr Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven marie-francine.moens@law.kuleuven.ac.be

2 Leuven, 2007-05-22 Talk overview  document indexing and computer aided document indexing  project AIDE  CADIS workstation: features  project CADIAL  eCADIS workstation: additional features  machine learning techniques  future developments  conclusions

3 Leuven, 2007-05-22 Computer Aided Document Indexing  document indexing  attachment of descriptors from a controlled thesaurus to a document  descriptors = labels representing the content of a document  necessary for document retrieval in many document collections  parliamentary documentation  legislation  technical documentation  …  usually done manually  tedious, error prone, slow (max. 30-40 documents/day)  could computers be of any help in this process?  if we build a Computer Aided Document Indexing System (CADIS)

4 Leuven, 2007-05-22 Project AIDE in Croatia  idea for a project  September 2004  interdisciplinary collaboration of 3 institutions  Croatian Information Documentation Referral Agency (HIDRA)  Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS) Faculty of Electrical Engineering and Computing University of Zagreb  Institute of Linguistics (ZZL) Faculty of Humanities and Social Sciences University of Zagreb

5 Leuven, 2007-05-22 AIDE – collaborating institutions  HIDRA  collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia  coordinator Maja Cvitaš, M.A.  ZEMRIS  research in the field of artificial intelligence, neural networks, machine learning, data and text mining  coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc.  ZZL  computational linguistic research and building language technologies for Croatian  coordinator prof. Marko Tadić

6 Leuven, 2007-05-22 AIDE – project objective Development of intelligent system for automatic indexing of the official documentation of the Republic of Croatia with descriptors from Eurovoc thesaurus

7 Leuven, 2007-05-22 AIDE – how?  AIDE = Automatic Indexing of Documents with Eurovoc  automatic indexing, how?  program which “learns to index” documents  conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09  at least 10,000 manually indexed documents  3-5 descriptors per document  10-15 documents per descriptor  indexed documents stored in XML format  Steinberger (2003)  compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors  situation with Croatian documentation in 2004-09  there were only few hundreds of documents indexed  manual indexing: painfully slow  how could we speed up the manual indexing?

8 Leuven, 2007-05-22 AIDE – activities  investigate and develop algorithms in the field of computational linguistics/language technologies  include that knowledge into the Computer Aided Document Indexing System (CADIS)  demonstration of CADIS in European parliament (2006-03-10)

9 Leuven, 2007-05-22 CADIS: two parallel windows Document window Eurovoc browser window

10 Leuven, 2007-05-22 Document Window

11 Leuven, 2007-05-22

12 CADIS features  Enhanced user interface  list of descriptors literary appearing in document

13 Leuven, 2007-05-22 CADIS features  Descriptors and non-descriptors marked in document

14 Leuven, 2007-05-22 CADIS features  Lists of n-grams

15 Leuven, 2007-05-22 CADIS features  Integration of corpus analysis  greyed n-grams are statistically relevant in the corpus i.e. collocations

16 Leuven, 2007-05-22 CADIS features  Manual marking of significant n-grams  important step towards further refinment of automatic indexing

17 Leuven, 2007-05-22 Eurovoc browser window

18 Leuven, 2007-05-22 AIDE – activities  investigate and develop algorithms in the field of computational linguistics/language technologies  include that knowledge into the Computer Aided Document Indexing System (CADIS)  demonstration of CADIS in European parliament (2006-03-10)  ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006  joint project proposal with Katholieke Universiteit Leuven for CADIAL project

19 Leuven, 2007-05-22 CADIAL project  Computer Aided Document Indexing for Accessing Legislation  a joint Flemish-Croatian project  Department International Flanders, grant no. KRO/009/06  partners:  Katholieke Universiteit Leuven (prof. Marie-Francine Moens)  University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić)  started: 2007-03  duration: 2 years  web: www.cadial.org  the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia  new version of CADIS (eCADIS) is one of modules in this project  planned as a web-based service

20 Leuven, 2007-05-22 CADIAL project 2  used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian  used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English  included that training data into the next version: eCADIS (  -version)

21 Leuven, 2007-05-22 eCADIS (  ) features  Automatic suggestion of relevant descriptors i.e. automatic indexing  application of machine learning techniques

22 Leuven, 2007-05-22 eCADIS (  ) features  Compare it to manually attached indexes…

23 Leuven, 2007-05-22 eCADIS (  ) features  Manual marking of inappropriate suggestions  another step in further refinment of automatic indexing

24 Leuven, 2007-05-22 eCADIS (  ) on document in English

25 Leuven, 2007-05-22 eCADIS (  ) on document in English  Automatic suggestion of relevant descriptors i.e. automatic indexing

26 Leuven, 2007-05-22 eCADIS (  ) on document in English  Compare it to manually attached indexes…

27 Leuven, 2007-05-22 Training the classifiers  already existing classifiers  profile classifier (Steinberger 2003)  K-nearest neighbours  binary classifiers  SVM, Logistic Regression, Rocchio, Bayes, …  classifiers used for the preliminary training  ca 3500 independent binary classifiers  need to be further evaluated  Logistic Regression used for 10,000 documents in Croatian  SVM used for 20,000 documents in English  features  tokens, lemmas, stems, character n-grams  various feature selection methods and their combinations:  2, ig, mi…

28 Leuven, 2007-05-22 Further development of eCADIS  training with new features and feature selection methods  collocations, word n-grams, chunks  new measures for evaluation of results  sensitive to thesaurus hierarchy  web-interface for eCADIS for inclusion into the CADIAL system  eCADIS for other languages  now only Croatian and English (  -version) covered  usable for other languages as it is, but without the linguistic module less efficient  no list of lemmas, but types  poor statistics for n-grams  cooperation with language technology experts in different languages for development of linguistic modules

29 Leuven, 2007-05-22 Further development of eCADIS  … eCADIS for other languages  training the automatic indexing system for other languages  enables automatic suggestions of relevant descriptors in new, unseen documents  analysis of manual markings  descriptors, word n-grams, suggestions  promote the use of eCADIS in other countries beyond the scope of CADIAL project  e.g. Belgium (Flanders)  linguistic module for Dutch and French needed  computational lingustics expertise  training data from Acquis can be used to make an automatic indexing system for Dutch and French  machine learning expertise

30 Leuven, 2007-05-22 Conclusion  CADIAL  a joint Flemish-Croatian project sponsored by Flemish government  better public access to Croatian official documentation  faster and improved document indexing  automatic content metadata generation (Semantic Web)  easier document retrieval and exploration of legislation  multilingual access via standardized EU thesaurus Eurovoc  a test-case for the usage of such a system in Flanders  Web information on CADIAL project and eCADIS  www.cadial.org www.cadial.org  contact:  bojana.dalbelo@fer.hr bojana.dalbelo@fer.hr  marie-france.moens@law.kuleuven.ac.be marie-france.moens@law.kuleuven.ac.be

31 Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb marko.tadic@ffzg.hr Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven marie-francine.moens@law.kuleuven.ac.be


Download ppt "Leuven, 2007-05-22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty."

Similar presentations


Ads by Google