Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Induction of Rules for Classification and Interpretation of Cultural Heritage Material S. Ferilli, F. Esposito, T.M.A. Basile and N. Di Mauro.

Similar presentations


Presentation on theme: "Automatic Induction of Rules for Classification and Interpretation of Cultural Heritage Material S. Ferilli, F. Esposito, T.M.A. Basile and N. Di Mauro."— Presentation transcript:

1 Automatic Induction of Rules for Classification and Interpretation of Cultural Heritage Material S. Ferilli, F. Esposito, T.M.A. Basile and N. Di Mauro {ferilli, esposito, basile, Dipartimento di Informatica - Università di Bari, Italy Trondheim, Norway August 17-22, 2003

2  The COLLATE project  The INTHELEX system  Experiments  Conclusions Overview

3 The COLLATE project IST Key Action III: Multimedia, Content and Tools Action line III.2.4: Digital Preservation of Cultural Heritage Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material

4 Fraunhofer - IPSI Project Coordinator Deutsches Filminstitut - DIF Filmarchiv Austria - FAA Risø National Laboratory Systems Analysis Dept. University of Bari Dip. di Informatica Sword ICT COLLATE: the consortium Národní Filmový Archiv - NFA

5 The goal of the COLLATE project  Constructing a Web-based “Collaboratory in use” (collaborative laboratory) for archives, researchers and end-users working with digitized historical and cultural material  Multimedia digital repository  European historic film documentation (20ies and 30ies)  XML metadata (cataloguing & content indexing)  Ensure accessibility  Work environment for content indexing & annotation  Content-based retrieval  Evaluate acceptability  Preservation case studies by film experts  Empirical studies of real-life user behavior

6 Collate Repository Censorship Decisions Source for: Film title Participants in the examination Notices on content Juridical legitimization for the decision Legal motivation Conditions for permission (e.g., cuts, change of title) Reference to previous decisions Costs for the procedure

7 Collate Repository Film Registration Cards Source for: Film title Production company Date and number of examination Length (after censoring) Number of acts Brief content Forbidden parts Staff

8 Collate Repository Film Application Forms Source for: Name of applicant (production or distribution company) Title of the film Year of production Length (before censorship) Brief content Information about earlier examinations Národní Filmový Archiv

9 Collate Documents Text Structure

10 Newspaper Clippings and Articles Collate Repository

11 INTHELEX  Multi-purpose  First-order logic hierarchical theories  Object Identity assumption  Fully incremental  Closed loop  Full memory storage  Historical memory of all positive/negative examples  Multi-conceptual  Dependency graph  Multi-strategy  Induction (generalization, specialization)  Abstraction, Abduction, Deduction INcremental THEory Learner from EXamples

12 INTHELEX Architecture

13 Deduction in INTHELEX  Based on a saturation operator  recognizes higher level concepts, deduced via subsumption and/or resolution, in the example descriptions Theory Saturation operator example BK saturated example grand_father(mike, anne) :- parent(mike,john), male(mike), parent(john,anne), male(john) grand_father(mike, anne) :- parent(mike,john), male(mike), father(mike,john), parent(john,anne), male(john), father(john,anne) father(X, Y) :- parent(X,Y), male(X). parent(mike,john), male(mike), parent(john,anne), male(john) parent(X,Y), male(X).

14 Abduction in INTHELEX Abductive logic theory: a normal logic program that contains  abducibles: predicates on which abductions can be made  integrity constraints: indirect information about abducibles father(x,y) :- parent(x,y), male(x) Father(mike,mary) :- parent(mike,mary) :- male(x),female(x) male(mike) Intertwining abductive and consistency derivations, INTHELEX starts from a goal and a set of initial assumptions and results in a set of consistency hypotheses.

15 Abstraction in INTHELEX  Abstraction concerns the shift of representation language. It takes place at the world-perception level, and then propagates to a higher level, by means of a set of operators  The abstraction theory, applied automatically by the system before processing the examples, is assumed already given to INTHELEX  Abstraction operators in INTHELEX:  replacing a number of components by a compound object  decreasing the grain-size of a set of values  ignoring whole objects or just part of their features  neglecting the number of occurrences of some kind of object

16 Abstraction in INTHELEX: an example grandfather(john,mary):- parent(john,mike), male(john), parent(mike,mary). Example description: father(X,Y):- parent(X,Y), male(X). An abstraction clause: Abstracted example: grandfather(john,mary):- father(john,mike), parent(mike,mary). parent(john,mike), male(john), parent(X,Y), male(X).

17 Experimental Session  Original Dataset  29 documents for the class FAA_registration_card 187 layout blocks  36 documents for the class DIF_censorship_decision 299 layout blocks  24 documents for the class NFA_application_forms 248 layout blocks  17 reject documents obtained from newspaper articles  Methodology  10-fold cross validation

18 Document Pre-processing  The system WISDOM++  Input: scanned document images (Tiff 300 dpi)  performs identification of layout blocks that make up a paper document, along with their type, size, position and relative position  Output: a first order description in terms of these layout blocks

19 Example of document pre-processing class_dif_censorship_card(d1) :-... on_top(b3,b4), to_rigth(b3,b4),... part_of(d1,b3), width(b3,103), height(b3,55), type_of_text(b3), x_pos_center(b3,115), y_pos_center(b3,323), part_of(d1,b4), width(b4,87), height(b4,57), type_of_text(b4), x_pos_center(b4, 297), y_pos_center(b4, 340), b3 b4 d1 Relative position between b3 and b4

20 Abstraction for handling numeric features class_dif_censorship_card(d1) :- part_of(d1,b3), width(b3,103), height(b3,55), type_of_text(b3), x_pos_centre(b3,345), y_pos_centre(b3,323), ….. Example description: pos_left(X):- x_pos_centre(X,Y),Y >= 0, Y =< 213. pos_center(X):- x_pos_centre(X,Y), Y >= 214, Y=<426. pos_right(X):- x_pos_centre(X,Y), Y >= 427. Abstraction Theory for horiz. pos.: Abstracted example for horizontal position: class_dif_censorship_card(d1) :- part_of(d1,b3), width(b3,103), height(b3,55), type_of_text(b3), pos_center(b3), y_pos_centre(b3,323), ….. x_pos_centre(b3,345), x_pos_centre(X,Y) Y = 345 x_pos_centre(X,Y) pos_center(X):-

21 Description Length The Description length after the abstraction process on numeric features doesn't change (increase/decrease) with respect to the original one, since each numeric value is now represented by a corresponding symbolic value

22 Experimental Results: DIF The number of clauses needed to classify the documents belonging to this class is exactly 1

23 Example of Clauses Learned class_dif_cen_decision(A) :- image_lenght_long(A), image_width_short(A), part_of(A, B), type_of_text(B), width_medium_large(B), height_very_very_small(B), pos_left(B), pos_upper(B), part_of(A, C), type_of_text(C), height_very_very_small(C), pos_left(C), pos_upper(C), on_top(C, D), type_of_text(D), width_medium_large(D), height_very_very_small(D), pos_left(D), pos_upper(D). The features in this description are common to all the learned definitions in the 10 folds which suggests that the system was able to catch the significant components and explains why its performance on this class is the best of all Starting with descriptions whose average length was 215, the average number of literals in the learned rules is just 22

24 Experimental Results: FAA

25 Experimental results: NFA

26 Discussion  INTHELEX was actually able to learn significant definitions for both the document classes and the layout blocks of interest for each of them  predictive accuracy is always very high, reaching even 99.17% in one case and only in 2 cases out of 32 falling below 90% (86.69% and 89.85%)  best accuracy is obtained by a theory made up of only one clause (grasped the target concept), and coincides with the best runtime (classification for class DIF)  High degree of understandability for human experts of the learned rules  Classification problem easier than the Interpretation one  tendential increase in number of clauses, performed generalizations and runtime between classification and interpretation

27 Discussion  Evident increase for the Runtime but High Predictive Accuracy  should ensure that few theory revisions can be expected when processing further documents  Scalability should be ensured  Very few documents will generate theory revision  Symbolic representations allow the expert to properly choose the training examples so that few of them are sufficient to reach a correct definition

28 Conclusions  Application of symbolic (first-order logic) multistrategy learning techniques to induce rules for automatic Classification and Interpretation of cultural heritage material  Incremental System (INTHELEX)  Experimental results prove the benefits that such an approach can bring


Download ppt "Automatic Induction of Rules for Classification and Interpretation of Cultural Heritage Material S. Ferilli, F. Esposito, T.M.A. Basile and N. Di Mauro."

Similar presentations


Ads by Google