Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

Similar presentations


Presentation on theme: "CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken."— Presentation transcript:

1 CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken {cath,paul}@iai.uni-sb.de http://www.iai.uni-sb.de

2 CIG Conference Norwich September 2006 AUTINDEX 2 Automatic Indexing and Classification of Texts AUTINDEX:- calculates keywords in texts places text in its appropriate classification

3 CIG Conference Norwich September 2006 AUTINDEX 3 APPLICATIONS Information Services for indexing scientific articles Document Management Systems for text classification according to content Libraries for indexing incoming books and articles

4 CIG Conference Norwich September 2006 AUTINDEX 4 Basis Components Morpho-syntactic analysis: tagging and lemmatisation Shallow parsing: resolution of grammatical ambiguities and identification of NPs

5 CIG Conference Norwich September 2006 AUTINDEX 5 Linguistic Resources for Pre- processing Morphological Analyser & Morpheme dictionaries Grammar rules for shallow parsing

6 CIG Conference Norwich September 2006 AUTINDEX 6 Morphological Analyser “Cost reduction” cost: {lu=cost,ls=cost,c=verb,vtype=fiv} {lu=cost,ls=cost,c=verb,vtype=inf} {lu=cost,ls=cost,c=noun,nb=sg} reduction: {lu=reduction,ls=reduce,c=noun,nb=sg}

7 CIG Conference Norwich September 2006 AUTINDEX 7 Shallow Parsing The company evaluated the cost reduction noun NP finite verb NP

8 CIG Conference Norwich September 2006 AUTINDEX 8 Controlled Indexing Identifies multiword terms and their syntactic variants Calculates keywords based on frequency and semantic weighting Checks thesaurus for relevant entry Classifies text

9 CIG Conference Norwich September 2006 AUTINDEX 9 Linguistic Resources for Indexing Multiword Terms and Variants Direct Match: cost reduction -> cost reduction Indirect match: inflectional differences cost reduction -> cost reductions

10 CIG Conference Norwich September 2006 AUTINDEX 10 AUTINDEX Linguistic Resources for Indexing lexical synonyms: rise - increase derivational synonyms: biomagnetic – biomagnetism air pollutant – air pollution

11 CIG Conference Norwich September 2006 AUTINDEX 11 AUTINDEX Linguistic Resources for Indexing structural variants: costs of reduction – reduction costs combined (structural plus derivational): transmitted DC power – DC power transmission to calculate plane waves – place wave calculation

12 CIG Conference Norwich September 2006 AUTINDEX 12 AUTINDEX Semantic Weighting 140 semantic types in dictionaries Weight assigned to nouns depending on semantic type Result of weighting set of keywords belonging to most frequent semantic classes

13 CIG Conference Norwich September 2006 AUTINDEX 13 AUTINDEX Classification Descriptors annotated with Classification Code Hyperonym and Synonym relations used Frequency used to calculate Topic Classification

14 CIG Conference Norwich September 2006 AUTINDEX 14 AUTINDEX User-Specific Thesauri Keywords checked against Thesaurus Hierarchical Structure of Thesaurus used to calculate Descriptors: hyperonym relations synonym relations

15 CIG Conference Norwich September 2006 AUTINDEX 15 AUTINDEX Example Output Keywords: List of descriptors from thesaurus plus weighting List of free terms / free descriptors plus weighting Topic Classification with relevant code

16 CIG Conference Norwich September 2006 AUTINDEX 16 AUTINDEX Free Indexing Free indexing follows the same steps as for controlled indexing but without the use of a thesaurus The result is a list of free descriptors

17 CIG Conference Norwich September 2006 AUTINDEX 17 AUTINDEX Architecture

18 CIG Conference Norwich September 2006 AUTINDEX 18 AUTINDEX Bilingual Components Automatic language recognition Bilingual dictionaries Bilingual thesauri

19 CIG Conference Norwich September 2006 AUTINDEX 19 AUTINDEX Libraries & the Internet Switch of focus from libraries to Internet because of: Search engines e.g. Google Poor access to library resources

20 CIG Conference Norwich September 2006 AUTINDEX 20 AUTINDEX Reasons for Poor Access search tools need full text match human indexation too general and inconsistent no flexibility in terms of semantic relations

21 CIG Conference Norwich September 2006 AUTINDEX 21 AUTINDEX AUTINDEX in Libraries High percentage of all queries have no hit in electronic library catalogue From the rest a high percentage is not used

22 CIG Conference Norwich September 2006 AUTINDEX 22 AUTINDEX IntelligentCAPTURE Complete processing chain for digital content in libraries: - scanning of contents tables - treatment with OCR technology - automatic indexation - feeding results into library system - integration of improved retrieval system

23 CIG Conference Norwich September 2006 AUTINDEX 23 AUTINDEX Dandelon database Supports 16 EU languages for multilingual retrieval Running in 4 countries at 9 libraries

24 CIG Conference Norwich September 2006 AUTINDEX 24 AUTINDEX Work Flow

25 CIG Conference Norwich September 2006 AUTINDEX 25 AUTINDEX Summary AUTINDEX provides for controlled and free indexing Integrated in a complete processing chain AUTINDEX can be used to improve access to library resources through efficient methods of indexation


Download ppt "CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken."

Similar presentations


Ads by Google