Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.

Similar presentations


Presentation on theme: "Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples."— Presentation transcript:

1

2 Indexing Knowledge Daniel Vasicek 2014 March 27

3 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples

4 Basic Ideas Concepts instead of key words – Thesauri instead of key words – Recognize Emerging concepts – Classification Facilitate communication between environments (Data translation) Meta data for publications (xml, sql, txt) – Indexing information

5 Topics to Cover Programming language constructs needed. What functionality do we need? What people pay Access Innovations to do? Typical programming problems that I encounter.

6 Input Data Formats – XML tagged meta data for publications – SQL data base – RAW text – Pictures of text Quantities – AIP 304,910 authors as xml files 807,005 xml files containing title, abstract, +meta data – Nicem (National Information Center for Educational Media) 503,534 xml files describing available educational media 26,144 xml files describing suppliers of educational media

7 Programming Languages Used Visual Basic (1990s) C++ Java (currently)

8 Who Cares? AIP – American Institute of Physics (17 journals + conference proceedings) IEEE- Institute of Electronic and Electrical Engineers (journals, standards, patents, …) SPIE- International Society for Optics and Photonics ACM – Association of Computing Machinery Wolters-Klewer Pub-Med

9 More Clients Parliament of Victoria (5000 articles per day) JSTOR (~10 million documents, some journals back to 1665) PLOS (quick path to electronic publication) Dupont DOW Council of Europe Triumph Learning ASCE, SAGE, SafetyLit, OSA, NICEM, NPR …

10 Useful Tools Controlled Vocabulary – an organizational tool for capturing concepts Proximity – a tool for capturing context Hash Table (Content Addressable Array) – Convenience – Uniqueness – Fast access Regular Expressions

11 What’s a taxonomy? Knowledge organization system Words – Controlled vocabulary for a subject area Descriptive labels Hierarchy – Simple hierarchical view of a thesaurus Storage and retrieval aid

12 Thesaurus Elements Hierarchy – Broader and Narrower concepts – Multiply connected “treelike” structure Nodes in the thesaurus structure contain descriptions of concepts and links to broader, narrower, related, and similar concepts Subject specific?

13 Structure of Controlled Vocabularies Flat List Synonym Ring Taxonomy Thesaurus Ontology INCREASING MEANING and CONTROL Ambiguity Synonym Ambiguity Synonym Hierarchy Relationships Synonym Hierarchy Additional Types of Relationships Hierarchy After ANSI/NISO Z39.19 -2005, Figure 5

14 Synonym Narrower Term Science of Life Broader Term Science Biology Thesaurus Node (Term)

15 Thesaurus Implementation Terms (Concepts, Preferred Terms) Broader Terms Narrower Terms Related Terms Other Concepts – Synonyms – History – Responsibility – Backup Rules to help identify the concept in text Methods for maintaining the thesaurus

16 Thesaurus Text Representation Biology Science Science of Life Science Biology Science of Life

17 Thesaurus Problems Missing Terms - pointer links to a term that is not present Broken loops – Narrower term without matching broader term – Broader term without matching narrower term – Related term without a matching return relationship

18 Proximity of Words Adjacent – Before – After Same sentence Same Paragraph Within 50 words Phrases (n-Grams)

19 Content Addressable Array T[“Science”]=1; T[“Biology”]=1; T[“Science of Life”]=1; BT[“Biology”] = “Science”; NT[“Science”] = “Biology”; UF[“Science of Life”]=“Biology”;

20 Regular Expressions /^[_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*@[a-zA-Z0- 9-]+(\.[a-zA-Z0-9-]+)*(\.[a-zA-Z]{2,4})$/ – Email addresses? / [A-Z][a-z]* / – Capitalized words /[A-Z][a-zA-Z0-9,\”\- ]*\. / – Sentence ? Paragraph?

21 Structure of Controlled Vocabularies Flat List Synonym Ring Taxonomy Thesaurus Ontology INCREASING MEANING and CONTROL Ambiguity Synonym Ambiguity Synonym Hierarchy Relationships Synonym Hierarchy Additional Types of Relationships Hierarchy After ANSI/NISO Z39.19 -2005, Figure 5


Download ppt "Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples."

Similar presentations


Ads by Google