Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.

Similar presentations


Presentation on theme: "INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University."— Presentation transcript:

1 INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University

2 Effective Information Retrieval Data Structures Data Structures Knowledge Representation Knowledge Representation  From Document representation to Knowledge representation User Interface and User Interaction User Interface and User Interaction

3 Document Representation Vocabulary Vocabulary Semantics Semantics Implementation Implementation

4 Vocabulary Controlled Vocabulary Controlled Vocabulary  A list of terms selected for index purpose.  The terms are processed to reduce inconsistence and ambiguity.  Established selection rules and indexing rules Uncontrolled vocabulary Uncontrolled vocabulary  Subject keywords  Metadata

5 Example: ACM record

6 Meta Data Data about data Data about data Descriptive Data Descriptive Data  External to the meaning of the document  Dublin Core Metadata Element Set  Author, title, publisher, etc. Semantic Metadata Semantic Metadata  Subject keywords Challenge: automatic generation of metadata for documents Challenge: automatic generation of metadata for documents

7 Semantics Semantics is the study of meaning Semantics is the study of meaning  Relational semantics  Synonymy, hierarchical, etc.  Referential semantics  Homonyms, techniques used to limited the meanings or referents of terms  Category semantics  Facets or other participations

8 Example: Mercury? Mercury?  Mercury (car)  Mercury (planet)  Mercury (metal)  Mercury (Greek god)

9 Implementation Standards Standards  AACR2  ISO Standard for Indexing (ISO 5963)  ISO Standard for Thesaurus Construction (ISO 2788) Rules Rules  Classification rules  Evaluation rules

10 Subject Indexing A human analytic process for identifying, selecting, and representing document concepts A human analytic process for identifying, selecting, and representing document concepts  Create indexing languages  Using standardized, limited vocabularies for index purposes.  Assign indexing terms to documents  Using only the terms in the index language selected.

11 Basic Processes of Subject Indexing Identifying concepts which represent the subject and purpose of a document. Identifying concepts which represent the subject and purpose of a document. Deciding which of these concepts are important for retrieval of this document Deciding which of these concepts are important for retrieval of this document Expressing concepts needed for retrieval in the indexing languages used Expressing concepts needed for retrieval in the indexing languages used Using uncontrolled vocabulary for concepts not represented or represented insufficiently specifically in the indexing languages. Using uncontrolled vocabulary for concepts not represented or represented insufficiently specifically in the indexing languages.

12 Controlled Vocabulary Goals: Goals:  To permit easy locations of documents by topic.  To define topic areas, and hence relate one document to another.  to provide multiple access pointers to documents  to enforce a uniformity throughout an information retrieval system

13 Controlled Vocabulary Formats: Formats:  Hierarchical Classified list  hierarchical subject descriptors  associative cross references  classification notation (codes)  Alphabetical list  include both descriptors and other lead-in terms

14 Main Components in a Controlled Vocabulary Keyword/ Descriptor Synonymous Term Broader Term Narrower Term Related Term

15 Example CancerMalignancy Malignant tumor Cancer morphology Diseases Neoplasms Malignant neoplasm of skins Breast Cancer Primary malignant neoplasm of liver Abdominal Neoplasms Hyperplasia Seminoma Broader Terms Related Terms Narrower Terms Synonyms

16 Example: MeSH – Medical Subject Headings MeSH – Medical Subject Headings  22,568 descriptors  139,000 headings (Supplementary Concept Records)  thousands of cross-references  i.e., Vitamin C see Ascorbic Acid.  Used t indexing MEDLINE MeSH Browser MeSH Browser MeSH Browser MeSH Browser

17 MeSH Tree Structures - 2004 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]

18 ERIC Thesaurus more than 10,000 terms or subject headings used in indexing and searching ERIC records. more than 10,000 terms or subject headings used in indexing and searching ERIC records. A supplemental list of over 55,000 terms or subject headings including A supplemental list of over 55,000 terms or subject headings including  proper names (e.g., geographic, personal, institutional, project, equipment, test, etc., names) or  concepts not yet represented by the controlled vocabulary of the ERIC Thesaurus.

19 Controlled Vocabulary Examples: Examples:  Case studies: Descriptor  SN: Details analyses, usually focusing on a particular problem of an individual, group, or organization (note: do not confuse with “medical case histories”  NT: Cross sectional studies Longitudinal studies Longitudinal studies

20 Examples (Case Studies)  BT Evaluation methods Research Research  RT Case records Counseling Counseling Qualitative research Qualitative research

21 Advantages of Subject Indexing Facilitates concept search Facilitates concept search  search by topics/subjects, not just by words  link related documents by subject terms  Make implicit information explicit Provides a standard terminology to index and search documents. Provides a standard terminology to index and search documents.  Use small indexing vocabulary  Help the searcher find related terms

22 Disadvantages of Subject Indexing Expensive manual operations Expensive manual operations  To construct the controlled vocabulary  To assign terms to documents Difficult to keep up to date Difficult to keep up to date  Terminology changes very fast  New terms are added daily. Inconsistent process of human indexing Inconsistent process of human indexing  Same documents are assigned different indexing terms by different indexers  The user may not use the same terms to find documents as the indexer would use to index the documents.

23 Document Representation Inverted Indexing Inverted Indexing  Represent a document as a list of terms occurred in the document  computer-based indexing  statistical-based indexing Subject Indexing Subject Indexing  Represent a document as a list of subject terms occurred in a controlled vocabulary.

24 Considerations of Document Representation Any format of document representation needs to maintain a balance of its Any format of document representation needs to maintain a balance of its  Discriminating power  Descriptiveness  Similarity identification  Conciseness

25 Considerations of DR Discriminating power Discriminating power  to identify a document uniquely  to reduce ambiguity  Examples: ISBN number for bookISBN number for book bar codes for productsbar codes for products

26 Considerations of DR Descriptiveness Descriptiveness  describe all the information as complete as possible  fulltext  abstracts  extracts  reviews  Completeness and correctness

27 Considerations of DR Similarity Identification Similarity Identification  to group similar documents  keywords or subject indexing  book classification numbers  Difficulty for the computer to assign keywords, subject descriptors, or classification numbers to documents

28 Considerations of DR Conciseness Conciseness  simple and clear  reduce process time and storage space  Examples:  authors and titles

29 Relationships of four considerations Higher discrimination power may lower the capability of identifying similarities among documents. Higher discrimination power may lower the capability of identifying similarities among documents. Good descriptiveness may defeat the conciseness Good descriptiveness may defeat the conciseness What’s good for the computer may not always be good for the user. What’s good for the computer may not always be good for the user. A good representation should seek a balance of the four, and take consideration of both the computer and the user. A good representation should seek a balance of the four, and take consideration of both the computer and the user.

30 What’s missing in DR? Intelligent Reasoning! Intelligent Reasoning! Knowledge-base Knowledge-base  Ontology  Semantic Networks Uncertainty(impreciseness)-handling Uncertainty(impreciseness)-handling

31 Knowledge Representation encoding human knowledge - in all its various forms - in such a way that the knowledge can be used. encoding human knowledge - in all its various forms - in such a way that the knowledge can be used.  A successful representation of some knowledge must be in a form that is understandable by humans, and must cause the system using the knowledge to behave as if it knows it.

32 Knowledge Representation A knowledge representation (KR) is most fundamentally a surrogate, a substitute for the thing itself. A knowledge representation (KR) is most fundamentally a surrogate, a substitute for the thing itself. It is a set of ontological commitments, i.e., an answer to the question: In what terms should I think about the world? It is a set of ontological commitments, i.e., an answer to the question: In what terms should I think about the world? It is a fragmentary theory of intelligent reasoning, expressed in terms of three components: (i) the representation's fundamental conception of intelligent reasoning; (ii) the set of inferences the representation sanctions; and (iii) the set of inferences it recommends. It is a fragmentary theory of intelligent reasoning, expressed in terms of three components: (i) the representation's fundamental conception of intelligent reasoning; (ii) the set of inferences the representation sanctions; and (iii) the set of inferences it recommends.

33 Knowledge Representation It is a medium for pragmatically efficient computation, i.e., the computational environment in which thinking is accomplished. One contribution to this pragmatic efficiency is supplied by the guidance a representation provides for organizing information so as to facilitate making the recommended inferences. It is a medium for pragmatically efficient computation, i.e., the computational environment in which thinking is accomplished. One contribution to this pragmatic efficiency is supplied by the guidance a representation provides for organizing information so as to facilitate making the recommended inferences. It is a medium of human expression, i.e., a language in which we say things about the world. It is a medium of human expression, i.e., a language in which we say things about the world.  From http://medg.lcs.mit.edu/ftp/psz/k- rep.html http://medg.lcs.mit.edu/ftp/psz/k- rep.htmlhttp://medg.lcs.mit.edu/ftp/psz/k- rep.html

34 Intelligent Information Retrieval Information retrieval supported by knowledge representation, rather than document representation. Information retrieval supported by knowledge representation, rather than document representation. Useful links Useful links  Stanford Stanford  Agent-based IR Agent-based IR Agent-based IR


Download ppt "INFO624 -- Week 8 Subject Indexing & Knowledge Representation Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University."

Similar presentations


Ads by Google