Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.

Similar presentations


Presentation on theme: "1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II."— Presentation transcript:

1 1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II

2 2 Administration Open laptop examination Laptops -- bring them Wireless modems -- use them Communication with others during examination -- NO! Submission of answers eitherEmail solutions to wya@cs.cornell.edu or Write on paper

3 3 Vector Space Methods: Concept n-dimensional space, where n is the total number of different words in the set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the number of times that the corresponding word appears in the document. Similarity between two documents is the angle between their vectors.

4 4 Example D1 -> ant ant bee D2 -> bee hog ant dog D3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length D1 2 1  5 D2 1 1 1 1  4 D3 1 1 1 1 1  5

5 5 Angle between two vectors Vectors x = ( x 1, x 2, x 3,..., x n ) y = ( y 1, y 2, y 3,..., y n ) Inner product x.y = ( x 1 y 1 + x 2 y 2 + x 3 y 3 +... + x n y n ) = len(x)len(y) cos  Similarity d(D1, D2) = cos  = (2.1 + 1.1 + 0.0 + 0.1 + 0.0 + 0.0 +0.1)/(  5  4)

6 6 Example (continued) D1D2D3 D1 10.67 0 D20.67 10.22 D300.22 1 Similarity of documents in example: Similarity measures the number of occurrences of words, but not other characteristics of the documents

7 7 Latent Semantic Indexing General concept Term-document matrix is very big, but very sparse (50,000 distinct words create 50,000 dimensions) Find a small set of dimensions (perhaps 100) that can substitute for the information in the larger space [Singular value decomposition]

8 8 Statistical methods General concept Certain groups of words frequently appear in sequence Cornell University... or close to each other in a text marriage, wedding, bride Statistical methods can be used to relate similar documents.

9 9 Computational linguistics Natural language processing is the branch of computer science that uses computers to interpret and manipulate words as part of a language. Computational linguistics deals with grammar and linguistics. Morphology studies variants of words derived from the same stem, such as plurals (library, libraries), and verb forms (look, looks, looked). Parsing analyzes the structure of sentences. It categorizes words by part of speech (verb, noun, adjective, etc.), groups them into phrases and clauses, and identifies structural elements (subject, verb, object, etc.).

10 10 Information retrieval: stemming Stemming reduce morphological variants to a common stem to use the stem as a search term. Superior to truncation, e.g., compare and comparison are morphological variants but compare and company are morphologically different In English, the stem is usually at the beginning of the word. In German, it is also necessary to trim at the beginning of words.

11 11 Information retrieval: noun phrases Noun phrases are groups of words that have the grammatical function of a noun within a sentence the house with the white shutters computational linguistics negative advertising Search methods that identify noun phrases within documents and in queries have a high precision in retrieval

12 12 Lexicon and thesaurus Lexicon contains information about words, their morphological variants, and their grammatical usage. Thesaurus relates words by meaning: ship, vessel, sail; craft, navy, marine, fleet, flotilla book, writing, work, volume, tome, tract, codex search, discovery, detection, find, revelation (From Roget's Thesaurus, 1911)

13 13 Art and Architecture Thesaurus Controlled vocabulary for describing and retrieving information: fine art, architecture, decorative art, and material culture. Almost 120,000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures. Used by archives, museums, and libraries to describe items in their collections. Used to search for materials. Used by computer programs, for information retrieval, and natural language processing. A project of the J. Paul Getty Trust

14 14 Art and Architecture Thesaurus Categories: associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects. Concept: a cluster of terms, one of which is established as the preferred term, or descriptor. Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism.

15 15 MeSH -- medical subject headings About 18,000 primary subject headings, plus thesaurus of about 80,000 chemical terms. Organized in a hierarchy: general terms, e.g.,anatomy, organisms, and diseases. anatomy is divided into sixteen topics, e.g., body regions and musculoskeletal system; body regions is divided into sections, e.g., abdomen, axilla, back etc. National Library of Medicine provides MeSH subject headings for each of the 400,000 articles that it indexes every year.


Download ppt "1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II."

Similar presentations


Ads by Google