1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.

Slides:



Advertisements
Similar presentations
THEORY-BOOK.
Advertisements

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Lexicography ( Dictionary Skills) Lecture 2
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430 / INFO 430 Information Retrieval Lecture 26 Classification 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Medical Subject Headings (MeSH)
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.
What is a Sentence? By Jaye Lynn Trapp.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 CS 430: Information Discovery Lecture 16 Thesauruses and Gazetteers.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
AAT Art & Architecture Thesaurus. Diffuse list of museum standards
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Information Retrieval Thesauruses and Cluster Analysis 1.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Rules, Movement, Ambiguity
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Vector Space Models.
Information Retrieval
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Subject Headings Objective: Students will understand that both books and articles are assigned words to describe their contents. These terms are referred.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Charlyn P. Salcedo Instructor Types of Indexing Languages.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
General Notes on Stylistics
Descriptive Grammar – 2S, 2016 Mrs. Belén Berríos
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
THEORY-BOOK.
Advanced search techniques in databases
CS 430: Information Discovery
CS 430: Information Discovery
Restructuring Sparse High Dimensional Data for Effective Retrieval
THESAURUS CONSTRUCTION: GROUND WATER
CS 430: Information Discovery
Presentation transcript:

1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II

2 Administration Open laptop examination Laptops -- bring them Wireless modems -- use them Communication with others during examination -- NO! Submission of answers either solutions to or Write on paper

3 Vector Space Methods: Concept n-dimensional space, where n is the total number of different words in the set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the number of times that the corresponding word appears in the document. Similarity between two documents is the angle between their vectors.

4 Example D1 -> ant ant bee D2 -> bee hog ant dog D3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length D1 2 1  5 D  4 D  5

5 Angle between two vectors Vectors x = ( x 1, x 2, x 3,..., x n ) y = ( y 1, y 2, y 3,..., y n ) Inner product x.y = ( x 1 y 1 + x 2 y 2 + x 3 y x n y n ) = len(x)len(y) cos  Similarity d(D1, D2) = cos  = ( )/(  5  4)

6 Example (continued) D1D2D3 D D D Similarity of documents in example: Similarity measures the number of occurrences of words, but not other characteristics of the documents

7 Latent Semantic Indexing General concept Term-document matrix is very big, but very sparse (50,000 distinct words create 50,000 dimensions) Find a small set of dimensions (perhaps 100) that can substitute for the information in the larger space [Singular value decomposition]

8 Statistical methods General concept Certain groups of words frequently appear in sequence Cornell University... or close to each other in a text marriage, wedding, bride Statistical methods can be used to relate similar documents.

9 Computational linguistics Natural language processing is the branch of computer science that uses computers to interpret and manipulate words as part of a language. Computational linguistics deals with grammar and linguistics. Morphology studies variants of words derived from the same stem, such as plurals (library, libraries), and verb forms (look, looks, looked). Parsing analyzes the structure of sentences. It categorizes words by part of speech (verb, noun, adjective, etc.), groups them into phrases and clauses, and identifies structural elements (subject, verb, object, etc.).

10 Information retrieval: stemming Stemming reduce morphological variants to a common stem to use the stem as a search term. Superior to truncation, e.g., compare and comparison are morphological variants but compare and company are morphologically different In English, the stem is usually at the beginning of the word. In German, it is also necessary to trim at the beginning of words.

11 Information retrieval: noun phrases Noun phrases are groups of words that have the grammatical function of a noun within a sentence the house with the white shutters computational linguistics negative advertising Search methods that identify noun phrases within documents and in queries have a high precision in retrieval

12 Lexicon and thesaurus Lexicon contains information about words, their morphological variants, and their grammatical usage. Thesaurus relates words by meaning: ship, vessel, sail; craft, navy, marine, fleet, flotilla book, writing, work, volume, tome, tract, codex search, discovery, detection, find, revelation (From Roget's Thesaurus, 1911)

13 Art and Architecture Thesaurus Controlled vocabulary for describing and retrieving information: fine art, architecture, decorative art, and material culture. Almost 120,000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures. Used by archives, museums, and libraries to describe items in their collections. Used to search for materials. Used by computer programs, for information retrieval, and natural language processing. A project of the J. Paul Getty Trust

14 Art and Architecture Thesaurus Categories: associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects. Concept: a cluster of terms, one of which is established as the preferred term, or descriptor. Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism.

15 MeSH -- medical subject headings About 18,000 primary subject headings, plus thesaurus of about 80,000 chemical terms. Organized in a hierarchy: general terms, e.g.,anatomy, organisms, and diseases. anatomy is divided into sixteen topics, e.g., body regions and musculoskeletal system; body regions is divided into sections, e.g., abdomen, axilla, back etc. National Library of Medicine provides MeSH subject headings for each of the 400,000 articles that it indexes every year.