1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
LHE 3252 Teaching the Language of Poetry Week 2 Matthew Arnold ( )
MARC 101 for Non-Catalogers Colorado Horizon Users Group Meeting Philip S. Miller Library Castle Rock, CO May 29, 2007.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Metadata 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
CS 430 / INFO 430 Information Retrieval
Orientation to Libraries Research Methods and Data College of Advancing Studies Brendan Rapple.
Web of Science: An Introduction Peggy Jobe
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 CS 502: Computing Methods for Digital Libraries Lecture 13 Descriptive Metadata I: cataloguing, classification, authority files.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Allyn & Bacon 2003 Social Work Research Methods: Qualitative and Quantitative Approaches Topic 12: Reviewing Literature and Report Writing.
This chapter is extracted from Sommerville’s slides. Text book chapter
Dr. Alireza Isfandyari-Moghaddam Department of Library and Information Studies, Islamic Azad University, Hamedan Branch
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
IL Step 1: Sources of Information Information Literacy 1.
1 DATABASES By: Hanna Ben-Or Phone: October 2011.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
By Matthew Arnold. The sea is calm tonight. The tide is full, the moon lies fair Upon the straits; on the French coast the light Gleams and is gone; the.
RESEARCH PROPOSAL: HOW TO REVIEW THE LITERATURE MNGT Özge Can.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Information Retrieval and Web Search Course overview Instructor: Rada Mihalcea.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Hailee Carpenter & Jessica Burnett
1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery.
1 CS 430: Information Discovery Lecture 11 Cranfield and TREC.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Dramatic Monologue Ms. Campbell. What is a Dramatic Monologue? O A single person (NOT the poet) who utters the “speech” that makes up the whole poem O.
Chapter 20 Asking Questions, Finding Sources. Characteristics of a Good Research Paper Poses an interesting question and significant problem Responds.
Definition, purposes/functions, elements of IR systems Lesson 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 1 Overview of Information Retrieval.
1 CS 430: Information Discovery Lecture 28 (a) Two Examples of Cluster Analysis (b) Conclusion.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
A few poetry terms to know…
Text Based Information Retrieval
By Matthew Arnold Instructor: Laurie Jui-hua Tseng
Software Documentation
Searching for and Accessing Information
Thanks to Bill Arms, Marti Hearst
Overview of Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
DATABASES By: Hanna Ben-Or Phone:
English compulsory B.A.I
Introduction to Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

1 CS 430: Information Discovery Lecture 1 Overview of Information Discovery

2 Course Administration Web site: Instructor: William Arms, Information Science, Room 104 Teaching assistants: Pavel Dmitriev, Ariful Gani, Heng-Scheng Chuang Assistant: Anat Nidar-Levi, Information Science, Room 101 Sign-up sheet: Include your NetID Contact the course team: to Notices: See the home page of the course Web site

3 Discussion Classes Format of Wednesday evening classes: Topic announced on Web site with chapter or article to read Allow several hours to prepare for class by reading the materials Class has discussion format One third of grade is class participation Class time is 7:30 to 8:30 in Upson B17 (note change of room)

4 Assignments Four individual programming assignments, in Java or C++.

5 Code of Conduct Computing is a collaborative activity. You are encouraged to work together, but... Some tasks may require individual work. Always give credit to your sources and collaborators. To make use of the expertise of others and to build on previous work, with proper attribution is good professional practice. To use the efforts of others without attribution is unethical and academic cheating. Read and follow the University's Code of Academic Integrity.

6 Course Description This course studies techniques and human factors in discovering information in online information systems. Methods that are covered include techniques for searching, browsing and filtering information, descriptive metadata, the use of classification systems and thesauruses, with examples from Web search systems and digital libraries.

7 Information Discovery People have many reasons to look for information: Known item Where will I find the wording of the US Copyright Act? Facts What is the capital of Barbados? Introduction or overview How do diesel engines work? Related information (annotation) Is there a review of this item? Comprehensive search What is known of the effects of global warming on hurricanes?

8 Information Discovery People have many ways to look for information: Where will I find the wording of the US Copyright Act? Is government information outside copyright? Browse: What is the capital of Barbados? Search: How do diesel engines work? Search: What is known of the effects of global warming on hurricanes? Explore:

9 Searching and Browsing: The Human in the Loop Search index Return hits Browse repository Return objects

10 Definitions Information retrieval: subfield of computer science that deals with automated retrieval of documents based on their content. Searching: seeking for specific information within a body of information. The result of a search is a set of hits. Browsing: unstructured exploration of a body of information. Linking: Moving from one item to another following links, such as citations, references, etc.

11 The Basics of Information Retrieval Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term. A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols. Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words. Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.

12 Searching: Physical Collections Document collection Index User Create catalog

13

14 Descriptive metadata Some methods of information discovery search descriptive metadata about the objects. Metadata typically consists of a catalog or indexing record, or an abstract, one record for each object. The record acts as a surrogate for the object. Usually the metadata is stored separately from the objects that it describes, but sometimes is embedded in the objects. Usually the metadata is a set of text fields. Textual metadata can be used to describe non-textual objects, e.g., software, images, music

15 Documents and Surrogates The sea is calm to-night. The tide is full, the moon lies fair Upon the straits;--on the French coast the light Gleams and is gone; the cliffs of England stand, Glimmering and vast, out in the tranquil bay. Come to the window, sweet is the night-air! Only, from the long line of spray Where the sea meets the moon-blanch'd land, Listen! you hear the grating roar Of pebbles which the waves draw back, and fling, At their return, up the high strand, Begin, and cease, and then again begin, With tremulous cadence slow, and bring The eternal note of sadness in. Author: Matthew Arnold Title: Dover Beach Genre: Poem Date: 1851 Document Surrogate (catalog record) Notes: 1. The surrogate is also a document 2. Every word is different!

16 Descriptive metadata Catalog: metadata records that have a consistent structure, organized according to systematic rules. (Example: Library of Congress Catalog) Abstract: a free text record that summarizes a longer document. Indexing record: less formal than a catalog record, but more structure than a simple abstract. (Example: Inspec, Medline)

17 Surrogates for non-textual materials Textual catalog record about a non-textual item (photograph) Surrogate Text based methods of information retrieval can search a surrogate for a photograph

18 Library of Congress catalog record (part) CREATED/PUBLISHED: [between 1925 and 1930?] SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on. NOTES: Title supplied by cataloger. Source: Morey Engle. SUBJECTS: Coolidge, Calvin, Presidents--United States Autographing--Colorado--Denver Denver (Colo.) Photographic prints. MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)

19 Searching: Electronic Collections Document collection Index User Automatic indexing

20

21 Automatic indexing Creating catalog records manually is labor intensive and hence expensive. The aim of automatic indexing is to build indexes and retrieve information without human intervention. History Much of the fundamental research in automatic indexing was carried out by Gerald Salton, Professor of Computer Science at Cornell, and his graduate students. The reading for Discussion Class 2 is a paper by Salton and others that describes the SMART system used for their research.

22 Searching: Other Methods Natural language processing (Not part of this class) Language support Lexicon, thesaurus Classification Divide documents into groups based on subject

23 Lexicon and thesaurus Lexicon contains information about words, their morphological variants, and their grammatical usage. Thesaurus relates words by meaning: ship, vessel, sail; craft, navy, marine, fleet, flotilla book, writing, work, volume, tome, tract, codex search, discovery, detection, find, revelation (From Roget's Thesaurus, 1911)

24 Searching: Distributed Collections (e.g., the Web) Document collections Index User Web crawling

25

26 Multi-modal Information Discovery Incorporate information from many sources into indexes Google (Web documents) full text indexing hyperlink structures file names, anchor text, etc. Informedia (video segments) speech recognition from sound track closed captions text over video

27 Measuring Effectiveness: Recall and Precision If information retrieval were perfect... Every hit would be relevant to the original query, and every relevant item in the body of information would be found. Precision: percentage of the hits that are relevant, the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query. Recall: percentage of the relevant items that are found by the query, the extent to which the query found all the items that satisfy the requirement.

28 Recall and Precision: Example Collection of 10,000 documents, 50 on a specific topic Ideal search finds these 50 documents and reject others Actual search identifies 25 documents; 20 are relevant but 5 were on other topics Precision: 20/ 25 = 0.8 Recall: 20/50 = 0.4

29 Measuring Precision and Recall Precision is easy to measure: A knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure: To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. In the example, all 10,000 documents must be examined.

30 History Measuring the effectiveness of information discovery is an unresolved research topic. Much of the work on evaluation of information retrieval derives from the ASLIB Cranfied projects led by Cyril Cleverdon, which began in Recent work on evaluation has centered around the Text Retrieval Conferences (TREC), which began in 1992, led by Donna Harman and Ellen Voorhees of NIST.