Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 430: Information Discovery

Similar presentations


Presentation on theme: "CS 430: Information Discovery"— Presentation transcript:

1 CS 430: Information Discovery
Lecture 25 Automatic Indexing and Metadata-based Retrieval

2 Course Administration
Final examination: Date: Tuesday, 15 May, 3:00 to 5:00 p.m. Room: Kimball Hall B11 Early examination: Date: Thursday, May 10th, 1:00 to 3:00 p.m. Room: Upson Hall 5130 If you wish to take the early examination send to Laptops: Return before examination and bring receipt to the examination.

3 Course Administration
Assignment 4: A new version of Assignment 4 has been posted: • The test data has been corrected. • The instructions for submission have been revised. Please submit separate files for the various parts of the assignment.

4 Before Digital Libraries
Access to scientific, medical, legal information In the United States: -- excellent if you belonged to a rich organization (e.g, a major university) -- very poor otherwise In many countries of the world: -- very poor for everybody Arms 2000

5 Research Libraries are Expensive
library materials buildings & facilities staff Arms 2000

6 The Potential of Digital Libraries
open access ? computers & networks materials staff staff Arms 2000

7 Automated Digital Libraries
How effectively can computers be used for the skilled tasks of professional librarianship? -- Time horizon: 5 to 20 years -- All materials in digital form Computers cannot imitate intelligence. Can automated digital libraries provide equivalent services? Arms 2000

8 Substitutes for Human Intelligence
Automated algorithms for information discovery Closeness of match -- vector space and statistical methods (Salton, et al., c. 1965) Importance of digital object -- Google ranks web pages by how many other pages link to them (NSF/DARPA/NASA Digital Libraries Initiative) Arms 2000

9 Brute Force Computing: Archiving and Preservation
Internet Archive -- Monthly, web crawler gathers every open access web page with associated images -- Web pages are preserved for future generations -- Files are available for scholarly research Arms 2000

10 Brute Force Computing: Reference Linking
ResearchIndex (CiteSeer, ScienceIndex) (NEC) -- fully automatic -- all open access material in computer science -- a free service Contrast with the Web of Science (ISI) -- input: combination of automatic means, skilled people -- limited number of journals -- very expensive Arms 2000

11 Brute Force Computing: Automated Metadata Extraction
Informedia (Carnegie Mellon) Automatic processing of segments of video, e.g., television news. Algorithms for: -- dividing raw video into discrete items -- generating short summaries -- indexing the sound track using speech recognition -- recognizing faces (NSF/DARPA/NASA Digital Libraries Initiative) Arms 2000

12 Example: Catalogs and Indexes
Catalog, index and abstracting records are very expensive when created by skilled professionals -- only available for certain categories of material (e.g., monographs, scientific journals) -- contain limited fields of information (e.g., no contents page) -- restricted to static information Arms 2000

13 Equivalent Services: Catalogs and Indexes
Cataloguing rules -- Application of cataloguing rules is skilled -- It is hard to imagine a computer system with these skills but ... -- Cataloguing rules are the means not the end Arms 2000

14 Example: Catalogs and Indexes manually produced indexes and catalogs
Catalog, index and abstracting records are very expensive when created by skilled professionals, but ... For information discovery, particularly with untrained users: automated indexing of full text is at least as effective as manually produced indexes and catalogs [Demonstrated repeatedly in experiments going back to the original Cranfield experiments.] Arms 2001

15 Equivalent Services Information discovery
I used to be a heavy user of Inspec. Now I use Google instead. Why are web search services the most widely used information discovery tools in universities today? Arms 2000

16 Conventional Criteria
Web search services have many weaknesses -- selection is arbitrary -- index records are crude -- no authority control -- duplicate detection is weak -- search precision is deplorable yet they clearly satisfy some users ... Arms 2000

17 Effectiveness of Web Search
Why I use Google instead of Inspec => Broader coverage => Better ranking => Immediate access to information (e.g., open access version of published paper) Google is an equivalent service for information discovery (for some users) Arms 2000

18 Brute Force Computing Few people really understand Moore's Law
-- Computing power doubles every 18 months -- Increases 100 times in 10 years -- Increases 10,000 times in 20 years Simple algorithms + immense computing power may outperform human intelligence Arms 2000

19 Brute Force Computing Example
Creators of the world champion chess program (Deep Thought later Deep Blue) -- moderate chess players -- simple tree-search algorithm -- very, very fast computer hardware Arms 2000

20 Brute Force Computing:Web Search
Web search engines: -- retrieve every page on the web -- index every word -- repeat every month Arms 2000

21 Automated Indexing Compared to Manual Indexing
1967: Cleverdon (Cranfield) using a normalized recall measure from Salton (Cornell) tested 33 indexing schemes on the same corpus with the same queries. Indexing schemes based on single terms were convincingly more effective than controlled vocabulary or indexing by simple concepts. No subsequent experiment has ever reversed this finding.

22 Abstracting and Indexing Services
Example 1: Chemical Abstracts Chemical Abstracts uses 600 Ph.D level chemists to write abstracts of journal articles Example 2: Westlaw Westlaw has 200 lawyers indexing law reports (and answering the telephone) Chemistry and law are rich fields

23

24 Web Search Engines Infoseek (Kirsch, 1994) Search engine: InQuery
Financial model: $10 per month subscription Lycos (Mauldin, 1994) Search engine: Pursuit -- vector length normalization, term weighting (tf*idf), stemming, stop words, etc. Emphasis on recall. Financial model: advertising Google (Brin and Page, 1997) Search engine: Emphasis on precision.


Download ppt "CS 430: Information Discovery"

Similar presentations


Ads by Google