Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Information Management Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida Original image* by Moshell et al.

Similar presentations


Presentation on theme: "1 Information Management Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida Original image* by Moshell et al."— Presentation transcript:

1 1 Information Management Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida Original image* by Moshell et al.

2 -2 - Cataloging and Indexing Why are we discussing this? I don't believe in memorizing a bunch of soon-obsolete facts. I DO believe that many of you will have to solve info-management problems. You will probably invent ways of doing it. So you should "steal from the best" – not reinvent the wheelbarrow.

3 -3 - How do we find things? 1)By starting in the neighborhood of similar things. 1)By using the name of the thing, and asking an "expert" or "resource"

4 -4 - How do we find things? 1)By starting in the neighborhood of similar things. 1)By using the name of the thing, and asking an "expert" or "resource" When reading a book: Look in the table of contents, for an ARTICLE. Look in the index, for a TOPIC.

5 -5 - How do we find things? 1)By starting in the neighborhood of similar things. 1)By using the name of the thing, and asking an "expert" or "resource" At the library: Go to the relevant section, browse shelves. Use the (card) catalog (really an index.)

6 -6 - How do we find things? 1)By starting in the neighborhood of similar things. 1)By using the name of the thing, and asking an "expert" or "resource" On the Internet: Follow links from trusted sources (like cnet). Use the indexes, e. g. those provided by search engines those provided by vendors (eBay, Amazon...) those provided by facilitators (uTube, craigslist)

7 -7 - What's an index? An index is a system that serves to optimize speed in finding relevant documents in a search. An index is a system that, given one or more search terms from either metadata or essence, efficiently reports the location of the essence. What's fast? What's efficient? here comes some math... (how we all love it!)

8 -8 - Order statistics A document contains k records. (perhaps k=1000). If you must examine EACH RECORD to find what you seek, the search is Order-k (written as O(k).) For ancient records, this is usually the only way. For instance, the Archivo General de Indias in Seville, Spain

9 -9 - Order statistics A document contains k records. (perhaps k=1000). If you must examine EACH RECORD to find what you seek, the search is Order-k (written as O(k).) For ancient records, this is usually the only way. On the average, you would look at 500 records (0.5*k) to find the one you are seeking. Let's say we seek a ship named Nuestra Senora de Atocha

10 -10 - Indexing To prepare an index of all ships' names,, captains' names, owners and dates in the archive, it would take O(k) time. Why? Because every document would be visited. Each index item contains SEARCH TERM and DOCUMENT NUMBER BUT now (if the index is sorted, which it is) we can find S=Nuestra Senora de Atocha much faster, by playing "binary search". S>this? sorted index A Z

11 -11 - Indexing If someone prepared an index of all ships, captains' names, owners and dates in the archive, this would take O(k) time. Why? Because every document would be visited. BUT now (if the index is sorted, which it is) we can find S=Nuestra Senora de Atocha much faster, by playing "binary search". sorted index A Z S>this? no

12 -12 - Indexing and binary Search 1 comparison distinguishes 2 records 2 comparison distinguish 4 records 3 comparisons distinguish 8 records comparisons distinguish comparisons distinguish over a million records. sorted index A Z Each comparison cuts in half the search space

13 -13 - Indexing and binary Search 1 comparison distinguishes 2 records 2 comparison distinguish 4 records 3 comparisons distinguish 8 records comparisons distinguish comparisons distinguish over a million records. sorted index A Z Each comparison cuts in half the search space O(log k)

14 -14 - OMG, a Log? Puleeeeez.... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 2 1 =2 2 2 =2*2=4 2 3 =2*2*2= =2*2*...*2= 1024=1 kilo, about a thousand O(log k) Ten twos

15 -15 - OMG, a Log? Puleeeeez.... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 2 1 =2 2 2 =2*2=4 2 3 =2*2*2= =2*2*...*2= = 2*2*...*2 = 1024 * 1024 = 1meg, about a million O(log k) Twenty twos

16 -16 - OMG, a Log? Puleeeeez.... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 2 1 =2 2 2 =2*2=4 2 3 =2*2*2= =2*2*...*2= = 2*2*...*2 = 1024 * 1024 = 1meg, about a million 2 30 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion O(log k) Thirty twos

17 -17 - OMG, a Log? Puleeeeez.... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 2 1 =2 2 2 =2*2=4 2 3 =2*2*2= =2*2*...*2= = 2*2*...*2 = 1024 * 1024 = 1meg, about a million 2 30 = 2*2*...*2 = 1024 * 1024 *1024 = 1 gig, about a billion (log 2 k) k meg 1 gig

18 -18 - OMG, a Log? Puleeeeez.... Yep, this is college and you are a DIGITAL Media Major. So here goes. 20=1 2 1 =2 2 2 =2*2=4 2 3 =2*2*2= =2*2*...*2= = 1 meg 2 30 =1 gig You need to be able to tell me what is log2(k) for any k (power of two) between 1 and 1meg. Example: 256k? 256=2 8. and 1k~=2 10. So that's 2*2*2..*2 18 twos log 2 (256k) = 18

19 -19 - OMG, a Log? Puleeeeez.... I will provide a Logarithm Practice Sheet on the website to help you study and practice for the midterm exam.

20 -20 - Indexing and binary Search Linear SearchBinary Search 1000 items10 steps 1 million items20 steps 1 billion items30 steps sorted index A Z Each comparison cuts in half the search space O(log 2 k)

21 -21 - Sorting N Objects We will discuss sorting, a bit later After you recover from Math Anxiety Slcc.edu

22 -22 - Why not just keep books in order? Could you do 'binary search' directly on the books...? Well, WHICH order? If they're on the shelf in that order, yes. -by ship names? -by captains' names? -by year of construction? -by year of sinking or decommissioning? An index can be sorted on any data field, then searched. (Sorting k objects takes O(k * log k) time (so sorting a billion objects; 1 billion * log 2 (1 billion) =1 billion* 30 = 30 billion steps)

23 -23 - Why not just keep books in order? -An index can be sorted on any data field, then searched. (Sorting k objects takes O(k log k) time (so sorting a billion objects; 1 billion * log 2 (1 billion) =1 billion* 30 = 30 billion steps) (This can be done overnight, when computers aren't busy) BUT – once sorted, inserting new information is O(log k) time. So, you can insert a new fact into our billion-item index in about 30 steps. Fast!

24 -24 - What terms shall we index? -For text, the essence yields keyword search -The dumbest but easiest kind of search, if essence=digital text.

25 -25 - What terms shall we index? -For text, the essence yields keyword search -The dumbest but easiest kind of search, if essence=digital text. -This was not true for traditional libraries. -Nobody had time to catalog every word of every book. - Professional catalogers had to develop techniques: - Author - Title - Publication Date - Subject (METADATA!) And this last one, Subject, took more work than all the rest together.

26 -26 - What's so hard about subject indexing? -The problem: restricting the vocabulary. Let's consider a fictional book: The Skills of a Nineteenth Century Bartender. Henry Macintosh, New York, 1889 How might someone seek this book? Or: what metadata fields might the librarian use? Occupations: bartender, barkeeper, barman, barkeep (Are there others we forgot to search for?) So catalogers established rules involving precedent to restrict vocabularies and establish standards

27 -27 - Cataloging an Item for a Library The card catalog at Yale University (of course, it's all computerized now)

28 -28 - Cataloging an Item for a Library Problem #1: What book (or other object) are we talking about? - Each item has an accession number (that's easy to issue) - Each title has a catalog number, shared with all instances (sometimes separate copies are called.c1,.c3 etc.) Problem #2: What catalog number should I give this item? -Did someone else catalog it already? If so, use that. -If not, follow the - International Standard Bibliographic Description (ISBD)

29 -29 - Title statement of responsibility (author or editor), edition, material specific details (for example, the scale of a map), publication and distribution, physical description (for example, number of pages), Series (e. g. this might be part 3 of a trilogy) notes, standard number (ISBN). International Standard Bibliographic Description (ISBD)

30 -30 - And then follow A complex set of rules Most English cataloging follows Anglo-American Cataloging Rules (AACR2) Germans follow Regeln für die alphabetische Katalogisierung Etc…

31 -31 - How to organize an index -Step 1: Deciding what fields to include (the Ontology) of the subject space -Step 2: Deciding if each metadata field is open or controlled (CV). Open set: American family names Closed set: Chinese family names In software,,CV fields are often presented as pulldown menus. -Step 3: Establishing the controlled vocabulary, and rules for extending it. -Step 4: Maintaining it. -(e. g. MIME types, subtypes.)

32 -32 - Concept: "Low-hanging fruit" -In any new domain, some ideas will come together that present opportunities not previously possible -Some of them will be easy to do. -Get these first, and you may be rich. The cataloging of dynamic media such as video can take advantage of techniques for Content Logging. In this area, closed captions was a low-hanging fruit.

33 -33 - Closed Captions for Content Logging -Originally for deaf... now for bars, etc. -"Closed" – not all viewers will see the captions -But they are built into most TV broadcasts. >> Indicates a new speaker has begun to talk.

34 -34 - Closed Captions for TV -Originally for deaf... now for bars, etc. -"Closed" – not all viewers will see the captions -But they are built into most TV broadcasts. >> Indicates a new speaker has begun to talk. But – isn't speech recognition still hard? - yes – but there are SCRIPTS and TELEPROMPTERS behind most TV programming. Live news feeds are a mix of scripted and unscripted. BBC developed a re-speak technology to maximize clarity. Sound effects and music are shown by # or notes.

35 -35 - Closed Captions for TV -now that CC exists, you can index it to produce metadata. -Services monitor in real-time for significant stories.

36 -36 - Can you think of another TV "LHF"? Where is another source of already-in-text-form metadata about TV program contents? (I can think of two).

37 -37 - Can you think of another TV "LHF"? Where is another source of already-in-text-form metadata about TV program contents? (I can think of two). Electronic Program Guides, such as Tivo's TV programming schedule Broadcasters' Websites (e. g.

38 -38 - We've discussed third party logging But what about in-house logging (by materials' own producers.) Static metadata (exists independently of the essence) Production Notes, including original scripts Edit Decision List (part of production notes) Advanced Authoring Format (AAF) News Feed rundowns (cues for local broadcasters) Media Object Server (MOS) format

39 -39 - We've discussed third party logging But what about in-house logging (by materials' own producers.) Dynamic metadata (sampled from or derived from the essence) A hierarchy of proxy representations: -time code (ties it all together) -Proxy video (low res, maybe easier to scan – or harder!) -Keyframes (still images for pattern recognition) -Audio transcript -annotation – added by staff

40 -40 - Speech Analysis -Phoneme: minimal meaningful unit of speech. English has 44. -Phone: the 'rendering' of a phoneme by an individual. Infinite # -Recognition of words: difficult under good conditions, nearly impossible under noisy conditions However, you don't need to get ALL the words to make the document searchable. Even getting SOME of the words is better than none.

41 -41 - Indexing things that aren't words -Built-in metadata (e. g. digital camera data, Adobe metadata) -Image libraries – cataloged by human beings (We will study some of the metadata standards used.) -Automatic pattern recognition -http://www.autonomy.com/content/Solutions/video- surveillance/index.en.html -Assignment: Download ONE of the "Autonomy Virage" documents, -read it and be prepared to give a one-minute summary of its claims.

42 -42 - Recognizing Faces -FINDING a face in a scene is far easier than RECOGNIZING it. -Nikon's cameras can now find faces and focus on them. Face-priority AF in Nikon Coolpix Cameras But it's a rough rough world out there. The website listed below provides a list of vendors... many of which are 'dead links' as companies come and go.

43 -43 - And... where do we go from here? Go back through these slides. Make a list of the important words. If you can write a one-sentence explanation of every word on this list, AND answer logarithm questions, you're ready for the midterm.... at least with regard to Searching and Pattern Recognition. But now let's go talk about SORTING.

44 -44 - Sorting Why are we discussing this? It's a good example of DUMB vs. SMART algorithms. What's an algorithm? A systematic procedure for solving a problem. Programs are built on the basis of algorithms. But so are * carpentry * medical diagnosis * electronic repair.. Etc etc etc.

45 -45 - Sorting and Ignorance Two thousand name-tags Printed in NAME order Needed in COMPANY order So… they put Six temps to Work … For HOURS… Mnddc.org.

46 -46 - Sorting the Hard Way Spread 'em all on a long table Insert each one into the ordered pile. Problem: The pile gets bigger and bigger, so the insertion goes more & more slowly..

47 -47 - Sorting the Hard Way Spread 'em all on a long table Insert each one into the ordered pile. This technique takes O(n 2 ) – that's n squared * 2000 = 4 million operations! Walk down the row (pass n badges), insert one. Do this n times. You have n * n distance to walk..

48 -48 - Sorting, a smart way 1. Grab 20 badges, and sort them in a small group. Create 100 small, sorted batches. 2. Combine the batches 2 by 2, like this: etc

49 -49 - Sorting, a smart way 2. Combine the batches 2 by 2, like this: etc Reminds you of binary search? Yes, Merging twice as many groups only takes One more step (layer). 4 groups – 2 layers (3 operations) 8 groups – 3 layers (7 operations) etc..

50 -50 - Sorting by 'merge-sort' Merge-Sort requires O(n log 2 n) operations to sort n objects. For 2000 name badges, log 2 (2000) = log 2 (1000) + 1 You recognize log 2 (1000) ~= log 2 (1k) = 10, So log2 (2000) ~= 11 So our total estimate for sorting 2000 name badges is Approximately 2000 * 11 or 22,000 steps Compared to 4 million steps (2000 * 2000) if doing the job the BFI (Brute Force & Ignorance) way!

51 -51 - Sorting by 'merge-sort' The moral of this story: 1) Do a little research before you undertake A major project. An hour's investigation might save WEEKS of work, And it might save your BUSINESS. 2) Ask an expert, if you have one. Become an expert, if you don't have one. Usfamily.net

52 -52 - Sorting by 'merge-sort' The moral of this story: 1) Do a little research before you undertake A major project. An hour's investigation might save WEEKS of work, And it might save your BUSINESS. 2) Ask an expert, if you have one. Become an expert, if you don't have one. > Usfamily.net

53 -53 - Seattletimes.nwnews.com


Download ppt "1 Information Management Lecture 3: Cataloging, Indexing, Searching J. Michael Moshell University of Central Florida Original image* by Moshell et al."

Similar presentations


Ads by Google