Presentation is loading. Please wait.

Presentation is loading. Please wait.

2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information.

Similar presentations


Presentation on theme: "2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information."— Presentation transcript:

1 2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information In Collections Automatic Classification Some slides in this lecture were originally created by Prof. Marti Hearst

2 2007.02.08 - SLIDE 2IS 240 – Spring 2007 Overview Introduction to Automatic Classification and Clustering Classification of Classification Methods Classification Clusters and Information Retrieval in Cheshire II The 4W project revisited…

3 2007.02.08 - SLIDE 3IS 240 – Spring 2007 Classification The grouping together of items (including documents or their representations) which are then treated as a unit. The groupings may be predefined or generated algorithmically. The process itself may be manual or automated. In document classification the items are grouped together because they are likely to be wanted together –For example, items about the same topic.

4 2007.02.08 - SLIDE 4IS 240 – Spring 2007 Automatic Indexing and Classification Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words. More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document. Automatic classification attempts to automatically group similar documents using either: –A fully automatic clustering method. –An established classification scheme and set of documents already indexed by that scheme.

5 2007.02.08 - SLIDE 5IS 240 – Spring 2007 Background and Origins Early suggestion by Fairthorne –“The Mathematics of Classification” Early experiments by Maron (1961) and Borko and Bernick(1963) Work in Numerical Taxonomy and its application to Information retrieval Jardine, Sibson, van Rijsbergen, Salton (1970’s). Early IR clustering work more concerned with efficiency issues than semantic issues.

6 2007.02.08 - SLIDE 6IS 240 – Spring 2007 Cluster Hypothesis The basic notion behind the use of classification and clustering methods: “Closely associated documents tend to be relevant to the same requests.” –C.J. van Rijsbergen

7 2007.02.08 - SLIDE 7IS 240 – Spring 2007 Classification of Classification Methods Class Structure –Intellectually Formulated Manual assignment (e.g. Library classification) Automatic assignment (e.g. Cheshire Classification Mapping) –Automatically derived from collection of items Hierarchic Clustering Methods (e.g. Single Link) Agglomerative Clustering Methods (e.g. Dattola) Hybrid Methods (e.g. Query Clustering)

8 2007.02.08 - SLIDE 8IS 240 – Spring 2007 Classification of Classification Methods Relationship between properties and classes –monothetic –polythetic Relation between objects and classes –exclusive –overlapping Relation between classes and classes –ordered –unordered Adapted from Sparck Jones

9 2007.02.08 - SLIDE 9IS 240 – Spring 2007 Properties and Classes Monothetic –Class defined by a set of properties that are both necessary and sufficient for membership in the class Polythetic –Class defined by a set of properties such that to be a member of the class some individual must have some number (usually large) of those properties, and that a large number of individuals in the class possess some of those properties, and no individual possesses all of the properties.

10 2007.02.08 - SLIDE 10IS 240 – Spring 2007 A B C D E F G H 1 + + + 2 + + + 3 + + + 4 + + + 5 + + + 6 + + + 7 + + + 8 + + + Monothetic vs. Polythetic Polythetic Monothetic Adapted from van Rijsbergen, ‘79

11 2007.02.08 - SLIDE 11IS 240 – Spring 2007 Exclusive Vs. Overlapping Item can either belong exclusively to a single class Items can belong to many classes, sometimes with a “membership weight”

12 2007.02.08 - SLIDE 12IS 240 – Spring 2007 Ordered Vs. Unordered Ordered classes have some sort of structure imposed on them –Hierarchies are typical of ordered classes Unordered classes have no imposed precedence or structure and each class is considered on the same “level” –Typical in agglomerative methods

13 2007.02.08 - SLIDE 13IS 240 – Spring 2007 Clustering Methods Hierarchical Agglomerative Hybrid Automatic Class Assignment

14 2007.02.08 - SLIDE 14IS 240 – Spring 2007 Coefficients of Association Simple Dice’s coefficient Jaccard’s coefficient Cosine coefficient Overlap coefficient

15 2007.02.08 - SLIDE 15IS 240 – Spring 2007 Hierarchical Methods 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 Single Link Dissimilarity Matrix Hierarchical methods: Polythetic, Usually Exclusive, Ordered Clusters are order-independent

16 2007.02.08 - SLIDE 16IS 240 – Spring 2007 Threshold =.1 Single Link Dissimilarity Matrix 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 0 4 0 0 0 5 1 0 0 1 1 2 3 4 2 1 3 5 4

17 2007.02.08 - SLIDE 17IS 240 – Spring 2007 Threshold =.2 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 1 4 0 0 0 5 1 0 0 1 1 2 3 4 2 1 3 5 4

18 2007.02.08 - SLIDE 18IS 240 – Spring 2007 Threshold =.3 2.4 3.4.2 4.3.3.3 5.1.4.4.1 1 2 3 4 2 0 3 0 1 4 1 1 1 5 1 0 0 1 1 2 3 4 2 1 3 5 4

19 2007.02.08 - SLIDE 19IS 240 – Spring 2007 Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent. Doc 1. Select initial centers (I.e. seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Rocchio’s method (similar to current K-means methods

20 2007.02.08 - SLIDE 20IS 240 – Spring 2007 Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

21 2007.02.08 - SLIDE 21IS 240 – Spring 2007 Automatic Categorization in Cheshire II The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. It is currently used in the UC Berkeley Digital Library Project and for a number of other sites and projects.

22 2007.02.08 - SLIDE 22IS 240 – Spring 2007 Overview of Cheshire II It supports SGML as the primary database type. It is a client/server application. Uses the Z39.50 Information Retrieval Protocol. Supports Boolean searching of all servers. Supports probabilistic ranked retrieval in the Cheshire search engine. Supports ``nearest neighbor'' searches, relevance feedback and Two-Stage Search. GUI interface on X window displays (Tcl/Tk). HTTP/CGI interface for the Web (Tcl scripting).

23 2007.02.08 - SLIDE 23IS 240 – Spring 2007 Cheshire II - Cluster Generation Define basis for clustering records. –Select field to form the basis of the cluster. –Evidence Fields to use as contents of the pseudo- documents. During indexing cluster keys are generated with basis and evidence from each record. Cluster keys are sorted and merged on basis and pseudo-documents created for each unique basis element containing all evidence fields. Pseudo-Documents (Class clusters) are indexed on combined evidence fields.

24 2007.02.08 - SLIDE 24IS 240 – Spring 2007 Cheshire II - Two-Stage Retrieval Using the LC Classification System –Pseudo-Document created for each LC class containing terms derived from “content-rich” portions of documents in that class (subject headings, titles, etc.) –Permits searching by any term in the class –Ranked Probabilistic retrieval techniques attempt to present the “Best Matches” to a query first. –User selects classes to feed back for the “second stage” search of documents. Can be used with any classified/Indexed collection.

25 2007.02.08 - SLIDE 25IS 240 – Spring 2007 Probabilistic Retrieval: Logistic Regression Estimates for relevance based on log-linear model with various statistical measures of document content as independent variables. Log odds of relevance is a linear function of attributes: Term contributions summed: Probability of Relevance is inverse of log odds:

26 2007.02.08 - SLIDE 26IS 240 – Spring 2007 Probabilistic Retrieval: Logistic Regression In Cheshire II probability of relevance is based on Logistic regression from a sample set of TREC documents to determine values of the coefficients. At retrieval the probability estimate is obtained by: For 6 attributes or “clues” about term usage in the documents and the query

27 2007.02.08 - SLIDE 27IS 240 – Spring 2007 Probabilistic Retrieval: Logistic Regression attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document (M) -- logged

28 2007.02.08 - SLIDE 28IS 240 – Spring 2007 Cheshire II Demo Examples from the: –SciMentor(BioSearch) project Journal of Biological Chemistry and MEDLINE data –CHESTER (EconLit) Journal of Economic Literature subjects –Unfamiliar Metadata & TIDES Projects Basis for clusters is a normalized Library of Congress Class Number Evidence is provided by terms from record titles (and subject headings for the “all languages” Five different training sets (Russian, German, French, Spanish, and All Languages Testing cross-language retrieval and classification –4W Project Search

29 2007.02.08 - SLIDE 29IS 240 – Spring 2007 References Christian Plaunt & Barbara Norgard, “An Association Based Method for Automatic Indexing with a Controlled Vocabulary”. JASIS 49(10), 1998. –Preprint available available on class web site Ray R. Larson, Jerome McDonough, Lucy Kuntz, Paul O’Leary & Ralph Moon, “Cheshire II: Designing a Next- Generation Online Catalog”. JASIS, 47(7) 555-567, 1996.

30 2007.02.08 - SLIDE 30IS 240 – Spring 2007 Developing a Metadata Infrastructure for Information Access: What, Where, When and Who? Prof. Ray R. Larson University of California, Berkeley School of Information

31 2007.02.08 - SLIDE 31IS 240 – Spring 2007 Overview Metadata as Infrastructure –What, Where, When and Who? What are Entry Vocabulary Indexes? –Notion of an EVI –How are EVIs Built Time Period Directories –Mining Metadata for new metadata

32 2007.02.08 - SLIDE 32IS 240 – Spring 2007 Metadata as Infrastructure The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?

33 2007.02.08 - SLIDE 33IS 240 – Spring 2007 Metadata as Infrastructure The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who. The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.

34 2007.02.08 - SLIDE 34IS 240 – Spring 2007 What? Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents. Two kinds of mapping in every search: –Documents are assigned to topic categories, e.g. Dewey –Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers. Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.

35 2007.02.08 - SLIDE 35IS 240 – Spring 2007 Texts ‘What’ searches involve mapping to controlled vocabularies Thesaurus/ Ontology

36 2007.02.08 - SLIDE 36IS 240 – Spring 2007 Start with a collection of documents.

37 2007.02.08 - SLIDE 37IS 240 – Spring 2007 Classify and index with controlled vocabulary Or use a pre- indexed collection. Index

38 2007.02.08 - SLIDE 38IS 240 – Spring 2007 Problem: Controlled Vocabularies can be difficult for people to use. “pass mtr veh spark ign eng” Index Use: “Economic Policy” In Library of Congress subj For: “Wirtschaftspolitik”

39 2007.02.08 - SLIDE 39IS 240 – Spring 2007 Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

40 2007.02.08 - SLIDE 40IS 240 – Spring 2007 “What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

41 2007.02.08 - SLIDE 41IS 240 – Spring 2007 Has an Entry Vocabulary Module been built? User selects a subject domain of interest. Download a set of training data. Build associations between extracted terms & controlled vocabularies. Map user’s query to ranked list of controlled vocabulary terms Part of speech tagging Use an existing EVI. Extract terms (words and noun phrases) from titles and abstracts. User selects search terms from the ranked list of terms returned by the EVI. YES Building an Entry Vocabulary Module (EVI) Searching For noun phrases Internet DB indexed with a controlled vocabulary. Domains to select from: Engineering, Medicine, Biology, Social science, etc. User has question but is unfamiliar with the domain he wants to search. NO Building and Searching EVIs

42 2007.02.08 - SLIDE 42IS 240 – Spring 2007 Technical Details Download a set of training data. Build associations between extracted terms & controlled vocabularies. Part of speech tagging Extract terms (words and noun phrases) from titles and abstracts. Building an Entry Vocabulary Module (EVI) For noun phrases Internet DB indexed with a controlled vocabulary.

43 2007.02.08 - SLIDE 43IS 240 – Spring 2007 Association Measure C ¬C t a b ¬t c d Where t is the occurrence of a term and C is the occurrence of a class in the training set

44 2007.02.08 - SLIDE 44IS 240 – Spring 2007 Association Measure Maximum Likelihood ratio W(C,t) = 2[logL(p 1,a,a+b) + logL(p 2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p) and p 1 = p 2 = p= a a+b c c+d a+c a+b+c+d Vis. Dunning

45 2007.02.08 - SLIDE 45IS 240 – Spring 2007 Alternatively Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion

46 2007.02.08 - SLIDE 46IS 240 – Spring 2007 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

47 2007.02.08 - SLIDE 47IS 240 – Spring 2007 EVI example EVI 1 Index term: “pass mtr veh spark ign eng” User Query “Automobile” EVI 2 Index term: “automobiles” OR “internal combustible engines”

48 2007.02.08 - SLIDE 48IS 240 – Spring 2007 But why stop there? Index EVI

49 2007.02.08 - SLIDE 49IS 240 – Spring 2007 “Which EVI do I use?” Index EVI Index EVI Index EVI

50 2007.02.08 - SLIDE 50IS 240 – Spring 2007 EVI to EVIs Index EVI Index EVI Index EVI EVI 2

51 2007.02.08 - SLIDE 51IS 240 – Spring 2007 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Why not treat language the same way?

52 2007.02.08 - SLIDE 52IS 240 – Spring 2007 Texts Numeric datasets It is also difficult to move between different media forms Thesaurus/ Ontology EVI

53 2007.02.08 - SLIDE 53IS 240 – Spring 2007 Searching across data types Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results

54 2007.02.08 - SLIDE 54IS 240 – Spring 2007 Texts Numeric datasets But texts associated with numeric data can be mapped as well… Thesaurus/ Ontology captions EVI

55 2007.02.08 - SLIDE 55IS 240 – Spring 2007 EVI to Numeric Data example EVI LCSH marcnew query search results captions numeric table numeric database online catalog search interface 1 search interface 2 1 876 5 432 11 109

56 2007.02.08 - SLIDE 56IS 240 – Spring 2007 Texts Numeric datasets But there are also geographic dependencies… Thesaurus/ Ontology captionsMaps/ Geo Data EVI

57 2007.02.08 - SLIDE 57IS 240 – Spring 2007 WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург, Saint-Pétersbourg,... Multiple names: Cluj, in Romania / Roumania / Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay  Mumbai. Homographs:Vienna, VA, and Vienna, Austria; –50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans; USSR Use a gazetteer!

58 2007.02.08 - SLIDE 58IS 240 – Spring 2007 WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map. Timebar 

59 2007.02.08 - SLIDE 59IS 240 – Spring 2007 Zoom on map. Click on place for a list of records. Click on record to display text.

60 2007.02.08 - SLIDE 60IS 240 – Spring 2007 Catalogs and gazetteers should talk to each other! Geographic sort / display of catalog search result. Catalog search Gazetteer search

61 2007.02.08 - SLIDE 61IS 240 – Spring 2007 Texts Numeric datasets So geographic search becomes part of the infrastructure Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI

62 2007.02.08 - SLIDE 62IS 240 – Spring 2007 WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time periods Named time periods resemble place names in being: –Unstable: European War, Great War, First World War –Multiple: Second World War, Great Patriotic War –Ambiguous: “Civil war” in different centuries in England, USA, Spain, etc. Places have temporal aspects & periods have geographical aspects: When the Stone Age was, varies by region

63 2007.02.08 - SLIDE 63IS 240 – Spring 2007 Suggests a similar solution: A gazetteer- like Time Period Directory. Gazetteer: –Place name – Type – Spatial markers (Lat & long) -- When Time Period Directory: –Period name – Type – Time markers (Calendar) – Where Note the symmetry in the connections between Where and When. Similarity between place names and period names

64 2007.02.08 - SLIDE 64IS 240 – Spring 2007 Solution - Time Period Directories Initial development involved mining the Library of Congress Subject Authority file for named time periods…

65 2007.02.08 - SLIDE 65IS 240 – Spring 2007 LC MARC Authorities Records sh 00000613 Magdeburg (Germany) History Siege, 1550- 1551 g Sieges Germany Work cat.: 45053442: Besselmeier, S. Warhafftige history vnd beschreibung des Magdeburgischen Kriegs, 1552. Cath. encyc. (Magdeburg: besieged (1550- 51) by the Margrave Maurice of Saxony) Ox. encyc. reformation (Magdeburg:... during the 1550-1551 siege of Magdeburg...)

66 2007.02.08 - SLIDE 66IS 240 – Spring 2007 timePeriodEntry  Time Period Directory Instance  Contains components described below - periodID  Unique identifier - periodName  Period name, can be repeated for alternative names  Information about language, script, transliteration scheme  Source information and notes (where was the period name mentioned) - descriptiveNotes  Description of time period - dates  Calendar and date format  Begin & end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing)  Notes, sources - periodClassification  Period type, e.g. Period of Conflict, Art movement  Can plug in different classification schemes  Can be repeated for several classifications - location  Associated places with time period  Contains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinates  Can plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names)  Recently added coordinates for direct use - relatedPeriod  Related time periods  periodID of related periods  Information about relationship type (part-of, successor etc.)  Can plug in different relationship type schemes - entryMetadata  Notes about creator / creation of instance  Entry date  Modification date

67 2007.02.08 - SLIDE 67IS 240 – Spring 2007

68 2007.02.08 - SLIDE 68IS 240 – Spring 2007 Time periods by named location

69 2007.02.08 - SLIDE 69IS 240 – Spring 2007 Catalog Search Result

70 2007.02.08 - SLIDE 70IS 240 – Spring 2007 Web Interface - Access by map

71 2007.02.08 - SLIDE 71IS 240 – Spring 2007 Zoomable interface gives access to geographically focused info…

72 2007.02.08 - SLIDE 72IS 240 – Spring 2007 Link initiates search of the Library of Congress catalog for all records relating to this time period. Web Interface - Access by timeline

73 2007.02.08 - SLIDE 73IS 240 – Spring 2007 WHEN and WHAT These named time periods are derived from Library of Congress catalog subject headings and so can be used for catalog searching which finds books on topics important for that time period

74 2007.02.08 - SLIDE 74IS 240 – Spring 2007 Time period directories link via the place (or time) Texts Numeric datasets Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI Time Period Directory Time lines, Chronologies

75 2007.02.08 - SLIDE 75IS 240 – Spring 2007 WHEN, WHERE and WHO Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

76 2007.02.08 - SLIDE 76IS 240 – Spring 2007 Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs, Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc. Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970. Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

77 2007.02.08 - SLIDE 77IS 240 – Spring 2007 A new form of biographical dictionary would link to all Texts Numeric datasets Thesaurus/ Ontology GazetteerscaptionsMaps/ Geo Data EVI Time Period Directory Time lines, Chronologies Biographical Dictionary

78 2007.02.08 - SLIDE 78IS 240 – Spring 2007 A Metadata Infrastructure CATALOGS Achives Historical Societies Libraries Museums Public Television Publishers Booksellers Audio Images Numeric Data Objects Texts Virtual Reality Webpages RESOURCES INTERMEDIA INFRASTRUCTURE Text and ImagesBiographical DictionaryWHO TimelinesTime Period DirectoryWHEN MapsGazetteer WHERE Syndetic StructureThesaurusWHAT Special Display ToolsAuthority ControlFacet Learners Dossiers

79 2007.02.08 - SLIDE 79IS 240 – Spring 2007 Acknowledgements Electronic Cultural Atlas Initiative project This work was partially supported by the Institute of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004 Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl, Anya Kartavenko, Minakshi Mukherjee Contact: ray@sims.berkeley.edu


Download ppt "2007.02.08 - SLIDE 1IS 240 – Spring 2007 Lecture 8: Clustering University of California, Berkeley School of Information IS 245: Organization of Information."

Similar presentations


Ads by Google