Download presentation
Presentation is loading. Please wait.
Published byPatience Churchman Modified over 9 years ago
1
Mediated Information Retrieval – The WebCluster and MIR Projects – Gheorghe Muresan School of Communication, Information and Library Sciences Rutgers University
2
Structure of the talk The WebCluster project Design decisions in WebCluster The MIR project Integrated approach to interaction modeling, logging and analysis
3
WebCluster - Motivation Information Need Query Search engine (within some subject domain) WWW_SearchEngine Domain Gulfs –information need query –structured subject domain unstructured target collection (WWW)
4
Information need 1. Select library 2. Consult catalog 3. Browse shelves 4. Use inter-library scheme Information Need Formulation Interaction in the library
5
1. Select source collection Information Need Formulation 2. Explore source collection with ClusterBook Results Information need 3. Search WWW Can we simulate the library interaction ? Structured source collections
6
The mediated access interaction Information need Web search engine WebCluster Query Specialised source Target collection (WWW) Topical documents
7
Interaction model vs. prototype Structuring the source collection Document clustering Supervised classification Manual (intellectual) classification Exploring the structured source collection Metaphor – Library, book, encyclopaedia Visualization tool – Folder metaphor, hyperbolic tree, themescape, cone trees, thematic maps Search strategies supported – Best match or cluster-based searching, browsing
8
Model vs. prototype Interaction model Explicit (the user marks relevant documents) vs. implicit (cues on relevance are derived based on user behavior/actions) Transparent (the user is aware) vs. opaque (the user is happy to see effect of ‘magic’) Transparentopaque Automatic vs. manual/intellectual generation of the mediated query Query model Language models (generative, Kullback-Leibler) Probabilistic models Rocchio or other RF-specific formulae
9
ClusterBook - Source collection
10
ClusterBook - Target collection
11
Informal experiments - Objectives - Test the users’ reaction to the mediated access concept Test the user satisfaction regarding the functionality of the system, and the relevance of the documents retrieved Formative usability testing - some volunteers were not only experienced searchers, but also had experience in evaluating IR systems Comparison of user generated queries vs. system generated queries Note. These experiments were run at different stages of the development
12
Informal experiments - Experimental procedure - Subjects received introduction to the system Task assigned: “You are a trainee in a newspaper. You support the journalists by providing information for the topic of their articles.” Sample topics: The history of the Brasilian debt crisis How are the quotas for growing coffee set and controlled on a world-wide basis ? Source collection: a sub-collection of Reuters (newspaper articles) Steps followed by users (explicit scenario): Formulate a query and record it Browse source collection, select ‘best’ cluster, edit query generated by system, submit it to the search engine Submit to the same search engine the initial, self-generated query Compare results of the two searches
13
Informal experiments - Results - Users found the mediation useful for unfamiliar topics The system nearly always proposed new, good query terms Users not always good at recognizing ‘good’ query terms The system proposed bad query terms (not specific to the topic) the opaque scenario not viable unless the query formulation is improved The two-step process was questioned when: the query formulation was considered easy, for a familiar topic the documents of the source collection were considered sufficient to cover the information need Complete link, group average – OK; single link – bad Overall, the system is usable
14
Consequences of informal experiments Formal experiments are needed to verify the main assumptions: The Cluster Hypothesis holds for a specialized collectionCluster Hypothesis Good clusters can be found with the search strategies provided Mediated queries can improve retrieval effectiveness The effect on retrieval performance of various parameters should be compared Weighting schemes Clustering methods Search strategies Search strategies
15
Fixed Plants Coastal Wind Farms Pacific Rim Wind Farms Design of Coastal Wind Farms Design of …. Desert Wind Farms Inland Wind Farms... Portable Generators... Wind generators for yachts Power GenerationPropulsion Wind Energy Critical issue: The label generation Document representatives searching Cluster representatives browsing browsing searching searching mediation mediation Collection representatives collection selection
16
Mediation experiment - simulations Objectives: Test the potential of mediation to increase retrieval effectiveness Test the effect on effectiveness of a variety of parameters Search engine Simple query generator (baseline) Topic-based mediator (upperbound) Source collection Target collection Cluster-based mediation (realistic mediation)
17
Experimental setup Interactive track of TREC-8 Offers relevance judgments for complex topics, with a multitude of aspects Offers the experimental design for the user experiment Six topics with 12 to 56 aspects each Target collection: FT 1991-4, with 210,158 articles Source collection built based on relevance judgments: half of the relevant documents, their nearest neighbors, plus the documents judged non-relevant
18
Results – the cluster hypothesis Aspectual cluster hypothesis confirmed by an extended version of the van Rijsbergen – Sparck Jones separation test Similarity between pairs of docs covering the same aspect is higher than between pairs of docs covering the same topics, which is higher than between pairs of docs in the collection Consequence confirmed: clustering groups documents in pockets of relevance pockets of relevance
19
Results – retrieval effectiveness Tf-Idf > KL > RelFreq as weighting schemes for document representation Adding disambiguation terms to the query increases recall, but decreases precision Nearest-neighbor mediation (“more like this”) highly significantly improves both recall and precision, even if just one exemplary document is offered for each topic aspect Cosine and Dice performs similarly
20
Mediation results Upperbound experiment (all relevant docs known in source) Both recall and precision increase with query length Query term weights strongly affect performance No evidence that uniformity of term frequency affects performance Clustered source mediation Best cluster mediation increases P, decreases R “Fuse and search” – strong increase in R and P “Search and fuse” – good R, terrible P !
21
Contributions of WebCluster Proposes and explores system-based mediated access to very large heterogeneous document collections Explores the use of clustering for capturing the topical, semantic structure of a problem domain (as represented by a specialized collection) Explores the use of language models for building cluster and document representatives Offers a framework for building structured portals on the WWW Offers a framework for building collaborative environments
22
Structure of the talk The WebCluster project Design decisions in WebCluster Design decisions in WebCluster The MIR project Integrated approach to interaction modeling, logging and analysis
23
Structure of the talk The WebCluster project Design decisions in WebCluster The MIR project Integrated approach to interaction modeling, logging and analysis
24
User experiment – effectiveness of mediated information retrieval for Web searches Within-Subjects Between-Subjects Non-mediatedMediatedTotal Subjects Linear Ranking ListNL: Linear (Web)ML: Linear (NJEDL and Web) 16 Combination of Linear ranking and classified display of search results NC: Combination (Web) MC: Combination (NJEDL and Web) 16 Total Subjects32 Total 32 NL- Non-mediated and Linear, NC- Non-mediated and Combination ML- Mediated and Linear, MC- Mediated and Combination
25
Research Hypotheses The mediated system is conducive to higher effectiveness than the non- mediated system The combination of linear/ranked display with a hierarchic/clustered display is conducive to higher effectiveness than simple ranked display
26
Mediation assumptions Relevant documents tend to be clustered together in the source collection The cluster hypotheses Subjects can identify relevant documents Subjects spend time exploring the source collection and some relevant documents Queries submitted after mediation are better Longer, higher clarity, wider vocabulary
27
Other areas of exploration Interaction models Transitions between states and activities The effect on search behavior of subject expertise or familiarity with a topic The subjects’ ability to recognize good documents and clusters based on snippets or labels Compare user-generated queries with system-generated queries in terms of performance
28
User experiment – no mediation
29
User experiment – mediated access
31
Structure of the talk The WebCluster project Design decisions in WebCluster The MIR project Integrated approach to interaction modeling, logging and analysis
32
Motivation Interest in studying Human Information Behavior and Interactive Information Retrieval Qualitative aspects Patterns of behavior → User models → Predictions of future behavior Quantitative aspects # queries, # query terms, # documents viewed / opened / saved, # errors / corrections, time spent → Conclusions regarding retrieval effectiveness & efficiency Typical tools Think-aloud protocols, video recording, questionnaires, interviews, activity logging
33
Motivation Logging – options: Commercial tools (Morae, uLog, Camtasia, etc) Expensive, less control over what is logged, format usually proprietary DIY – log events related to the research questions Rather inflexible – what if new research ideas come to light? Idea #1: log all semantic events Identified during interaction / interface design → integrate: Interface design Logger design Log analyzer design
34
Motivation - practical Frustration with existing practices in IR research Rutgers participation in Interactive TREC 2002 Rutgers participation in Interactive TREC 2002 User interface Logging Idea #2: use state-based design, logging and analysis Advantages: Design tools are plentiful The entire research team can participate in design Once the design is completed, the procedures to generate the logging software and the log analyzer are deterministic
35
Typical interactive IR experiment Research Hypotheses Prepare experimental system Design system Build system Add functionality for logging interactions Run experiment Generate experimental data, including logs of interactions Analyze experimental data Draw conclusions
36
Problems with this experimental model (based on anectodal evidence) The system is built and the extraction of experimental data from logs is done by “those who can program” in the research group Most researchers are not involved in specifying system requirements The design stage is often skipped The system may display usability problems, which affects the exploration of the research problems The logging functionality is added ad-hoc, in unsystematic fashion There is no standard format for interaction logs; additional software is needed for parsing logs and extracting useful data
37
Proposed integrated approach (Keywords: UML, DTD, XML) Research Hypotheses Conceptual design of the system (UML model of interactions) DTD model of interactions Experimental system Build system XML logger Run experiment Generate experimental data, including logs of interactions XML log format XML parser Analyze experimental data Data visualization
38
Summary Model the interaction using UML diagrams This allows the whole research team to contribute to the design of the user interfaces, and supports the documentation of the interface. Derive a DTD coding for the states of the user interface, the valid user actions in every state and the state transitions that take place based on user actions; Use XML to log the user actions and their outcomes based on the interaction DTD; Based on the DTD, generate a log parser for the log analysis; The log analysis provides interaction information and can also be used for generating a visual / graphic representation of the interaction
39
MIR – state diagram
40
MIR – DTD
41
MIR – XML log
42
MIR – logger class diagram
43
Design decisions Design patterns State – each state of the interface/system is modeled by a class Inheritance (class hierarchy) is used to model sub- states (states at different levels of granularity) Composition is used to model orthogonal states Visitor – decouples the strategy chosen for parsing / visiting the log from the actions taken in each node Strategy – supports different analysis strategies DOMLogAnalyzer – visits the entire log tree, for a comprehensive analysis XPathAnalyzer – visits only a selection of nodes, relevant for a certain RH (“log/record/message/SaveDoc”)
44
Design decisions Singletons vs. multiple objects for states Singleton – one object for each class Adv: simple Disadv: only appropriate for cumulative data or summaries Multiple objects Adv: supports accurate, detailed analysis Explicit vs. implicit logging of states Explicit: allows a human reader to interpret the logs; redundancy; problems capturing orthogonal states Implicit: only events are captured, states are re-created
45
Types of analysis supported State transitions, user behavior Average user vs. individual user Levels of state granularity (Think, EditQuery, ViewResults ( ExploreSource ( ViewSourceHierarchy, ViewSourceHitList, ViewSourceDoc ), ExploreTarget ( ViewTargetHitList, ViewTargetDoc, ViewSavedDoc ))) Statistical analysis on qual data ANOVA shows no difference in number of saved docs between non-mediated condition (m=3.94, sd=1.76) and mediated condition (m=3.13, sd=1.62) Think4 EditQuery9 ViewTargetHitList15 ViewTargetDoc78 SavingDoc16 ViewTargetHitList6 ViewTargetDoc31 ViewTargetDoc9 ViewTargetDoc35 SavingDoc11 ViewTargetHitList3 ViewTargetDoc173 SavingDoc16 ViewTargetHitList14 EditQuery7 ViewTargetHitList4 ViewTargetDoc17 ViewTargetDoc59 ViewTargetDoc51 ViewTargetDoc39 EditQuery13 ViewTargetHitList25 ViewTargetDoc38 SavingDoc15 …
46
Conclusions Advantages of the proposed method Better teamwork All members contribute and are responsible for the design More accurate experimental results Increased usability of the experimental system Accurate data, due to accurate logging of events Less effort in testing and debugging, as well as in parsing and analyzing results DTD offers the interaction template XML logs support debugging Available open-source XML parsers
47
Future work Automatic (or semi-automatic) generation of the DTD model from the UML model Conceptual problem: designing a transition scheme between the two models Practical problem: interpreting the format that various modeling packages use to store UML models Visualization of the interaction Model: timeline of the interaction vs. summary Format: HTML vs. SVG vs. … Automatic generation: programming language (Java) vs. transformation template (XSLT)
48
Questions ?
49
Query formulation problems Vague information need Vocabulary mismatch Difficulty of query language syntax Lack of context, ambiguity of terms Lack of a search strategy No understanding of the underlying indexing/searching model Note. TREC experiments have shown that the quality of the query has a higher impact on retrieval effectiveness than weighting schemes or search algorithms.
50
Role of structure Computing Computer Screen Keyboard C++ Pascal Programming language... Mathematics... Algebra Computing, MathematicsPhysics Science Reveals the semantic structure of the domain & its concepts Groups (semantically ?) similar documents Supports exploration and concept formation Supports term disambiguation (context) (Has potential for efficient retrieval) (Has potential for effective retrieval)
51
Browsing label (relative cluster representative) Coastal Wind FarmsInland Wind Farms Pacific Rim Wind Farms Design of Coastal Wind Farms Design of …. Desert Wind Farms Wind generators for yachts Fixed Plants... Portable Generators... Power GenerationPropulsion Wind Energy
52
Searching label (absolute cluster representative) Coastal Wind FarmsInland Wind Farms Pacific Rim Wind Farms Design of Coastal Wind Farms Design of …. Desert Wind Farms Wind generators for yachts Fixed Plants... Portable Generators... Power GenerationPropulsion Wind Energy
53
Mediation label (Expanded cluster representative) Fixed Plants Coastal Wind Farms Pacific Rim Wind Farms Design of Coastal Wind Farms Design of …. Desert Wind Farms Inland Wind Farms... Portable Generators... Wind generators for yachts Power GenerationPropulsion Wind Energy
54
Topic model representations Exemplary representation Statistical representation Statistical analysis Language model Context analysis Typical terms, weighted Thresholding Mediated query Keyword representation
55
The cluster hypothesis Reminder: the original cluster hypothesis “Closely associated documents tend to be relevant to the same requests” (van Rijsbergen) Aspectual cluster hypothesis: Highly similar documents tend to be relevant to the same topic. However, documents relevant to the same topic may be quite dissimilar if they cover distinct aspects of the topic. Consequence: Clustering algorithms tend to group together documents that cover highly focused topics, or aspects of complex topic. Documents covering distinct aspects of complex topics tend to be spread over the cluster structure.
56
Aspects of relevance in the mediated access process
57
Distribution of relevant documents in clusters
58
WebCluster scenario#1 Document from the source collection Document from the target collection (WWW) WebClusterWebCluster Web Search Engine c0c0 c4c4 c5c5 c2c2 c1c1 c3c3 c’ 0 c’ 3 c’ 2 c’ 5 WWW Name Transparent mediated access Targeted users Experienced searchers Specific The users are aware of the mediation process, of the separation between the source and target collections The users have the option to edit the query generated (proposed) by the system. They understand the indexing / searching model.
59
WebCluster scenario#2 WebClusterWebCluster c0c0 c4c4 c5c5 c2c2 c1c1 c3c3 WWW c’ 0 c’ 3 c’ 2 c’ 5 Web Search Engine Name Opaque mediated access Targeted users Naive / beginner searchers Specific The users explore the structure of the domain, which contains sample documents, and have the option of asking for similar documents The users are unaware of the mediation process - the query generation and target search are not visible Document from the source collection Document from the target collection (WWW)
60
Initial user interface (Java AWT)
61
History of the IR research group at RGU “Systemic” approach with an interest in building software frameworks for IR Eclair - An Extensible Class Library for Information Retrieval (Harper et al, ’92) Flair – A Flexible Architecture for Information Retrieval (Jose et al, ’96) Epic - A Photographic Retrieval System Based on Evidence Combination Approach Fireworks - An Architecture for Implementing Extensible Information-Seeking Environments (Hendry et al, ’96) SketchTrieve - An Informal Information-Seeking Environment
62
Initial plan of work for WebCluster Produce a flexible Clustering Framework that can apply a variety of clustering algorithms on: Static and dynamic (on the fly) document collections User profiles Sources of information “Play” with the CF to understand how clustering works on various collections Use the CF for structuring source collections in view of mediation Design, build and test a few user interfaces for mediation
63
The Clustering Framework User Application Clustering Framework Kernel CF-Web Search Engine Interface ECLAIR IRS CF-Document Collection Interface CF-User Interface CF-ObjectStore Interface CF-File System Interface File System ObjectStore CF-ECLAIR Interface WEB
64
Requirements of the CF Generality Not devoted to a particular document collection, nor to a particular IRS. Document Independence The document are not managed by the CF, which handles only their representatives. Flexibility and Reusability Large variety of clustering methods, inter-document similarity measures, halt conditions,... Storage management independence We can use file system, OODBMS,..., to make the result persistent. Adaptability and Extensibility Users can add their own clustering methods. Transparency For reuse as a toolbox in other applications.
65
Original design of the CF
66
Example of clustering parameter file GroupAverageCMCosineSimilarityMeasure InqueryTfIdfWeighting060.00010.001 *. Clustering method. Possible values: CompleteLinkCM, GroupAverageCM, SingleLinkCM *. Similarity measure. Possible values: CosineSimilarityMeasure, DiceCoefficientSimilarityMeasure *. Weighting measure (indexing). Possible values: FreqWeighting, RelFreqWeighting, KLWeighting, KLRelWeighting, InqueryTfIdfWeighting *. Cluster size threshold (don't agglomerate clusters with sufficient docs) *. Halt condition (stop when I'm down to a certain number of clusters). Note. Value 0 means no restriction; stop when you can't cluster anymore. *. Similarity measure threshold *. Cluster cleaning threshold
67
Design patterns Used extensively, as they provide Flexibility and extensibility – for a research system, playing with parameters and plugging in more modules was more important than performance Combined Most operations (such as ClusteringMethod) combined Strategy, Singleton, Factory Method, Product-Trader, sometimes TemplateMethod For storage management we combined Strategy, Bridge and Serializer Adapted, rather than applied blindly For the cluster structure simply applying Composite (Cluster, SimpleCluster, ComplexCluster) was very inefficient; we combined it with a Mediator that indexes documents and clusters (Clustering)
68
Document collection Vocabulary, index, inverted file Cluster hierarchy (Informia) meta-search engine Indexer Clustering Framework Cluster-based searcher Ranked-based searcher GetHits GetCollections GetClustering SearchClusters SearchDocuments GetClusteredHits Client (ClusterBook ) Server Query: Mediated query: 1.2.3.4.5.6. The client-server architecture 7.
69
ClientServer_ProxyComms_Proxy Client side Server side ServerClient_ProxyTCP_Proxy Client-server communication
70
Architecture / design decisions Good ones Software framework in the server Java for the user interface Refactoring as language support improved AWT implementation replaced by Swing AWT implementation –MVC, native tree representation, CellRenderers STL instead of own library (String, List, Map, Iterator) Tight coupling with Informia replaced by loose coupling Data-centered rather than software-centered Questionable ones The client-server connection (CGI, TCP, HTTP) Alternatives: RMI, CORBA, servlet & JNI ?
71
Bad example: Rutgers Interactive TREC 2002 (a)
72
Bad example: Rutgers Interactive TREC 2002 (b)
73
Bad example – Logging in Interactive TREC 2002 TREC-2002 START: 2002-08-15 17:58:57 QUERY: geneticly engineered foods safety SAVE DOCUMENT: [G13-84-2041245] Food Safety and Biotechnology: Are They Related? QUERY: problems genetically engineered foods SAVE DOCUMENT: [G40-01-0459199] International Information Programs, U.S. Department of State, Economic Perspectives, October 1999 FINALLY SAVED DOCUMENTS:[G40-01-0459199] International Information Programs, U.S. Department of State, Economic Perspectives, October 1999;[G13-84-2041245] Food Safety and Biotechnology: Are They Related? NUMBER OF VIEWED DOCUMENTS: 12 NUMBER OF UNIQUE VIEWED DOCUMENTS: 8 TREC-2002 STOP: 2002-08-15 18:03:50
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.