Presentation is loading. Please wait.

Presentation is loading. Please wait.

November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul.

Similar presentations


Presentation on theme: "November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul."— Presentation transcript:

1 November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul Kantor DIMACS, Rutgers University

2 November 10, 2004Dmitriy Fradkin, CIKM'042 What Is This Work About? Small-scale view: We analyze differences between two implementations of Rocchio method and discuss choices of parameters. Large-scale view: The problem of constructing an IR/AF system can be seen as an optimization problem in a large design space. (Well-known methods are simply points in this space.)

3 November 10, 2004Dmitriy Fradkin, CIKM'043 Large-Scale View Use optimization methods to find optimal choices of parameters. These optimal choices do not have to correspond to well-known methods or standard practices. Design space optimization methods have been suggested for designing VLSI chips [Bahuman et. al. 2002], airplanes [Schwabacher and Gelsey, 1996; Zha et. al. 19996] and HVAC systems [Szykman 1997].

4 November 10, 2004Dmitriy Fradkin, CIKM'044 What’s in a name? We find that even a single “name” involves an enormous number of design choices. TREC2002 Adaptive Filtering –DIMACS: Rocchio method –Chinese Academy of Sciences: Rocchio Method One method performs almost twice as well as the other.

5 November 10, 2004Dmitriy Fradkin, CIKM'045 For any system: Choose Data Representation Construct Initial Classifier Training Phase: Incorporate labeled examples Supplement with “pseudo positives” and “pseudo negatives” Set the threshold Filtering Phase: as new documents arrive Evaluate performance Update the classifier model Update threshold

6 November 10, 2004Dmitriy Fradkin, CIKM'046 All of these are usually: Characterized informally, as a choice, and the exclusion of alternatives. Seen as points on a map – but to understand the significance of these choices we need to explore the real territory. So: we must interpolate between the choices made in one method and those made in another.

7 November 10, 2004Dmitriy Fradkin, CIKM'047 Interpolation Identify the corresponding design decisions Develop a “path” between them –sometimes called a “homotopy” from the topological concept of smoothly distorting one shape (say a coffee cup) into another (say, a doughnut). Study the effectiveness along various paths among design options.

8 November 10, 2004Dmitriy Fradkin, CIKM'048 Interpolation Aspects for IR/AF Term Representation Term Weighting Computing Scores Setting Classifier Threshold Document Set Representation Pseudolabeled Documents in Training

9 November 10, 2004Dmitriy Fradkin, CIKM'049 Interpolation Aspects (cont.) Query Initialization Unjudged document in test Query Update Quitting Strategy

10 November 10, 2004Dmitriy Fradkin, CIKM'0410 Example: Term Representation Where f’(t,d) is number of times a term occurs in a document

11 November 10, 2004Dmitriy Fradkin, CIKM'0411 Example: Term Weighting DIMACS: CAS: Homotopy: i’(t) is the number of documents, in training set T, containing term t.

12 November 10, 2004Dmitriy Fradkin, CIKM'0412 Example: Score Computation DIMACS: CAS: Homotopy: i’(t) is the number of documents, in training set T, containing term t. W is a diagonal matrix of weights

13 November 10, 2004Dmitriy Fradkin, CIKM'0413 Example: Score Interpolation Same mapping for scores and for thresholds from CAS scale to DIMACS scale: Homotopy:

14 November 10, 2004Dmitriy Fradkin, CIKM'0414 Example: Setting Thresholds DIMACS: CAS: Homotopy: is chosen to optimize utility Threshold for query q after seeing document i:

15 November 10, 2004Dmitriy Fradkin, CIKM'0415 Example: Set Representation DIMACS CAS Homotopy

16 November 10, 2004Dmitriy Fradkin, CIKM'0416 Example: Pseudo-labeled Documents CAS method does not make use of pseudo-labeled documents in training stage DIMACS method: Given “density” parameters (d+ and d-) and “proportion” (p+ and p-), score unlabeled training documents and choose top and bottom sets according to “proportion”. Then pick documents out of these sets according to corresponding “density”. Interpolate between density and proportion parameters (DIMACS) and 0 (CAS).

17 November 10, 2004Dmitriy Fradkin, CIKM'0417 Example: Query Initialization General Formula: DIMACS: CAS: Homotopy:

18 November 10, 2004Dmitriy Fradkin, CIKM'0418 Example: Unjudged Documents A submitted document for which there is no label is “unjudged”. DIMACS ignores such documents. CAS considers such documents pseudo-negative if its score is less than 0.6. Can view this as a threshold:

19 November 10, 2004Dmitriy Fradkin, CIKM'0419 Example: Query Update General Formula: DIMACS: CAS: Homotopy:

20 November 10, 2004Dmitriy Fradkin, CIKM'0420 Example: Quitting Strategy DIMACS: if after 50 submissions the utility is negative, stop submitting for this topic CAS: no quitting strategy Alternatively:

21 November 10, 2004Dmitriy Fradkin, CIKM'0421 Experimental Evaluation TREC11 Data - Reuters Corpus v1 23,000 training; 800,000 test 100 topics (50 assessor, 50 intersection) 3 positive and 0 negative examples per topic T+ - all positive documents; D+ - submitted positive; D- - submitted negative; Du – submitted unlabelled

22 November 10, 2004Dmitriy Fradkin, CIKM'0422 Diagonal Interpolation

23 November 10, 2004Dmitriy Fradkin, CIKM'0423 Documents Retrieved

24 November 10, 2004Dmitriy Fradkin, CIKM'0424 Parameter Analysis It is possible to analyze effect of individual parameters at each point in space by taking “small steps” along the parameter axis. Requires a lot of computational effort Results may not be easy to interpret

25 November 10, 2004Dmitriy Fradkin, CIKM'0425 Example of Parameter Analysis Effect of individual parameters on number of relevant and nonrelevant documents retrieved around 0.8 point

26 November 10, 2004Dmitriy Fradkin, CIKM'0426 Results based on topic type Comparison of CAS results and 0.8 diagonal homotopy point

27 November 10, 2004Dmitriy Fradkin, CIKM'0427 Additional Experiments Reordered TREC documents Experimented with 77 topics on OHSUMED dataset (1987-1988 as training data, 1989-1991 as test) The results are similar to those on the original TREC task.

28 November 10, 2004Dmitriy Fradkin, CIKM'0428 Result of Experiments with Reordering Lambda0.00.81.0 Average T11SU 0.1080.4060.391 Standard Deviation 0.002 0.004 Average Results on 5 re-orderings of TREC test set:

29 November 10, 2004Dmitriy Fradkin, CIKM'0429 OHSUMED Results

30 November 10, 2004Dmitriy Fradkin, CIKM'0430 Documents Retrieved: OHSUMED

31 November 10, 2004Dmitriy Fradkin, CIKM'0431 Discussion We demonstrate the design complexity hidden under “Rocchio method” We provide specific models for interpolating between design choices These interpolation options can work for methods that are significantly more different (for example Rocchio and SVM).

32 November 10, 2004Dmitriy Fradkin, CIKM'0432 Discussion (cont.) These models should help researchers explore their systems, and regions “between systems” Suggests a new approach to designing IR systems: finding a set of (interpolation) parameters optimizing performance This can be done with existing optimization methods.

33 November 10, 2004Dmitriy Fradkin, CIKM'0433 A Note on Interpolation Limits The need for two endpoint systems is not very restrictive: Some interpolation parameters can be moved beyond [0,1] interval. The endpoints themselves can be moved.

34 November 10, 2004Dmitriy Fradkin, CIKM'0434 Abstract Interpolation More abstractly: do not interpolate every single parameter –work at higher abstraction levels Ex: representation block, scoring block, thresholding block, etc. Can use this with several systems This is at a lower level than ensembles of classifiers.

35 November 10, 2004Dmitriy Fradkin, CIKM'0435 Caveat In moving to large design space we still face two major problems: The range of parameters cannot be explored exhaustively, and non-smooth optimization is needed Requires a lot of labeled data that is usually produced manually and is in short supply.

36 November 10, 2004Dmitriy Fradkin, CIKM'0436 Acknowledgments KD-D group via NSF grant EIA-0087022 Andrei Anghelescu, Vladimir Menkov Jamie Callan Members of DIMACS MMS project CAS researchers Ian Soboroff Anonymous reviewers


Download ppt "November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul."

Similar presentations


Ads by Google