Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC this year Generalized Performance Plots Remarks on.

Similar presentations


Presentation on theme: "© Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC this year Generalized Performance Plots Remarks on."— Presentation transcript:

1 © Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC this year Generalized Performance Plots Remarks on the formal model of decision

2 © Paul Kantor 2002 1. Accumulated documents 2. Unexpected event 3. Initial Profile 4. Guided Retrieval 5.Clustering 6. Revision and Iteration Analysts Retrospective/Supervised/Tracking 1. Accumulated documents 4. Anticipated event 3. Initial Profile 5.. Guided Retrieval 2.Clustering Prospective/Unsupervised/Detection Rutgers DIMACS: Automatic Event Finding in Streams of Messages 7. Track New documents

3 © Paul Kantor 2002 Rutgers-DIMACS KD-D MMS Project Matrix

4 © Paul Kantor 2002 Communication The process converges…. Central limit theorem … What??? Pretty good fit Confidence levels What??? And so on

5 © Paul Kantor 2002 Measures of performance Effectiveness 1. Batch Post-hoc learning. Here there is a large set of already discovered documents, and the system must learn to recognize future instances from the same family 2. Adaptive learning of defined profiles. Here there is a small group of "seed documents" and thereafter the system must learn while it works. Realistic measures must penalize the system for sending documents that are of no interest to any analyst, to the human experts. 3. Discovery of new regions of interest. Here the focus is on unexpected patterns of related documents, which are far enough from established patterns to warrant sending them for human evaluation.

6 © Paul Kantor 2002 Measures of performance Effectiveness Efficiency is measured in both time and space resources required to accomplish a given level of effectiveness. Results are best visualized in a set of two or three dimensional plots, as suggested on the following page.

7 © Paul Kantor 2002 Efficiency- Effectiveness Plots Measure of Effectiveness Measure of Time Required (Best Baseline method/Method_plotted) 100% Strong and slow Strong and fast Weak but fast Not good enough for government work

8 © Paul Kantor 2002 The process Incoming Documents N; G are relevant Our System Sends n to analyst Analyst: Reports g are relevant n g G

9 © Paul Kantor 2002 Typical Effectiveness measures Basic Concepts: precision p=g/n: g=number of relevant documents flagged by our system: n = number that the analyst must examine recall R=g/G: G=total number that “should” be sent to the analyst, that is the number of relevant documents. –F-measures: Harmonic mean of precision and recall 1/F = a/p+ (1-a)/R =(1/g)(an+(1-a)G) so F=g/[an+(1-a)G] there is no persuasive argument for using this in TREC2002 a=0.8. A 4:1 weighting

10 © Paul Kantor 2002 Typical measures used Utility-Based measures –Pure Measure: U=vg -c(n-g) =-cn+g(v-c) –Note that sending irrelevant documents drives the score negative. v=2; c=1 –“Training Wheels”: To protect groups from having to report astronomically negative results: U is replaced by –T11SU = [max{U/2G, -0.5} - 0.5]/[1.5]

11 © Paul Kantor 2002 How we have done: TREC2002 Disclaimers and caveats. –We report here only on those results that were achieved and validated at the TREC2002 conference. These were done primarily to convince ourselves that we can manage the entire data pipeline, and were not selected to represent the best conceptual approaches that we can think of.

12 © Paul Kantor 2002 Disclaimers and caveats (cont). The TREC Adaptive rules are quite confusing to newcomers. It appears, in conference and post-conference discussions that the two top-ranked systems may not have followed the same set of rules as the other competitors. If this is the case, our results will actually be better than those reported here.

13 © Paul Kantor 2002 Using measure T11SU Adaptive -- Assessor topics - 9th among all 14 teams- 7th among those known to follow rules. Intersection topics - 7th among all 14 teams -- 5th among known to follow the rules. Batch. 6th among all 10 groups on Assessor topics; 3rd among all 10 groups on Intersection topics. Scored above median on almost all. Tops on 24 of 50

14 © Paul Kantor 2002 Fusion of Methods Paul Kantor and Dmitiry Fradkin (supported in part by ARDA)

15 © Paul Kantor 2002 Fusion Models Each of several systems gives scores to documents. Call these s j (d). Can these be combined so that the resulting score is a more accurate indication of the relevance of the document? The underlying mathematical concept is the conditional score distribution f(s,h) =Prob(document has score s, given relevance h). The “hypothesis” h=R,N (“Relevant or Not”.

16 © Paul Kantor 2002 Tools We have built visualization tools to show these two distributions. It can be shown that all decision making needs only know the so- called ROC curve, which is invariant to any monotone change of the score variable. We have also built tools which show the ROC. The simplest form gives a curve with coordinates (d(t), f(t))

17 © Paul Kantor 2002 ROC d(t)=Prob(score >t | document relevant) f(t)=Prob(score >t |document not relevant)

18 © Paul Kantor 2002 Score Distributions

19 © Paul Kantor 2002 ROC Display Applet

20 © Paul Kantor 2002 ROC Display Applet

21 © Paul Kantor 2002 Formal Models David Madigan and Paul Kantor

22 © Paul Kantor 2002 Formal Models The BinWorld model Some heuristic ideas

23 © Paul Kantor 2002 BinWorlds --Very simple models Documents live in some number (L) of bins. Some bins have only (b) irrelevant (bad) documents, a few have relevant (good) documents. Documents are delivered randomly from the world, labeled only by their bin numbers. The work has a horizon H, with a payoff v for good documents sent to be judged, and a cost c for bad documents sent to be judged. We consider a hierarchy of models. For example, if only one bin contains good documents, the optimum strategy is either QUIT or continue until seeing one good document, and thereafter only submit documents from this bin to be judged. The expected value of the game is given by: EV=-CostToLearnRightBin+GainThereafter. Since the expected time to learn the right bin is 1+Lb/g EV=-c(1+Lb/g)+(H-(1+Lb/g))(vg-cb)/(b+g). Increasing Horizon H increases EV, while increasing the number of candidate bins, L, makes the game harder.

24 © Paul Kantor 2002 The essential math However, if we have failed once on a bin, perhaps it is not wise to test it again. At any step on the way to the horizon H the decision maker can know only these things: The judgments on submitted documents, and the stage at which they were submitted. Let j i =j(b,i) be the judgment received when a document from bin b was submitted at time step i.

25 © Paul Kantor 2002 The challenge As a result of these judgments, the decision maker has a present Bayesian estimate of the chance that each bin is the right bin Can we find a simple and effective heuristic based on the available history j 1 … j i and the time remaining:H=K-i.

26 © Paul Kantor 2002 Example Heuristic 5 Bins p=0.2 If the current bin is the one that has the largest number of failures to date, do not send for judgment. (Yellow line). Gains slowly until the correct bin is discovered. Alternative is to submit always. (Mauve line)

27 © Paul Kantor 2002 Future work Such an heuristic should exist because the decision rule must be of the form: if the current estimate that a bin is the right one is below some critical value, don’t submit it. Note: this is “obvious but not yet proved.” In more complex models, the chance of success in the right bin (g); the number of bins L and even the number of good bins may be unspecified.


Download ppt "© Paul Kantor 2002 A Potpurri of topics Paul Kantor Project overview and cartoon How we did at TREC this year Generalized Performance Plots Remarks on."

Similar presentations


Ads by Google