© CvR SIGIR2002. © CvR SIGIR2002 Keith van Rijsbergen Tampere 12 th August, 2002 Landmarks in Information Retrieval: the message out of the bottle.

2 © CvR SIGIR2002 Keith van Rijsbergen Tampere 12 th August, 2002 Landmarks in Information Retrieval: the message out of the bottle

3 © CvR SIGIR2002 Introductory Remarks Exclusions – IE, TM,.. Commercial successes and failures Caveats Why we have survived. Where we were, where we are, where we are going.

4 © CvR SIGIR2002 Pre-history Smee(1850) Wells (1936) Bush (1945) Bagley (1951) MIT Fairthorne (1945-52) RAE Luhn(1958) Mooers(1952)

5 © CvR SIGIR2002 Experimental Methodology CleverdonCranfield LancasterMedlars KeenCranfield/Smart SaracevicCWRU SaltonSmart Sparck JonesIdeal Test Collection Blair & MaronStairs HarmanTREC

6 © CvR SIGIR2002 Evaluation ABNO/OBNA(Fairthorne) Precision, Recall -> trade-off (Cleverdon) Probabilistic versions (Swets) Measure-theoretic(Bollman)

7 © CvR SIGIR2002 ‘the world in 1980 according to Belver Griffith’ Who is missing?

8 © CvR SIGIR2002 Landmarks Luhn’s tf weighting Architecture Relevance Feedback Stemming Poisson Model -> BM25 Statistical weighting tf*idf Various models

9 © CvR SIGIR2002 Luhn’s curve

10 © CvR SIGIR2002 What about evaluation? Information Problem Indexed Objects Query Fictive Objects Representation Compare

11 © CvR SIGIR2002 Architecture (Brenda Gerrie, 1983)

12 © CvR SIGIR2002 Time I ( highlights for me ) 1952 Mooers coins IR 1958 International Conference on Scientific Information 1960 Cranfield I 1960 Maron and Kuhns paper 1961 Towards IR, RAF 1961 (-1965) Smart built 1964 Washington conference on Association Methods 1966 Cranfield II 1968 Salton’s first book 197- Cranfield conferences 1975 CvR’s book 1975 Ideal test collection 1976 KSJ/SER JASIS paper

13 © CvR SIGIR2002 Time II 1978 1 st SIGIR 1979 1 st BCSIRSG 1980 1 st joint ACM/BCS conference on IR1 st joint ACM/BCS conference on IR 1981 KSJ book on IR Experiments 1982 Belkin et al ASK hypothesis 1983 - Okapi started 1985 RIAO-1 1986 CvR logic model 1990 Deerwester et al,LSI paper 1991 CoLIS 1 (in Tampere!) 1991 – Inquiry started 1992 Ingwersen’s book 1992 TREC-1 1998 Croft Ponte paper on language models

14 © CvR SIGIR2002 Matching Inference Model Classification Query Language Query Definition Query Dependence Items wanted Error response Logic Exact MatchPartial (best) Match DeductionInduction DeterministicProbabilistic MonotheticPolythetic ArtificialNatural CompleteIncomplete YesNo MatchingRelevant SensitiveInsensitive ClassicalNon-classical Representationa prioria posteriori Language Models Logical Statistical dimensions

15 © CvR SIGIR2002 Probabilistic Retrieval Maron and Kuhns Miller (following Goffman) SER/KSJ Croft

16 © CvR SIGIR2002 Vector Space Model Salton Murray Rocchio

17 © CvR SIGIR2002 Logical Model Mooers/Faithorne1960+ Hillman1965 Cooper/Maron1970+ CvR1986 Nie/Amati/Bruza/Huibers1990+ For Against Bar-Hillel1950+ Kasher1966

18 © CvR SIGIR2002 Buried Treasure Dependence e.g C.T Yu Unified Probabilistic Model Maron/Cooper/SER Co-relevanceIvie Stochastic ProcessesMandelbrot/Herdan Brouwerian LogicsHillman Error AnalysisHughes/Cover/Duda

19 © CvR SIGIR2002 Hypotheses/Principles P & R trade-off – ABNO/OBNA Exhaustivity/Specificity Cluster Hypothesis Association Hypothesis Probability Ranking Principle Logical Uncertainty Principle ASK Polyrepresentation Items may be associated without apparent meaning but exploiting their association may help retrieval

20 © CvR SIGIR2002 Postulates of Impotence (according to Swanson, 1988) An information need cannot be expressed independent of context It is impossible to instruct a machine to translate a request into adequate search terms A document’s relevance depends on other seen documents It is never possible to verify whether all relevant documents have been found Machines cannot recognise meaning -> can’t beat human indexing etc

21 © CvR SIGIR2002 ….more postulates Word-occurrence statistics can neither represent meaning nor substitute for it The ability of an IR system to support an iterative process cannot be evaluated in terms of single-iteration human relevance judgment You can have either subtle relevance judgments or highly effective mechanised procedures, but not both Thus, consistently effective fully automatic in dexing and retrieval is not possible

22 © CvR SIGIR2002 ? Conclusions

23 © CvR SIGIR2002 Co-ordination is positively correlated with external relevance Jackson, 1969 – Association Hypothesis The larger the number of matching descriptive items, for a request and document, the more likely the document is to be relevant to the request Sparck Jones, 1971- Relevance Hypothesis Matching

24 © CvR SIGIR2002 It is a common fallacy, underwritten at this date by the investment of several million dollars in a variety of retrieval hardware, that the algebra of Boole (1847) is the appropriate formalism for retrieval design…..The ‘logic’ of Brouwer, as invoked by Fairthorne, is one such weakening of the postulate system,…… Mooers, 1961 Another one: Logical Uncertainty Principle CvR, 1986 Inference

25 © CvR SIGIR2002 Co-occurrence [of terms] as a basis for grouping makes for good swops i.e. permits substitutions which retrieve relevant rather than irrelevant documents. Sparck Jones, 1971. – Classification Hypothesis If an index term is good at discriminating relevant from non-relevant document then any closely associated index term is also likely to be good at this. CvR, 1979 – Association Hypothesis Closely associated documents tend to be relevant to the same requests – CvR, 1971 - Cluster Hypothesis Classification

26 © CvR SIGIR2002 Vector Space/LSI Probabilistic Logical Models

27 © CvR SIGIR2002 Query Language Artificial/Natural Multilingual/cross-lingual images none at all

28 © CvR SIGIR2002 Query Definition Complete/Incomplete Independence/Dependence Weighted/Unweighted Query Expansion/one shot (feedback, web) Sense disambiguation Cross-lingual

29 © CvR SIGIR2002 Relevance Feedback Ostensive Retrieval Context Query Expansion Query Dependence

30 © CvR SIGIR2002 Relevance ASK: Anomolous State of Knowledge Situated Relevance Items wanted

31 © CvR SIGIR2002 Precision and Recall Error response

32 © CvR SIGIR2002 Logic standard/non-standard probabilistic logic information flow/logic

33 © CvR SIGIR2002 Discrimination/Representation Specificity/Exhaustivity Representation

34 © CvR SIGIR2002 NLP Montague Semantics Language Models Stochastic

