Presentation on theme: "I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 Introduction to Information Retrieval An overview of."— Presentation transcript:
I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 email@example.com Introduction to Information Retrieval An overview of the topics to choose for your CW
I.R. Intro2 Problems in I.R. Searching for information in a vast unstructured digital world can be tricky. We have probably all found two problems: Too many hits You are looking at the first page of hundreds of hits. Many are duplicates. None seem relevant. Too few hits You enter your key terms and get back nothing relevant, or nothing at all. Rarer probably, but still a nuisance.
I.R. Intro3 Scope n I.R. is about pointing the user to sources of information n Not about processing the sources for the user. n Not database retrieval n Sometimes the sources must be processed in order to see if they are a hit or not. Job is done when user is shown relevant sources
I.R. Intro4 Objectives n Retrieve ALL relevant documents n Retrieve NO irrelevant documents n Show the results sensibly
I.R. Intro5 Basics questions 1. What problems have you had in I.R.? 2. What is a document? 3. What is a relevant document? 4. What is sensible output? 5. How might you measure the performance of an I.R. system?
I.R. Intro7 Measures of effectiveness Relevant Docs R All Docs Hits H Hits H Relevant Hits RH
I.R. Intro8 Measures of effectiveness Precision = RH / H 0 if no hits relevant 1 if all hits relevant Recall = RH / R 0 if no relevant docs found 1 if all relevant docs found Relevant Docs R All Docs Hits H Hits H Relevant Hits RH
I.R. Intro9 Techniques : Simple Binary match Query term1Document1 22 qd Document d is hit if it contains q1 or q2... What is wrong with this? How could you do better?
I.R. Intro10 Techniques: Better Binary match n Remove stop words n Weighting by where match occurs Heading? Title? Sentence subject? n Weighting by number of matches More matches = more relevant? n Boolean queries Allow user to specify AND, OR etc Maybe distance this applies to? (e.g. sentence, paragraph, document)
I.R. Intro11 Better Binary match Significant words
I.R. Intro12 Better Binary match Inverse index Term= word IDF = inverse document frequency = freq. of term in all docs DOC= document identifier TF = term frequency
I.R. Intro13 Vector matching I.R. n Documents represented as a vector n Size of vector usually the number of terms in the document space n Queries represented as a pseudo document vector (usually very sparse) n Matching by dot product or cosine similarity
I.R. Intro14 Vector matching I.R. t = term d = document q = query
I.R. Intro15 An example document space Computing titles C1:Human machine interface for ABC computer applications C2:A survey of user opinion of computer system response times C3: The EPS user interface management system C4:System and human system engineering testing of EPS C5:Relation of user perceived response time to error measurement
I.R. Intro16 An example document space Maths titles M1: The generation of random binary ordered trees M2: The intersection graph of paths in trees M3: Graph minors IV : Widths of trees and well quasi ordering M4 Graph minors: A survey
I.R. Intro19 Specification Errors n Any words NOT in the query are assumed to have ZERO relevance n Fine for irrelevant terms but what about synonyms? n What about distance between terms? PC very close to “personal computer” PC close to computer PC far from dog
I.R. Intro21 Probabilistic models n Soft matching n Rank hits by relevance n Vector space allows this n Bayes theorem / Fuzzy Logic
I.R. Intro22 Relevance feedback n Why let user interact? n Query can be refined based on user’s input n One good hit leads to another - best queries are long (e.g. a whole document!) n Learn about user for future searches (e.g. What hits did they follow up? Was there a pattern?)
I.R. Intro23 Classification n Manual e.g Yahoo! n Automatic: e.g. auto key word extractor with fixed classification tree (Dewey for example) n Automatic: e.g. use classification algorithm such as ID3 to get classification tree as well (tree can look bizarre)
I.R. Intro24 Collaboration n Why not look at links followed by other users after a search event? n Why not look at links TO or FROM a page? (these might be relevant) What types of collaboration between a connected community could you imagine?
I.R. Intro25 Visualisation Show hits in a way that makes sense Is a list a good way to do this? What other ways could you imagine?
I.R. Intro26 User modelling n Why not know something about the user? n Their past search events might be informative. What could you model?
I.R. Intro27 Text analysis n How could you examine a page of text and find out what it is about? n Use HTML tags? n Use metadata? n Parse text to find subject of sentence? How would you try and find key words from a text?
I.R. Intro28 NLP approaches n Attempt to understand content of a document n Syntax (structure of text) is easy, semantics (meaning) is ad hoc n Possible for very limited domains with little ambiguity n Essential problem is implicit context or common sense
I.R. Intro29 Conclusions n I.R. is needed! n I.R. is hard! n A range of approaches competing n No winners yet n Find out about these, be creative and critical, contribute and win!
I.R. Intro30 IR topics n BAuthor identification n CClassification n DCollaboration n ECommercial Systems n FData Mining n GDistributed IR
I.R. Intro31 IR topics n HEvaluation n ILatent Semantic Indexing n KMultimedia IR n LProbabilistic Retrieval n MQuery Languages n NRelevance Feedback n OSearch technologies
I.R. Intro32 IR topics n PText Analysis n QText level analysis n RThesauri in IR n SUser interfaces and visualisation n TUser modelling n UWeb Search
I.R. Intro33 References Chapter 1 of INFORMATION RETRIEVAL by C. J. van RIJSBERGEN look at: http://www.dcs.glasgow.ac.uk/ Keith/Preface.html Chapter 1 of Finding out about by Richard Belew n Chapter 1 of Modern Information Retrieval by Ricardo Baeza-Yates & Bertier Ribeiro-Neto