Information Retrieval

Information Retrieval
Yu Hong and Heng Ji October 15, 2014 Hi, guys. Nice to meet you. My name is Yu. I am very glad to spend the coming 3 hours with you. The topic in this talk is the information retrieval. Although popular, it is a big subject. Big enough to devote an entire semester to discuss it. So I will skip the boasting time, in other words, the long story about the background and the history in this field, but briefly introduce the basic information theory and some interesting techniques.

Outline Introduction IR Approaches and Ranking Query Construction
Document Indexing IR Evaluation Web Search INDRI To be simple or to be useful?

Information Before discussing the information retrieval, we should understand the meaning of information. You can give any definition of information as you will, but at least, we should have the same opinion on its basic function. See this picture please, can you let me know the meaning of the sign? Yes, it is a stop sign. Through the sign, you can get some perception that somebody, you know, the policemen, ask you to stop your car here.

Basic Function of Information
Information = transmission of thought Thoughts Thoughts Telepathy? Words Words Writing So now, we should realize the basic function of information. That is the transmission of thought. Obviously, we don’t have the capacity of telepathy. So we have to encode the thought into signals, languages, pictures, images, and so on, by which to transmit our thought to other people. On the other side, the persons who want to know what we are thinking about necessarily decode the signals, languages, pictures, and images into their own thought. In this procedure, we can say, the signals, languages, pictures or images are the so-called information. Sounds Sounds Speech Encoding Decoding

Information Theory Better called “communication theory”
Developed by Claude Shannon in 1940’s Concerned with the transmission of electrical signals over wires How do we send information quickly and reliably? Underlies modern electronic communication: Voice and data traffic… Over copper, fiber optic, wireless, etc. Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy Information = “reduction in surprise” In the literature, Mr. Shannon should be the first person who studied the information theory in detail. Shannon was concerned with the transmission of electrical signals over wires and attempted to solve the problem of transmission rate and reliability. Actually, until now, the big problems in the field of information processing are still the speed and reliability of information acquisition. The effort of Shannon underlies the modern electronic communication, such as voice and data traffic. According to the Channel Capacity Theorem of Shannon, the measure of information is the reduction in surprise. For example, see the pictures. Maybe, it is surprising to suddenly see this picture. But after seeing lots of the similar pictures, you will find them boring. At least, they are not surprising any more. That is because you have had enough related information about the factors in the pictures. More such information, less in surprise.

The Noisy Channel Model
Information Transmission = producing the same message at the destination as that was sent at the source The message must be encoded for transmission across a medium (called channel) But the channel is noisy and can distort the message Source Destination Transmitter Receiver In summary, in Shannon’s information theory, the information transmission model produces the same message at the destination as that was sent at the source. And the message must be encoded for the transmission across a medium. The medium is the perceptible information, signals, languages, pictures, images, and so on, which can be also named as the channel, the channel in the channel capacity theorem of Shannon. In addition, we necessarily remember that there are always noises in the channel. And the noises normally mislead the information recipient. For example, if the channel looks like this, the message receiver at the destination will be confused a little bit, if looks like this, seriously confused, and if looks like this, a Chinese stop sign, non native people will be totally confused. Therefore, the model of information transmission can be called noisy channel model. message channel message noise

A Synthesis Information retrieval as communication over time and space, across a noisy channel Source Destination Transmitter Receiver message channel message noise Modern information retrieval inherits the noisy channel model. It needs to consider the problems of encoding at the end of the message sender, the storage of information in the channel, and the decoding at the end of the message receiver. For example, in the field of text retrieval, the global information senders provide kinds of texts, such as news, blogs, stories, research papers and so on, so the encoding techniques should translate the texts into the information of uniform pattern and structure, such as the indexical XML(Extensive Markup Language) formatted documents, while in the channel, large scale information will be stored in specific database, such as the distributed system based cloud storage, and at the end of the destination, the decoding techniques should mine or search the related information in the channel and translate them back into the normal texts, and then submit the texts to corresponding message receiver. Of cause yes, there are lots of noises in the channel. Neither encoding nor decoding can effectively filter the noises. So do you know that? Everyday, it is unavoidable for you to receive noises just like what I am talking about here. Sender Recipient Encoding Decoding storage message noise indexing/writing acquisition/reading

What is Information Retrieval?
Most people equate IR with web-search highly visible, commercially successful endeavors leverage 3+ decades of academic research IR: finding any kind of relevant information web-pages, news events, answers, images, … “relevance” is a key notion Normally, when people talk about the information retrieval system, the direct impression is certain search engine, such as google. But actually, search engine is just a kind of the online retrieval application system. In contrast, there should be a richer understanding for what the information retrieval system is. For example, the library is also an information retrieval system, where the authors of the books provide kinds of knowledge, serving as the message senders, and in a book, the knowledge are well organized through catalogue, the catalogue is the prototype of the modern indexing techniques, while the entire books which were sorted and stored in the bookrack can be regarded as the channel, and through the channel, readers are able to browse, search and read books, by which to ingest corresponding knowledge. Besides, the bilingual dictionary or the machine translation system is also a kind of information retrieval system, the comparable and parallel bilingual materials stored in the system have the capacity of the information channel for thought communication among speakers of different languages. And question answering system, the knowledge base wikipedia, filtering system and product recommendation system in electronic commerce all can be regarded as the information retrieval systems.

What is Information Retrieval (IR)?
Most people equate IR with web-search highly visible, commercially successful endeavors leverage 3+ decades of academic research IR: finding any kind of relevant information web-pages, news events, answers, images, … “relevance” is a key notion So we can give a simple definition of information retrieval. It is an aggregation mechanism of information storage, organization and acquisition, with the aim of finding any kind of relevant information quickly and reliably for the message recipients. Here, please remember that the relevance between the information need of recipients and the information in the channel is the key issue in any kind of information retrieval system. That is, not all the information in the channel should be delivered to the recipients, but only the one which matches the need of the recipients. In this lecture, I only concentrate on the state of the art of retrieval techniques, especially the ones which have been widely used in the fashionable search engines. But once again, please advised that search engine doesn’t equal to information retrieval. It is just a kind of application of retrieval techniques.

Interesting Examples Google image search Google video search
People Search Social Network Search Here are some examples of search engines, respectively searching image, video, entity and social network. But actually the underlying technique is full text retrieval.

Interesting Examples Google image search Google video search
People Search Social Network Search For example, if I wanna find some pictures of cat, I have to input the word “cat” in the search box and click the search button. Although the search results all are pictures, but the fundamental cause of the results is the strict matching between the word “cat” I inputtte into the search box and that occurred in the textual contexts around the pictures in the web pages. In a pure image search engine, the input should be a picture, and the engine needs to match the inputted picture with large scale image resources. About the reason in detail why such an image search engine hasn’t been widely used until recently, we should ask the experts of image processing. But here, I have the pleasure to give some rough explanation for the case. Firstly, let’s take a look at the time complexity of text retrieval. Through the study in Shannon’s information theory, we have known that an information retrieval system should select relevant information from the channel and deliver them to the recipients, where the relevance is the decisive factor for the successful and welcome delivery. Secondly, we should know that the relevance is normally measured by the similarity among different information, for example, if the inputted word is cat, thus the cat nearly reflects the type of the information need of the recipient, and if the word cat frequently occurs in a text, we will say that the text is similar to the information need, it is a good matching, and thus the text is relevant to the information need. In this procedure, for each text, search engine just needs to verify whether there is a word cat occurred in the text and how many times it occurs. The maximum time complexity for the verification is O(n), where n is the length of the text. But for the case of image, things are very different. Please follow me to the blackboard. Here is a circle, and here is another circle. And we assume that this circle is the one inputted into the search box, and this one is an information in the channel. We aims to use an image search engine to determine the relevance between the two circles. Obviously, we can find the two circles are very similar, and thus the one in the channel should be delivered to the recipient as a relevant information. But for the image search engine, it is a very challenging task. Now I explain the reason. As we know, the display of a picture on the monitor relies on the signal display at each pixel. Therefore, we can segment each of the pictures into the pixels in a two-dimensional space, and for the pixels on the circle, we let the signals be the value 1 and the rest be 0. And then we can get two matrixes filled with 1 and 0. Now let’s measure the similarity between the pictures by using the matrixes of the signals. Firstly, you can find that the two matrixes are very different. The similarity is 0. In other words, the relevance is 0. Thus this picture will be regarded as an irrelevant information. Ok, the similarity of matrixes doesn’t work. So, secondly, it seems to be a good idea to consider the frequency of the segment of the circle on a specific pixel, just like that of ward in the full text search engine. For example, we can check whether this segment frequently occurs in this picture, if yes, we can also determine the similarity. But you will find there isn’t any consistent circle segment on every pixels between the two pictures. In this picture, every circle segments correspond to an angle of 90 degrees, while in this picture, every segments correspond to the angle of nearly 30 degrees. Thus the occurrence frequency of any circle segment in the other picture is 0, thus the relevance is still 0. Maybe, you think that we can further minimize the pixels until every segments are enough small. Thus the two segments are comparable or even consistent to each other. It means that we can find one or more the same segments in this picture, and the frequency of the segment now is available for the similarity calculation. But, given each segment in this picture, we have to traverse the whole pixels in this picture to find the similar one to this one. And to match the two pictures, we need to calculate the frequency for all segments in this picture. Thus if there are n pixels in the two-dimensional space, the time complexity is O(n2) the square of n. It is really time consuming even if for only a pair of images. It is unimaginable to introduce this method into the image search which deals with more than millions of information needs.

IR System Document corpus IR Query String System Ranked Documents
Sender Recipient Encoding Decoding storage message noise indexing/writing acquisition/reading Document corpus IR System Query String OK. Now, let us come back to the problem of full text search engine. As mentioned a moment ago, any information retrieval system sticks to the noisy channel model. Therefore the text search engine also has a noisy channel, where, the message recipients on the destination of the channel are information searchers, also named users, while the message senders on the source end of the channel are the document corpus, the corpus provide the information resources need to be delivered to users through the channel. The information retrieval system controls the whole noisy channel to organize, store and acquire information. Normally, users describe their information need with natural languages, generating a query string per search and submitting it to search engine. In the noisy channel, the retrieval system selects relevant information to the query and filters the noises, by which to deliver high-qualified relevant information to users. Such information is also named search results and ranked with the degree of relevance. That is the reason why we always see a list of search results. Normally, the text search engine arrange the most relevant search results on the top of the list. Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

The IR Black Box Documents Query Results
So, simply, the input of the text search engine is the user query, the output are the lists of search results, and the results derive from the document corpus. But what happen in the channel? In other words, what will an information retrieval system do in the channel to control the information organization, storage and delivery? Results

Inside The IR Black Box Index Documents Query Results Representation
Function Representation Function Query Representation Document Representation The internal components of the search engine include representation model, comparison model and indexing model. The representation model is used to describe the meaning of query and documents, comparison model is used to determine whether a query and a document have the similar meaning, if have, they are related to each other, and the relevant documents will be delivered as search results, and the indexing model will build the index of the documents, just like an electronic catalogue of the documents. But here, there is another question: where can we get the documents? Users input the query, Search engine output the results, but where do the documents originate? Without the documents, the information channel is null, and thus there is nothing the search engine can do. Index Comparison Function Results

Building the IR Black Box
Fetching model Comparison model Representation Model Indexing Model So, besides the representation, comparison and indexing models inside the search engine, there is an external model, the fetching model, which is specially used to fetch the documents. And it is indispensable.

Fetching models Crawling model Gentle Crawling model Comparison models Boolean model Vector space model Probabilistic models Language models PageRank Representation Models How do we capture the meaning of documents? Is meaning just the sum of all terms? Indexing Models How do we actually store all those words? How do we access indexed terms quickly? And here you can see that there are many concrete models. For example, to compare the similarity between query and document, we can use Boolean comparison model, vector space model, probabilistic model, language models and page rank. But at the first, let’s study the crawling model which is specially used to fetch web pages.

Fetching model: Crawling
Documents Search Engines Now, we are talking about full text search engines, the widely used information retrieval application system. For this kind of search engine, as you know, the available document corpus normally are the web pages. Web pages

Crawling Fetching Function World Wide Web Documents Query
So we necessarily fetch the web pages from the world wide web for use. But here are two questions? One is how many web pages should we fetch? The other one is how to fetch them? Representation Function Representation Function Query Representation Document Representation Index Comparison Function Results

Q1: How many web pages should we fetch? As many as we can. More web pages = Richer knowledge Intelligent Search engine IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . For the first question, the answer is “as many as we can”. It will be a really good news if we can fetch all the web pages around the world. But it’s impossible indeed. You may have another question here why we need so many web pages. It is because an intelligent search engine should reserve richer knowledge before responding to the queries of users. And obviously, large scale web pages contain richer knowledge.

Q1: How many web pages should we fetch? As many as we can. Fetching model is enriching the knowledge in the brain of the search engine I know everything now, hahahahaha! Fetching Function So the fetching model traverses the whole world wide web and take lots of web pages back to the search engine and store the knowledge of the pages into the brain of the engine. Then the search engine has the intelligence. More web pages are fetched and stored into the search engine, more intelligent the engine will be . It is just like that I know there are nearly 7 billion people around the world, but if you ask me who is Jimmy, tommy, or marry, sorry, I don’t know them, unless “somebody named fetching model” takes me to travel around the world and introduce each person to me. IR System

Q2: How to fetch the web pages? First, we should know the basic network structure of the web Basic Structure: Nodes and Links (hyperlinks) For the second question, how to fetch the web pages? Here, we introduce an automatic fetching model, named Crawling model. Before giving insight into the model, let’s review the basic network structure of the web. It is very important for understanding the operating principle of the clawing technique. The web looks like very complicated, jointly constructed by human, kinds of hardware devices, and lots of communication software. But the basic structure is a web consisted of nodes and links. The nodes can be servers, personal computer terminals, iphones, routers and so on, while a link between nodes can be regarded as the combination of the web addresses of the nodes and the access control programs for the communication between the nodes, which is also named hyperlink. World Wide Web Basic Structure

Q2: How to fetch the web pages? Crawling program (Crawler) visit each node in the web through hyperlink. For the crawling program, it is really sufficient to know the hyperlinks because it just needs to visit the nodes through the links. A crawling program is also named crawler. And the job of the crawler is to traverse all the nodes in the network of the web and steal the information in the nodes. So the crawler looks like a bad spider crawled on the cobweb. Normally, the search engine dispatches the crawler to some nodes of the web. The earliest nodes the crawler arrived at are named the seed nodes. From the seed nodes, the crawler begins to work independently, finding the addresses of the unknown nodes which are linked to the known nodes, automatically visiting the newly founded nodes through the hyperlinks, stealing the information in the nodes, sending the information back to the search engine for storage, and searching new unknown nodes. The crawler iteratively carries out the procedure until it finds there is no any new unknown nodes in the web. IR System Basic Network Structure

Q2: How to fetch the web pages? Q2-1: what are the known nodes? It means that the crawler know the addresses of nodes The nodes are web pages So the addresses are the URLs (URL: Uniform Resource Locater) Such as: etc. Q2-2: what are the unknown nodes? It means that the crawler don’t know the addresses of nodes The seed nodes are the known ones Before dispatching the crawler, a search engine will introduce some addresses of the web pages to the crawler. The web pages are the earliest known nodes (so called seeds) In the crawling procedure, I repeatedly mentioned two kinds of nodes, one is the known nodes, the other is the unknown nodes. What are they? From the perspective of the crawler, if the address of a web page is previously noticed, the page is a known node, else it is unknown. The address mentioned here is the uniform resource locater, such as the yahoo.com, sohu.com, sina.com and so on. Through the addresses, the crawler can use the access control programs (ACP) of hyperlink to directly visit the nodes. Please advised here, the seed nodes are the so-called known nodes. Before dispatching the crawler to the web, a search engine necessarily introduce some addresses of the web pages to the crawler. For the crawler, the pages are the earliest known nodes and therefore they are the start points to traverse the whole network of the web.

Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Known Nod. Nod. Unknown Nod. Doc. Now, the crawler encounters a new problem: how to find the unknown nodes in the web? Here, don’t forget that the crawler knows some nodes, such as the seed nodes introduced by the search engine. Because the crawler know the addresses of the nodes, it can visit this nodes by using the access control programs (ACP). After visiting the nodes, in other words, after visiting the known web pages, the crawler can browse the whole contents in the web pages, not only the documents in the pages, such as texts, pictures, videos or images, but also the underlying source codes. Unknown Nod. Unknown Nod. Unknown

Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Nod. Nod. Unknown Nod. Doc. The source code of a web page is a kind of XML (Extensible Markup Language) or HTML (Hyper Text Markup Language) formatted text. XML is the abbreviation of the Extensible Markup Language, and the HTML is the abbreviation of the Hyper Text Markup Language. Either XML or HTML is a structured language, which use the uniform tags to generate structured data. Please check the source code here, it is structured by using these markup language, see the tags highlighted by the boxes. The markup language gives the explanation for the contents in each structure. For example, the tag <Lang> denotes the natural language used in the web page, here is American English, shortly written as en-US, while the tag <title> denotes the string behind the tag is the title of the text body in the web page. More importantly, some tags can denotes the addresses of the hyperlinks. Such as the tags <src> and <href> here. See the blue boxes. The strings behind them all are the addresses, the addresses of the web pages which link to the current web page. Unknown Nod. Unknown Nod. Unknown

See this web page. In the page, these all are hyperlinks
See this web page. In the page, these all are hyperlinks. And if you click any of them, you will open another web page. It is just because the source codes of this web page record the web addresses of the hyperlinks you clicked. When you click the hyperlink, the access control program will visit the address and help you to open the corresponding web pages.

Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Unknown Nod. I can do this. Believe me. Nod. Nod. Unknown Nod. Doc. That means, if some parser can parse the XML or HTML formatted source code of the web pages, it can orient the tags of the web addresses and extract the addresses for the usage in accessing new web pages. Unknown Nod. Unknown Nod. Unknown

PARSER Known Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? Unknown Nod. Good news for me. Known Nod. Nod. Nod. Doc. It is a really good news for the crawler. The outstanding programmers have embedded a parser of the source code into the crawler. Thus, when given a known node, in other words, given a known web page, the crawler can use the parser to identify and extract the web addresses in the source code of the page. Then, for all the web pages which have a hyperlink to the given web page, the crawler now knows their addresses. That means, the crawler know them, know where they are, and can visit and steal the information of the nodes through their source codes at any time as it will. The nodes are known now for the crawler. Nod. Nod.

Q2: How to fetch the web pages? Q2-3: How can the crawler find the unknown nodes? If you introduce a web page to the crawler (let it known the web address), the crawler will use a parser of source code to mine lots of new web pages. Of cause, the crawler have known their addresses. But if you don’t tell the crawler anything, it will be on strike because it can do nothing. That is the reason why we need the seed nodes (seed web pages) to awaken the crawler. In summary, if you introduce a web page to the crawler (let it known the web address), the crawler will use a parser of source code to mine lots of new web pages. Of cause, the crawler have known their addresses. But if you don’t tell the crawler anything, it will be on strike because it can do nothing. That is the reason why we need the seed nodes (seed web pages) to awaken the crawler. Give me some seeds.

I need some equipment. Q2: How to fetch the web pages? To traverse the whole network of the web, the crawler need some auxiliary equipment. A register of FIFO (First in, First out) data structure, such as QUEUE. An Access Control Program (ACP) Source Code Parser (SCP) Seed nodes To traverse the whole network of the web, the crawler need some auxiliary equipment. First, the crawler need a temporary storage, also named register, which is used to store the web nodes. The register necessarily has a data structure of FIFO. FIFO means First in then First out. In other words, the first data of the storage in the register will be taken first. The data lagging behind can never be taken out earlier than the one ahead of them. So the structure of the register is a queue. Besides, the crawler also needs an access control program, shortly named ACP, which is specially used to visit a web node. About the ACP, if guys here have the interest in how it works, please look up corresponding knowledge in the book of computer networks and focus on the contents of TCPIP protocol, SOCKET, network port, and three times handshake protocol. Further, the crawler needs a source code parser, shortly named SCP. We need to develop SCP by ourselves after briefly understanding the structured languages XML or HTML. For the use of the crawler of an full text search engine, the SCP only need to have the function of identifying the structures headed by the tags <title>, <language type>, <text body>, <keywords> and <URL resources> and the function of extracting corresponding contents from the structures. The crawler necessarily steal these contents because the title, text body, key words, and language type are very important for the linguistics based relevance measurement between query and web pages, and more importantly, the stolen URL resources give the crawler new targets to visit and steal information. At last, the crawler needs to know some seed nodes. crawler FIFO Register ACP SCP

I am working now. Q2: How to fetch the web pages? Robotic crawling procedure (Only five steps) Initialization: push seed nodes (known web pages) into the empty queue Step 1: Take out a node from the queue (FIFO) and visit it (ACP) Step 2: Steal necessary information from the source code of the node (SCP) Step 3: Send the stolen text information (title, text body, keywords and Language) back to search engine for storage (ACP) Step 4: Push the newly found nodes into the queue Step 5: Execute Step 1-5 iteratively Supported by the equipment, the clawer traverses the network of the web through five steps: Before crawling, we should initialize the register of the crawler, that is, we need push some seed nodes into the empty queue in the register. During crawling, the crawler firstly takes out one node from the register, it always take the one at the head of the queue, and then visit the node by using the access control program. Secondly, the crawler steals the necessary information from the source code of the node, including the text information, such as titles, languages, keywords, text body and the address information of the unknown nodes. After this step, the crawler can find some new nodes. Thirdly, the crawler sends the text information back to the search engine for storage. Fourthly, the crawler push the newly founded nodes into the register. And from now on, these nodes are known to the crawler. In the fifth step, the crawler iteratively executes the whole steps mentioned above.

Q2: How to fetch the web pages? Trough the steps, the number of the known nodes continuously grows The underlying reason why the crawler can travers the whole web Crawler stops working until the register is empty Although the register is empty, the information of all nodes in the web has been stolen and stored in the server of the search engine. I control this. Slot Slot Slot New Node New Node Slot Slot Slot Slot Slot Slot New Node New Node Slot Slot Slot Slot Slot Slot New Node Slot Slot New Node New Node Slot Slot Slot Slot New Node New Node Slot Slot Slot Slot New Node Slot New Node Slot Slot New Node Slot New Node New Node Slot Slot New Node Slot New Node Slot New Node Seed Seed Seed During the crawling procedure, in each iteration, the crawler takes out a known node from the register, and then take newly found nodes back from the web and push them into the register. So the number of the known nodes continuously grows until few new nodes can be found. And the crawler will stop working when the register is empty. Ideally, when the register is empty, the crawler can not found new nodes in the web. That means, the crawler has traversed all the nodes in the web and stolen all the necessary information of the nodes.

Problems 1) Actually, the crawler can not traverse the whole web. Such as encountering the infinite loop when falling into a partial closed-circle network (snare) in the web Node Node Node No. Ok, above are some basic knowledge of the fetching model. And I show a widely used crawling technique. But please advised that the crawler is just a prototype machine. To develop a practical crawler, we still need to overcome many problems. For example, empirically, the crawler can not traverse the whole web. Sometimes it could fall into a partial closed-circle network in the web and thus encounters the problem of infinite iteration in the circle. Node Node Node Node Node

Node Fetching model: Crawling Node Node Node Problems 2) Crude Crawling. A portal web site causes a series of homologous nodes in the register. Abided by the FIFO rule, the iterative crawling of the nodes will continuously visit the mutual server of the nodes. It is crude crawling. Node Slot Node Node Node Slot Node Node Slot Node Slot Node Node Node A class of homologous web pages linking to a portal sit Slot Node Slot Slot Node Slot Slot Slot Node Slot Slot Node Another important problem is the crude crawling. Normally, the crawler changes to be crude when it enters the iterative crawling of a series of homologous nodes in the register, such as the nodes in the yellow boxes. They all derive from the server of Yahoo, yahoo games, yahoo mobile, yahoo weather and so on. Because the crawler sticks to the rule of First in and First out, so it will visit these nodes one by one. That means, the crawler will visit the server of Yahoo over and over again. On the other side, the server of Yahoo has to ceaselessly serve the crawler. It is a serious interruption because the server of yahoo nearly has no time to serve the normal users during the time when the crawler is visiting it. And the this phenomenon occurs frequently in the crawling procedure. It occurs after the crawler visit the homepage of a portal web site and steal all of the URLs in the source code of the page. Such as the homepage of the yahoo.com. Such a page have so many hyperlinks, and thus after stealing the addresses of the hyperlinks, the crawler will push lots of newly found nodes into the queue of the register and arrange them one by one, generating a series of homologous nodes. Node Slot Slot Node Slot Slot Slot Slot Node Slot Slot Slot Slot Slot Slot Slot Slot Slot Slot Node Slot Slot Slot Slot Slot Node Slot Slot Slot Slot Slot Slot Network of Web

Homework 1) How to overcome the infinite loop cased by the partial closed-circle network in the web? 2) Please find a way to crawl the web like a gentlemen (not crude). Please select one of the problems as the topic of your homework. A short paper is necessary. No more than 500 words in the paper. But please include at least your idea and a methodology. The methodology can be described with natural languages, flow diagram, or algorithm. Send it to me. Thanks. I hope guys can select one of the problems as the topic of the homework. Please check the details in this slide.

Fetching models Crawling model Gentle Crawling model Comparison models Boolean model Vector space model Probabilistic models Language models PageRank Representation Models How do we capture the meaning of documents? Is meaning just the sum of all terms? Indexing Models How do we actually store all those words? How do we access indexed terms quickly? In the rest of our time. I will briefly introduce the comparison models and concentrate on the Boolean model and the vector space model.

Index Documents Query Results Representation Function Representation
Query Representation Document Representation Index Comparison Function We have known that the crawler can help the search engine obtain large scale documents from the web. Thus, now, we can compare the user’s information need with the documents, detect the relevance between the information need and the documents, and deliver the relevant documents to the user. Here, the user information need is directly described by the query words inputted into the search box by user. Therefore, the job of the comparison model is to measure the relevance between query and documents. In the framework of search engine, the most important part is the comparison model because it directly influence the quality of the search results. People may have a question like that: can we ignore the indexing model to measure and determine the relevance? The answer is yes. The index is a complex catalogue, although complex, it is still a catalogue, specially used to speed up the search procedure, having few direct influence to the relevance between query and documents. So in this chapter, we temporarily ignore it. Results

Index Ignore Now Documents Query Results Representation Function
Query Representation Document Representation Ignore Now Index Comparison Function Before measuring the relevance between a query and the documents, we need to represent the query and documents. For a full text search engine, either the query or the documents can be represented as a set of independent words. For the query, the representation is a set of query words inputted by the user. For a document, can you remember the information stolen by the crawler, including title, keywords and text body in a web page. Yes, a document is consisted of such information. Thus for a document, the representation is a set of words from the title, keywords and the text body. Ok, for a pair of query and document, the comparison model measure the relevance between two sets of words. The word set of query and that of the document. Results

A heuristic formula for IR (Boolean model)
Rank docs by similarity to the query suppose the query is “spiderman film” Relevance= # query words in the doc favors documents with both “spiderman” and “film” mathematically: Logical variations (set-based) Boolean AND (require all words): Boolean OR (any of the words): The simplest comparison model abides by a heuristic rule that if the document involves the query words, then the document is relevant to the query. For example, if the query is the “spiderman film”, then the relevant documents should contain at least one of the two words “spiderman” and “film”. Of cause the search engine more favors the documents which contain both the words. Mathematically, the heuristic rule can be modeled as the number of query words contained in a document D. Please check this formula, the lowercase character q is a query word, the uppercase character Q is the set of the query words, the uppercase character D denotes the set of the words in document, and the number 1 means the word q occurs in the word set D. This the basic boolean comparison model. This model has many logical variance, such as Boolean AND model and Boolean OR model. According to the AND one, a relevant document necessarily contains all of the query words. Mathematically, the relevance is the result of multiplying the Boolean value of the occurrence of the query words, see this formula, where the character O denote the Boolean value of occurrence, equaling to 1 when the query word q occurs in the document D, else 0. That is, if a query word doesn’t occur in the document D, the value of the boolean AND model will be 0. In other words, the document is irrelevant to the query. In contrast, according to the Boolean OR, just the document contains one query word, the relevance equals to 1, meaning the document is relevant to the query. But these two logical variations both have serious shortcomings. For the Boolean AND, the restriction is so strong that the comparison model normally miss lots of relevant documents. For example, a document which contains the words “spiderman”and “cinema” is relevant to the query “spiderman film”, but the Boolean AND model will determine it as irrelevant because it doesn’t contain all of the query words. In the other side, for the Boolean OR, the restriction is relaxed so much that the comparison model will introduce lots of irrelevant documents into search results. For example, a document which contains the words “Wolverine” and “film” will be determined to be relevant by the Boolean OR because it contains one of the query word “film”.

Term Frequency (TF) Observation: Modify our similarity measure:
key words tend to be repeated in a document Modify our similarity measure: give more weight if word occurs multiple times Problem: biased towards long documents spurious occurrences normalize by length:

Inverse Document Frequency (IDF)
Observation: rare words carry more meaning: cryogenic, apollo frequent words are linguistic glue: of, the, said, went Modify our similarity measure: give more weight to rare words … but don’t be too aggressive (why?) |C| … total number of documents df(q) … total number of documents that contain q

TF normalization Observation: Correction:
D1={cryogenic,labs}, D2 ={cryogenic,cryogenic} which document is more relevant? which one is ranked higher? (df(labs) > df(cryogenic)) Correction: first occurrence more important than a repeat (why?) “squash” the linearity of TF:

State-of-the-art Formula
Common words less important Repetitions of query words  good More query words  good Penalize very long documents

Strengths and Weaknesses
Precise, if you know the right strategies Precise, if you have an idea of what you’re looking for Implementations are fast and efficient Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many hits or none When do you stop reading? All documents in the result set are considered “equally good” What about partial matches? Documents that “don’t quite match” the query may be useful also

Vector-space approach to IR
cat cat cat cat cat cat cat cat pig dog dog cat pig θ pig cat pig dog Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Some formulas for Similarity
Dot product Cosine Dice Jaccard t1 D Q t2

An Example A document space is defined by three terms:
hardware, software, users the vocabulary A set of documents are defined as: A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) If the Query is “hardware and software” what documents should be retrieved?

An Example (cont.) In Boolean query matching:
document A4, A7 will be retrieved (“AND”) retrieved: A1, A2, A4, A5, A6, A7, A8, A9 (“OR”) In similarity matching (cosine): q=(1, 1, 0) S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 Document retrieved set (with ranking)= {A4, A7, A1, A2, A5, A6, A8, A9}

Probabilistic model Given D, estimate P(R|D) and P(NR|D)
P(R|D)=P(D|R)*P(R)/P(D) (P(D), P(R) constant)  P(D|R) D = {t1=x1, t2=x2, …}

Prob. model (cont’d) For document ranking

Prob. model (cont’d) How to estimate pi and qi?
ri Rel. doc. with ti ni-ri Irrel.doc. with ti ni Doc. Ri-ri Rel. doc. without ti N-Ri–n+ri Irrel.doc. without ti N-ni Doc. without ti Ri Rel. doc N-Ri N Samples How to estimate pi and qi? A set of N relevant and irrelevant samples:

Prob. model (cont’d) Smoothing (Robertson-Sparck-Jones formula)
When no sample is available: pi=0.5, qi=(ni+0.5)/(N+0.5)ni/N May be implemented as VSM

An Appraisal of Probabilistic Models
Among the oldest formal models in IR Maron & Kuhns, 1960: Since an IR system cannot predict with certainty which document is relevant, we should deal with probabilities Assumptions for getting reasonable approximations of the needed probabilities: Boolean representation of documents/queries/relevance Term independence Out-of-query terms do not affect retrieval Document relevance values are independent

An Appraisal of Probabilistic Models
The difference between ‘vector space’ and ‘probabilistic’ IR is not that great: In either case you build an information retrieval scheme in the exact same way. Difference: for probabilistic IR, at the end, you score queries not by cosine similarity and tf-idf in a vector space, but by a slightly different formula motivated by probability theory

Language-modeling Approach
query is a random sample from a “perfect” document words are “sampled” independently of each other rank documents by the probability of generating query D query P ( ) P ( ) P ( ) P ( ) P ( ) = = 4/9 * 2/9 * 4/9 * 3/9

Naive Bayes and LM generative models
We want to classify document d. We want to classify a query q. Classes: geographical regions like China, UK, Kenya. Each document in the collection is a different class. Assume that d was generated by the generative model. Assume that q was generated by a generative model. Key question: Which of the classes is most likely to have generated the document? Which document (=class) is most likely to have generated the query q? Or: for which class do we have the most evidence? For which document (as the source of the query) do we have the most evidence? 57

Using language models (LMs) for IR
LM = language model We view the document as a generative model that generates the query. What we need to do: Define the precise generative model we want to use Estimate parameters (different parameters for each document’s model) Smooth to avoid zeros Apply to query and find document most likely to have generated the query Present most likely document(s) to user Note that x – y is pretty much what we did in Naive Bayes.

What is a language model?
We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. 59

A probabilistic language model
This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q1. STOP is not a word, but a special symbol indicating that the automaton stops. frog said that toad likes frog STOP P(string) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 60

A different language model for each document
frog said that toad likes frog STOP P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · = = 4.8 · 10-12 P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = = 12 · P(string|Md1 ) < P(string|Md2 ) Thus, document d2 is “more relevant” to the string “frog said that toad likes frog STOP” than d1 is. 61

Using language models in IR
Each document is treated as (the basis for) a language model. Given a query q Rank documents based on P(d|q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d But we can give a prior to “high-quality” documents, e.g., those with high PageRank. P(q|d) is the probability of q given d. So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent. 62

Where we are In the LM approach to IR, we attempt to model the query generation process. Then we rank documents by the probability that a query would be observed as a random sample from the respective document model. That is, we rank according to P(q|d). Next: how do we compute P(q|d)? 63

How to compute P(q|d) We will make the same conditional independence assumption as for Naive Bayes. (|q|: length ofr q; tk : the token occurring at position k in q) This is equivalent to: tft,q: term frequency (# occurrences) of t in q Multinomial model (omitting constant factor) 64

Parameter estimation Missing piece: Where do the parameters P(t|Md). come from? Start with maximum likelihood estimates (as we did for Naive Bayes) (|d|: length of d; tft,d : # occurrences of t in d) As in Naive Bayes, we have a problem with zeros. A single t with P(t|Md) = 0 will make zero. We would give a single term “veto power”. For example, for query [Michael Jackson top hits] a document about “top songs” (but not using the word “hits”) would have P(t|Md) = 0. – That’s bad. We need to smooth the estimates to avoid zeros. 65

Smoothing Key intuition: A nonoccurring term is possible (even though it didn’t occur), . . . . . . but no more likely than would be expected by chance in the collection. Notation: Mc: the collection model; cft: the number of occurrences of t in the collection; : the total number of tokens in the collection. We will use to “smooth” P(t|d) away from zero. 66

Mixture model P(t|d) = λP(t|Md) + (1 - λ)P(t|Mc)
Mixes the probability from the document with the general collection frequency of the word. High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Correctly setting λ is very important for good performance. 67

Mixture model: Summary
What we model: The user has a document in mind and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one. 68

Example Collection: d1 and d2
d1 : Jackson was one of the most talented entertainers of all time d2: Michael Jackson anointed himself King of Pop Query q: Michael Jackson Use mixture model with λ = 1/2 P(q|d1) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0.003 P(q|d2) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0.013 Ranking: d2 > d1 69

Exercise: Compute ranking
Collection: d1 and d2 d1 : Xerox reports a profit but revenue is down d2: Lucene narrows quarter loss but decreases further Query q: revenue down Use mixture model with λ = 1/2 P(q|d1) = [(1/8 + 2/16)/2] · [(1/8 + 1/16)/2] = 1/8 · 3/32 = 3/256 P(q|d2) = [(1/8 + 2/16)/2] · [(0/8 + 1/16)/2] = 1/8 · 1/32 = 1/256 Ranking: d2 > d1 70

LMs vs. vector space model (1)
LMs have some things in common with vector space models. Term frequency is directed in the model. But it is not scaled in LMs. Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space. Mixing document and collection frequencies has an effect similar to idf. Terms rare in the general collection, but common in some documents will have a greater influence on the ranking. 71

LMs vs. vector space model (2)
LMs vs. vector space model: commonalities Term frequency is directly in the model. Probabilities are inherently “length-normalized”. Mixing document and collection frequencies has an effect similar to idf. LMs vs. vector space model: differences LMs: based on probability theory Vector space: based on similarity, a geometric/ linear algebra notion Collection frequency vs. document frequency Details of term frequency, length normalization etc. 72

Language models for IR: Assumptions
Simplifying assumption: Queries and documents are objects of same type. Not true! There are other LMs for IR that do not make this assumption. The vector space model makes the same assumption. Simplifying assumption: Terms are conditionally independent. Again, vector space model (and Naive Bayes) makes the same assumption. Cleaner statement of assumptions than vector space Thus, better theoretical foundation than vector space … but “pure” LMs perform much worse than “tuned” LMs. 73

Relevance Using Hyperlinks
Number of documents relevant to a query can be enormous if only term frequencies are taken into account Using term frequencies makes “spamming” easy E.g., a travel agency can add many occurrences of the words “travel” to its page to make its rank very high Most of the time people are looking for pages from popular sites Idea: use popularity of Web site (e.g., how many people visit it) to rank site pages that match given keywords Problem: hard to find actual popularity of site Solution: next slide

Relevance Using Hyperlinks (Cont.)
Solution: use number of hyperlinks to a site as a measure of the popularity or prestige of the site Count only one hyperlink from each site (why? - see previous slide) Popularity measure is for site, not for individual page But, most hyperlinks are to root of site Also, concept of “site” difficult to define since a URL prefix like cs.yale.edu contains many unrelated pages of varying popularity Refinements When computing prestige based on links to a site, give more weight to links from sites that themselves have higher prestige Definition is circular Set up and solve system of simultaneous linear equations Above idea is basis of the Google PageRank ranking mechanism

PageRank in Google To be simple or to be useful?

PageRank in Google (Cont’)
B I2 Assign a numeric value to each page The more a page is referred to by important pages, the more this page is important d: damping factor (0.85) Many other criteria: e.g. proximity of query words “…information retrieval …” better than “… information … retrieval …”

Relevance Using Hyperlinks (Cont.)
Connections to social networking theories that ranked prestige of people E.g., the president of the U.S.A has a high prestige since many people know him Someone known by multiple prestigious people has high prestige Hub and authority based ranking A hub is a page that stores links to many pages (on a topic) An authority is a page that contains actual information on a topic Each page gets a hub prestige based on prestige of authorities that it points to Each page gets an authority prestige based on prestige of hubs that point to it Again, prestige definitions are cyclic, and can be got by solving linear equations Use authority prestige when ranking answers to a query

HITS: Hubs and authorities
79

HITS update rules A: link matrix h: vector of hub scores
a: vector of authority scores HITS algorithm: Compute h = Aa Compute a = ATh Iterate until convergence Output (i) list of hubs ranked according to hub score and (ii) list of authorities ranked according to authority score 80

Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). 82 82

Problems with Keywords
May not retrieve relevant documents that include synonymous terms. “restaurant” vs. “café” “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. “bat” (baseball vs. mammal) “Apple” (company vs. fruit) “bit” (unit of data vs. act of eating) 83 83

Query Expansion Most errors caused by vocabulary mismatch query: “cars”, document: “automobiles” solution: automatically add highly-related words Thesaurus / WordNet lookup: add semantically-related words (synonyms) cannot take context into account: “rail car” vs. “race car” vs. “car and cdr” Statistical Expansion: add statistically-related words (co-occurrence) very successful

Indri Query Examples <parameters><query>#combine( #weight( #1(explosion) #1(blast) #1(wounded) #1(injured) #1(death) #1(deaths)) #weight( #1(Davao Cityinternational airport) #1(Tuesday) #1(DAVAO) #1(Philippines) #1(DXDC) #1(Davao Medical Center)))</query></parameters>

Synonyms and Homonyms Synonyms Homonyms
E.g., document: “motorcycle repair”, query: “motorcycle maintenance” Need to realize that “maintenance” and “repair” are synonyms System can extend query as “motorcycle and (repair or maintenance)” Homonyms E.g., “object” has different meanings as noun/verb Can disambiguate meanings (to some extent) from the context Extending queries automatically using synonyms can be problematic Need to understand intended meaning in order to infer synonyms Or verify synonyms with user Synonyms may have other meanings as well

Concept-Based Querying
Approach For each word, determine the concept it represents from context Use one or more ontologies: Hierarchical structure showing relationship between concepts E.g., the ISA relationship that we saw in the E-R model This approach can be used to standardize terminology in a specific field Ontologies can link multiple languages Foundation of the Semantic Web (not covered here)

Indexing of Documents An inverted index maps each keyword Ki to a set of documents Si that contain the keyword Documents identified by identifiers Inverted index may record Keyword locations within document to allow proximity based ranking Counts of number of occurrences of keyword to compute TF and operation: Finds documents that contain all of K1, K2, ..., Kn. Intersection S1 S2 .....  Sn or operation: documents that contain at least one of K1, K2, …, Kn union, S1 S2 .....  Sn,. Each Si is kept sorted to allow efficient intersection/union by merging “not” can also be efficiently implemented by merging of sorted lists

Indexing of Documents Goal = Find the important meanings and create an internal representation Factors to consider: Accuracy to represent meanings (semantics) Exhaustiveness (cover all the contents) Facility for computer to manipulate What is the best representation of contents? Char. string (char trigrams): not precise enough Word: good coverage, not precise Phrase: poor coverage, more precise Concept: poor coverage, precise Coverage (Recall) Accuracy (Precision) String Word Phrase Concept

Indexer steps Sequence of (Modified token, Document ID) pairs. Doc 1
I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Multiple term entries in a single document are merged.
Frequency information is added.

An example

Stopwords / Stoplist function words do not bear useful information for IR of, in, about, with, I, although, … Stoplist: contain stopwords, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stopwords usually improves IR effectiveness A few “standard” stoplists are commonly used.

Stemming Reason: Stemming:
Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: Removing some endings of word computer compute computes computing computed computation comput

Lemmatization transform to standard form according to syntactic category. E.g. verb + ing  verb noun + s  noun Need POS tagging More accurate than stemming, but needs more resources crucial to choose stemming/lemmatization rules noise v.s. recognition rate compromise between precision and recall light/no stemming severe stemming -recall +precision +recall -precision

Simple conjunctive query (two terms)
Consider the query: BRUTUS AND CALPURNIA To find all matching documents using inverted index: Locate BRUTUS in the dictionary Retrieve its postings list from the postings file Locate CALPURNIA in the dictionary Intersect the two postings lists Return intersection to user 97

Intersecting two posting lists
This is linear in the length of the postings lists. Note: This only works if postings lists are sorted. 98

Does Google use the Boolean model?
On Google, the default interpretation of a query [w1 w wn] is w1 AND w2 AND . . .AND wn Cases where you get hits that do not contain one of the wi : anchor text page contains variant of wi (morphology, spelling correction, synonym) long queries (n large) boolean expression generates very few hits Simple Boolean vs. Ranking of result set Simple Boolean retrieval returns matching documents in no particular order. Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 99

IR Evaluation Efficiency: time, space Effectiveness:
How is a system capable of retrieving relevant documents? Is a system better than another one? Metrics often used (together): Precision = retrieved relevant docs / retrieved docs Recall = retrieved relevant docs / relevant docs relevant retrieved retrieved relevant

IR Evaluation (Cont’) Information-retrieval systems save space by using index structures that support only approximate retrieval. May result in: false negative (false drop) - some relevant documents may not be retrieved. false positive - some irrelevant documents may be retrieved. For many applications a good index should not permit any false drops, but may permit a few false positives. Relevant performance metrics: precision - what percentage of the retrieved documents are relevant to the query. recall - what percentage of the documents relevant to the query were retrieved.

IR Evaluation (Cont’) Recall vs. precision tradeoff:
Can increase recall by retrieving many documents (down to a low level of relevance ranking), but many irrelevant documents would be fetched, reducing precision Measures of retrieval effectiveness: Recall as a function of number of documents fetched, or Precision as a function of recall Equivalently, as a function of number of documents fetched E.g., “precision of 75% at recall of 50%, and 60% at a recall of 75%” Problem: which documents are actually relevant, and which are not

General form of precision/recall
Precision change w.r.t. Recall (not a fixed point) Systems cannot compare at one Precision/Recall point Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)

An illustration of P/R calculation
List Rel? Doc1 Y Doc2 Doc3 Doc4 Doc5 … Assume: 5 relevant docs.

MAP (Mean Average Precision)
rij = rank of the j-th relevant document for Qi |Ri| = #rel. doc. for Qi n = # test queries E.g. Rank: st rel. doc. 5 8 2nd rel. doc. 10 3rd rel. doc.

Some other measures Noise = retrieved irrelevant docs / retrieved docs
Silence = non-retrieved relevant docs / relevant docs Noise = 1 – Precision; Silence = 1 – Recall Fallout = retrieved irrel. docs / irrel. docs Single value measures: F-measure = 2 P * R / (P + R) Average precision = average at 11 points of recall Precision at n document (often used for Web IR) Expected search length (no. irrelevant documents to read before obtaining n relevant doc.)

Interactive system’s evaluation
Definition: Evaluation = the process of systematically collecting data that informs us about what it is like for a particular user or group of users to use a product/system for a particular task in a certain type of environment. Most of this is typically is taught in HCI or Human Factors courses.

Problems Attitudes: Designers assume that if they and their colleagues can use the system and find it attractive, others will too Features vs. usability or security Executives want the product on the market yesterday Problems “can” be addressed in versions 1.x Consumers accept low levels of usability “I’m so silly” The photocopier story

Two main types of evaluation
Formative evaluation is done at different stages of development to check that the product meets users’ needs. Part of the user-centered design approach Supports design decisions at various stages May test parts of the system or alternative designs Summative evaluation assesses the quality of a finished product. May test the usability or the output quality May compare competing systems

What to evaluate Iterative design & evaluation is a continuous process that examines: Early ideas for conceptual model Early prototypes of the new system Later, more complete prototypes Designers need to check that they understand users’ requirements and that the design assumptions hold.

Four evaluation paradigms
‘quick and dirty’ usability testing field studies predictive evaluation

Quick and dirty ‘quick & dirty’ evaluation describes the common practice in which designers informally get feedback from users or consultants to confirm that their ideas are in-line with users’ needs and are liked. Quick & dirty evaluations are done any time. The emphasis is on fast input to the design process rather than carefully documented findings.

Usability testing Usability testing involves recording typical users’ performance on typical tasks in controlled settings. Field observations may also be used. As the users perform these tasks they are watched & recorded on video & their key presses are logged. This data is used to calculate performance times, identify errors & help explain why the users did what they did. User satisfaction questionnaires & interviews are used to elicit users’ opinions.

Usability testing It is very time consuming to conduct and analyze
Explain the system, do some training Explain the task, do a mock task Questionnaires before and after the test & after each task Pilot test is usually needed Insufficient number of subjects for ‘proper’ statistical analysis In laboratory conditions, subjects do not behave exactly like in a normal environment

Field studies Field studies are done in natural settings
The aim is to understand what users do naturally and how technology impacts them. In product design field studies can be used to: - identify opportunities for new technology - determine design requirements - decide how best to introduce new technology - evaluate technology in use

Predictive evaluation
Experts apply their knowledge of typical users, often guided by heuristics, to predict usability problems. Another approach involves theoretically based models. A key feature of predictive evaluation is that users need not be present Relatively quick & inexpensive

The TREC experiments Once per year
A set of documents and queries are distributed to the participants (the standard answers are unknown) (April) Participants work (very hard) to construct, fine-tune their systems, and submit the answers (1000/query) at the deadline (July) NIST people manually evaluate the answers and provide correct answers (and classification of IR systems) (July – August) TREC conference (November)

TREC evaluation methodology
Known document collection (>100K) and query set (50) Submission of 1000 documents for each query by each participant Merge 100 first documents of each participant -> global pool Human relevance judgment of the global pool The other documents are assumed to be irrelevant Evaluation of each system (with 1000 answers) Partial relevance judgments But stable for system ranking

Tracks (tasks) Ad Hoc track: given document collection, different topics Routing (filtering): stable interests (user profile), incoming document flow CLIR: Ad Hoc, but with queries in a different language Web: a large set of Web pages Question-Answering: When did Nixon visit China? Interactive: put users into action with system Spoken document retrieval Image and video retrieval Information tracking: new topic / follow up

CLEF and NTCIR CLEF = Cross-Language Experimental Forum NTCIR:
for European languages organized by Europeans Each per year (March – Oct.) NTCIR: Organized by NII (Japan) For Asian languages cycle of 1.5 year

Impact of TREC Provide large collections for further experiments
Compare different systems/techniques on realistic data Develop new methodology for system evaluation Similar experiments are organized in other areas (NLP, Machine translation, Summarization, …)

IR on the Web No stable document collection (spider, crawler)
Invalid document, duplication, etc. Huge number of documents (partial collection) Multimedia documents Great variation of document quality Multilingual problem …

Web Search Application of IR to HTML documents on the World Wide Web.
Differences: Must assemble document corpus by spidering the web. Can exploit the structural layout information in HTML (XML). Documents change uncontrollably. Can exploit the link structure of the web. 125 125

Web Search System Web Spider Document corpus IR Query String System
Ranked Documents 1. Page1 2. Page2 3. Page3 . 126

Challenges Scale, distribution of documents
Controversy over the unit of indexing What is a document ? (hypertext) What does the use expect to be retrieved ? High heterogeneity Document structure, size, quality, level of abstraction / specialization User search or domain expertise, expectations Retrieval strategies What do people want ? Evaluation

Web documents / data No traditional collection Structure Huge
Time and space to crawl index IRSs cannot store copies of documents Dynamic, volatile, anarchic, un-controlled Homogeneous sub-collections Structure In documents (un-/semi-/fully-structured) Between docs: network of inter-connected nodes Hyper-links - conceptual vs. physical documents

Web documents / data Mark-up Multi-lingual documents Multi-media
HTML – look & feel XML – structure, semantics Dublin Core Metadata Can webpage authors be trusted to correctly mark-up / index their pages ? Multi-lingual documents Multi-media

Theoretical models for indexing / searching
Content-based weighting As in traditional IRS, but trying to incorporate hyperlinks the dynamic nature of the Web (page validity, page caching) Link-based weighting Quality of webpages Hubs & authorities Bookmarked pages Iterative estimation of quality

Architecture Centralized Distributed
Main server contains the index, built by an indexer, searched by a query engine Advantage: control, easy update Disadvantage: system requirements (memory, disk, safety/recovery) Distributed Brokers & gatherers Advantage: flexibility, load balancing, redundancy Disadvantage: software complexity, update

User variability Power and flexibility for expert users vs. intuitiveness and ease of use for novice users Multi-modal user interface Distinguish between experts and beginners, offer distinct interfaces (functionality) Advantage: can make assumptions on users Disadvantage: habit formation, cognitive shift Uni-modal interface Make essential functionality obvious Make advanced functionality accessible

Search strategies Web directories Query-based searching
Link-based browsing (provided by the browser, not the IRS) “More like this” Known site (bookmarking) A combination of the above

Support for Relevance Feedback
RF can improve search effectiveness … but is rarely used Voluntary vs. forced feedback At document vs. word level “Magic” vs. control

Some techniques to improve IR effectiveness
Interaction with user (relevance feedback) - Keywords only cover part of the contents - User can help by indicating relevant/irrelevant document The use of relevance feedback To improve query expression: Qnew = *Qold + *Rel_d - *Nrel_d where Rel_d = centroid of relevant documents NRel_d = centroid of non-relevant documents

Modified relevance feedback
Users usually do not cooperate (e.g. AltaVista in early years) Pseudo-relevance feedback (Blind RF) Using the top-ranked documents as if they are relevant: Select m terms from n top-ranked documents One can usually obtain about 10% improvement

Term clustering Based on `similarity’ between terms
Collocation in documents, paragraphs, sentences Based on document clustering Terms specific for bottom-level document clusters are assumed to represent a topic Use Thesauri Query expansion

User modelling Build a model / profile of the user by recording
the `context’ topics of interest preferences based on interpreting (his/her actions): Implicit or explicit relevance feedback Recommendations from `peers’ Customization of the environment

Personalised systems Information filtering
Ex: in a TV guide only show programs of interest Use user model to disambiguate queries Query expansion Update the model continuously Customize the functionality and the look-and-feel of the system Ex: skins; remember the levels of the user interface

Autonomous agents Purpose: find relevant information on behalf of the user Input: the user profile Output: pull vs. push Positive aspects: Can work in the background, implicitly Can update the master with new, relevant info Negative aspects: control Integration with collaborative systems

Document Representation
<html> <head> <title>Department Descriptions</title> </head> <body> The following list describes … <h1>Agriculture</h1> … <h1>Chemistry</h1> … <h1>Computer Science</h1> … <h1>Electrical Engineering</h1> … … <h1>Zoology</h1> </body> </html> <title> context <title>department descriptions</title> <title> extents 1. department descriptions <body> context <body>the following list describes … <h1>agriculture</h1> … </body> <body> extents 1. the following list describes <h1>agriculture </h1> … <h1> context <h1>agriculture</h1> <h1>chemistry</h1>… <h1>zoology</h1> <h1> extents 1. agriculture 2. chemistry … 36. zoology .

Model Based on original inference network retrieval framework [Turtle and Croft ’91] Casts retrieval as inference in simple graphical model Extensions made to original model Incorporation of probabilities based on language modeling rather than tf.idf Multiple language models allowed in the network (one per indexed context)

Model … … … D θtitle θbody θh1 r1 rN r1 rN r1 rN q1 q2 I
Model hyperparameters (observed) Document node (observed) α,βbody D α,βtitle α,βh1 Context language models θtitle θbody θh1 … … … r1 rN r1 rN r1 rN Representation nodes (terms, phrases, etc…) q1 q2 Belief nodes (#combine, #not, #max) Information need node (belief node) I

Model … … … D α,βtitle θtitle θbody θh1 r1 rN r1 rN r1 rN q1 q2 I

P( r | θ ) Probability of observing a term, phrase, or “concept” given a context language model ri nodes are binary Assume r ~ Bernoulli( θ ) “Model B” – [Metzler, Lavrenko, Croft ’04] Nearly any model may be used here tf.idf-based estimates (INQUERY) Mixture models

Model D θtitle θbody θh1 r1 rN … q1 q2 α,βtitle α,βbody α,βh1 I

P( θ | α, β, D ) Prior over context language model determined by α, β
Assume P( θ | α, β ) ~ Beta( α, β ) Bernoulli’s conjugate prior αw = μP( w | C ) + 1 βw = μP( ¬ w | C ) + 1 μ is a free parameter

Model … … … D α,βtitle θtitle θbody θh1 r1 rN r1 rN r1 rN q1 q2 I

P( q | r ) and P( I | r ) Belief nodes are created dynamically based on query Belief node CPTs are derived from standard link matrices Combine evidence from parents in various ways Allows fast inference by making marginalization computationally tractable Information need node is simply a belief node that combines all network evidence into a single value Documents are ranked according to: P( I | α, β, D)

Example: #AND P(Q=true|a,b) A B false true 1 A B Q

Query Language Extension of INQUERY query language
Structured query language Term weighting Ordered / unordered windows Synonyms Additional features Language modeling motivated constructs Added flexibility to deal with fields via contexts Generalization of passage retrieval (extent retrieval) Robust query language that handles many current language modeling tasks

Terms Type Example Matches Stemmed term dog
All occurrences of dog (and its stems) Surface term “dogs” Exact occurrences of dogs (without stemming) Term group (synonym group) <”dogs” canine> All occurrences of dogs (without stemming) or canine (and its stems) Extent match #any:person Any occurrence of an extent of type person

Date / Numeric Fields Example Matches #less #less(URLDEPTH 3)
Any URLDEPTH numeric field extent with value less than 3 #greater #greater(READINGLEVEL 3) Any READINGINGLEVEL numeric field extent with value greater than 3 #between #between(SENTIMENT 0 2) Any SENTIMENT numeric field extent with value between 0 and 2 #equals #equals(VERSION 5) Any VERSION numeric field extent with value equal to 5 #date:before #date:before(1 Jan 1900) Any DATE field before 1900 #date:after #date:after(June ) Any DATE field after June 1, 2004 #date:between #date:between(1 Jun Sep 2001) Any DATE field in summer 2000.

Proximity Type Example Matches #odN(e1 … em) or #N(e1 … em)
#od5(saddam hussein) or #5(saddam hussein) All occurrences of saddam and hussein appearing ordered within 5 words of each other #uwN(e1 … em) #uw5(information retrieval) All occurrences of information and retrieval that appear in any order within a window of 5 words #uw(e1 … em) #uw(john kerry) All occurrences of john and kerry that appear in any order within any sized window #phrase(e1 … em) #phrase(#1(willy wonka) #uw3(chocolate factory)) System dependent implementation (defaults to #odm)

Context Restriction Example Matches yahoo.title
All occurrences of yahoo appearing in the title context yahoo.title,paragraph All occurrences of yahoo appearing in both a title and paragraph contexts (may not be possible) <yahoo.title yahoo.paragraph> All occurrences of yahoo appearing in either a title context or a paragraph context #5(apple ipod).title All matching windows contained within a title context

Context Evaluation Example Evaluated google.(title)
The term google evaluated using the title context as the document google.(title, paragraph) The term google evaluated using the concatenation of the title and paragraph contexts as the document google.figure(paragraph) The term google restricted to figure tags within the paragraph context.

Belief Operators INQUERY INDRI #sum / #and #combine #wsum* #weight #or
#not #max * #wsum is still available in INDRI, but should be used with discretion

Extent / Passage Retrieval
Example Evaluated #combine[section](dog canine) Evaluates #combine(dog canine) for each extent associated with the section context #combine[title, section](dog canine) Same as previous, except is evaluated for each extent associated with either the title context or the section context #combine[passage100:50](white house) Evaluates #combine(dog canine) 100 word passages, treating every 50 words as the beginning of a new passage #sum(#sum[section](dog)) Returns a single score that is the #sum of the scores returned from #sum(dog) evaluated for each section extent #max(#sum[section](dog)) Same as previous, except returns the maximum score

Extent Retrieval Example
Query: #combine[section]( dirichlet smoothing ) <document> <section><head>Introduction</head> Statistical language modeling allows formal methods to be applied to information retrieval. ... </section> <section><head>Multinomial Model</head> Here we provide a quick review of multinomial language models. <section><head>Multiple-Bernoulli Model</head> We now examine two formal methods for statistically modeling documents and queries based on the multiple-Bernoulli distribution. … </document> Treat each section extent as a “document” Score each “document” according to #combine( … ) Return a ranked list of extents. 0.15 0.50 0.05 SCORE DOCID BEGIN END 0.50 IR IR IR … … … …

Other Operators Type Example Description Filter require
#filreq( #less(READINGLEVEL 10) ben franklin) ) Requires that documents have a reading level less than 10. Documents then ranked by query ben franklin Filter reject #filrej( #greater(URLDEPTH 1) microsoft) ) Rejects (does not score) documents with a URL depth greater than 1. Documents then ranked by query microsoft Prior #prior( DATE ) Applies the document prior specified for the DATE field

System Overview Indexing Query processing Features
Inverted lists for terms and fields Repository consists of inverted lists, parsed documents, and document vectors Query processing Local or distributed Computing local / global statistics Features

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval

Similar presentations

Presentation on theme: "Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback