# LIS618 lecture 2 Thomas Krichel 2004-02-08. Structure Theory: information retrieval performance Practice: more advanced dialog.

## Presentation on theme: "LIS618 lecture 2 Thomas Krichel 2004-02-08. Structure Theory: information retrieval performance Practice: more advanced dialog."— Presentation transcript:

LIS618 lecture 2 Thomas Krichel 2004-02-08

Structure Theory: information retrieval performance Practice: more advanced dialog.

retrieval performance evaluation "Recall" and "Precision" are two classic measures to measure the performance of information retrieval in a single query. Both assume that there is an answer set of documents that contain the answer to the query. Performance is optimal if –the database returns all the documents in the answer set –the database returns only documents in the answer set Recall is the fraction of the relevant documents that the query result has captured. Precision is the fraction of the retrieved documents that is relevant.

recall and precision curves Assume that all the retrieved documents arrive at once and are being examined. During that process, the user discover more and more relevant documents. Recall increases. During the same process, at least eventually, there will be less and less useful document. Precision declines (usually). This can be represented as a curve.

Example Let the answer set be {0,1,2,3,4,5,6,7,8,9} and non-relevant documents represented by letters. A query reveals the following result: 7,a,3,b,c,9,n,j,l,5,r,o,s,e,4. For the first document, (recall, precision) is (10%,100%), for the third (20%,66%), for the sixth (30%,50%), for the tenth (40%,40%), and for the last (50%,33%).

recall/precision curves Such curves can be formed for each query. An average curve, for each recall level, can be calculated for several queries. Recall and precision levels can also be used to calculate two single-valued summaries. –average precision at seen document –R-precision

R-precision This is a pretty ad-hoc measure. Let R be the size of the answer set. Take the first R results of the query. Find the number of relevant documents Divide by R. In our example, the R-precision is 40%. An average can be calculated for a number of queries.

average precision at seen document To find it, sum all the precision level for each new relevant document discovered by the user and divide by the total number of relevant documents for the query. In our example, it is (100+66+50+40+ 33)/5=57.8% This measure favors retrieval methods that get the relevant documents to the top.

critique of recall & precision Recall has to be estimated by an expert. Recall is very difficult to estimate in a large collection. They focus on one query only. No serious user works like this. There are some other measures, but that is more for an advanced course in IR.

Looking at database structure Up until now, we have looked at commands that take a full-text view of the database. Such commands can be executed for every database. If we want to make more precise queries, we have to take account of database structure.

bluesheet Each database name is linked to a blueish pop-up window called the blue sheet for the database. This is called the bluesheet. It contains the details of the database.

closer look at the bluesheet file description subject coverage (free vocabulary) format options, lists all formats –by number (internal) –by dialog web format (external, i.e. cross- database) search options –basic index, i.e. subject contents –additional index, i.e. non-subject

basic vs additional index the basic index –has information that is relevant to the substantive contents of the data –usually is indexed by word, i.e. connectors are required the additional index –has data that is not relevant to the substantive matter –usually indexed by phrase, i.e. connectors are not required

search options: basic index select without qualifiers searches in all fields in the basic index bluesheet lists field indicators available for a database also note if field is indexed by word or phrase. proximity searching only works with word indices. when phrases are indexed you don't need proximity indicators

search in basic index a field in the basic index is queried through term/IN, where term is a search term and IN is a field indicator Thomas calls this a appending indicator several field indicators can be ORed by giving a comma separated list for example mate/ti,de searches for mate in the title or descriptor fields

limiters and sorting Some databases allow to restrict the search using limiters. For example –/ABSrequire abstract present –/ENGEnglish language publication Some fields are sortable with the sort command, i.e. records can be sorted by the values in the fields. Example: sort s1/all/ti. Such features are database specific.

additional indices additional indices lists those terms that can lead a query. Often, these are phrase indexed. Such fields a queried by prefix IN=term where IN is the field abbreviator and term is the search term Thomas calls this a pre-pending indicator

expanding queries names have to be entered as they appear in the database. The "expand" command can be used to see varieties of spelling of a value It has to be used in conjunction with a field identifier, example –expand au=cruz, b? –expand au=barrueco? to search for misspellings of José Manuel Barrueco Cruz

expanding queries II search produces results of the form Ref Items Index-term –Ref is a reference number –Items is the number of items where the index term appears –Index-term is the index term "s Ref" searches for the reference term.

expand topics You can also expand a topic in a database to see what index terms are available that start with the term. Example b 155 ; e cold If you expand an entry in the expansion list again, you can see a list of related terms to the term, if such a list is available.

Example How many domain names are currently registered in Novosibirsk, Russia? Hint: use domain name database file 225. Note that this database also covers non- current domains.

ranking The rank command can be use to show the most frequent values of a phrase indexed field in a search set. Example –rank au s1 shows the most frequent authors –rank de s1 shows most frequent descriptors read the screens following rank command for instructions.

example Who wrote on interest rates and growth rates. Use EconLit b 139 s interest(n)rate? and growth(n)rate? rank au s1 You can then set some authors you are interested in, 1-5 for example exit to leave rank, confirm with yes. exs to search for those authors.

topic searches Often we want to know what literature is available on a certain topic. Many times authors do not use obvious words that occur to the searcher. Using descriptors can be very helpful. –Conduct a search –Look for descriptors –Use those in other searches

Initial file selection On the main menu, go to the database menu. After the principle menu, you get a search box There you can enter full-text queries for all the databases You can then select the database you want And get to the begin databases stage.

database categories In order to help people to find databases (files), DIALOG have grouped databases by categories. categories are listed at http://library.dialog.com/bluesheets/html/blo.html http://library.dialog.com/bluesheets 'b category' will select databases from the category category at the start. 'sf category' selects files belonging to a category category at other times.

add/repeat add number, number adds databases by files to the last query example "add 297" to see what the bible says about it repeat repeats previous query with database added

to find publications Sometimes, you want to find out if a certain publication, say, a serial, is available on Dialog http://library.dialog.com/bluesheets/ has a search box specifically for journal data.

http://openlib.org/home/krichel Thank you for your attention!

Download ppt "LIS618 lecture 2 Thomas Krichel 2004-02-08. Structure Theory: information retrieval performance Practice: more advanced dialog."

Similar presentations