Presentation is loading. Please wait.

Presentation is loading. Please wait.

Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading.

Similar presentations


Presentation on theme: "Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading."— Presentation transcript:

1 Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading

2 Information discovery  Searching vs browsing  When do you use one over the other?  Do we need both?  Is one a special case of the other?  Types of information seeking  Comprehensive search  Known (specific) item  Facts  Introduction or overview  Related information

3 Item descriptions  Metadata  Catalogs  Library catalog records are time-consuming to produce. They contain more than just easily available information about the item.  Services produce the catalog records and distribute them to libraries.  OCLC: Online Computer Library Center http://www.oclc.org/us/en/default.htm http://www.oclc.org/us/en/default.htm  Abstracting and indexing services  Alternative to catalog, more detailed description  Specific to a discipline  Automating the process is a subject of research and experiment

4 Paper topic suggestion  What is the state of the art in automatic indexing and abstracting? In what fields is this most fully researched? Who is leading the efforts?  What has been accomplished in auomatic e- mail indexing, summarizing?

5 Controlled Vocabularies and Ontologies  Effective description of materials requires unambiguous descriptive terms  Natural language is inherently ambiguous  Controlled vocabularies force use of a restricted set of terms  ACM CCS http://www.acm.org/class/1998/http://www.acm.org/class/1998/  Regularly updated, difficult to use  Computing Ontology

6 Dublin Core  Standard set of metadata fields for entries in digital libraries:  Title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights

7 Dublin Core elements see: http://dublincore.org/documents/dces/http://dublincore.org/documents/dces/  Title  Creator  Subject - C  Description  Publisher  Contributor  Date  Type - C  Format - C  Identifier  Source  Language  Relation  Coverage - C  Rights Rights Management information Space, time, jurisdiction. C = controlled vocabulary recommended. Reference to related resource Standards RFC 3066, ISO639 Unambiguous ID Ex: collection, dataset, event, image YYYY-MM-DD, ex. Entity primarily responsible for making content of the resource Entity making the resource available Contributor to content of the resource What is needed to display or operate the resource. Resource from which this one was derived

8 Metadata  What does metadata look like?  Metadata is data about data  Information about a resource, encoded in the resource or associated with the resource.  The language of metadata: XML  eXtensible Markup Language

9 XML  XML is a markup language  XML describes features  There is no standard XML  Use XML to create a resource type  Separately develop software to interact with the data described by the XML codes. Source: tutorial at w3school.com

10 XML rules  Easy rules, but very strict  First line is the version and character set used:   The rest is user defined tags  Every tag has an opening and a closing

11 Element naming  XML elements must follow these naming rules:  Names can contain letters, numbers, and other characters  Names must not start with a number or punctuation character  Names must not start with the letters xml (or XML or Xml..)  Names cannot contain spaces

12 Elements and attributes  Use elements to describe data  Use attributes to present information that is not part of the data  For example, the file type or some other information that would be useful in processing the data, but is not part of the data.

13 Repeating elements  Naming an element means it appears exactly once.  Name+ means it appears one or more times  Name* means it appears 0 or more times.  Name? Means it appears 0 or one time.

14 Using XML - an example Define the fields of a recipe collection: ISO 8859 is a character set. See http://www.bbsinc.com/iso8859.html

15 Processing the XML data  How do we know what to do with the information in an XML file?  Document Type Definition (DTD)  Put in the same file as the data -- immediate reference  Put a reference to an external description  Provides the definition of the legitimate content for each element

16 Document Type Definition   <!DOCTYPE recipe [   ]> Repeat 0 or more times

17 Meringue cookies 3 egg whites 1 cup sugar 1 teaspoon vanilla 2 cups mini chocolate chips Beat the egg whites until stiff. Stir in sugar, then vanilla. Gently fold in chocolate chips. Place in warm oven at 200 degrees for an hour. Alternatively, place in an oven at 350 degrees. Turn oven off and leave overnight. Not the way that I want to see a recipe in a magazine! What could we do with a large collection of such entries? How would we get the information entered into a collection? External reference to DTD

18 XML exercise  Design an XML schema for an application of your choice. Keep it simple.  Examples -- address book, TV program listing, DVD collection, …

19 Another example  A paper with content encoded with XML: http://tecfaseed.unige.ch/staf18/modules/ePBL/uploads/proj3/paper81.xml http://tecfaseed.unige.ch/staf18/modules/ePBL/uploads/proj3/paper81.xml  First few lines:   Standards E-learning and their possible support for a rich pedagogic approach in a  'Integrated Learning' context   Rodolophe  Borer  http://tecfa.unige.ch/perso/staf/borer/  "ePBLpaper11.dtd” shown on next slide This paper is no longer available online

20 %foreign-dtd; Source: http://tecfa.unige.ch/staf/staf-j/vuilleum/staf18/p6/

21 Vocabulary  Given the need for processing, do you want free text or restricted entries?  Free text gives more flexibility for the person making the entry  Controlled vocabulary helps with  Consistent processing  Comparison between entries  Controlled vocabulary limits  Options for what is said

22 Vocabulary example  Recipe example  What text should be controlled?  What should be free text?  Ingredients  Ingredient-amount  Ingredient-name  Should we revise how we coded ingredient amount?  Directions

23 A DSpace example  CITIDEL: http://citidel.villanova.edu

24 IEEE - LOM  Example of a specialized metadata scheme  Learning Object Metadata  Specifically for collections of educational materials  Includes all of Dublin Core  See http://projects.ischool.washington.edu/sasutton/IEEE1484.html

25 Information Retrieval  Until now, information description  Now, how to match the information need to the resources available  Query - expresses the information need  Composed of individual words or symbols called search terms  Search types  Full text search  Compare search terms to every word in the text  Fielded search  Match the search terms to the relevant parts of the text

26 Information retrieval techniques  Eliminate stop words  Words that do not contribute to identifying useful resources  Typically: articles (a, an, the), prepositions (in, of, with, to, on,..), conjunctions (and, or), pronouns (he, she, it, they, them, …), auxiliary verbs or verb parts (to, be, was, …)  Making the stop list is not trivial.  Arms example of query to be or not to be, composed entirely of words usually considered stop words.

27 IR techniques - 2  Inverted files  List the words in the whole document collection and append a pointer to each place the word appears  Extract all the words  Alphabetize  “Tokenize” find the root word, without punctuation, etc.  Stemming Reduce the word to its basic stem  Index Link each word to its location in the document  Example - handout and exercise Interesting resource: http://www.rhymezone.com/shakespeare/http://www.rhymezone.com/shakespeare/

28 Inverted file exercise  Given the short documents,  Each team takes a document and produces an alphabetical list of all the words in the document.  Make a stop list (what words will you put on it?)  Reduce each word to its stem. (For computer, computing, etc. use “compute” as the stem.)  List the location of each word by counting the word position in the document.  Make your list show the word, then the document number, then the number of times the word occurs in that document and the locations in which the word appears.

29 Using our inverted file  Search for Computer Science and Election  How well does our inverted file serve our purpose?

30 Search result evaluation  Basic search question -- Is this word (or are these words) in the document?  Answer -- yes or no.  Is that good enough? Are all yes responses the same?  Boolean search --  Using the inverted list, we can find what terms are in the document and also the relative positions of the words.  Basic Boolean search is for an exact match.  Tokenizing and stemming improves performance, but more is possible.

31 The Vector model  Simplest verstion -- boolean vectors  Vector for a document  Each position represents a word  There is a 1 if the document contains the word and a 0 if it does not  Vector for a query  Each position is the same as for the document  There is a 1 wherever the word corresponds to a term in the query and a 0 everywhere else.

32 Example boolean vector  Consider the “documents”  (1) Is computer science relevant?  (2) Computing majors are needed.  Index terms:  compute major need science relevant  Document vectors:  (1) {1 0 0 1 1}  (2) {1 1 1 0 0}  Consider a query: Computing relevance  Query vector {1 0 0 0 1}

33 Compare document and query vectors  Document vectors:  (1) {1 0 0 1 1}  (2) {1 1 1 0 0}  Consider a query: Computing relevance  Query vector {1 0 0 0 1}  Doc 1 & Query: {1 0 0 0 1}  Doc 2 & Query: {1 0 0 0 0}  Note that this tells us that document 1 contains exactly the terms of the query, but does not tell us how many occurrences there are, or the relative positions of the terms. If we had many documents and computed the same vectors for several, how would we decide which is best? Can we rank the results? Does the notion of a vector make sense?

34 The vector model  Let  k i be an index term (keyword)  N be the total number of documents in the collection  n i be the number of documents that contain k i  freq(i,j) raw frequency of k i within document d j  A normalized tf (term frequency) factor is given by  tf(i,j) = freq(i,j) / max(freq(i,j))  where the maximum is computed over all terms that occur within the document d j  The idf (inverse term frequency) factor, computed as  idf i = log (N/n i )  the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i.

35 Consider these terms  tf(i,j) = freq(i,j) / max(freq(i,j))  idf i = log (N/n i )

36 Vector Model - 2  These expressions allow us to give weights to terms within documents.  tf: term frequency, quantifies intra-document occurrence (also called term density in a document)  idf: inverse document frequency, quantifies inter- document differentiation. If a word is common to nearly all the documents in the collection, it will not be very useful in finding good matches to a query.  The weight assigned to word i in document j is  wij = tf(i,j) * idf(i)  This is called the tf-idf weighting scheme  This method generally does as well as any other ranking scheme and has the advantage simplicity and computational efficiency.  Weights may also be assigned to words in the query.

37 Some examples  The following examples come from slides provided by the author of the textbook:  Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://people.ischool.berkeley.edu/~hearst/irbook/  Addison Wesley Longman Publishing Company  All the slides that go with the book are available there.

38 The Vector Model: Example I d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Relationship between documents and keywords shown here Here, all the keywords are equally weighted in the documents and in the query. This column tells us how well each document matches the query

39 The Vector Model: Example II d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Query terms are weighted, but the documents are not.

40 The Vector Model: Example III d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Document and query terms are weighted. Compare the three results for document recommendations.

41 Evaluating IR results  Precision  Of the results returned, what percentage were relevant  Recall  Of the matches available, what percentage were returned.

42 This session  Talked about the way that content is described.  Looked at how a document is indexed  Looked at how a query is matched to a document  Looked at the value of weighting the occurrence of words in a document  Some specific things: Dublin Core, XML, Boolean and Vector Space Modeling of Information Retrieval.


Download ppt "Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading."

Similar presentations


Ads by Google