Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Access I Information Representation and Text Searching GSLT, Göteborg, September 2003 Barbara Gawronska, Högskolan i Skövde.

Similar presentations


Presentation on theme: "Information Access I Information Representation and Text Searching GSLT, Göteborg, September 2003 Barbara Gawronska, Högskolan i Skövde."— Presentation transcript:

1 Information Access I Information Representation and Text Searching GSLT, Göteborg, September 2003 Barbara Gawronska, Högskolan i Skövde

2 Requirements on Information Representation: Discriminating power Descriptive power Similarity identification Ambiguity minimalization Conciseness Those requirements may collide...

3 Traditional descriptors Classification codes (e.g. Universal Decimal Classification) Subject headings Key words  Problems: Standarized lists of subject headings needed Different spelling conventions Morphology: inflectional and derivational, compounding Semantic relations

4 Strategies for linking related words and phrases Different spelling conventions: spelling checkers; in proper names - counting the number of identical letters or identical bigrams (letter pairs). Could be improved by adding some phonological knowledge (metathesis etc.) BARBARA GAWRONSKA BARBRO GRAVONSKA Relations on morphological level:  Truncation: finding the common part of a string; no language specific morphological knowledge. Problems: too many unrelated words may pass trough ren#: renen, renar, rena, rent, renad... ren$$: renen, renar, renad...

5 Strategies for linking related words and phrases (2)  Lemmatization: identifying the lexical form  Stemming: a strategy between truncation and lemmatization The general principle for English (Lovins 1968,Paice 1990): remove the ending, and transform the ending of the remaining string, if needed Language-dependent algorithms; consider e.g. Indonesian: infinitiveactive tawarmenawar ”bargain” pikirmemikir ”think” berimemberi ”give” sewamenyewa ”rent”

6 Strategies for linking related words and phrases (3) multi-word entries: context operators, e.g.  exact distance between words retrieval$information: retrieval of information retrieval with information loss  maximal distance between words text##retrieval: text retrieval text and data retrieval  unspecified word order information#,retrieval: information retrieval, retrieval of information + word pair co-occurence rate

7 Strategies for linking related words and phrases (4) Semantic relations: thesauri, lexicons, semantic nets as tools for term expansion; some examples:  ERIC Thesaurus of Descriptors (the Dialog Corporation)  Roger Thesaurus  KL-ONE  WordNet... Normally used relations: broader/narrower term, related term, synonym,”used for”/ ”use” (identifies a preferred synonym); Even entailment (WordNet), role (KL-ONE)

8 Thesauri: Top-down classification - monohierarchy

9 Thesauri : Polyhierarchy

10 Thesauri : Polydimensional hierarchy

11

12 Thesauri: WordNet, some problems feline mammal usually having thick soft fur and being unable to roar; domestic cats; wildcats  any domesticated member of the genus Felis  any small or medium-sized cat resembling the domestic cat and living in the wild

13 Thesauri: WordNet, some problems feline mammal usually having thick soft fur and being unable to roar; domestic cats; wildcats  any domesticated member of the genus Felis female cat a long-haired breed similar to the Persian cat a slender short-haired blue-eyed breed of cat having a pale coat with dark ears paws face and tail tip  Siamese cat having a bluish cream body and dark gray points a cat proficient at mousing a long-haired breed of cat a short-haired breed with body similar to the Siamese cat but having a solid dark brown or gray coat a short-haired bluish-gray cat breed a small slender short-haired breed of African origin having brownish fur with a reddish undercoat homeless cat ….

14 Thesauri: WordNet, some problems feline mammal usually having thick soft fur and being unable to roar…..  any small or medium-sized cat resembling the domestic cat and living in the wild widely distributed wildcat of Africa and Asia Minor long-bodied long-tailed tropical American wildcat small spotted wildcat found from Texas to Brazil bushy-tailed European wildcat resembling the domestic tabby and regarded as the ancestor of the domestic cat medium-sized wildcat of Central and South America having a dark-striped coat small Asiatic wildcat a desert-dwelling wildcat …. short-tailed wildcats with usually tufted ears; valued for their fur  of northern Eurasia  of southern Europe  small lynx of North America  of deserts of northern Africa and southern Asia  of northern North America

15 Thesauri: Bottom-up classification Attribute A: size Attribute B: fur Attribute C: colour Attribute D: eye colour A1: middleB1: shortC1: paleD1: blue A2: smallB2: longC2: darkD2: green A3: bigC3: striped

16 Finding significant words Significance as a function of rank (Luhn 1958) A simple frequency-based indexing method: frequent words – stop list + truncation/conflating

17 Finding significant words (2) Term weighting: Salton & McGill1983 The ”Tf x idf” method (also called document frequency, or inverse term frequency): ”Tf x idf” can be combined with similarity measures, e.g. the vector space model

18 Similarity measures Models for comparing texts normally make use of words the texts have in common Some models also utilize the size of the documents and/or the number of words the texts do not have in common

19 Similarity measures (2) = THE WEIGHT OF AN OCCURENCE OF TERM j IN DOCUMENT i THE MAXIMUM NUMBER OF TERMS IN BOTH DOCUMENTS COMBINED T = No attention is paid to the size of a document

20 Similarity measures (3) Dice’s coefficient Jaccard’s coefficient

21 Similarity measures (4) The cosine coefficient (the cosine of the angle between two vectors)

22 Similarity measures (5) Clustering by similarity matrices (Jaccard’s coefficient applied to attribute/value matrices) Document signature matching (documents coded into very compact binary representations, so-called signatures) Discriminator words (Williams 1963): the discrimination coefficient ascribes high values to words that occur with a probability much different from the mean probability Latest advances in document clustering – wait for Hercules Dalianis’ lecture!

23 Which words should count as common to both documents? As summer turns to fall, many brewers start to plan their Oktoberfest brewing. This installment of "Brewing in Styles" looks at the materials and techniques used for brewing traditional and modern Maerzen beers and offers some radical tips for brewing Oktoberfest-like ales. Ein prosit! Several people called in response to the last installment of "Brewing in Styles" ("American Wheat," BrewingTechniques 1 [1], May/June 1993) to say that they were confused because many pubs and micros in the Midwest brew wheat beers in the traditional German manner, complete with the 4-vinylguaiacol clovelike character. Many fine German-style Weizenbiers are brewed in America. ***************************************************************************************************** Republished from BrewingTechniques' July/August What to do with that unfortunate mistake of a recipe? Design another beer that is out of balance in an opposite and complementary way. It invariably happens, even to the best of us. The beer that should have been so good ends up out of balance and undrinkable. Not being the type to accept less-than-perfect products graciously, I decided to take a page from the Belgian book of brewing. Belgian brewers have long used the practice of blending to even out inconsistent, wild fermentations

24 Relevance estimation The Retrieval Status Value (rsv) – the measure of closeness between the query and the document  In strictly Boolean systems: 0 or 1  Fuzzy (weighted) Boolean retrieval : values between 0 and 1; however, ”false drops” very probable because of the definition of retrieval functions

25 Relevance estimation (2)  The vector space model: the closeness of the query and the document vectors, computed using some of the previously mentioned similarity measures (Dice, Jaccard, or cosine)

26 Relevance estimation (3)  The probabilistic model (a feedback model) query term weight number of relevant documents in which the term occurs the total number of documents the number of documents in which the term occus the total number of documents that are relevant for query q (a non trivial problem!)


Download ppt "Information Access I Information Representation and Text Searching GSLT, Göteborg, September 2003 Barbara Gawronska, Högskolan i Skövde."

Similar presentations


Ads by Google