Presentation is loading. Please wait.

Presentation is loading. Please wait.

Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com.

Similar presentations


Presentation on theme: "Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:

1 Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh

2 2 Previous chapter: Conclusions Modeling of text helps predict behavior of systems oZipf law, Heaps law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax oSGML, too expensive oHTML, too simple oXML, good combination

3 3 Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

4 4 Linguistic operations Purpose: Convert words to meanings Synonyms or related words oDifferent words, same meaning. Morphology oFoot / feet, woman / female Homonyms oSame words, different meanings. Word senses oRiver bank / financial bank Stopwords oWord, no meaning. Functional words oThe

5 5 For good or for bad? More exact matching oLess noise, better recall Unexpected behavior oDifficult for users to grasp oHarms if introduces errors More expensive oAdds a whole new technology oMaintenance; language dependents oSlows down Good if done well, harmful if done badly

6 6 Document preprocessing Lexical analysis (punctuation, case) oSimple but must be careful Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections,... oMultiword expressions: hot dog, B-52 oHere, all the power of linguistic analysis can be used Selection of index terms oOften nouns; noun groups: computer science Construction of thesaurus osynonymy: network of related concepts (words or phrases)

7 7 Stemming Methods oLinguistic analysis: complex, expensive maintenance oTable lookup: simple, but needs data oStatistical (Avetisyan): no data, but imprecise oSuffix removal Suffix removal oPorter algorithm. Martin Porter. Ready code on his website oSubstitution rules: sses s, s ostresses stress.

8 8 Better stemming The whole problematics of computational linguistics POS disambiguation owell adverb or noun? Oil well. oStatistical methods. Brill tagger oSyntactic analysis. Syntactic disambiguation Word sense disambiguatiuon obank 1 and bank 2 should be different stems oStatistical methods oDictionary-based methods. Lesk algorithm oSemantic analysis

9 9 Thesaurus Terms (controlled vocabulary) and relationships Terms oused for indexing orepresent a concept. One word or a phrase. Usually nouns osense. Definition or notes to distinguish senses: key (door). Relationships oParadigmatic: Synonymy, hierarchical (is-a, part), non-hierarchical oSyntagmatic: collocations, co-occurrences WordNet. EuroWordNet osynsets

10 10 Use of thesurus To help the user to formulate the query oNavigation in the hierarchy of words oYahoo! For the program, to collate related terms owoman female ofuzzy comparison: woman 0.8 * female. Path length

11 11 Yahoo! vs. thesaurus The book says Yahoo! is based on a thesaurus. I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation

12 12 Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

13 13 Document clustering Operation on the whole collection Global vs. local Global: whole collection oAt compile time, one-time operation Local oCluster the results of a specific query oAt runtime, with each query Is more a query transformation operation oAlready discussed in Chapter 5

14 14 Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

15 15 Compression Gain: storage, transmission, search Lost: time on compressing/decompressing In IR: need for random access. oBlocks do not work Also: pattern matching on compressed text

16 16 Compression methods Statistical Huffman: fixed size per symbol. oMore frequent symbols shorter oAllows starting decompression from any symbol Arithmetic: dynamic coding oNeed to decompress from the beginning oNot for IR Dictionary Pointers to previous occurrences. Lampel-Ziv oAgain not for IR

17 17 Compression ratio Size compressed / size decompressed Huffman, units = words: up to 2 bits per char oClose to the limit = entropy. Only for large texts! oOther methods: similar ratio, but no random access Shannon: optimal length for symbol with probability p is - log 2 p Entropy: Limit of compression oAverage length with optimal coding oProperty of model

18 18 Modeling Find probability for the next symbol Adaptive, static, semi-static oAdaptive: good compression, but need to start from beginning oStatic (for language): poor compression, random access oSemi-static (for specific text; two-pass): both OK Word-based vs. character-based oWord-based: better compression and search

19 19 Huffman coding Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one How to build the tree: book Byte codes are better Allow for sequential search

20 20 Dictionary-based methods Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence oAdaptive Disadvantages for IR oNeed to decode from the very beginning oNew statistical methods perform better

21 21 Comparison of methods

22 22 Compression of inverted files Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps. oShort gaps occur more frequently oStatistical compression Our work: order the docs for better compression oWe code runs of docs oMinimize the number of runs oDistance: # of different words oTSP.

23 23 Research topics All computational linguistics oImproved POS tagging oImproved WSD Uses of thesaurus ofor user navigation ofor collating similar terms Better compression methods oSearchable compression oRandom access

24 24 Conclusions Text transformation: meaning instead of strings oLexical analysis oStopwords oStemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression oSearchable oRandom access oWord-based statistical methods (Huffman) Index compression

25 25 Thank you! Till compensation lecture


Download ppt "Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com."

Similar presentations


Ads by Google