Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com.

Alexander Gelbukh www.Gelbukh.com
Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh

Previous chapter: Conclusions
Modeling of text helps predict behavior of systems Zipf law, Heaps’ law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax SGML, too expensive HTML, too simple XML, good combination

Text operations Linguistic operations Document clustering Compression
Encription (not discussed here)

Linguistic operations
Purpose: Convert words to “meanings” Synonyms or related words Different words, same meaning. Morphology Foot / feet, woman / female Homonyms Same words, different meanings. Word senses River bank / financial bank Stopwords Word, no meaning. Functional words The

For good or for bad? More exact matching Unexpected behavior
Less noise, better recall Unexpected behavior Difficult for users to grasp Harms if introduces errors More expensive Adds a whole new technology Maintenance; language dependents Slows down Good if done well, harmful if done badly

Document preprocessing
Lexical analysis (punctuation, case) Simple but must be careful Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ... Multiword expressions: hot dog, B-52 Here, all the power of linguistic analysis can be used Selection of index terms Often nouns; noun groups: computer science Construction of thesaurus synonymy: network of related concepts (words or phrases)

Stemming Methods Linguistic analysis: complex, expensive maintenance
Table lookup: simple, but needs data Statistical (Avetisyan): no data, but imprecise Suffix removal Porter algorithm. Martin Porter. Ready code on his website Substitution rules: sses  s, s   stresses  stress.

Better stemming The whole problematics of computational linguistics
POS disambiguation well  adverb or noun? Oil well. Statistical methods. Brill tagger Syntactic analysis. Syntactic disambiguation Word sense disambiguatiuon bank1 and bank2 should be different stems Statistical methods Dictionary-based methods. Lesk algorithm Semantic analysis

Thesaurus Terms (controlled vocabulary) and relationships Terms
used for indexing represent a concept. One word or a phrase. Usually nouns sense. Definition or notes to distinguish senses: key (door). Relationships Paradigmatic: Synonymy, hierarchical (is-a, part), non-hierarchical Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet synsets

Use of thesurus To help the user to formulate the query
Navigation in the hierarchy of words Yahoo! For the program, to collate related terms woman  female fuzzy comparison: woman  0.8 * female. Path length

Yahoo! vs. thesaurus The book says Yahoo! is based on a thesaurus.
I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation

Document clustering Operation on the whole collection Global vs. local
Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query Is more a query transformation operation Already discussed in Chapter 5

Compression Gain: storage, transmission, search
Lost: time on compressing/decompressing In IR: need for random access. Blocks do not work Also: pattern matching on compressed text

Compression methods Statistical Huffman: fixed size per symbol.
More frequent symbols shorter Allows starting decompression from any symbol Arithmetic: dynamic coding Need to decompress from the beginning Not for IR Dictionary Pointers to previous occurrences. Lampel-Ziv Again not for IR

Compression ratio Size compressed / size decompressed
Huffman, units = words: up to 2 bits per char Close to the limit = entropy. Only for large texts! Other methods: similar ratio, but no random access Shannon: optimal length for symbol with probability p is - log2 p Entropy: Limit of compression Average length with optimal coding Property of model

Modeling Find probability for the next symbol
Adaptive, static, semi-static Adaptive: good compression, but need to start from beginning Static (for language): poor compression, random access Semi-static (for specific text; two-pass): both OK Word-based vs. character-based Word-based: better compression and search

Huffman coding Each symbol is encoded, sequentially
More frequent symbols have shorter codes No code is a prefix of another one How to build the tree: book Byte codes are better Allow for sequential search

Dictionary-based methods
Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence Adaptive Disadvantages for IR Need to decode from the very beginning New statistical methods perform better

Comparison of methods

Compression of inverted files
Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps. Short gaps occur more frequently Statistical compression Our work: order the docs for better compression We code runs of docs Minimize the number of runs Distance: # of different words TSP.

Research topics All computational linguistics Uses of thesaurus
Improved POS tagging Improved WSD Uses of thesaurus for user navigation for collating similar terms Better compression methods Searchable compression Random access

Conclusions Text transformation: meaning instead of strings
Lexical analysis Stopwords Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression Searchable Random access Word-based statistical methods (Huffman) Index compression

Till compensation lecture
Thank you! Till compensation lecture

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com.

Similar presentations

Presentation on theme: "Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com.

Similar presentations

Presentation on theme: "Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:

Similar presentations

About project

Feedback