Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Data compression. INTRODUCTION If you download many programs and files off the Internet, we have probably encountered.
Analysis of Computer Algorithms
Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.
Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 5 (book chapter 11): Multimedia.
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Addition Facts
Query optimisation.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
ABC Technology Project
1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"
© Arjen P. de Vries Arjen P. de Vries Fascinating Relationships between Media and Text.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Traditional IR models Jian-Yun Nie.
Boolean and Vector Space Retrieval Models
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
11-1 FRAMING The data link layer needs to pack bits into frames, so that each frame is distinguishable from another. Our postal system practices a type.
Data Compression CS 147 Minh Nguyen.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let.
Compression Word document: 1 page is about 2 to 4kB Raster Image of 1 page at 600 dpi is about 35MB Compression Ratio, CR =, where is the number of bits.
WMES3103 : INFORMATION RETRIEVAL
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
CS336: Intelligent Information Retrieval
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Text Operations: Preprocessing and Compression. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis,
Indexing and Searching
Prepared By : Loay Alayadhi Supervised by: Dr. Mourad Ykhlef
Modern Information Retrieval Chapter 7: Text Processing.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Submitted To-: Submitted By-: Mrs.Sushma Rani (HOD) Aashish Kr. Goyal (IT-7th) Deepak Soni (IT-8 th )
Data Compression.
Data Compression CS 147 Minh Nguyen.
Information Retrieval and Web Design
Presentation transcript:

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 7: Text Operations Alexander Gelbukh www.Gelbukh.com

Previous chapter: Conclusions Modeling of text helps predict behavior of systems Zipf law, Heaps’ law Describing formally the structure of documents allows to treat a part of their meaning automatically, e.g., search Languages to describe document syntax SGML, too expensive HTML, too simple XML, good combination

Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

Linguistic operations Purpose: Convert words to “meanings” Synonyms or related words Different words, same meaning. Morphology Foot / feet, woman / female Homonyms Same words, different meanings. Word senses River bank / financial bank Stopwords Word, no meaning. Functional words The

For good or for bad? More exact matching Unexpected behavior Less noise, better recall Unexpected behavior Difficult for users to grasp Harms if introduces errors More expensive Adds a whole new technology Maintenance; language dependents Slows down Good if done well, harmful if done badly

Document preprocessing Lexical analysis (punctuation, case) Simple but must be careful Stopwords. Reduces index size and pocessing time Stemming: connected, connection, connections, ... Multiword expressions: hot dog, B-52 Here, all the power of linguistic analysis can be used Selection of index terms Often nouns; noun groups: computer science Construction of thesaurus synonymy: network of related concepts (words or phrases)

Stemming Methods Linguistic analysis: complex, expensive maintenance Table lookup: simple, but needs data Statistical (Avetisyan): no data, but imprecise Suffix removal Porter algorithm. Martin Porter. Ready code on his website Substitution rules: sses  s, s   stresses  stress.

Better stemming The whole problematics of computational linguistics POS disambiguation well  adverb or noun? Oil well. Statistical methods. Brill tagger Syntactic analysis. Syntactic disambiguation Word sense disambiguatiuon bank1 and bank2 should be different stems Statistical methods Dictionary-based methods. Lesk algorithm Semantic analysis

Thesaurus Terms (controlled vocabulary) and relationships Terms used for indexing represent a concept. One word or a phrase. Usually nouns sense. Definition or notes to distinguish senses: key (door). Relationships Paradigmatic: Synonymy, hierarchical (is-a, part), non-hierarchical Syntagmatic: collocations, co-occurrences WordNet. EuroWordNet synsets

Use of thesurus To help the user to formulate the query Navigation in the hierarchy of words Yahoo! For the program, to collate related terms woman  female fuzzy comparison: woman  0.8 * female. Path length

Yahoo! vs. thesaurus The book says Yahoo! is based on a thesaurus. I disagree Tesaurus: words of language organized in hierarchy Document hierarchy: documents attached to hierarchy This is word sense disambiguation I claim that Yahoo! is based on (manual) WSD Also uses thesaurus for navigation

Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

Document clustering Operation on the whole collection Global vs. local Global: whole collection At compile time, one-time operation Local Cluster the results of a specific query At runtime, with each query Is more a query transformation operation Already discussed in Chapter 5

Text operations Linguistic operations Document clustering Compression Encription (not discussed here)

Compression Gain: storage, transmission, search Lost: time on compressing/decompressing In IR: need for random access. Blocks do not work Also: pattern matching on compressed text

Compression methods Statistical Huffman: fixed size per symbol. More frequent symbols shorter Allows starting decompression from any symbol Arithmetic: dynamic coding Need to decompress from the beginning Not for IR Dictionary Pointers to previous occurrences. Lampel-Ziv Again not for IR

Compression ratio Size compressed / size decompressed Huffman, units = words: up to 2 bits per char Close to the limit = entropy. Only for large texts! Other methods: similar ratio, but no random access Shannon: optimal length for symbol with probability p is - log2 p Entropy: Limit of compression Average length with optimal coding Property of model

Modeling Find probability for the next symbol Adaptive, static, semi-static Adaptive: good compression, but need to start from beginning Static (for language): poor compression, random access Semi-static (for specific text; two-pass): both OK Word-based vs. character-based Word-based: better compression and search

Huffman coding Each symbol is encoded, sequentially More frequent symbols have shorter codes No code is a prefix of another one How to build the tree: book Byte codes are better Allow for sequential search

Dictionary-based methods Static (simple, poor compression), dynamic, semi-static. Lempel-Ziv: references to previous occurrence Adaptive Disadvantages for IR Need to decode from the very beginning New statistical methods perform better

Comparison of methods

Compression of inverted files Inverted file: words + lists of docs where they occur Lists of docs are ordered. Can be compressed Seen as lists of gaps. Short gaps occur more frequently Statistical compression Our work: order the docs for better compression We code runs of docs Minimize the number of runs Distance: # of different words TSP.

Research topics All computational linguistics Uses of thesaurus Improved POS tagging Improved WSD Uses of thesaurus for user navigation for collating similar terms Better compression methods Searchable compression Random access

Conclusions Text transformation: meaning instead of strings Lexical analysis Stopwords Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems Text compression Searchable Random access Word-based statistical methods (Huffman) Index compression

Till compensation lecture Thank you! Till compensation lecture