Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network,

Slides:



Advertisements
Similar presentations
Data compression. INTRODUCTION If you download many programs and files off the Internet, we have probably encountered.
Advertisements

CSCI 3130: Formal Languages and Automata Theory Tutorial 5
Hokkaido University 1 Lecture on Information knowledge network2010/11/10 Lecture on Information Knowledge Network "Information retrieval and pattern matching"
February 12, 2007 WALCOM '2007 1/22 DiskTrie: An Efficient Data Structure Using Flash Memory for Mobile Devices N. M. Mosharaf Kabir Chowdhury Md. Mostofa.
北海道大学 Hokkaido University 1 Lecture on Information knowledge network2010/12/23 Lecture on Information Knowledge Network "Information retrieval and pattern.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
T.Sharon-A.Frank 1 Multimedia Compression Basics.
Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Source Coding Data Compression A.J. Han Vinck. DATA COMPRESSION NO LOSS of information and exact reproduction (low compression ratio 1:4) general problem.
Shift-And Approach to Pattern Matching in LZW Compressed Text Takuya KIDA Department of Informatics Kyushu University, Japan Masayuki TAKEDA Ayumi SHINOHARA.
CSCI 3280 Tutorial 6. Outline  Theory part of LZW  Tree representation of LZW  Table representation of LZW.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Lecture 10: Dictionary Coding
Algorithms for Data Compression
Lecture 6 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Lossless Compression - II Hao Jiang Computer Science Department Sept. 18, 2007.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Lossless Compression in Multimedia Data Representation Hao Jiang Computer Science Department Sept. 20, 2007.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
8. Compression. 2 Video and Audio Compression Video and Audio files are very large. Unless we develop and maintain very high bandwidth networks (Gigabytes.
Lecture 10 Data Compression.
Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Text Compression Spring 2007 CSE, POSTECH. 2 2 Data Compression Deals with reducing the size of data – Reduce storage space and hence storage cost Compression.
Page 110/6/2015 CSE 40373/60373: Multimedia Systems So far  Audio (scalar values with time), image (2-D data) and video (2-D with time)  Higher fidelity.
1 Analysis of Algorithms Chapter - 08 Data Compression.
Multimedia Specification Design and Production 2012 / Semester 1 / L3 Lecturer: Dr. Nikos Gazepidis
Multiple Pattern Matching in LZW Compressed Text Takuya KIDA Masayuki TAKEDA Ayumi SHINOHARA Masamichi MIYAZAKI Setsuo ARIKAWA Department of Informatics.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Speeding up pattern matching by text compression Department of Informatics, Kyushu University, Japan Department of AI, Kyushu Institute of Technology,
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa Department of Informatics, Kyushu.
A Unifying Framework for Compressed Pattern Matching Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Yusuke Shibata, Setsuo Arikawa Department of Informatics,
Data Compression Reduce the size of data.  Reduces storage space and hence storage cost. Compression ratio = original data size/compressed data size.
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda
Faster Approximate String Matching over Compressed Text By Gonzalo Navarro *, Takuya Kida †, Masayuki Takeda †, Ayumi Shinohara †, and Setsuo Arikawa.
Lecture 7 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan
Fundamental Data Structures and Algorithms
Data Compression 황승원 Fall 2010 CSE, POSTECH 2 2 포항공과대학교 황승원 교 수는 데이터구조를 수강하 는 포항공과대학교 재학생 들에게 데이터구조를 잘해 야 전산학을 잘할수 있으니 더욱 열심히 해야한다고 말 했다. 포항공과대학교 A 데이터구조를.
Machines That Can’t Count CS Lecture 15 b b a b a a a b a b.
CS 1501: Algorithm Implementation LZW Data Compression.
compress! From theoretical viewpoint...
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 4, 2012 Lossless Data Compression.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 7 – Basics of Compression (Part 2) Klara Nahrstedt Spring 2012.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
CS 1501: Algorithm Implementation
Computer Sciences Department1. 2 Data Compression and techniques.
CSE 589 Applied Algorithms Spring 1999
Textbook does not really deal with compression.
Data Coding Run Length Coding
Compression & Huffman Codes
Data Compression.
Tries 07/28/16 11:04 Text Compression
Andrzej Ehrenfeucht, University of Colorado, Boulder
COMP261 Lecture 21 Data Compression.
Algorithms in the Real World
Applied Algorithmics - week7
Information of the LO Subject: Information Theory
Why Compress? To reduce the volume of data to be transmitted (text, fax, images) To reduce the bandwidth required for transmission and to reduce storage.
Data Compression Reduce the size of data.
Topic 3: Data Compression.
فشرده سازي داده ها Reduce the size of data.
CSE 589 Applied Algorithms Spring 1999
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Hokkaido University Lecture on Information Knowledge Network "Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA 1 Lecture on Information knowledge network2011/11/22

The 6th Pattern matching on compression text About data compression Motivation and aim of this study Pattern matching on Huffman encoded text Pattern matching on LZW compressed text Unified framework: Collage system Aspect of speeding-up of pattern matching by text compression: BPE compression 2011/11/22 Lecture on Information knowledge network 2

Hokkaido University 3 Lecture on Information knowledge network 2011/11/22 About data compression Lossless compressionLossy compression LZ77 Sequitur LZ78 BPELZW JPEG MPEG MP3 used for image and voice data Entropy encoding Huffman encoding Arithmetic encoding Non-universal encoding Run- length BWT Universal encoding Dictionary-based sort-based Grammar-based PPM Statistical reference Managing Gigabytes: Compressing and Indexing Documents and Images, I. H. Witten, A. Moffat, T. C. Bell, Morgan Kaufmann Pub, 1999.

Hokkaido University 4 Lecture on Information knowledge network 2011/11/22 Compressed text Original text decompress Ordinal pattern matching machine Pattern matching machine for compressed texts Compressed text Aim of this study Original text Ordinal pattern matching machine

Hokkaido University 5 Lecture on Information knowledge network 2011/11/22 Example of application s Directories Schedule tables E2J/J2E dictionaries Business cards Short memos E-books KOJIEN Personal databases We want to pack a lot of data into a small computer such as a mobile phone and PDA as much as possible! Because of small amount of memory, to construct an extra index structure isnt good solution! However, we want to retrieve at high speed! sharp mi110 V601T

Hokkaido University 6 Lecture on Information knowledge network 2011/11/22 Difficulty of PM on compressed texts There might hardly be "To decrease capacity, the text data is preserved by compressing it" in the category that personally uses the computer today when the capacity of the hard disk and the memory has grown enough. I have not used this function though the function to reduce capacity putting compression on Windows in each folder is provided. It will be seemed as an advantage none to compress the text data because there are 100 harms though preserving it by compressing it if it is a multimedia data like the image and the voice data, etc. is natural. However, the good policy doing the compression preservation deleting neither for instance a large amount of log file nor past mail data, etc.In a word Document files Compressed document files The starting position of each codeword is invisible 2.Representation of each string is not unique

Hokkaido University 7 Lecture on Information knowledge network 2011/11/22 Search-without-decompress method Search-on-the-fly method Decompress-then-search method Our goal Goal Do pattern matching faster than the above!

Hokkaido University 8 Lecture on Information knowledge network 2011/11/22 Lempel-Ziv-Welch (LZW) compression a b ab ab ba b c aba bc abab Text T: Compressed text E(T): LZW is used for UNIX compress command, GIF image format, and so on. T. A. Welch: A technique for high performance data compression, IEEE Comput., Vol.17, pp.8-19, |D| = O(compressed text length) Let D be the set of strings entered in the dictionary trie D = {a, b, c, ab, ba, bc, ca, aba, abb, bab, bca, abab} Dictionary trie c a b 4 b 5 a 9 c 10 a 6 a 7 b 8 b 12 a 11 b D is constructed adaptively Dictionary trie c a b b a c a a bb a b

Hokkaido University 9 Lecture on Information knowledge network 2011/11/22 Move of Aho-Corasick PM machine AC machine for pattern set Π= {aba, ababb, abca, bb} a b b a b c a b b {bb} {abca} {aba} {ababb, bb} : goto function : failure function { } : output abababba aba Output aba bb ababb Text Current state

Hokkaido University 10 Lecture on Information knowledge network 2011/11/ Idea for doing pattern matching on LZW texts To simulate the move of AC machine on LZW compressed texts Comp. text a b b a b c a b b {bb} {abca} {aba} {ababb, bb} abababba 012 aba Output aba bb ababb Text Current state T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern matching in LZW compressed text, Proc. Data Compression Conference, pp , IEEE Computer Society, Mar : goto function : failure function { } : output

Hokkaido University 11 Lecture on Information knowledge network 2011/11/22 Core functions Jump & Output Can we compute two functions Jump and Output well? –function Jump(q, u) simulates the consecutive transitions caused by string u in O(1) time. The domain is Q×D. returns the state number of AC machine –function Output(q, u) reports the occurrences within the string obtained by concatenating the string corresponding to state q and string u in O(r) time. The domain is Q×D. returns the set of pattern IDs It needs O(m|D|) space by a naïve way. It can be realized in O(m 2 +|D|) space! It needs O(m|D|) space by a naïve way. It can be realized in O(m 2 +|D|) space!

Hokkaido University 12 Lecture on Information knowledge network 2011/11/22 function Jump δ(q, u) δ(ε, u) if u is a factor of some pattern, otherwise. Jump(q, u) = O(m 3 ) space O(|D|) space Ancestor(N' 1 (q, u'), |u'| |u|) δ(ε, u) if u is a factor of some pattern, otherwise. Jump(q, u) = O(m 2 ) space O(|D|) space O(m 2 ) space Let δ(q,u) be the (extended) state transition function of the AC machine. δ(q,u) returns the state position after making transition from the state q by string u. u is the string corresponding to the nearest ancestor node of u that is also explicit on the generalized suffix trie for P.

Hokkaido University 13 Lecture on Information knowledge network 2011/11/22 function Output u ~ the longest prefix of u that is also a suffix of a pattern. A(u) = i,p | p Π, |u|< i <|u|, |p|< i, and u[i |p|+1...i ]=p ~ Output(q, u) = Output(q, u) A(u) ~ q u p1p1p1p1 p2p2p2p2 u~ p1p1p1p1 p2p2p2p2 O(|D|) space O(m 2 ) space Note that state q corresponds to a prefix of some pattern

Hokkaido University 14 Lecture on Information knowledge network 2011/11/22 Pseudo code of Kida, et al.[1998]s algorithm PMonLZW (E(T) = u 1 u 2 …u n, Π: pattern set) 1 Construct AC machine and generalized suffix trie for Π; 2 Initialize the dictionary trie for E(T); 3 Preprocess Jump(q,u) and Output(q,u) for any q and u {a pattern π Π factor} 4 l 0; 5 q q 0 ; 6 for i 1…n do 7 for each d,π Output(q, u i ) do 8 report pattern π occurs at position l+d; 9 q Jump(q, u i ); 10 l l + |u i |; 11 Update the dictionary trie; /* enter the string for node u i+1 into D */ 12 Update variables for Jump(q, u i+1 ) and Output(q, u i+1 ); /* compute δ(ε,u i+1 ), A(u i+1 ), u i+1, and |u i+1 | by using its parent info. */ 13 end of for 14 end of for

Hokkaido University 15 Lecture on Information knowledge network 2011/11/22 The result of Kida, et al. [1998] The original idea is from –A. Amir, G. Benson, and M. Farach: Let sleeping files lie: Pattern matching in Z-compressed files, J. Computer and System Sciences, Vol.52, pp , It simulates KMP on LZW compressed texts By simulating Aho-Corasick AC pattern matching machine, we can do multiple pattern matching. It takes O(m 2 +|D|) time and space for preprocessing. It scans compressed texts in O(n+r) time with O(m 2 +|D|) space for multiple patterns, and reports all the occurrences. This firstly appears in T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern matching in LZW compressed text, Proc. Data Compression Conference, pp , IEEE Computer Society, Mar Its Journal ed. Appears in T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple Pattern Matching in LZW Compressed Text, Journal of Discrete Algorithms, 1(1), pp , Hermes Science Publishing, Dec

Hokkaido University 16 Lecture on Information knowledge network 2011/11/22 Idea for applying bit-parallel technique aabaacaabacab abc Pattern P:= a b a c a Text T:= aabac Mask bits aab a a a c aa b a c Jump! T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Shift-And approach to pattern matching in LZW compressed text, Proc. CPM'99, LNCS1645, pp. 1-13, Springer-Verlag, Jul

Hokkaido University 17 Lecture on Information knowledge network 2011/11/22 Extended state updating function f For any a Σ, u Σ *, S {1,…, m}, we define as follows. – M(a) = { 1< i < m | P[i] = a } – f(S, a) = ((S 1) {1}) M(a) – f(S,ε) = S and f(S, ua) = f( f(S, u), a) – M(u) = f({1,, m}, u) Then, for any u Σ *, S {1,, m}, we define as –f(S, u) = ((S |u|) {1,, |u|}) M(u) O(1) time O(|D|) time and space

Hokkaido University 18 Lecture on Information knowledge network 2011/11/22 function Output (Bit-parallel type) Definition –Output(S, u) = { 1 < j < |u| | m S } –U(u) = {1 < j < |u| | i <m and u[1..i] =P[m-i+1..m] } –A(u) = {1 < j < |u| | m < i and u[1-m+1..i]=P } –Output(S, u) =((m S)U(u)) A(u) O(|D|) time and space q u PP (m S)U(u) A(u)

Hokkaido University 19 Lecture on Information knowledge network 2011/11/22 The result of Kida, et al. [1999] applied the bit-parallel technique based on Shift-And method to processing of functions Jump and Output to speed up. It uses O(m+|Σ|) time and space for preprocessing. For a given pattern, it scans a given compressed text in O(n+r) time and O(m+|D|) space, and it reports all the occurrences. It excels in the extensibility as well as Shift-And method. –pattern matching for a generalized pattern –pattern matching with allowing k mismatches –multiple pattern matching

Hokkaido University 20 Lecture on Information knowledge network 2011/11/22 Achievement of our aim! Pattern length CPU time sec. compress(LZW) + KMP AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Genbank DNA base sequence 17.1Mbyte T. Kida, et al.[1998] gunzip(LZ77)+ KMP Speeding-up by bit-parallelism[1999] Search-on-the-fly method Search-without-decompress method

Hokkaido University Take a breath 21 Lecture on Information knowledge network 2011/11/ RG Park

Hokkaido University 22 Lecture on Information knowledge network 2011/11/22 If … The time for doing pattern matching on the original text The time for doing compressed pattern matching Why do you need compressed PM? Goal 2 A new goal! We have enough storage space now. Why do you compress small data like text documents? × × × ×

Hokkaido University 23 Lecture on Information knowledge network 2011/11/22 A new goal! Goal Pattern length CPU time sec. Matching by KMP on the original text Overwhelmingly faster! compress(LZW) + KMP AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Genbank DNA base sequence 17.1Mbyte T. Kida, et al.[1998] gunzip(LZ77)+ KMP Speeding-up by bit-parallelism[1999] Search-on-the-fly method Search-without-decompress method

Hokkaido University 24 Lecture on Information knowledge network 2011/11/22 After substitutions Text ABABCDEBDEFABDEABC GGCHBHFGHGC GIHBHFGHI GGCDEBDEFGDEGC G H I 9 G H I AB DE GCGC dictionary 18 Size = 1 byte Byte Pair Encoding (BPE) method

Hokkaido University 25 Lecture on Information knowledge network 2011/11/22 Achievement of Goal 2 AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Medline English text 60.3Mbyte Pattern length CPU time sec. Matching by KMP on the original text Compressed PM on BPE (KMP type) Search-without-decompress method Agrep on the original text Compressed PM on BPE (BM type) Shibata, et al. (2000) Search-without-decompress method The fastest in the previous

Hokkaido University 26 Lecture on Information knowledge network 2011/11/22 Text compressed by BPE Text compressed by LZSS ordinal Text compressed by LZW The original uncompressed text for LZSSfor BPEfor LZW GOAL 1342 Low compression Medium compression High compression …but its the most suitable for PM! Summarize the above…

Hokkaido University 27 Lecture on Information knowledge network 2011/11/22 Paradigm shift 1 Develop pattern matching algorithms for each compression methods Choosing a suitable compression enables us to accelerate pattern matching! Develop a novel compression method which is suitable for pattern matching!

Hokkaido University Data compression methods for PM Dense coding type –[ETDC] Nieves R. Brisaboa, Eva Lorenzo Iglesias, Gonzalo Navarro, and Jose R. Parama: An efficient compression code for text databases, In ECIR2003, pp , –[SCDC] Nieves R. Brisaboa, Antonio Farina, Gonzalo Navarro, and Maria F. Esteller: (s, c)-dense coding: An optimized compression code for natural language text databases, In SPIRE2003, pp , –[FibC] Shmuel Tomi Klein and Miri Kopel Ben-Nissan: Using fibonacci compression codes as alternatives to dense codes, In DCC2008, pp , –[SVVC] Nieves R. Brisaboa, Antonio Farina, Juan-Ramon Lopez, Gonzalo Navarro, and Eduardo R. Lopez: A new searchable variable-to-variable compressor, In DCC2010, pp , VF coding type (including grammar-based compressions) –[BPEX] Shirou Maruyama, Yohei Tanaka, Hiroshi Sakamoto, and Masayuki Takeda: Context-sensitive grammar transform: Compression and pattern matching, In SPIRE2008, LNCS5280, pp , Nov –[DynC] Shmuel T. Klein and Dana Shapira: Improved variable-to-fixed length codes, In SPIRE2008, pp , –[STVF] Takashi Uemura, Satoshi Yoshida, Takuya Kida, Tatsuya Asai, and Seishi Okamoto: Training parse trees for efficient VF coding, In SPIRE2010, pp , Lecture on Information knowledge network 2011/11/22

Hokkaido University 29 Lecture on Information knowledge network 2011/11/22 Paradigm shift 2 We use the data compression technology to reduce the cost for storing and transferring the data. We can speed up pattern matching by compressing the data. Break difficulties of various processing by using the compression technology!

Hokkaido University 30 Lecture on Information knowledge network 2011/11/22 Doing something by using compression Speeding up the calculation of similarity between two long strings by compression technique. –A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices, M. Crochemore, G. M. Landau, and M. Ziv-Ukelson, Proceeding of 13th Symposium on Discrete Algorithm, pp , 2002 Processing a very huge graph structure on memory at high speed by compression technique. –Shinichi Nakano Gunma University Graph compression with query support Their method can represent a triangulated planar graph in 2m+o(n) bit and moreover can support some queries on it. Speeding up the query processing for XML data by compression technique. – Tetsuya Maita and Hiroshi Sakamoto Kyushu Institute of Technology

Hokkaido University 31 Lecture on Information knowledge network 2011/11/22 The 6th summary Pattern matching algorithms on compressed texts –Pattern matching on Huffman encoded text automaton with synchronization –Pattern matching on LZW compressed text simulating the move of KMP(AC) on the compressed text Unified framework: Collage system –A formal system to represent a text compressed by lexicographical compression method –We have clarified what kind of compression methods are suitable for pattern matching. Aspect of speed-up pattern matching by compression –BPE compression: it has low compression ratio, but it can speed up pattern matching –Our experimental results showed that we could do pattern matching faster than doing on the original texts A big paradigm shift caused –The data compression technology can be used in the other purposes rather than reducing the data size The next theme (which is the final topic of "Information retrieval and pattern matching) –Various topics I didnt mention about