Download presentation
Presentation is loading. Please wait.
1
Association Clusters Definition The frequency of a stem in a document,, is referred to as. Let be an association matrix with rows and columns, where. Let be the transpose of. The matrix is a local stem-stem association matrix. Each element in expresses a correlation between the stems and namely, (5.5) (5.6)
2
Example for Association Clusters We have three documents:d1, d2, d3, and d4, is the frequency of words in that appears in
3
Example for Association Clusters
5
Metric Clusters Definition Let the distance between two keywords and be given by the number of words between them in a same document. If and are in distinct documents, we take. A local stem-stem metric correlation matrix is defined as follows. Each element of expresses a metric correlation between the stems and namely,
6
Metric Clusters
7
Example for Metric Clustering …word □ □ □ □ □ polish … …words □ □ □ polishing …
9
Chapter 7: Document Preprocessing (textbook) Document preprocessing is a procedure which can be divided mainly into five text operations (or transformations): (1) Lexical analysis of the text with the objective of treating digits, hyphens, punctuation marks, and the case of letters. (2) Elimination of stop-words with the objective of filtering out words with very low discrimination values for retrieval purposes.
10
Document Preprocessing (3) Stemming of the remaining words with the objective of removing affixes (i.e., prefixes and suffixes) and allowing the retrieval of documents containing syntactic variations of query terms (e.g., connect, connecting, connected, etc). (4) Selection of index terms to determine which words/stems (or groups of words) will be used as an indexing elements. Usually, the decision on whether a particular word will be used as an index term is related to the syntactic nature of the word. In fact, noun words frequently carry more semantics than adjectives, adverbs, and verbs.
11
Document Preprocessing (5) Construction of term categorization structures such as a thesaurus, or extraction of structure directly represented in the text, for allowing the expansion of the original query with related terms (a usually useful procedure).
12
Lexical Analysis of the text Task: convert strings of characters into sequence of words. Main task is to deal with spaces,e.g, multiple spaces are treated as one space. Digits—ignoring numbers is a common way. Special cases, 1999, 2000 standing for specific years are important. Mixed digits are important, e.g., 510B.C. 16 digits numbers might be credit card #. Hyphens: state-of-the art and “state of the art” should be treated as the same. Punctuation marks: remove them. Exception: 510B.C Lower or upper case of letters: treated as the same. Many exceptions: semi-automatic.
13
Elimination of Stopwords words appear too often are not useful for IR. Stopwords: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words.
14
Stemming A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes or suffixes). Example: connect is the stem for {connected, connecting connection, connections} Porter algorithm: using suffix list for suffix stripping. S , sses ss, etc.
15
Index terms selection Identification of noun groups Treat nouns appear closely as a single component, e.g., computer science
16
Huffman codes Binary character code: each character is represented by a unique binary string. A data file can be coded in two ways: abcdef frequency(%)4513121695 fixed-length code 000001010011100101 variable-length code 0101100111110 1 110 0 The first way needs 100 3=300 bits. The second way needs 45 1+13 3+12 3+16 3+9 4+5 4=232 bits.
17
Variable-length code Need some care to read the code. –001011101 (codeword: a=0, b=00, c=01, d=11.) –Where to cut? 00 can be explained as either aa or b. Prefix of 0011: 0, 00, 001, and 0011. Prefix codes: no codeword is a prefix of some other codeword. (prefix free) Prefix codes are simple to encode and decode.
18
Using codeword in Table to encode and decode Encode: abc = 0.101.100 = 0101100 –(just concatenate the codewords.) Decode: 001011101 = 0.0.101.1101 = aabe abcdef frequency(%)4513121695 fixed-length code 000001010011100101 variable-length code 0101100111110 1 110 0
19
Encode: abc = 0.101.100 = 0101100 –(just concatenate the codewords.) Decode: 001011101 = 0.0.101.1101 = aabe –(use the (right)binary tree below:) a:4 5 b:13c:1 2 d:16e: 9 f:5 0 1 10 0 1486 142858 00 0 0 0 11 1 1 a:4 5 b:13c:1 2 d:16 e: 9 f:5 55 2530 14 10 0 01 0 0 0 0 11 1 1 Tree for the fixed length codeword Tree for variable-length codeword
20
Binary tree Every nonleaf node has two children. The fixed-length code in our example is not optimal. The total number of bits required to encode a file is –f ( c ) : the frequency (number of occurrences) of c in the file –d T (c): denote the depth of c ’ s leaf in the tree
21
Constructing an optimal code Formal definition of the problem: Input: a set of characters C={c 1, c 2, …, c n }, each c C has frequency f[c]. Output: a binary tree representing codewords so that the total number of bits required for the file is minimized. Huffman proposed a greedy algorithm to solve the problem.
22
a:4 5 d:16e: 9 f:5b:13c:1 2 a:4 5 d:16 e: 9 f:5 14 01 b:13c:1 2 (a) (b)
23
a:4 5 d:16 e: 9 f:5 14 01 b:13c:1 2 25 01 a:4 5 b:13c:1 2 d:16 e: 9 f:5 2530 14 01 0 011 (c) (d)
24
a:4 5 b:13c:1 2 d:16 e: 9 f:5 55 2530 14 10 0 01 0 0 0 0 11 1 1 a:4 5 b:13c:1 2 d:16 e: 9 f:5 55 2530 14 01 0 0 0 11 1 (e) (f)
25
HUFFMAN(C) 1n:=|C| 2Q:=C 3for i:=1 to n-1 do 4z:=ALLOCATE_NODE() 5x:=left[z]:=EXTRACT_MIN(Q) 6y:=right[z]:=EXTRACT_MIN(Q) 7f[z]:=f[x]+f[y] 8INSERT(Q,z) 9return EXTRACT_MIN(Q)
26
The Huffman Algorithm This algorithm builds the tree T corresponding to the optimal code in a bottom-up manner. C is a set of n characters, and each character c in C is a character with a defined frequency f[c]. Q is a priority queue, keyed on f, used to identify the two least-frequent characters to merge together. The result of the merger is a new object (internal node) whose frequency is the sum of the two objects.
27
Time complexity Lines 4-8 are executed n-1 times. Each heap operation in Lines 4-8 takes O(lg n) time. Total time required is O(n lg n). Note: The details of heap operation will not be tested. Time complexity O(n lg n) should be remembered.
28
Another example: e:4a:6c:6b:9d:11 c:6b:9d:11 e:4a:6 10 01
29
d:11 e:4a:6 10 01 c:6b:9 15 0 1 c:6b:9 15 0 1 d:11 e:4a:6 10 01 21 01
30
c:6b:9 15 0 1 d:11 e:4a:6 10 01 21 01 36 01
31
Correctness of Huffman ’ s Greedy Algorithm (Fun Part, not required) Again, we use our general strategy. Let x and y are the two characters in C having the lowest frequencies. (the first two characters selected in the greedy algorithm.) We will show the two properties: 1.There exists an optimal solution T opt (binary tree representing codewords) such that x and y are siblings in T opt. 2.Let z be a new character with frequency f[z]=f[x]+f[y] and C ’ =C-{x, y} {z}. Let T ’ be an optimal tree for C ’. Then we can get T opt from T ’ by replacing z with z xy
32
Proof of Property 1 Look at the lowest siblings in T opt, say, b and c. Exchange x with b and y with c. B(T opt )-B(T new ) 0 since f[x] and f[y] are the smallest. 1 is proved. b x y c x b c y T opt T new
33
2.Let z be a new character with frequency f[z]=f[x]+f[y] and C ’ =C-{x, y} {z}. Let T ’ be an optimal tree for C ’. Then we can get T opt from T ’ by replacing z with Proof: Let T be the tree obtained from T ’ by replacing z with the three nodes. B(T)=B(T ’ )+f[x]+f[y]. … (1) (the length of the codes for x and y are 1 bit more than that of z.) Now prove T= T opt by contradiction. If T T opt, then B(T)>B(T opt ). … (2) From 1, x and y are siblings in T opt. Thus, we can delete x and y from T opt and get another tree T ’’ for C ’. B(T ’’ )=B(T opt ) – f[x]-f[y]<B(T)-f[x]-f[y]=B(T ’ ). using (2) using (1) Thus, T(T ’’ )<B(T ’ ). Contradiction! --- T ’ is optimum. z x y
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.