Presentation is loading. Please wait.

Presentation is loading. Please wait.

A pTree organization for text mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an always. all again a... Term (Vocab)

Similar presentations


Presentation on theme: "A pTree organization for text mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an always. all again a... Term (Vocab)"— Presentation transcript:

1 A pTree organization for text mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an always. all again a... Term (Vocab) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0... 0 0 0 000 000 000 000 000 001 DocTrmPos pTreeSet 0 0 0 1 0 0 0 0 1... Term Ex 0 0 0 3 0 3 0 0 2...Term Freq 0 0 0 0 0 0 0 0 0... tf2 0 0 0 1 0 0 0 0 1... tf1 0 0 0 1 0 0 0 0 0... tf0 1 1 1 13 1 1 3 1 8..doc freq JSE HHS LMM...doc 2 00 2 00 1 00 a again all... te doc=1 prd=NOTpure0 2 00 2 00 2 00 2 00 2 00 2 00 0 00... tf doc=1,0 prd= modct/2 1 00... 0 0 <---------tf doc=1,2 <---------tf doc=1,1 2 00 2 00 2 00... <---------tf (count) 8 13 8 13 8 13 1 00 8 13 df count <--df 3 <--df 0 level 2 level 1 0 data Cube layout: stride= position map pTree for doc=1, term=1 P 1,1 stride length = universal doc length (So we assume a fixed max doc length and a fixed vocabulary.)

2 te tf tf1 tf0 VOCAB Little Miss Muffet sat on a tuffet eating 1 2 1 0 a 0 0 0 0 0 1 0 0 0 0 0 0 again. 0 0 0 0 0 0 0 0 0 0 0 0 all 0 0 0 0 0 0 0 0 0 0 0 0 always 0 0 0 0 0 0 0 0 0 0 0 0 an 0 0 0 0 0 0 0 0 1 3 1 1 and 0 0 0 0 0 0 0 0 0 0 0 0 apple 0 0 0 0 0 0 0 0 0 0 0 0 April 0 0 0 0 0 0 0 0 0 0 0 0 are 0 0 0 0 0 0 0 0 0 0 0 0 around 0 0 0 0 0 0 0 0 0 0 0 0 ashes, 0 0 0 0 0 0 0 0 0 0 0 0 away 0 0 0 0 0 0 0 0 1 1 0 1 away. 0 0 0 0 0 0 0 0 0 0 0 0 baby 0 0 0 0 0 0 0 0 0 0 0 0 baby. 0 0 0 0 0 0 0 0 0 0 0 0 bark! 0 0 0 0 0 0 0 0 0 0 0 0 beans 0 0 0 0 0 0 0 0 0 0 0 0 beat 0 0 0 0 0 0 0 0 0 0 0 0 bed, 0 0 0 0 0 0 0 0 0 0 0 0 Beggars 0 0 0 0 0 0 0 0 0 0 0 0 begins. 0 0 0 0 0 0 0 0 1 1 0 1 beside 0 0 0 0 0 0 0 0 0 0 0 0 between 0 0 0 0 0 0 0 0 1 1 0 1 big 0 0 0 0 0 0 0 0 0 0 0 0 Birds 0 0 0 0 0 0 0 0 0 0 0 0 boiled 0 0 0 0 0 0 0 0 0 0 0 0 both 0 0 0 0 0 0 0 0 0 0 0 0 bread 0 0 0 0 0 0 0 0 0 0 0 0 bring 0 0 0 0 0 0 0 0 0 0 0 0 butter. 0 0 0 0 0 0 0 0 1 1 0 1 came 0 0 0 0 0 0 0 0 0 0 0 0 cannot 0 0 0 0 0 0 0 0 0 0 0 0 cheese, 0 0 0 0 0 0 0 0 0 0 0 0 choice 0 0 0 0 0 0 0 0 0 0 0 0 clean. 0 0 0 0 0 0 0 0 0 0 0 0 clear. 0 0 0 0 0 0 0 0 0 0 0 0 come 0 0 0 0 0 0 0 0 0 0 0 0 coming 0 0 0 0 0 0 0 0 0 0 0 0 could 0 0 0 0 0 0 0 0 0 0 0 0 Cry 0 0 0 0 0 0 0 0 0 0 0 0 crying 0 0 0 0 0 0 0 0 0 0 0 0 cry. 0 0 0 0 0 0 0 0 1 1 0 1 curds 0 0 0 0 0 0 0 0 0 0 0 0 Daddy 0 0 0 0 0 0 0 0 0 0 0 0 do 0 0 0 0 0 0 0 0 0 0 0 0 dogs 0 0 0 0 0 0 0 0 1 1 0 1 down 0 0 0 0 0 0 0 0 0 0 0 0 down. 0 0 0 0 0 0 0 0 0 0 0 0 drink? 0 0 0 0 0 0 0 0 0 0 0 0 Dumpty 0 0 0 0 0 0 0 0 0 0 0 0 eat 0 0 0 0 0 0 0 0 1 1 0 1 eating 0 0 0 0 0 0 0 1 0 0 0 0 eye 0 0 0 0 0 0 0 0 0 0 0 0 fall 0 0 0 0 0 0 0 0 0 0 0 0 fall. 0 0 0 0 0 0 0 0 0 0 0 0 fat. 0 0 0 0 0 0 0 0 0 0 0 0 feather 0 0 0 0 0 0 0 0 0 0 0 0 finger 0 0 0 0 0 0 0 0 0 0 0 0 flock 0 0 0 0 0 0 0 0. 0 0 0 0 your 0 0 0 0 0 0 0 0 pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60. 182 of curds and whey. There came a big spider and sat down... 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with  30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. document-1: Little Miss Muffet 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000 0000 000 0000 000 0010. Position 123456 7 are April apple and an always. all again a voc 1 0 0 1 0 0 0 0 1 te 0 0 0 3 0 0 0 2 tf 0 0 0 0 0 0 0 0 0 tf2 1 0 0 1 0 0 0 0 0 tf1 0 0 0 1 0 0 0 0 0 tf0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20... Level-0 pTreesLevel-1 pTrees (term freq/exist)

3 te tf tf1 tf0 05HDS HumptyDumpty sat on a wall. HumptyDumpty 1 2 1 0 a 0 0 0 0 1 0 0 0 1 1 0 1 again. 0 0 0 0 0 0 0 0 1 2 1 0 all 0 0 0 0 0 0 0 0 0 0 0 0 always 0 0 0 0 0 0 0 0 0 0 0 0 an 0 0 0 0 0 0 0 0 1 1 0 1 and 0 0 0 0 0 0 0 0 0 0 0 0 apple 0 0 0 0 0 0 0 0 0 0 0 0 April 0 0 0 0 0 0 0 0 0 0 0 0 are 0 0 0 0 0 0 0 0 0 0 0 0 around 0 0 0 0 0 0 0 0 0 0 0 0 ashes, 0 0 0 0 0 0 0 0 0 0 0 0 away 0 0 0 0 0 0 0 0 0 0 0 0 away. 0 0 0 0 0 0 0 0 0 0 0 0 baby 0 0 0 0 0 0 0 0 0 0 0 0 baby. 0 0 0 0 0 0 0 0 0 0 0 0 bark! 0 0 0 0 0 0 0 0 0 0 0 0 beans 0 0 0 0 0 0 0 0 0 0 0 0 beat 0 0 0 0 0 0 0 0 0 0 0 0 bed, 0 0 0 0 0 0 0 0 0 0 0 0 Beggars 0 0 0 0 0 0 0 0 0 0 0 0 begins. 0 0 0 0 0 0 0 0 0 0 0 0 beside 0 0 0 0 0 0 0 0 0 0 0 0 between 0 0 0 0 0 0 0 0 0 0 0 0 big 0 0 0 0 0 0 0 0 0 0 0 0 Birds 0 0 0 0 0 0 0 0 0 0 0 0 boiled 0 0 0 0 0 0 0 0 0 0 0 0 both 0 0 0 0 0 0 0 0 0 0 0 0 bread 0 0 0 0 0 0 0 0 0 0 0 0 bring 0 0 0 0 0 0 0 0 0 0 0 0 butter. 0 0 0 0 0 0 0 0 0 0 0 0 came 0 0 0 0 0 0 0 0 1 1 0 1 cannot 0 0 0 0 0 0 0 0 0 0 0 0 cheese, 0 0 0 0 0 0 0 0 0 0 0 0 choice 0 0 0 0 0 0 0 0 0 0 0 0 clean. 0 0 0 0 0 0 0 0 0 0 0 0 clear. 0 0 0 0 0 0 0 0 0 0 0 0 come 0 0 0 0 0 0 0 0 0 0 0 0 coming 0 0 0 0 0 0 0 0 0 0 0 0 could 0 0 0 0 0 0 0 0 0 0 0 0 Cry 0 0 0 0 0 0 0 0 0 0 0 0 crying 0 0 0 0 0 0 0 0 0 0 0 0 cry. 0 0 0 0 0 0 0 0 0 0 0 0 curds 0 0 0 0 0 0 0 0 0 0 0 0 Daddy 0 0 0 0 0 0 0 0 0 0 0 0 do 0 0 0 0 0 0 0 0 0 0 0 0 dogs 0 0 0 0 0 0 0 0 0 0 0 0 down 0 0 0 0 0 0 0 0 0 0 0 0 down. 0 0 0 0 0 0 0 0 0 0 0 0 drink? 0 0 0 0 0 0 0 0 1 3 1 1 Dumpty 0 1 0 0 0 0 0 1 0 0 0 0 eat 0 0 0 0 0 0 0 0 0 0 0 0 eating 0 0 0 0 0 0 0 0 0 0 0 0 eye 0 0 0 0 0 0 0 0 0 0 0 0 fall 0 0 0 0 0 0 0 0 1 1 0 1 fall. 0 0 0 0 0 0 0 0 0 0 0 0 fat. 0 0 0 0 0 0 0 0 0 0 0 0 feather 0 0 0 0 0 0 0 0 0 0 0 0 finger 0 0 0 0 0 0 0 0 0 0 0 0 your 0 0 0 0 0 0 0 0 pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60. 182 df3 df2 df1 df0 df VOCAB te04 te05 te08 te09 te27 te29 te34 1 0 0 0 8 a 1 1 0 1 0 0 0 0 0 0 1 1 again. 0 1 0 0 0 0 0 0 0 1 1 3 all 0 1 0 0 0 0 0 0 0 0 1 1 always 0 0 0 0 0 1 0 0 0 0 1 1 an 0 0 0 0 0 0 0 1 1 0 1 13 and 1 1 1 1 1 1 1 0 0 0 1 1 apple 0 0 0 0 0 0 0 0 0 0 1 1 April 0 0 0 0 0 0 0 0 0 0 1 1 are 0 0 0 0 0 0 0 0 0 0 1 1 around 0 0 0 0 0 0 0 0 0 0 1 1 ashes, 0 0 0 0 0 0 0 0 0 1 0 2 away 0 0 0 0 0 1 0 0 0 0 1 1 away. 1 0 0 0 0 0 0 0 0 0 1 1 baby 0 0 0 0 1 0 0 0 0 0 1 1 baby. 0 0 0 1 0 0 0 0 0 0 1 1 bark! 0 0 0 0 0 0 0 0 0 0 1 1 beans 0 0 0 0 0 0 1 0 0 0 1 1 beat 0 0 0 0 0 0 0 0 0 0 1 1 bed, 0 0 0 0 0 1 0 0 0 0 1 1 Beggars 0 0 0 0 0 0 0 0 0 0 1 1 begins. 0 0 0 0 0 0 0 0 0 0 1 1 beside 1 0 0 0 0 0 0 0 0 0 1 1 between 0 0 1 0 0 0 0 0 0 0 1 1 big 1 0 0 0 0 0 0 0 0 0 1 1 Birds 0 0 0 0 0 0 0 0 0 0 1 1 boiled 0 0 0 0 0 0 1 0 0 0 1 1 both 0 0 1 0 0 0 0 0 0 0 1 1 bread 0 0 0 0 0 0 0 0 0 0 1 1 bring 0 0 0 0 0 0 0 0 0 0 1 1 butter. 0 0 0 0 0 0 1 0 0 0 1 1 came 1 0 0 0 0 0 0 0 0 0 1 1 cannot 0 1 0 0 0 0 0 0 0 0 1 1 cheese, 0 0 0 0 0 0 0 0 0 0 1 1 choice 0 0 0 0 0 0 0 0 0 0 1 1 clean. 0 0 1 0 0 0 0 0 0 0 1 1 clear. 0 0 0 1 0 0 0 0 0 0 1 1 come 0 0 0 0 0 0 1 0 0 0 1 1 coming 0 0 0 0 0 0 0 0 0 0 1 1 could 0 0 1 0 0 0 0 0 0 0 1 1 Cry 0 0 0 0 1 0 0 0 0 0 1 1 crying 0 0 0 0 0 0 0 0 0 0 1 1 cry. 0 0 0 0 1 0 0 0 0 0 1 1 curds 1 0 0 0 0 0 0 0 0 0 1 1 Daddy 0 0 0 1 0 0 0 0 0 0 1 1 do 0 0 0 0 0 0 0 0 0 0 1 1 dogs 0 0 0 0 0 0 0 0 0 1 0 2 down 1 0 0 0 0 0 0 0 0 0 1 1 down. 0 0 0 0 0 0 0 0 0 0 1 1 drink? 0 0 0 0 0 0 0 0 0 0 1 1 Dumpty 0 1 0 0 0 0 0 0 0 1 0 2 eat 0 0 1 0 0 0 0 0 0 0 1 1 eating 1 0 0 0 0 0 0 0 0 0 1 1 eye 0 0 0 0 1 0 0 0 0 0 1 1 fall 0 0 0 0 0 0 0 0 0 0 1 1 fall. 0 1 0 0 0 0 0 0 0 0 1 1 fat. 0 0 1 0 0 0 0 0 0 0 1 1 feather 0 0 0 0 0 0 0 0 0 0 1 1 finger 0 0 0 0 1 0 0 document-2: Humpty Dumpty level-1 level-0 Level-2 pTrees (document frequency)

4 Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text.Singular value decomposition LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts.[1][1] LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up.synonymypolysemy[3] LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5][5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs. Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse. Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. local: [13] Binary if term exists in the doc TermFrequency; global weighting functions: Binary Normal GfIdf, Idf Entropy[13] Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) term represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, a ij, initially representing number of times the associated term appears in the indicated document, tf ij. This matrix is usually large and very sparse. SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best. We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves. Is a new SVD program run required for every new query? or is it a one time thing? If it is one-time, there is probably little advantage in searching for pTree speedups? If and when it is not a one-time application to the original data, pTree speedups my hold promise. Even if it is one-time, we might take the point of view that we do the SVD reduction (using standard horizontal methods) and then covert the result to vertical pTrees for the data mining (which would be done over and over again). That pTree- ization of the end result of the SVD reduction could be organized as in the previous slides. Here is a good paper on the subject of LSI and SVD: http://www.cob.unt.edu/itds/faculty/evengelopoulos/dsci5910/LSA_Deerwester1990.pdf


Download ppt "A pTree organization for text mining... Position 123456 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 are April apple and an always. all again a... Term (Vocab)"

Similar presentations


Ads by Google