1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn lwthucs@gmail.com Presented in ICDM’2010, Sydney, Australia Term Filtering with Bounded Error

2 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

3 Introduction

4 How can we overcome the disadvantage?

5 Introduction (cont'd) Three general methods: –To improve the algorithm itself –High performance platforms multicore processors and distributed machines [Chu et al., 2006] –Feature selection approaches [Yi, 2003] We follow this line

6 Introduction (cont'd) Basic assumption –Each term captures more or less information. Our goal –A subspace of features (terms) –Minimal information loss. Conventional solution: –Step 1: Measure each individual term or the dependency to a group of terms. Information gain or mutual information. –Step 2: Remove features with low scores in importance or high scores in redundancy.

7 Introduction (cont'd) Basic assumption –Each term captures more or less information. Our goal –A subspace of features (terms) –Minimal information loss. Conventional solution: –Step 1: Measure each individual term or the dependency to a group of terms. Information gain or mutual information. –Step 2: Remove features with low scores in importance or high scores in redundancy. Limitations -A generic definition of information loss for a single term and a set of terms in terms of any metric? -The information loss of each individual document, and a set of documents? Limitations -A generic definition of information loss for a single term and a set of terms in terms of any metric? -The information loss of each individual document, and a set of documents? Why should we consider the information loss on documents?

8 Introduction (cont'd) A A B Doc 1Doc 2

9 Introduction (cont'd) A A B Doc 1Doc 2 What we have done? - Information loss for both terms and documents - Term Filtering with Bounded Error problem - Develop efficient algorithms What we have done? - Information loss for both terms and documents - Term Filtering with Bounded Error problem - Develop efficient algorithms

11 Problem Definition - Superterm Group terms into super terms!

12 Problem Definition – Information loss A A B Doc 1Doc 2 Doc 1 Doc 2 A B Doc 1 Doc 2 S S

13 Problem Definition – Information loss A A B Doc 1Doc 2 Doc 1 Doc 2 A B Doc 1 Doc 2 S S Can be chosen with different methods: winner-take-all, average-occurrence, etc.

14 Problem Definition – TFBE Mapping function from terms to superterms Bounded by user- specified errors

16 Lossless Term Filtering Special case –NO information loss of any term and document Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation. Algorithm –Step 1: find “local” superterms for each document Same occurrences within the document –Step 2: add, split, or remove superterms from global superterm set

17 Lossless Term Filtering Special case –NO information loss of any term and document Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation. Algorithm –Step 1: find “local” superterms for each document Same occurrences within the document –Step 2: add, split, or remove superterms from global superterm set This case is applicable? YES! On Baidu Baike dataset (containing 1,531,215 documents) The vocabulary size 1,522,576 -> 714,392 > 50% This case is applicable? YES! On Baidu Baike dataset (containing 1,531,215 documents) The vocabulary size 1,522,576 -> 714,392 > 50%

19 Lossy Term Filtering NP-hard! #iterations << |V|

20 Lossy Term Filtering NP-hard! #iterations << |V| Further reduce the size of candidate superterms? Locality-sensitive hashing (LSH) Further reduce the size of candidate superterms? Locality-sensitive hashing (LSH)

21 Same hash value (with several randomized hash projections) Same hash value (with several randomized hash projections) Close to each other (in the original space) Close to each other (in the original space) Efficient Candidate Superterm Generation LSH (Figure modified based on http://cybertron.cg.tu- berlin.de/pdci08/imageflight/nn_search.html)

22 Same hash value (with several randomized hash projections) Same hash value (with several randomized hash projections) Close to each other (in the original space) Close to each other (in the original space) Efficient Candidate Superterm Generation LSH (Figure modified based on http://cybertron.cg.tu- berlin.de/pdci08/imageflight/nn_search.html) a projection vector (drawn from the Gaussian distribution) the width of the buckets (assigned empirically) the width of the buckets (assigned empirically)

23 Same hash value (with several randomized hash projections) Same hash value (with several randomized hash projections) Close to each other (in the original space) Close to each other (in the original space) Efficient Candidate Superterm Generation LSH (Figure modified based on http://cybertron.cg.tu- berlin.de/pdci08/imageflight/nn_search.html) a projection vector (drawn from the Gaussian distribution) the width of the buckets (assigned empirically) the width of the buckets (assigned empirically)

25 Experiments Settings –Datasets Academic: ArnetMiner (10,768 papers and 8,212 terms) 20-Newsgroups (18,774 postings and 61,188 terms) –Baselines Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA) Supervised method (only on classification): Chi- statistic (CHIMAX)

26 Experiments (cont'd) Filtering results on Euclidean metric –Findings Errors ↑, term radio ↓ Same bound for terms, bound for documents ↑, term ratio ↓ Same bound for documents, bound for terms ↑, term ratio ↓

27 Experiments (cont'd) The information loss in terms of error Avg. errors Max errors

28 Experiments (cont'd)

29 Experiments (cont'd) Applications –Clustering (Arnet) –Classification (20NG) –Document retrieval (Arnet)

31 Conclusion Formally define the problem and perform a theoretical investigation Develop efficient algorithms for the problems Validate the approach through an extensive set of experiments with multiple real-world text mining applications

32 Thank you!

33 References Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993-1022. Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,. Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,. Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,. Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

Similar presentations

Presentation on theme: "1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

Similar presentations

Presentation on theme: "1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,"— Presentation transcript:

Similar presentations

About project

Feedback