1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

Slides:



Advertisements
Similar presentations
An Adaptive MLS Surface for Reconstruction with Guarantees
Advertisements

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Kapitel 21 Astronomie Autor: Bennett et al. Galaxienentwicklung Kapitel 21 Galaxienentwicklung © Pearson Studium 2010 Folie: 1.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
1 Chapter 40 - Physiology and Pathophysiology of Diuretic Action Copyright © 2013 Elsevier Inc. All rights reserved.
By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.
Factors, Primes & Composite Numbers
Source of slides: Introduction to Automata Theory, Languages and Computation.
LOD Map – A Visual Interface for Navigating Multiresolution Volume Visualization Chaoli Wang and Han-Wei Shen The Ohio State University Presented at IEEE.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
0 - 0.
ALGEBRAIC EXPRESSIONS
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULTIPLYING MONOMIALS TIMES POLYNOMIALS (DISTRIBUTIVE PROPERTY)
ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.
SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING Think Distributive property backwards Work down, Show all steps ax + ay = a(x + y)
Addition Facts
ZMQS ZMQS
Reductions Complexity ©D.Moshkovitz.
Colour From Grey by Optimized Colour Ordering Arash VahdatMark S. Drew School of Computing Science Simon Fraser University.
1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.
Trusted Data Sharing over Untrusted Cloud Storage Provider Gansen Zhao, Chunming Rong, Jin Li, Feng Zhang, and Yong Tang Cloud Computing Technology and.
Truncation Errors and Taylor Series
BT Wholesale October Creating your own telephone network WHOLESALE CALLS LINE ASSOCIATED.
Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)
Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM)
Real-Time Competitive Environments: Truthful Mechanisms for Allocating a Single Processor to Sporadic Tasks Anwar Mohammadi, Nathan Fisher, and Daniel.
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
ABC Technology Project
Shadow Prices vs. Vickrey Prices in Multipath Routing Parthasarathy Ramanujam, Zongpeng Li and Lisa Higham University of Calgary Presented by Ajay Gopinathan.
Capacity-Approaching Codes for Reversible Data Hiding Weiming Zhang, Biao Chen, and Nenghai Yu Department of Electrical Engineering & Information Science.
O X Click on Number next to person for a question.
© S Haughton more than 3?
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
© Charles van Marrewijk, An Introduction to Geographical Economics Brakman, Garretsen, and Van Marrewijk.
1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.
Twenty Questions Subject: Twenty Questions
Take from Ten First Subtraction Strategy -9 Click on a number below to go directly to that type of subtraction problems
Linking Verb? Action Verb or. Question 1 Define the term: action verb.
Energy & Green Urbanism Markku Lappalainen Aalto University.
K-means Clustering Ke Chen.
Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.
Addition 1’s to 20.
25 seconds left…...
Test B, 100 Subtraction Facts
11 = This is the fact family. You say: 8+3=11 and 3+8=11
Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Analysis of engineering system by means of graph representation.
Week 1.
We will resume in: 25 Minutes.
Lecture 15 Functions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
1 Unit 1 Kinematics Chapter 1 Day
O X Click on Number next to person for a question.
DETC ASME Computers and Information in Engineering Conference ONE AND TWO DIMENSIONAL DATA ANALYSIS USING BEZIER FUNCTIONS P.Venkataraman Rochester.
University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,
How Cells Obtain Energy from Food
Primary school Horvati w 3/26/2015copyright
Ramprasad Yelchuru, Optimal controlled variable selection for Individual process units, 1 Optimal controlled variable selection for individual process.
Learning to Recommend Questions Based on User Ratings Ke Sun, Yunbo Cao, Xinying Song, Young-In Song, Xiaolong Wang and Chin-Yew Lin. In Proceeding of.
RollCaller: User-Friendly Indoor Navigation System Using Human-Item Spatial Relation Yi Guo, Lei Yang, Bowen Li, Tianci Liu, Yunhao Liu Hong Kong University.
Copyright © Cengage Learning. All rights reserved.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Presentation transcript:

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi, tangjie, Presented in ICDM’2010, Sydney, Australia Term Filtering with Bounded Error

2 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

3 Introduction

4 How can we overcome the disadvantage?

5 Introduction (cont'd) Three general methods: –To improve the algorithm itself –High performance platforms multicore processors and distributed machines [Chu et al., 2006] –Feature selection approaches [Yi, 2003] We follow this line

6 Introduction (cont'd) Basic assumption –Each term captures more or less information. Our goal –A subspace of features (terms) –Minimal information loss. Conventional solution: –Step 1: Measure each individual term or the dependency to a group of terms. Information gain or mutual information. –Step 2: Remove features with low scores in importance or high scores in redundancy.

7 Introduction (cont'd) Basic assumption –Each term captures more or less information. Our goal –A subspace of features (terms) –Minimal information loss. Conventional solution: –Step 1: Measure each individual term or the dependency to a group of terms. Information gain or mutual information. –Step 2: Remove features with low scores in importance or high scores in redundancy. Limitations -A generic definition of information loss for a single term and a set of terms in terms of any metric? -The information loss of each individual document, and a set of documents? Limitations -A generic definition of information loss for a single term and a set of terms in terms of any metric? -The information loss of each individual document, and a set of documents? Why should we consider the information loss on documents?

8 Introduction (cont'd) A A B Doc 1Doc 2

9 Introduction (cont'd) A A B Doc 1Doc 2 What we have done? - Information loss for both terms and documents - Term Filtering with Bounded Error problem - Develop efficient algorithms What we have done? - Information loss for both terms and documents - Term Filtering with Bounded Error problem - Develop efficient algorithms

10 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

11 Problem Definition - Superterm Group terms into super terms!

12 Problem Definition – Information loss A A B Doc 1Doc 2 Doc 1 Doc 2 A B Doc 1 Doc 2 S S

13 Problem Definition – Information loss A A B Doc 1Doc 2 Doc 1 Doc 2 A B Doc 1 Doc 2 S S Can be chosen with different methods: winner-take-all, average-occurrence, etc.

14 Problem Definition – TFBE Mapping function from terms to superterms Bounded by user- specified errors

15 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

16 Lossless Term Filtering Special case –NO information loss of any term and document Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation. Algorithm –Step 1: find “local” superterms for each document Same occurrences within the document –Step 2: add, split, or remove superterms from global superterm set

17 Lossless Term Filtering Special case –NO information loss of any term and document Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation. Algorithm –Step 1: find “local” superterms for each document Same occurrences within the document –Step 2: add, split, or remove superterms from global superterm set This case is applicable? YES! On Baidu Baike dataset (containing 1,531,215 documents) The vocabulary size 1,522,576 -> 714,392 > 50% This case is applicable? YES! On Baidu Baike dataset (containing 1,531,215 documents) The vocabulary size 1,522,576 -> 714,392 > 50%

18 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

19 Lossy Term Filtering NP-hard! #iterations << |V|

20 Lossy Term Filtering NP-hard! #iterations << |V| Further reduce the size of candidate superterms? Locality-sensitive hashing (LSH) Further reduce the size of candidate superterms? Locality-sensitive hashing (LSH)

21 Same hash value (with several randomized hash projections) Same hash value (with several randomized hash projections) Close to each other (in the original space) Close to each other (in the original space) Efficient Candidate Superterm Generation LSH (Figure modified based on berlin.de/pdci08/imageflight/nn_search.html)

22 Same hash value (with several randomized hash projections) Same hash value (with several randomized hash projections) Close to each other (in the original space) Close to each other (in the original space) Efficient Candidate Superterm Generation LSH (Figure modified based on berlin.de/pdci08/imageflight/nn_search.html) a projection vector (drawn from the Gaussian distribution) the width of the buckets (assigned empirically) the width of the buckets (assigned empirically)

23 Same hash value (with several randomized hash projections) Same hash value (with several randomized hash projections) Close to each other (in the original space) Close to each other (in the original space) Efficient Candidate Superterm Generation LSH (Figure modified based on berlin.de/pdci08/imageflight/nn_search.html) a projection vector (drawn from the Gaussian distribution) the width of the buckets (assigned empirically) the width of the buckets (assigned empirically)

24 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

25 Experiments Settings –Datasets Academic: ArnetMiner (10,768 papers and 8,212 terms) 20-Newsgroups (18,774 postings and 61,188 terms) –Baselines Task-irrelevant feature selection: document frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA) Supervised method (only on classification): Chi- statistic (CHIMAX)

26 Experiments (cont'd) Filtering results on Euclidean metric –Findings Errors ↑, term radio ↓ Same bound for terms, bound for documents ↑, term ratio ↓ Same bound for documents, bound for terms ↑, term ratio ↓

27 Experiments (cont'd) The information loss in terms of error Avg. errors Max errors

28 Experiments (cont'd)

29 Experiments (cont'd) Applications –Clustering (Arnet) –Classification (20NG) –Document retrieval (Arnet)

30 Outline Introduction Problem Definition Lossless Term Filtering Lossy Term Filtering Experiments Conclusion

31 Conclusion Formally define the problem and perform a theoretical investigation Develop efficient algorithms for the problems Validate the approach through an extensive set of experiments with multiple real-world text mining applications

32 Thank you!

33 References Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research 3, Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp ,. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp ,. Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp ,. Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp ,. Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp ,.