Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu.

Similar presentations


Presentation on theme: "Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu."— Presentation transcript:

1 Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics

2 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium 2 Introduction Generic model –Document preprocessing –Text mining methods

3 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Text Mining Tasks Classification (supervised learning) –Binary classification –Single label (multi-class) classification –Multi-label classification –Multi-level (hierarchical) classification Clustering (unsupervised learning) Summarization –Extraction: only parts of the original text –Abstraction: introduces text that is not included in the original text 3

4 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Solutions Classification –Decision tree –Neural network –Bayes network Clustering –k-means 4

5 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Document preprocessing Goal: represent any text briefly, in a fixed number of parameters Representation: vector space model 5

6 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Vector space model The text is tokenized to words The words are canonized to base words we refer to base words as terms A dictionary is built, that is the set of the terms in the document The document is represented as a vector: the i th element of the vector is the number the i th term of the dictionary occurs in the document The collection of documents is represened in the term- document matrix Problem: the number of dimensions is too large Solution: feature selection 6

7 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Dimension Reduction Feature Selection: find a subset of original variables –Document Frequency Thresholding Omit the words with occurences greater than a threshold value, because these words are not discriminative Omit the words with occurences less then a threshold value, because these words do not carry much information –Information gain based feature selection (information theory) –Chi-square based feature selection (statistics) Feature Extraction: transform the data to fewer dims –Latent Semantic Indexing (LSI) –Principal Component Analysis (PCA) –Nonlinear methods 7

8 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Latent Semantic Indexing (LSI) SVD is applied to the term-document matrix The features belonging to the k largest eigenvalues represent the term-document matrix well, these features are used LSI regards documents with many common words as being semantically near 8

9 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Principal Component Analysis (PCA) Also called Karhunen-Loève transform (KLT) A linear technique Maps the data to a lower dimensional space in a way that the variance in the low-dimensional representation is maximized The algorithm –The correlation matrix of the data is constructed –The eigenvectors and eigenvalues of the correlation matrix are calculated –The original space is reduced to the space spanned by the eigenvectors that belong to the largest eigenvalues 9

10 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Kernel PCA A nonlinear method PCA + kernel trick Kernel trick (generally) –we map observations from a general set S into a higher dimensional space V –we hope that the general classification in S reduces to the linear classification in V –the trick lets us avoid the calculation of mapping the observations from S to V We use a learning algorithm that needs only the dot product operation in V We use a mapping that allows to calculate the dot product within V by a kernel function K within S (the original space) 10

11 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Manifold learning techniques they minimize a cost function that retains local properties of the data methods –Locally Linear Embedding (LLE) –Hessian LLE –Laplacian Eigenmaps –Local tangent space alignment (LTSA) –Maximum Variance Unfolding (MVU) 11

12 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Locally Linear Embedding (LLE) 12

13 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Locally Linear Embedding (LLE) 13

14 Laboratory of Cryptography and System Security CrySyS Adat- és Rendszerbiztonság Laboratórium Maximum Variance Unfolding (MVU) instead of defining a fixed kernel, it tries to learn the kernel using semidefinite programming exactly preserves all pairwise distances between nearest neighbors maximizes the distances between points that are not nearest neighbors 14


Download ppt "Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics www.crysys.hu."

Similar presentations


Ads by Google