01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001.

01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001 00100000 01101011 01100001 01110100 01100101 01100100 01110010 01111001 00100000 01110000 01101111 01100011 01101001 01110100 01100001 01100011 01110101 00101100 00100000 01000110 01000101 01001100 00100000 01000011 01010110 01010101 01010100 00101100 00100000 01010000 01110010 01100001 01101000 01100001 00000000 Fast Supervised Feature Extraction from Structured Representation of Text Data Ondřej Háva havaondr@fel.cvut.cz Pavel Kordík, Miroslav Skrbek http://cig.felk.cvut.cz Computational Intelligence Group Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague

Agenda Structure representation of text documents Dimensionality reduction Two stage supervised feature extraction Neural network implementation Experiments Notes and future plans

Data mining, text mining, dimensionality reduction Text documents are popular source of data for data mining tasks articles, web pages, notes, blogs, free survey questions, … Written language is rich synonyms, inflection, … The goal of text mining algorithms is to extract important topics from documents transformation from bag-of-words to reliable and easy-to-use representation of documents

Representation of text data Focus on collection of unstructured text documents categorization, classification, merging with structured data, information retrieval, … Transformation of free text to structured data matrix row  document column  dictionary item documents in text format document- term matrix term frequencies dictionaries linguistic tagging possible feature reduction

Structured representation of documents M>N>K linguistic items extracted from documents topics assigned externally to documents frequency based weights topic coverage in document unsupervised learning supervised learning text records: web pages, paragraphs, documents, patents, medical records, operator’s notes, press releases, blogs, etc.

Dimensionality reduction Typical dimensionality of document matrix: 10 3 terms (M), 10 2 documents (N), 10 1 categories (K) Classes of dimensionality reduction techniques: feature selection x feature extraction standalone algorithm x wrapper supervised x unsupervised Probably most popular dimensionality reduction technique in text mining is Singular Value Decomposition (SVD) extraction, standalone, unsupervised extracted features represent latent semantic concepts

Objective Develop supervised standalone feature extraction method suitable for document-term matrix applicable on structured representation of text Method should be fast enough to process new unlabeled documents efficiently Exploit useful information from labeled training documents

Solution document-term matrix NxM document-document matrix NxN document-category matrix NxK 1. stage2. stage training dictionary similarities to training documents training documents categories similarities to training category assigent

Notation D … document-term training matrix without target columns (NxM) C … document-category indicator training matrix (NxK) S … matrix of similarities among the training documents (NxN) R … document-extracted-feature matrix for training documents (NxK) d … row vector of new unclassified document (1xM) s … row vector of similarities between new document and training documents (1xN) r … row vector of extracted features for new document (1xK)

First stage Positioning of new documents in training document space Coordinates are similarities with training documents cosine similarities Unsupervised phase Labels are unnecessary

Second stage Each training document is expressed in category space by supervised labeling new document can be also placed to category space because its similarities to training documents are known from first stage (Weighted) mean of coordinates of training documents in category space weighted by similarities with new document after normalization it is again cosine similarity to category columns in training document-category matrix Supervised phase utilization of assignment of training documents to categories

Modified second stage Assignments of training documents to categories are usually binary (0/1) document is/isn’t member of particular category Due richness of written language the real coordinates are more pragmatic then binary ones in document-category matrix some documents represent particular category better then others some documents represent more categories Training documents that are similar to training documents from same category are better representatives of this category substitute binary document-category matrix by real matrix that consists of sums of the similarities to the same category documents

Neural implementation Cosine similarity can be easily simulated by artificial neuron weighted sum of inputs (action potential) is realized by matrix multiplication in numerator activation function changes action potential by normalization in denominator Two types of neurons similarity to training documents expression of similarity in category space First stage: similarity to training document w i …term frequencies of training document x i …term frequencies of new unlabeled document Second stage: similarity to particular category w i …category indicators of training documents x i …similarities of new unlabeled document to training documents

Neural network M>N>K original second stage modified second stage input layer only transfers term frequencies to hidden layer hidden layer computes similarities to training documents output layer expresses similarities in category space

Experimental design 645 press releases by Czech News Agency (ČTK) or Grand Prince (GP) Length of each document is approximately 5KB Press releases are manually divided to eight categories cars, housing, travel, culture, Prague, domestic news, health, foreign news categories are roughly equally occupied Random split to training (65%) and test (35%) sets Comparison of proposed feature extraction method (SFX) with standard SVD Binary logistic regression classifier for each category

Experimental setup construction of document- term matrix SFX logistic regression classifier alternative to SVD

Document weighting terms constructed from words no lemmatization or stemming dictionary items selected by frequency filter gf > 2…word appears at least two times in training collection n/df > 2…word does not appear in more than half of documents gf/df > 1.2…average tf is at least 1.2 dictionary includes 5320 most frequent words from training set term frequencies are expressed by popular tfidf weights tf…term frequency in particular document gf…term frequency in whole collection df…number of documents where the term is present n…number of documents in the collection global feature of term derived from training collection local feature of term derived from particular document

Experimental results better quality but sometime overfit even better quality without overfit

Summary SFX is easy to implement no matrix decomposition, inverse or eigenvector computation SFX is fast 100 times faster than SVD SFX can deal with multitopic documents simple modification in training matrix Extracted features are interpretable they correspond to training categories Performs better then SVD utilization of training labels Simple neural simulation of SFX topology is derived from training documents and their labels no learning iterative algorithm, synaptic weights are just values from training matrix

Note: Similarity to RBF networks RBF 1. classifier 2. hidden neurons and connections between input and hidden layers represent cluster centers 3. similarity is measured by Euclidian distance transformed by radial basis function 4. weights between hidden and output layers are assessed by iterative learning SFX 1. feature extractor 2. hidden neurons and connections between input and hidden layers represent training documents 3. cosine similarity in hidden and output layers 4. weights between hidden and output layers represent labeling

Future plans Unification of all measures to common scale term weighting, similarities, labeling preserve simplicity of feature extraction by matrix multiplication Selection of best documents for training set topology of neural network: neurons in hidden layer can influence labeling

Thank you

01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001.

Similar presentations

Presentation on theme: "01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001.

Similar presentations

Presentation on theme: "01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001."— Presentation transcript:

Similar presentations

About project

Feedback