Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.

Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk University of Technology, Poland Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering, Nanyang Technological University, Singapore

Outline Text representations Words Words References References Compression Compression Evaluation of text representations Wikipedia data Wikipedia data SVM & PCA SVM & PCA Experimental results and conclusions Experimental results and conclusions Future directions

Text representation Amount of the information in the Internet grows rapidly. Thus machine support is needed for: Categorization (supervised or unsupervised) Categorization (supervised or unsupervised) Searching / retrieval Searching / retrieval Human understand the text. Machine doesn't. To process the text machine requires it in computationable form. Results of text processing strongly depends on the methods used for text representation. Processing natural language – several approaches to that problem: Logic (ontologies), Logic (ontologies), Statistical processing of large text copora, Statistical processing of large text copora, Geometry mainly used in machine learning. Geometry mainly used in machine learning. Machine learning for NLP uses text features The aim of the experiments presented here is to find hypertext representation suitable for automatic categorization Information retrieval 4 Wiki project – improvement of existing Wikipedia category system

Text representation with features Convenient way for machine processing of the is a vector of the features. Text set is represented as a matrix of the N features related with text k by the weight c. Where features come from?  k – document number,  n – feature number,  c – feature value

Words The most intuitive approach is to take words as features. Words content should describe well subject of the text. n-th word has value C in context of k-th document calculated as: where tf – term frequency. Describes how many times the word n appears in k document. tf – term frequency. Describes how many times the word n appears in k document. idf – inverse document frequency. Describes how seldom n word appears in whole text set. Proportion of nuber of all documents and nuber of the documents containing the given word. idf – inverse document frequency. Describes how seldom n word appears in whole text set. Proportion of nuber of all documents and nuber of the documents containing the given word. Problem: high dimensional sparse vectors. BOW - Bag of Words that looses syntax Preprocessing: stopwords, stemming. Features -> terms Some other possibilities n-grams, profiles of the letter frequiencies.

References Scientific articles contains bibliography. Web documents contains hyperlinks. They can be used as representaton space where document is represented by other document it is refferenced to. Typically binary vector containing 0 – lack of reference to the given document 1 – existance of the reference Some possible extensions: Not all articles are equqll. Ranking algorithms such as PageRank, HITS allow to measur eimportance of the documents and provide intsead binary walue weight that deccribe importance while one article points to the another. Not all articles are equqll. Ranking algorithms such as PageRank, HITS allow to measur eimportance of the documents and provide intsead binary walue weight that deccribe importance while one article points to the another. We can use references of higher order, that captures references not only from neighbours but also loog further. We can use references of higher order, that captures references not only from neighbours but also loog further. Similary like words representation, sparse vectors but much more lower dimensions, Poor for capturing semantic.

Compression Usually we need to show differences and similarities between text in the repository. They can be calculated using eg. Cosine distance which is suitable for high dimensional, sparse vectors. Square matrix describing text similarity. Other possibility is to make representation space based on algorithmic information estimated using standard file compression techniques. Key idea: If two documents are similar their concatenation will lead to a file size slightly larger than the size of a single compressed file. Two similar files will be compressed better that two different. complexity-based similarity measure as a fraction by which the sum of the separately compressed files exceeds the size of the jointly compressed file. where A and B denote text files, and the suffix p denotes the compression operation.

The data The three ways to generate numerical representation of texts have been compared on a set of articles selected from the Wikipedia Articles that belog to sub categories of Super category Science: Chemistry → Chemical compounds, Chemistry → Chemical compounds, Biology → Trees Biology → Trees Mathematics → Algebra Mathematics → Algebra Computer science → MS (Microsoft) operating systems Computer science → MS (Microsoft) operating systems Geology → Volcanology. Geology → Volcanology.

Rough view of the class distribution PCA projections of the data with two principal components having the highest variance Projection of dataset on two highest principal components for text representation based on terms, links and compressionfor Number of components used that complete 90% of variance and cumulative sum of primary components variance for successive text representations

SVM classification Classification may be used as a method for validation of the text representations. The better results classifier obtains – the better representation is. Information extracted by different text representations may be estimated by comparing classifier errors in various feature spaces. Multiclass classification with SVM performed with one-versus-other class approach has been used with two-fold crossvalidation repeated 50 times for accurate averaging of the results. Raw rrepresentation based on complexity gives the best results. Reducing the dimensionality removing the features that are related only to one article improves the results. Introducing cosine kernel improves considerably results

SVM and PCA reduction Selecting components that complete 90% of variance has been used for dimensionality reduction It worsen the results of classification for terms and links (to high reduction?) PCA does not influence complexity representation as in previous results introduction of cosine kernel improves classification. For terms it is even slightly better

Summary Complexity measure allowed for much more compact representation, as seen from the cumulative contribution of principal components and achieved best accuracy in PCA-reduced space with only 36 dimensions. After using cosine kernel term based representation is slightly more accurate. Explicit representation of kernel spaces and the use of linear SVM classifier allows to find important reference documents for a given category, as well as identify collocations and phrases that are important for characterization of each category. Distance-typed kernels improves results and reduces dimensionality in terms and links representations. Improvement is also in the case of representation based on complexity where similarity, based on distance, is second-order transformation.

Future directions Different methods of representation extract different information from texts. They show different aspects of the documents. In future we plan to combine representations and use one, joint representation. We plan introduce more background knowledge and capture some semantics. Wordnet can be used as semantic space where words from the article are mapped. Wordnet can be used as semantic space where words from the article are mapped. Wordnet is made as a network of interconnected synsets – elementary atoms that brings meaning. Wordnet is made as a network of interconnected synsets – elementary atoms that brings meaning. Mapping requires usage of disambiguation techniques. Mapping requires usage of disambiguation techniques. It allow to use activations of a WordNet semantic network and then calculate distances between them what should give better semantic similarity measures. It allow to use activations of a WordNet semantic network and then calculate distances between them what should give better semantic similarity measures. Large scale classifier for whole Wikipedia.

Thank for yor attention

Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.

Similar presentations

Presentation on theme: "Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.

Similar presentations

Presentation on theme: "Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk."— Presentation transcript:

Similar presentations

About project

Feedback