Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
Aggregating local image descriptors into compact codes
Chapter 5: Introduction to Information Retrieval
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimensionality Reduction PCA -- SVD
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Medical Document Categorization Using a Priori Knowledge L. Itert 1,2, W. Duch 2,3, J. Pestian 1 1 Department of Biomedical Informatics, Children’s Hospital.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Support Feature Machine for DNA microarray data Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Scalable Text Mining with Sparse Generative Models
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Text mining.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Question Classification using Support Vector Machine Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR2003.
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Support Feature Machine for DNA microarray data
Sentiment analysis algorithms and applications: A survey
Vector-Space (Distributional) Lexical Semantics
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Project 1: Text Classification by Neural Networks
Text Categorization Assigning documents to a fixed set of categories
Chapter 5: Information Retrieval and Web Search
Chapter 7: Transformations
Information Retrieval and Web Design
Presentation transcript:

Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk University of Technology, Poland Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering, Nanyang Technological University, Singapore

Outline Text representations Words Words References References Compression Compression Evaluation of text representations Wikipedia data Wikipedia data SVM & PCA SVM & PCA Experimental results and conclusions Experimental results and conclusions Future directions

Text representation Amount of the information in the Internet grows rapidly. Thus machine support is needed for: Categorization (supervised or unsupervised) Categorization (supervised or unsupervised) Searching / retrieval Searching / retrieval Human understand the text. Machine doesn't. To process the text machine requires it in computationable form. Results of text processing strongly depends on the methods used for text representation. Processing natural language – several approaches to that problem: Logic (ontologies), Logic (ontologies), Statistical processing of large text copora, Statistical processing of large text copora, Geometry mainly used in machine learning. Geometry mainly used in machine learning. Machine learning for NLP uses text features The aim of the experiments presented here is to find hypertext representation suitable for automatic categorization Information retrieval 4 Wiki project – improvement of existing Wikipedia category system

Text representation with features Convenient way for machine processing of the is a vector of the features. Text set is represented as a matrix of the N features related with text k by the weight c. Where features come from?  k – document number,  n – feature number,  c – feature value

Words The most intuitive approach is to take words as features. Words content should describe well subject of the text. n-th word has value C in context of k-th document calculated as: where tf – term frequency. Describes how many times the word n appears in k document. tf – term frequency. Describes how many times the word n appears in k document. idf – inverse document frequency. Describes how seldom n word appears in whole text set. Proportion of nuber of all documents and nuber of the documents containing the given word. idf – inverse document frequency. Describes how seldom n word appears in whole text set. Proportion of nuber of all documents and nuber of the documents containing the given word. Problem: high dimensional sparse vectors. BOW - Bag of Words that looses syntax Preprocessing: stopwords, stemming. Features -> terms Some other possibilities n-grams, profiles of the letter frequiencies.

References Scientific articles contains bibliography. Web documents contains hyperlinks. They can be used as representaton space where document is represented by other document it is refferenced to. Typically binary vector containing 0 – lack of reference to the given document 1 – existance of the reference Some possible extensions: Not all articles are equqll. Ranking algorithms such as PageRank, HITS allow to measur eimportance of the documents and provide intsead binary walue weight that deccribe importance while one article points to the another. Not all articles are equqll. Ranking algorithms such as PageRank, HITS allow to measur eimportance of the documents and provide intsead binary walue weight that deccribe importance while one article points to the another. We can use references of higher order, that captures references not only from neighbours but also loog further. We can use references of higher order, that captures references not only from neighbours but also loog further. Similary like words representation, sparse vectors but much more lower dimensions, Poor for capturing semantic.

Compression Usually we need to show differences and similarities between text in the repository. They can be calculated using eg. Cosine distance which is suitable for high dimensional, sparse vectors. Square matrix describing text similarity. Other possibility is to make representation space based on algorithmic information estimated using standard file compression techniques. Key idea: If two documents are similar their concatenation will lead to a file size slightly larger than the size of a single compressed file. Two similar files will be compressed better that two different. complexity-based similarity measure as a fraction by which the sum of the separately compressed files exceeds the size of the jointly compressed file. where A and B denote text files, and the suffix p denotes the compression operation.

The data The three ways to generate numerical representation of texts have been compared on a set of articles selected from the Wikipedia Articles that belog to sub categories of Super category Science: Chemistry → Chemical compounds, Chemistry → Chemical compounds, Biology → Trees Biology → Trees Mathematics → Algebra Mathematics → Algebra Computer science → MS (Microsoft) operating systems Computer science → MS (Microsoft) operating systems Geology → Volcanology. Geology → Volcanology.

Rough view of the class distribution PCA projections of the data with two principal components having the highest variance Projection of dataset on two highest principal components for text representation based on terms, links and compressionfor Number of components used that complete 90% of variance and cumulative sum of primary components variance for successive text representations

SVM classification Classification may be used as a method for validation of the text representations. The better results classifier obtains – the better representation is. Information extracted by different text representations may be estimated by comparing classifier errors in various feature spaces. Multiclass classification with SVM performed with one-versus-other class approach has been used with two-fold crossvalidation repeated 50 times for accurate averaging of the results. Raw rrepresentation based on complexity gives the best results. Reducing the dimensionality removing the features that are related only to one article improves the results. Introducing cosine kernel improves considerably results

SVM and PCA reduction Selecting components that complete 90% of variance has been used for dimensionality reduction It worsen the results of classification for terms and links (to high reduction?) PCA does not influence complexity representation as in previous results introduction of cosine kernel improves classification. For terms it is even slightly better

Summary Complexity measure allowed for much more compact representation, as seen from the cumulative contribution of principal components and achieved best accuracy in PCA-reduced space with only 36 dimensions. After using cosine kernel term based representation is slightly more accurate. Explicit representation of kernel spaces and the use of linear SVM classifier allows to find important reference documents for a given category, as well as identify collocations and phrases that are important for characterization of each category. Distance-typed kernels improves results and reduces dimensionality in terms and links representations. Improvement is also in the case of representation based on complexity where similarity, based on distance, is second-order transformation.

Future directions Different methods of representation extract different information from texts. They show different aspects of the documents. In future we plan to combine representations and use one, joint representation. We plan introduce more background knowledge and capture some semantics. Wordnet can be used as semantic space where words from the article are mapped. Wordnet can be used as semantic space where words from the article are mapped. Wordnet is made as a network of interconnected synsets – elementary atoms that brings meaning. Wordnet is made as a network of interconnected synsets – elementary atoms that brings meaning. Mapping requires usage of disambiguation techniques. Mapping requires usage of disambiguation techniques. It allow to use activations of a WordNet semantic network and then calculate distances between them what should give better semantic similarity measures. It allow to use activations of a WordNet semantic network and then calculate distances between them what should give better semantic similarity measures. Large scale classifier for whole Wikipedia.

Thank for yor attention