Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Soft computing Lecture 6 Introduction to neural networks.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Introduction to Machine Learning Approach Lecture 5.
Overview of Search Engines
Introduction to machine learning
Marko Grobelnik Jasna Škrbec Jozef Stefan Institute Social Context as a part of News-Archive-Explorer Web application for exploratory browsing of news.
Machine Learning. Learning agent Any other agent.
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
Blaz Fortuna, Marko Grobelnik, Dunja Mladenic Jozef Stefan Institute ONTOGEN SEMI-AUTOMATIC ONTOLOGY EDITOR.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Article by Dunja Mladenic, Marko Grobelnik, Blaz Fortuna, and Miha Grcar, Chapter 3 in Semantic Knowledge Management: Integrating Ontology Management,
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Triplet Extraction from Sentences Technical University of Cluj-Napoca Conf. Dr. Ing. Tudor Mureşan “Jožef Stefan” Institute, Ljubljana, Slovenia Assist.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Data Mining and Text Mining. The Standard Data Mining process.
String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Information Organization: Overview
Semi-Supervised Clustering
Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,
System for Semi-automatic ontology construction
Bag-of-Visual-Words Based Feature Extraction
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Presented by: Prof. Ali Jaoua
Concave Minimization for Support Vector Machine Classifiers
Semi-Automatic Data-Driven Ontology Construction System
Information Organization: Overview
Unsupervised Machine Learning: Clustering Assignment
Unsupervised learning of visual sense models for Polysemous words
Presentation transcript:

Background Knowledge for Ontology Construction Blaž Fortuna, Marko Grobelnik, Dunja Mladenić, Institute Jožef Stefan, Slovenia

Bag-of-words Documents are encoded as vectors Each element of vector corresponds to frequency of one word Each word can also be weighted corresponding to the importance of the word There exist various ways of selecting word weights. In our paper we propose a method to learn them! computer2 mathematics2 are1 and4 science3 …… computer0.9 mathematics0.8 are0.01 and0.01 science0.9 …… computer1.8 mathematics1.6 are0.01 and0.04 science2.7 …… Word Weigts Important Noise Computers are used in increasingly diverse ways in Mathematics and the Physical and Life Sciences. This workshop aims to bring together researchers in Mathematics, Computer Science, and Sciences to explore the links between their disciplines and to encourage new collaborations.

SVM Feature selection Input: Set documents Set of categories Each document is assigned a subset of categories Output: Ranking of words according to importance Intuition: Word is important if it discriminates documents according to categories. Basic algorithm: Learn linear SVM classifier for each of the categories. Word is important if it is important for classification into any of the categories. Reference: Brank J., Grobelnik M., Milic- Frayling N. & Mladenic D. Feature selection using support vector machines.

Word weight learning Algorithm: 1.Calculate linear SVM classifier for each category 2.Calculate word weights for each category from SVM normal vectors. Weight for i-th word and j-th category is: 3.Final word weights are calculated separately for each document: The word weight learning method is based on SVM feature selection. Besides ranking the words it also assigns them weights based on SVM classifier. Notation: N – number of documents {x 1, …, x N } – documents C(x i ) – set of categories for document x i n – number of words {w 1, …, w n } – word weights {n j 1, …, n j n } – SVM normal vector for j-th category

OntoGen system System for semi-automatic ontology construction –Why semi-automatic? The system only gives suggestions to the user, the user always makes the final decision. The system is data-driven and can scale to large collections of documents. Current version focused on construction of Topic Ontologies, next version will be able to deal with more general ontologies. Can import/export RDF. There is a big divide between unsupervised and fully supervised construction tools. Both approaches have weak points: –it is difficult to obtain desired results using unsupervised methods, e.g. limited background knowledge –manual tools (e.g. Protégé, OntoStudio) are time consuming, user needs to know the entire domain. We combined these two approaches in order to eliminate these weaknesses: –the user guides the construction process, –the system helps the user with suggestions based on the document collection.

Context Topic How does OnteGen help? By identifying the topics and relations between them: … using k-means clustering: cluster of documents => topic documents are assigned to clusters => subject-of relation We can repeat clustering on a subset of documents assigned to a specific topic => identifies subtopics and subtopic-of relation By naming the topics: … using centroid vector: A centroid vector of a given topic is the average document from this topic (normalised sum of topics documents) Most descriptive keywords for a given topic are the words with the highest weights in the centroid vector. … using linear SVM classifier: SVM classifier is trained to seperate documents of the given topic from the other document in the context Words that are found most mportant for the classification are selected as keywords for the topic

Suggestions of subtopics Topic ontology visualization Topic ontology Topic KeywordsAll documents Selected topic Outlier detection Topic document

Topic ontology of Yahoo! Finances

Background knowledge in OntoGen All of the methods in OntoGen are based on bag-of-words representation. By using a different word weights we can tune these methods according to the users needs. The user needs to group the documents into categories. This can be done efficiently using active learning.

Influence of background knowledge Data: Reuters news articles Each news is assigned two different sets of tags: –Topics –Countries Each set of tags offers a different view on the data Topics view Countries view Documents

Links OntoGen: Text Garden: