Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
1 / 22 Issues in Text Similarity and Categorization Jordan Smith – MUMT 611 – 27 March 2008.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
Web Mining Research: A Survey
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Text Analytics And Text Mining Best of Text and Data
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Masoud Makrehchi, PAMI, UW Learning Object Metadata Masoud Makrehchi PAMI University of Waterloo August 2004.
Vector Space Models.
Working with Ontologies Introduction to DOGMA and related research.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Queensland University of Technology
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Clustering of Web pages
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Data Warehousing and Data Mining
Presented by: Prof. Ali Jaoua
CSE 635 Multimedia Information Retrieval
Content Analysis of Text
Information Networks: State of the Art
Mulugeta H Tedla University of Cincinnati, April 22, 2008
Presentation transcript:

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel

Makrehchi & Kamel 2 of 46 Outlines Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks

Makrehchi & Kamel 3 of 46 Metadata Mining Metadata Definition –Data about data, for example a library catalogue Metadata Application: –Cataloging (Item and Collections) –Resource Discovery –Electronic Commerce and Digital Signatures –Intelligent Software Agents –Content Rating –Intellectual Property Rights –Semantic Web –Learning Objects LOM Standards: IEEE LOM, DC, SCORM, CANCORE

Makrehchi & Kamel 4 of 46 Metadata Mining Definition –extraction of implicit, previously unknown, and potentially useful information from metadata. Methods –classification, clustering, summarization, mining association rules, ontology extraction, information integration, keyword extraction, automatic title generation.

Makrehchi & Kamel 5 of 46 Metadata Mining Why metadata mining? –No access to the data itself, lack of raw data, –The data is not convenient for mining (heterogeneous formats and non-text format) –Diversity of metadata standards, and need to merge different metadata repositories, –Ontology extraction is much easier in metadata level.

Makrehchi & Kamel 6 of 46 Metadata Mining Conceptual data architecture

Makrehchi & Kamel 7 of 46 Metadata Mining Applications –Metadata mining instead of raw data mining, –Metadata enrichment (keyword extraction) –(Semi)-automatic Ontology extraction, –RDF, OWL and other semantic tagged script mining, –Information integration (LOs aggregation and integration),

Makrehchi & Kamel 8 of 46 Metadata Mining Statistical methods based on word frequency analysis, Syntactic methods based on linguistic parsing and pattern matching, Structural methods studying the outline of the document, Conceptual (semantic) methods on the use of knowledge base to interpret the meaning.

Makrehchi & Kamel 9 of 46 Metadata Mining We don’t use –Natural Language Processing (NLP), –Semantic analysis and processing, –Graph, tree and other sophisticate data structures and models, –Dictionaries, thesauruses, and any other global vocabularies (only a simple Porter stemmer).

Makrehchi & Kamel 10 of 46 Outlines Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks

Makrehchi & Kamel 11 of 46 Metadata Representation Model We treat metadata as a text document (semi-structured format), The only measures are –statistical measures (like frequency) –geometric features (like location of a specific term, the order of words in a term or phrase)

Makrehchi & Kamel 12 of 46 Metadata Representation Model Vector Space Model T didi Vocabulary

Makrehchi & Kamel 13 of 46 Metadata Representation Model Multi-Partition Vector Space Model T didi Vocabulary

Makrehchi & Kamel 14 of 46 Metadata Representation Model Multi-Partition Vector Space Model

Makrehchi & Kamel 15 of 46 Metadata Representation Model Converting to standard vector model

Makrehchi & Kamel 16 of 46 Metadata Representation Model Weight of each partition –To be determined by expert, for example: W abstract =1.0, W titile =1.5. Membership degree of each term in every partition –By expert, –Frequency based measures (tfidf), –Geometric measures (location of each term in the partition).

Makrehchi & Kamel 17 of 46 Outlines Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks

Makrehchi & Kamel 18 of 46 Class-Term Matrix Document-Term Matrix (Collection X Vocabulary) –The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), –The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), –The matrix is dual with respect to terms and documents.

Makrehchi & Kamel 19 of 46 Class-Term Matrix Class-Term Matrix (Class X Vocabulary) –The matrix is large. (tens of classes and millions of terms in the vocabulary), –The matrix is less sparse, –The matrix is still dual with respect to terms and classes.

Makrehchi & Kamel 20 of 46 Class-Term Matrix Class-term Frequency Term significance measure Normalized term significance measure

Makrehchi & Kamel 21 of 46 Class-Term Matrix

Makrehchi & Kamel 22 of 46 Class-Term Matrix Terminology All terms which occur in a class (or concept) A fuzzy set of all terms in the vocabulary

Makrehchi & Kamel 23 of 46 Class-Term Matrix Definition All concepts (classes) which the term belongs to A fuzzy set of all concepts (classes)

Makrehchi & Kamel 24 of 46 Outlines Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks

Makrehchi & Kamel 25 of 46 Case Study Data set –There is no available LO metadata repository –Citeseer computer science directory ( –~400,000 terms (vocabulary size) –17 classes –2,912 documents –Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles.

Makrehchi & Kamel 26 of 46 Case Study

Makrehchi & Kamel 27 of 46 Case Study

Makrehchi & Kamel 28 of 46 Case Study Types of Frequency Measures –Within document: by document-term frequency (like tfidf) –Within class: by class-term frequency (like term significance) –Within collection: by collection-term frequency (like mean of term significances)

Makrehchi & Kamel 29 of 46 Case Study Term Clustering: Categorizing all terms into three main groups –Features: More frequent terms within a class –Keywords: More frequent terms within some documents belonging to a given class –Stopwords: More frequent terms in all classes Introducing Class-Collection Map –To visualize the location of each category

Makrehchi & Kamel 30 of 46 Case Study

Makrehchi & Kamel 31 of 46 Case Study

Makrehchi & Kamel 32 of 46 Case Study

Makrehchi & Kamel 33 of 46 Case Study Extraction of Stopwords (doesn’t contribute to the meaning of the document) –General stopwords (a, an, the, in, …) –Domain-specific stopwords Politics: Government, State, Medicine: Patient, Education: Learner, Instructor, Social sciences: Society, Anthropology: Human.

Makrehchi & Kamel 34 of 46 Case Study Why we need to remove domain specific stopwords? –Dimensionality reduction, –Accurate feature selection (drawbacks of information gain in selecting noise as feature) –Based on stopwords, we can find and separate phrases (based on our definition, a phrase is a set of words between two stopwords).

Makrehchi & Kamel 35 of 46

Makrehchi & Kamel 36 of 46

Makrehchi & Kamel 37 of 46

Makrehchi & Kamel 38 of 46

Makrehchi & Kamel 39 of 46

Makrehchi & Kamel 40 of 46

Makrehchi & Kamel 41 of 46 Case Study Dimensionality reduction process ~400,000 15,971 stemming 12,044 Multi-partition document Vector space model 5,605 Fuzzy-based term clustering 507 stopwords 4,872 keywords 226 features Using metadata

Makrehchi & Kamel 42 of 46 Outlines Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks

Makrehchi & Kamel 43 of 46 Conclusion Remarks Most statistic-based data mining methods do not use domain knowledge Metadata (semi-structured data) mining uses domain knowledge embedded in tags and partitions. We introduced multi-partition document vector space model. We mine class-term matrix in addition to document-term matrix.

Makrehchi & Kamel 44 of 46 Conclusion Remarks Based on the visualization model (class- collection map) and a fuzzy inference, we can cluster vocabulary for each class and extract three essential categories; –Features: to classify unknown documents, –Keywords: for indexing and access to specific document in IR applications, –Stopwords: for dimensionality reduction and noise removal.

Makrehchi & Kamel 45 of 46 Conclusion Remarks Based on class-term matrix, we defined –Terminologies as fuzzy sets of all terms in the vocabulary –Definitions as fuzzy sets of all concepts

Makrehchi & Kamel 46 of 46 Conclusion Remarks Future Works –Collecting LO metadata and constructing a LO metadata repository, –A keyword recall method to test and validate extracted keywords, –Implementing an average classifier (KNN or Fuzzy classifier) to test and validate selected features, –Applying multi-classifier architecture on metadata mining problem.