Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel

Makrehchi & Kamel 2 of 46 Outlines Metadata Mining Metadata Representation Model Class-Term Matrix Case Study Conclusion Remarks

Makrehchi & Kamel 3 of 46 Metadata Mining Metadata Definition –Data about data, for example a library catalogue Metadata Application: –Cataloging (Item and Collections) –Resource Discovery –Electronic Commerce and Digital Signatures –Intelligent Software Agents –Content Rating –Intellectual Property Rights –Semantic Web –Learning Objects LOM Standards: IEEE LOM, DC, SCORM, CANCORE

Makrehchi & Kamel 4 of 46 Metadata Mining Definition –extraction of implicit, previously unknown, and potentially useful information from metadata. Methods –classification, clustering, summarization, mining association rules, ontology extraction, information integration, keyword extraction, automatic title generation.

Makrehchi & Kamel 5 of 46 Metadata Mining Why metadata mining? –No access to the data itself, lack of raw data, –The data is not convenient for mining (heterogeneous formats and non-text format) –Diversity of metadata standards, and need to merge different metadata repositories, –Ontology extraction is much easier in metadata level.

Makrehchi & Kamel 6 of 46 Metadata Mining Conceptual data architecture

Makrehchi & Kamel 7 of 46 Metadata Mining Applications –Metadata mining instead of raw data mining, –Metadata enrichment (keyword extraction) –(Semi)-automatic Ontology extraction, –RDF, OWL and other semantic tagged script mining, –Information integration (LOs aggregation and integration),

Makrehchi & Kamel 8 of 46 Metadata Mining Statistical methods based on word frequency analysis, Syntactic methods based on linguistic parsing and pattern matching, Structural methods studying the outline of the document, Conceptual (semantic) methods on the use of knowledge base to interpret the meaning.

Makrehchi & Kamel 9 of 46 Metadata Mining We don’t use –Natural Language Processing (NLP), –Semantic analysis and processing, –Graph, tree and other sophisticate data structures and models, –Dictionaries, thesauruses, and any other global vocabularies (only a simple Porter stemmer).

Makrehchi & Kamel 11 of 46 Metadata Representation Model We treat metadata as a text document (semi-structured format), The only measures are –statistical measures (like frequency) –geometric features (like location of a specific term, the order of words in a term or phrase)

Makrehchi & Kamel 12 of 46 Metadata Representation Model Vector Space Model T didi Vocabulary

Makrehchi & Kamel 13 of 46 Metadata Representation Model Multi-Partition Vector Space Model T didi Vocabulary

Makrehchi & Kamel 14 of 46 Metadata Representation Model Multi-Partition Vector Space Model

Makrehchi & Kamel 15 of 46 Metadata Representation Model Converting to standard vector model

Makrehchi & Kamel 16 of 46 Metadata Representation Model Weight of each partition –To be determined by expert, for example: W abstract =1.0, W titile =1.5. Membership degree of each term in every partition –By expert, –Frequency based measures (tfidf), –Geometric measures (location of each term in the partition).

Makrehchi & Kamel 18 of 46 Class-Term Matrix Document-Term Matrix (Collection X Vocabulary) –The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), –The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), –The matrix is dual with respect to terms and documents.

Makrehchi & Kamel 19 of 46 Class-Term Matrix Class-Term Matrix (Class X Vocabulary) –The matrix is large. (tens of classes and millions of terms in the vocabulary), –The matrix is less sparse, –The matrix is still dual with respect to terms and classes.

Makrehchi & Kamel 20 of 46 Class-Term Matrix Class-term Frequency Term significance measure Normalized term significance measure

Makrehchi & Kamel 21 of 46 Class-Term Matrix

Makrehchi & Kamel 22 of 46 Class-Term Matrix Terminology All terms which occur in a class (or concept) A fuzzy set of all terms in the vocabulary

Makrehchi & Kamel 23 of 46 Class-Term Matrix Definition All concepts (classes) which the term belongs to A fuzzy set of all concepts (classes)

Makrehchi & Kamel 25 of 46 Case Study Data set –There is no available LO metadata repository –Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) –~400,000 terms (vocabulary size) –17 classes –2,912 documents –Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles.

Makrehchi & Kamel 26 of 46 Case Study

Makrehchi & Kamel 28 of 46 Case Study Types of Frequency Measures –Within document: by document-term frequency (like tfidf) –Within class: by class-term frequency (like term significance) –Within collection: by collection-term frequency (like mean of term significances)

Makrehchi & Kamel 29 of 46 Case Study Term Clustering: Categorizing all terms into three main groups –Features: More frequent terms within a class –Keywords: More frequent terms within some documents belonging to a given class –Stopwords: More frequent terms in all classes Introducing Class-Collection Map –To visualize the location of each category

Makrehchi & Kamel 33 of 46 Case Study Extraction of Stopwords (doesn’t contribute to the meaning of the document) –General stopwords (a, an, the, in, …) –Domain-specific stopwords Politics: Government, State, Medicine: Patient, Education: Learner, Instructor, Social sciences: Society, Anthropology: Human.

Makrehchi & Kamel 34 of 46 Case Study Why we need to remove domain specific stopwords? –Dimensionality reduction, –Accurate feature selection (drawbacks of information gain in selecting noise as feature) –Based on stopwords, we can find and separate phrases (based on our definition, a phrase is a set of words between two stopwords).

Makrehchi & Kamel 35 of 46

Makrehchi & Kamel 41 of 46 Case Study Dimensionality reduction process ~400,000 15,971 stemming 12,044 Multi-partition document Vector space model 5,605 Fuzzy-based term clustering 507 stopwords 4,872 keywords 226 features Using metadata

Makrehchi & Kamel 43 of 46 Conclusion Remarks Most statistic-based data mining methods do not use domain knowledge Metadata (semi-structured data) mining uses domain knowledge embedded in tags and partitions. We introduced multi-partition document vector space model. We mine class-term matrix in addition to document-term matrix.

Makrehchi & Kamel 44 of 46 Conclusion Remarks Based on the visualization model (class- collection map) and a fuzzy inference, we can cluster vocabulary for each class and extract three essential categories; –Features: to classify unknown documents, –Keywords: for indexing and access to specific document in IR applications, –Stopwords: for dimensionality reduction and noise removal.

Makrehchi & Kamel 45 of 46 Conclusion Remarks Based on class-term matrix, we defined –Terminologies as fuzzy sets of all terms in the vocabulary –Definitions as fuzzy sets of all concepts

Makrehchi & Kamel 46 of 46 Conclusion Remarks Future Works –Collecting LO metadata and constructing a LO metadata repository, –A keyword recall method to test and validate extracted keywords, –Implementing an average classifier (KNN or Fuzzy classifier) to test and validate selected features, –Applying multi-classifier architecture on metadata mining problem.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Similar presentations

Presentation on theme: "Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Similar presentations

Presentation on theme: "Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel."— Presentation transcript:

Similar presentations

About project

Feedback