Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.

Progress Report (Concept Extraction) Presented by: Mohsen Kamyar

Outline Motivations of “Semantic Web” “Semantic Web” structure Motivations of “Automatic Ontology Extraction” “Concept Extraction” Problems Main ideas

Motivations of “ Semantic Web ” Computation History: W3C definition: “ Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries ”. Huge Computations Gathering Human Data Sharing Data Processing Huge Data First Computers First Applications Inventing Web Semantic Web

Motivations of “ Semantic Web ” Some Efforts:  http://dbpedia.org/ ( RDF enabled Wikipedia Data ) http://dbpedia.org/  http://www.foaf-project.org/ ( Friend of a Friend ) http://www.foaf-project.org/  http://sioc-project.org/ ( Semantically-Interlinked Online Communities ) http://sioc-project.org/  http://openguid.net/ http://openguid.net/  http://simile.mit.edu/ ( Semantic Interoperability of Metadata and Information in unLike Environments ) http://simile.mit.edu/

Motivations of “ Semantic Web ” Different Ontologies and Metadata and Their Overlaps

Motivations of “ Semantic Web ” Different Projects and Their Overlaps

“ Semantic Web ” structure W3C Suggestion for Semantic Web Stack

“ Semantic Web ” structure Road from “zero” to Semantic Web Having an Ontology Annotating Documents Submit Query on Semantic Enabled Data Having an Ontology Generating Data According to Ontology Submit Query on Semantic Enabled Data Existing Data

Motivations of “ Automatic Ontology Extraction ” Ontology is domain specific In each domain we need a team of experts to create an ontology Existing ontologies cover small ranges of knowledge.  WordNet  Data about members, projects and … in Universities  DBLP, …

Motivations of “ Automatic Ontology Extraction ”  DailyMed for drugs  FOAF  MySpace  BBC  Widely used one is AudioSrobbler Based on event ontology with 600 million entry but it is a simple ontology. Problem: Representing Knowledge in a Machine Readable Format.

“ Concept Extraction ” An ontology is a set of “Concepts”, properties of concepts (and constraints on them) and “Rules” between “Concepts”. If we can’t generate a general ontology in a domain, “Concepts” can be used in determining the relevance of documents and their ranks.

Main Components in existing methods are as below:  Similarity Measures Euclidean Measures: Cosine Distance Problem: If we have a document “B” that is the subset of a document “A” then the cosine distance will be large Problems

 Importance Measures Frequency: Has big false positive error (on technical words) and false negative error (on common words) TFiDF (Term Frequency. Inverse Document Frequency):  Has more accuracy  But if we have two semantic correlated words this measure will make mistake on main purpose of the document.  Also for long documents it will fail  Information about order of words will be lost  If we consider stems and synonyms or other related set of words it can work fine, otherwise many problems.

Problems  LSI (Latent Semantic Indexing/Analysis) This method uses SVD (singular value decomposition) After applying SVD we have term×concept (U matrix), eigenvalues, concept×document (V matrix). Matrices U and V are eigenvectors of terms correlation matrix and document correlation matrix. Eigenvectors have property of clustering the nodes of a graph (for example in PageRank algorithm we use same property) But there are problems.

Problems  This method assumes some probabilistic properties for term-document matrix that may be not hold, for example Gaussian Model properties.  This method is not customized for sparse matrices  This method is offline  This method has big computation complexity  Some efforts have been made such as SDD  This method again is based on cosine distance and similar measures

Main ideas on this process are as below  Using more general spaces instead of Euclidian space, such as “Banach Spaces” and defining a new distance measure that can serve desirable properties of term-document space. Banach spaces have interesting properties for high dimensional problems because of their generlity.  Using Linguistics theorems in determining the importance of a term in a document instead of TFiDF

Main ideas  Customizing LSI for sparse matrices and creating a framework for any method for matrix decomposing.  Giving an online version of LSI  LSI (SVD) basically is an approximation for term- document matrix but also have big computation complexity and can be approximated to.  We use clustering properties of eigenvectors in LSI, so we can substitute it with any clustering that is not based on Euclidian as mentioned.

Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.

Similar presentations

Presentation on theme: "Progress Report (Concept Extraction) Presented by: Mohsen Kamyar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.

Similar presentations

Presentation on theme: "Progress Report (Concept Extraction) Presented by: Mohsen Kamyar."— Presentation transcript:

Similar presentations

About project

Feedback