On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Information Extraction CS 652 Information Extraction and Integration.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 7: Expert Systems and Artificial Intelligence Decision Support.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
TDT 4242 Inah Omoronyia and Tor Stålhane Guided Natural Language and Requirement Boilerplates TDT 4242 Institutt for datateknikk og informasjonsvitenskap.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Institute of Informatics and Telecommunications – NCSR “Demokritos” Bootstrapping ontology evolution with multimedia information extraction C.D. Spyropoulos,
MMSEM background Dr Ioannis Pratikakis Institute of Informatics & Telecommunications NCSR “Demokritos”, Athens, Greece MMSEM – F2F meeting Amsterdam, 10.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Funded by: European Commission – 6th Framework Project Reference: IST WP 2: Learning Web-service Domain Ontologies Miha Grčar Jožef Stefan.
Information Systems & Semantic Web University of Koblenz ▪ Landau, Germany Semantic Web - Multimedia Annotation – Steffen Staab
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Ontology-Centered Personalized Presentation of Knowledge Extracted from the Web Ralitsa Angelova.
Digital libraries and web- based information systems Mohsen Kamyar.
Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Supertagging CMSC Natural Language Processing January 31, 2006.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Facilitating Document Annotation Using Content and Querying Value.
WP1: Application Ontology Management Maria Teresa Pazienza Dept. Of Computer Science University of Rome “Tor Vergata”
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
Representation and Analysis of Multimedia Content: The BOEMIE Proposal
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
System Software Unit-1 (Language Processors) A TOY Compiler
Institute of Informatics & Telecommunications NCSR “Demokritos”
Institute of Informatics & Telecommunications
Presented by: Hassan Sayyadi
Restrict Range of Data Collection for Topic Trend Detection
Social Knowledge Mining
Clustering Algorithms for Noun Phrase Coreference Resolution
Presentation transcript:

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos”

Kassel, 22/07/2005 ICCS’052 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’053 Motivation Practical information extraction requires a conceptual description of the domain, e.g. an ontology, and a grammar. Manual creation and maintenance of these resources is expensive. Machine learning has been used to: –Learn ontologies based on extracted instances. –Learn extraction grammars, given the conceptual model. Study how the two processes are interacting and the possibility of combining them.

Kassel, 22/07/2005 ICCS’054 Information extraction Common approach: shallow parsing with regular grammars. Limited use of deep analysis to improve extraction accuracy (HPSGs, concept graphs). Linking of extraction patterns to ontologies (e.g. information extraction ontologies). Initial attempts to combine syntax and semantics (Systemic Functional Grammars). Learning simple extraction patterns (regular expressions, HMMs, tree-grammars, etc.)

Kassel, 22/07/2005 ICCS’055 Ontology learning Deductive approach to ontology modification: driven by linguistic rules. Inductive identification of new concepts/terms. Clustering, based on lexico-syntactic analysis of the text (subcat frames). Formal Concept Analysis for term clustering and concept identification. Clustering and merging of conceptual graphs (conceptual graph theory). Deductive learning of extraction grammars in parallel with the identification of concepts.

Kassel, 22/07/2005 ICCS’056 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’057 SKEL - vision Research objective: innovative knowledge technologies for reducing the information overload on the Web Areas of research activity: –Information gathering (retrieval, crawling, spidering) –Information filtering (text and multimedia classification) –Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) –Personalization (user stereotypes and communities) –Ontology learning and population

Kassel, 22/07/2005 ICCS’058 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’059 CROSSMARC Objectives crawl the Web for interesting Web pages, extract information from pages of different sites without a standardized format (structured, semi- structured, free text), process Web pages written in several languages, be customized semi-automatically to new domains and languages, deliver integrated information according to personalized profiles. Develop technology for Information Integration that can:

Kassel, 22/07/2005 ICCS’0510 CROSSMARC Architecture Ontology

Kassel, 22/07/2005 ICCS’0511 CROSSMARC Ontology … Laptops Processor Processor Name Intel Pentium 3 … Intel Pentium III Pentium III P3 PIII Lexicon Ontology Όνομα Επεξεργαστή Greek Lexicon

Kassel, 22/07/2005 ICCS’0512 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’0513 Meta-learning for Web IE Motivation: There are many different learning methods, producing different types of extraction grammar. In CROSSMARC we had four different approaches with significant difference in the extracted information. Proposed approach: Use meta-learning to combine the strengths of individual learning methods.

Kassel, 22/07/2005 ICCS’0514 D \ D j DjDj Meta-learning for Web IE Base-level dataset D L 1 …L N MD j Meta-level dataset MD C 1 (j)…C N (j) CMCM New vector x C 1...C N Meta-level vector Class value y(x) L 1 …L N LMLM Stacked generalization

Kassel, 22/07/2005 ICCS’0515 Meta-learning for Web IE …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Template T t(s,e)s, eField f Transport ZX47, 49Model 15”56, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram Each template is filled with instances

Kassel, 22/07/2005 ICCS’0516 Meta-learning for Web IE T 1 filled by the IE system E 1 t(s, e)s, ef Transport ZX47, 49model 15”56, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83ram T 2 filled by the IE system E 2 t(s, e)s, ef Transport ZX47, 49manuf TFT59, 60screenType Intel Pentium63, 66procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83HDcapacity …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Combining Information Extraction systems

Kassel, 22/07/2005 ICCS’0517 Meta-learning for Web IE Stacked template (ST) s, et(s, e)Field by E 1 Field by E 2 Correct field 47, 49Transport ZXmodelmanufmodel 56, 5815”screenSize- 59, 60TFTscreenType 63, 66Intel Pentium-procName- 63, 67Intel Pentium IIIprocName- 67, MHzprocSpeed 76, MBram 81, 831 GBramHDcapacity- Creating a stacked template …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB…

Kassel, 22/07/2005 ICCS’0518 D \ D j Meta-learning for Web IE Training in the new stacking framework DjDj L 1 …L N E 1 (j)…E N (j) CMCM ST 1 ST 2 … L 1 …L N E 1 …E N LMLM MD j D = set of documents, paired with hand-filled templates MD = set of meta-level feature vectors

Kassel, 22/07/2005 ICCS’0519 Meta-learning for Web IE Stacking at run-time New document d E1E1 E2E2 ENEN … T1T1 T2T2 TNTN Stacked template CMCM T Final template

Kassel, 22/07/2005 ICCS’0520 Experimental results DomainBest baseStacking Courses Projects Laptops Jobs Seminars F1-scores (combined recall and precision) on four benchmark domains and one of the CROSSMARC domains.

Kassel, 22/07/2005 ICCS’0521 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’0522 Learning CFGs Motivation: Wanting to provide more complex extraction patterns for less structured text. Wanting to learn more compact and human- comprehensible grammars. Wanting to be able to process large corpora containing only positive examples. Proposed approach: Efficient learning of context free grammars from positive examples, guided by Minimum Description Length.

Kassel, 22/07/2005 ICCS’0523 Learning CFGs Infers context-free grammars. Learns from positive examples only. Overgenarisation controlled through a heuristic, based on MDL. Two basic/three auxiliary learning operators. Two search strategies: –Beam search. –Genetic search. Introducing eg-GRIDS

Kassel, 22/07/2005 ICCS’0524 Learning CFGs Minimum Description Length (MDL) Model Length (ML) = GDL + DDL Bits required to encode the grammar G. Grammar Description Length (GDL) Bits required to encode all training examples, as encoded by the grammar G. Derivations Description Length (DDL) Overly Specific Grammar Overly General Grammar DDL Hypothese s GDL

Kassel, 22/07/2005 ICCS’0525 Learning CFGs eg-GRIDS Architecture Operator Mode Beam of Grammars Merge NT Operator Create NT Operator Learning Operators Create Optional NT Detect Center Embedding YES NO Evolutionary Algorithm Mutation Search Organisation Selection Body Substitution Training Examples Overly Specific Grammar Final Grammar Any Inferred Grammar better than those in beam?

Kassel, 22/07/2005 ICCS’0526 Experimental results The Dyck language with k=1: S → S S | ( S ) | є Errors of: Omission: failures to parse sentences generated from the “correct” grammar (longer test sentences than in the training set). –Overly specific grammar. Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar. –Overly general grammar.

Kassel, 22/07/2005 ICCS’0527 Probability of parsing a valid sentence (1-errors of omission) Experimental results

Kassel, 22/07/2005 ICCS’0528 Probability of generating a valid sentence (1-errors of commission) Experimental results

Kassel, 22/07/2005 ICCS’0529 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’0530 Ontology Enrichment Highly evolving domain (e.g. laptop descriptions) –New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. –New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ We concentrate on instances. The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain.

Kassel, 22/07/2005 ICCS’0531 Ontology Enrichment Multi-Lingual Domain Ontology Additional annotations Validation Ontology Enrichment / Population Domain Expert Annotating Corpus Using Domain Ontology Information extraction machine learning Corpus

Kassel, 22/07/2005 ICCS’0532 Finding synonyms The number of instances for validation increases with the size of the corpus and the ontology. There is a need for supporting the enrichment of the ‘synonymy’ relationship. Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Issues to be handled: Synonym : ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical : ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Combination : ‘Intell Pentium 3’ - ‘P III’

Kassel, 22/07/2005 ICCS’0533 COCLU COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Kassel, 22/07/2005 ICCS’0534 Experimental results Initial2 nd iter. 15/5848/58 28/5856/58 40/5857/58 Discovering lexical synonyms: Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group Instances removed (%) Accuracy (%) Discovering new instances: Hide part of the known instances. Evolve ontology and grammars to recover them.

Kassel, 22/07/2005 ICCS’0535 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –BOEMIE: Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’0536 BOEMIE - motivation Multimedia content grows with increasing rates in public and proprietary webs. Hard to provide semantic indexing of multimedia content. Significant advances in automatic extraction of low-level features from visual content. Little progress in the identification of high-level semantic features Little progress in the effective combination of semantic features from different modalities. Great effort in producing ontologies for semantic webs. Hard to build and maintain domain-specific multimedia ontologies.

Kassel, 22/07/2005 ICCS’0537 BOEMIE- approach EVOLVED ONTOLOGY INITIAL ONTOLOGY POPULATION & ENRICHMENT COORDINATION INTERMEDIATE ONTOLOGY ONTOLOGY EVOLUTION TOOLKIT LEARNING TOLS REASONING ENGINE MATCHING TOOLS ONTOLOGY MANAGEMENT TOOL ONTOLOGY EVOLUTION SEMANTICS EXTRACTION RESULTS OTHER ONTOLOGIES SEMANTICS EXTRACTION MULTIMEDIA CONTENT SEMANTICS EXTRACTION TOOLKIT TEXT EXTRACTION TOOLS AUDIO EXTRACTION TOOLS INFORMATION FUSION TOOLS VISUAL EXTRACTION TOOLS FROM VISUAL CONTENT FROM NON-VISUAL CONTENT FROM FUSED CONTENT Content Collection (crawlers, spiders, etc.)

Kassel, 22/07/2005 ICCS’0538 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’0539 KR issues Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE? Is that better than having separate representations for different tasks? Do we need an intermediate formalism (e.g. grammar + CG + ontology)? Do we need to represent uncertainty (e.g. using probabilistic graphical models)?

Kassel, 22/07/2005 ICCS’0540 ML issues What types and which aspects of grammars and conceptual structures can we learn? What training data do we need? Can we reduce the manual annotation effort? What background knowledge do we need and what is the role of deduction? What is the role of multi-strategy learning, especially if complex representations are used?

Kassel, 22/07/2005 ICCS’0541 Content-type issues What is the role of semantically annotated content in learning, e.g. as training data? What is the role of hypertext as a graph? Can we extract information from multimedia content? How can ontologies and learning help improve extraction from multimedia?

Kassel, 22/07/2005 ICCS’0542 SKEL Introduction This is research of many current and past members of SKEL. CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway). Acknowledgements