On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR “Demokritos” http://www.iit.demokritos.gr/~paliourg

Kassel, 22/07/2005 ICCS’052 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’053 Motivation Practical information extraction requires a conceptual description of the domain, e.g. an ontology, and a grammar. Manual creation and maintenance of these resources is expensive. Machine learning has been used to: –Learn ontologies based on extracted instances. –Learn extraction grammars, given the conceptual model. Study how the two processes are interacting and the possibility of combining them.

Kassel, 22/07/2005 ICCS’054 Information extraction Common approach: shallow parsing with regular grammars. Limited use of deep analysis to improve extraction accuracy (HPSGs, concept graphs). Linking of extraction patterns to ontologies (e.g. information extraction ontologies). Initial attempts to combine syntax and semantics (Systemic Functional Grammars). Learning simple extraction patterns (regular expressions, HMMs, tree-grammars, etc.)

Kassel, 22/07/2005 ICCS’055 Ontology learning Deductive approach to ontology modification: driven by linguistic rules. Inductive identification of new concepts/terms. Clustering, based on lexico-syntactic analysis of the text (subcat frames). Formal Concept Analysis for term clustering and concept identification. Clustering and merging of conceptual graphs (conceptual graph theory). Deductive learning of extraction grammars in parallel with the identification of concepts.

Kassel, 22/07/2005 ICCS’057 SKEL - vision Research objective: innovative knowledge technologies for reducing the information overload on the Web Areas of research activity: –Information gathering (retrieval, crawling, spidering) –Information filtering (text and multimedia classification) –Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) –Personalization (user stereotypes and communities) –Ontology learning and population

Kassel, 22/07/2005 ICCS’059 CROSSMARC Objectives crawl the Web for interesting Web pages, extract information from pages of different sites without a standardized format (structured, semi- structured, free text), process Web pages written in several languages, be customized semi-automatically to new domains and languages, deliver integrated information according to personalized profiles. Develop technology for Information Integration that can:

Kassel, 22/07/2005 ICCS’0510 CROSSMARC Architecture Ontology

Kassel, 22/07/2005 ICCS’0511 CROSSMARC Ontology … Laptops Processor Processor Name Intel Pentium 3 … Intel Pentium III Pentium III P3 PIII Lexicon Ontology Όνομα Επεξεργαστή Greek Lexicon

Kassel, 22/07/2005 ICCS’0513 Meta-learning for Web IE Motivation: There are many different learning methods, producing different types of extraction grammar. In CROSSMARC we had four different approaches with significant difference in the extracted information. Proposed approach: Use meta-learning to combine the strengths of individual learning methods.

Kassel, 22/07/2005 ICCS’0514 D \ D j DjDj Meta-learning for Web IE Base-level dataset D L 1 …L N MD j Meta-level dataset MD C 1 (j)…C N (j) CMCM New vector x C 1...C N Meta-level vector Class value y(x) L 1 …L N LMLM Stacked generalization

Kassel, 22/07/2005 ICCS’0515 Meta-learning for Web IE …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Template T t(s,e)s, eField f Transport ZX47, 49Model 15”56, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram Each template is filled with instances

Kassel, 22/07/2005 ICCS’0516 Meta-learning for Web IE T 1 filled by the IE system E 1 t(s, e)s, ef Transport ZX47, 49model 15”56, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83ram T 2 filled by the IE system E 2 t(s, e)s, ef Transport ZX47, 49manuf TFT59, 60screenType Intel Pentium63, 66procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83HDcapacity …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Combining Information Extraction systems

Kassel, 22/07/2005 ICCS’0517 Meta-learning for Web IE Stacked template (ST) s, et(s, e)Field by E 1 Field by E 2 Correct field 47, 49Transport ZXmodelmanufmodel 56, 5815”screenSize- 59, 60TFTscreenType 63, 66Intel Pentium-procName- 63, 67Intel Pentium IIIprocName- 67, 69600 MHzprocSpeed 76, 78256 MBram 81, 831 GBramHDcapacity- Creating a stacked template …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB…

Kassel, 22/07/2005 ICCS’0518 D \ D j Meta-learning for Web IE Training in the new stacking framework DjDj L 1 …L N E 1 (j)…E N (j) CMCM ST 1 ST 2 … L 1 …L N E 1 …E N LMLM MD j D = set of documents, paired with hand-filled templates MD = set of meta-level feature vectors

Kassel, 22/07/2005 ICCS’0519 Meta-learning for Web IE Stacking at run-time New document d E1E1 E2E2 ENEN … T1T1 T2T2 TNTN Stacked template CMCM T Final template

Kassel, 22/07/2005 ICCS’0520 Experimental results DomainBest baseStacking Courses65.7371.93 Projects61.6470.66 Laptops63.8171.55 Jobs83.2285.94 Seminars86.2390.03 F1-scores (combined recall and precision) on four benchmark domains and one of the CROSSMARC domains.

Kassel, 22/07/2005 ICCS’0522 Learning CFGs Motivation: Wanting to provide more complex extraction patterns for less structured text. Wanting to learn more compact and human- comprehensible grammars. Wanting to be able to process large corpora containing only positive examples. Proposed approach: Efficient learning of context free grammars from positive examples, guided by Minimum Description Length.

Kassel, 22/07/2005 ICCS’0523 Learning CFGs Infers context-free grammars. Learns from positive examples only. Overgenarisation controlled through a heuristic, based on MDL. Two basic/three auxiliary learning operators. Two search strategies: –Beam search. –Genetic search. Introducing eg-GRIDS

Kassel, 22/07/2005 ICCS’0524 Learning CFGs Minimum Description Length (MDL) Model Length (ML) = GDL + DDL Bits required to encode the grammar G. Grammar Description Length (GDL) Bits required to encode all training examples, as encoded by the grammar G. Derivations Description Length (DDL) Overly Specific Grammar Overly General Grammar DDL Hypothese s GDL

Kassel, 22/07/2005 ICCS’0525 Learning CFGs eg-GRIDS Architecture Operator Mode Beam of Grammars Merge NT Operator Create NT Operator Learning Operators Create Optional NT Detect Center Embedding YES NO Evolutionary Algorithm Mutation Search Organisation Selection Body Substitution Training Examples Overly Specific Grammar Final Grammar Any Inferred Grammar better than those in beam?

Kassel, 22/07/2005 ICCS’0526 Experimental results The Dyck language with k=1: S → S S | ( S ) | є Errors of: Omission: failures to parse sentences generated from the “correct” grammar (longer test sentences than in the training set). –Overly specific grammar. Commission: failures of the “correct” grammar to parse sentences generated by the inferred grammar. –Overly general grammar.

Kassel, 22/07/2005 ICCS’0527 Probability of parsing a valid sentence (1-errors of omission) Experimental results

Kassel, 22/07/2005 ICCS’0528 Probability of generating a valid sentence (1-errors of commission) Experimental results

Kassel, 22/07/2005 ICCS’0530 Ontology Enrichment Highly evolving domain (e.g. laptop descriptions) –New Instances characterize new concepts. e.g. ‘Pentium 2’ is an instance that denotes a new concept if it doesn’t exist in the ontology. –New surface appearance of an instance. e.g. ‘PIII’ is a different surface appearance of ‘Intel Pentium 3’ We concentrate on instances. The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain.

Kassel, 22/07/2005 ICCS’0531 Ontology Enrichment Multi-Lingual Domain Ontology Additional annotations Validation Ontology Enrichment / Population Domain Expert Annotating Corpus Using Domain Ontology Information extraction machine learning Corpus

Kassel, 22/07/2005 ICCS’0532 Finding synonyms The number of instances for validation increases with the size of the corpus and the ontology. There is a need for supporting the enrichment of the ‘synonymy’ relationship. Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Issues to be handled: Synonym : ‘Intel pentium 3’ - ‘Intel pIII’ Orthographical : ‘Intel p3’ - ‘intell p3’ Lexicographical : ‘Hewlett Packard’ - ‘HP’ Combination : ‘Intell Pentium 3’ - ‘P III’

Kassel, 22/07/2005 ICCS’0533 COCLU COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

Kassel, 22/07/2005 ICCS’0534 Experimental results Initial2 nd iter. 15/5848/58 28/5856/58 40/5857/58 Discovering lexical synonyms: Assign an instance to a group, while decreasing proportionally the number of instances available initially in each group. 50 60 70 80 90 100 020406080 Instances removed (%) Accuracy (%) Discovering new instances: Hide part of the known instances. Evolve ontology and grammars to recover them.

Kassel, 22/07/2005 ICCS’0535 Outline Motivation and state of the art SKEL research –Vision –Information integration in CROSSMARC. –Meta-learning for information extraction. –Context-free grammar learning. –Ontology enrichment. –BOEMIE: Bootstrapping ontology evolution with multimedia information extraction. Open issues

Kassel, 22/07/2005 ICCS’0536 BOEMIE - motivation Multimedia content grows with increasing rates in public and proprietary webs. Hard to provide semantic indexing of multimedia content. Significant advances in automatic extraction of low-level features from visual content. Little progress in the identification of high-level semantic features Little progress in the effective combination of semantic features from different modalities. Great effort in producing ontologies for semantic webs. Hard to build and maintain domain-specific multimedia ontologies.

Kassel, 22/07/2005 ICCS’0537 BOEMIE- approach EVOLVED ONTOLOGY INITIAL ONTOLOGY POPULATION & ENRICHMENT COORDINATION INTERMEDIATE ONTOLOGY ONTOLOGY EVOLUTION TOOLKIT LEARNING TOLS REASONING ENGINE MATCHING TOOLS ONTOLOGY MANAGEMENT TOOL ONTOLOGY EVOLUTION SEMANTICS EXTRACTION RESULTS OTHER ONTOLOGIES SEMANTICS EXTRACTION MULTIMEDIA CONTENT SEMANTICS EXTRACTION TOOLKIT TEXT EXTRACTION TOOLS AUDIO EXTRACTION TOOLS INFORMATION FUSION TOOLS VISUAL EXTRACTION TOOLS FROM VISUAL CONTENT FROM NON-VISUAL CONTENT FROM FUSED CONTENT Content Collection (crawlers, spiders, etc.)

Kassel, 22/07/2005 ICCS’0539 KR issues Is there a common formalism to capture the necessary semantics + syntactic + lexical knowledge for IE? Is that better than having separate representations for different tasks? Do we need an intermediate formalism (e.g. grammar + CG + ontology)? Do we need to represent uncertainty (e.g. using probabilistic graphical models)?

Kassel, 22/07/2005 ICCS’0540 ML issues What types and which aspects of grammars and conceptual structures can we learn? What training data do we need? Can we reduce the manual annotation effort? What background knowledge do we need and what is the role of deduction? What is the role of multi-strategy learning, especially if complex representations are used?

Kassel, 22/07/2005 ICCS’0541 Content-type issues What is the role of semantically annotated content in learning, e.g. as training data? What is the role of hypertext as a graph? Can we extract information from multimedia content? How can ontologies and learning help improve extraction from multimedia?

Kassel, 22/07/2005 ICCS’0542 SKEL Introduction This is research of many current and past members of SKEL. CROSSMARC is joint work of the project consortium (NCSR “Demokritos”, Uni of Edinburgh, Uni of Roma ‘Tor Vergata’, Veltinet, Lingway). Acknowledgements

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.

Similar presentations

Presentation on theme: "On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab.

Similar presentations

Presentation on theme: "On the Need to Bootstrap Ontology Learning with Extraction Grammar Learning Kassel, 22 July 2005 Georgios Paliouras Software & Knowledge Engineering Lab."— Presentation transcript:

Similar presentations

About project

Feedback