Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications.

Similar presentations


Presentation on theme: "Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications."— Presentation transcript:

1 Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications NCSR Demokritos http://www.iit.demokritos.gr/skel Dagstuhl, February 15, 2005

2 Dagstuhl 15/2/2005 Machine Learning for Information Integration 2 SKEL Introduction Areas of research activity: –Information gathering (retrieval, crawling, spidering) –Information filtering (text and multimedia classification) –Information extraction (named entity recognition and classification, role identification, wrappers, grammar and lexicon learning) –Personalization (user stereotypes and communities) SKELs research objective: innovative knowledge technologies for reducing the information overload on the Web

3 Dagstuhl 15/2/2005 Machine Learning for Information Integration 3 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

4 Dagstuhl 15/2/2005 Machine Learning for Information Integration 4 SKEL Introduction National Centre for Scientific Research "Demokritos (GR) University of Edinburgh (UK) Universita di Roma Tor Vergata (IT) VeltiNet A.E. (GR) Lingway (FR) CROSSMARC consortium

5 Dagstuhl 15/2/2005 Machine Learning for Information Integration 5 CROSSMARC Objectives crawl the Web for interesting Web pages, extract information from pages of different sites without a standardized format (structured, semi-structured, free text), process Web pages written in several languages, be customized semi-automatically to new domains and languages, deliver integrated information according to personalized profiles. Develop technology for Information Integration that can:

6 Dagstuhl 15/2/2005 Machine Learning for Information Integration 6 CROSSMARC Architecture Ontology

7 Dagstuhl 15/2/2005 Machine Learning for Information Integration 7 CROSSMARC Ontology … Laptops Processor Processor Name Intel Pentium 3 … Intel Pentium III Pentium III P3 PIII Lexicon Ontology Όνομα Επεξεργαστή Greek Lexicon

8 Dagstuhl 15/2/2005 Machine Learning for Information Integration 8 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

9 Dagstuhl 15/2/2005 Machine Learning for Information Integration 9 Learning Context Free Grammars Infers context-free grammars. Learns from positive examples only. Overgenarisation controlled through a heuristic, based on MDL. Two basic/three auxiliary learning operators. Two search strategies: –Beam search. –Genetic search. Introducing eg-GRIDS

10 Dagstuhl 15/2/2005 Machine Learning for Information Integration 10 Learning Context Free Grammars Minimum Description Length (MDL) Model Length (ML) = GDL + DDL Bits required to encode the grammar G. Grammar Description Length (GDL) Bits required to encode all training examples, as encoded by the grammar G. Derivations Description Length (DDL) Overly Specific Grammar Overly General Grammar DDL Hypothese s GDL

11 Dagstuhl 15/2/2005 Machine Learning for Information Integration 11 Learning Context Free Grammars eg-GRIDS Architecture Operator Mode Beam of Grammars Merge NT Operator Create NT Operator Learning Operators Create Optional NT Detect Center Embedding YES NO Evolutionary Algorithm Mutation Search Organisation Selection Body Substitution Training Examples Overly Specific Grammar Final Grammar Any Inferred Grammar better than those in beam?

12 Dagstuhl 15/2/2005 Machine Learning for Information Integration 12 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

13 Dagstuhl 15/2/2005 Machine Learning for Information Integration 13 D \ D j DjDj Meta-learning for Web IE Base-level dataset D L 1 …L N MD j Meta-level dataset MD C 1 (j)…C N (j) CMCM New vector x C 1...C N Meta-level vector Class value y(x) L 1 …L N LMLM Stacked generalization

14 Dagstuhl 15/2/2005 Machine Learning for Information Integration 14 Meta-learning for Web IE …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Information Extraction is not naturally a classification task In IE we deal with text documents, paired with templates Template T t(s,e)s, eField f Transport ZX47, 49model 1556, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram Each template is filled with instances

15 Dagstuhl 15/2/2005 Machine Learning for Information Integration 15 Meta-learning for Web IE T 1 filled by the IE system E 1 t(s, e)s, ef Transport ZX47, 49model 1556, 58screenSize TFT59, 60screenType Intel Pentium III63, 67procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83ram T 2 filled by the IE system E 2 t(s, e)s, ef Transport ZX47, 49manuf TFT59, 60screenType Intel Pentium63, 66procName 600 MHz67, 69procSpeed 256 MB76, 78ram 1 GB81, 83HDcapacity …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB… Combining Information Extraction systems

16 Dagstuhl 15/2/2005 Machine Learning for Information Integration 16 Meta-learning for Web IE Stacked template (ST) s, et(s, e)Field by E 1 Field by E 2 Correct field 47, 49Transport ZXmodelmanufmodel 56, 5815screenSize- 59, 60TFTscreenType 63, 66Intel Pentium-procName- 63, 67Intel Pentium IIIprocName- 67, 69600 MHzprocSpeed 76, 78256 MBram 81, 831 GBramHDcapacity- Creating a stacked template …TransPort ZX 15" XGA TFT Display Intel Pentium III 600 MHZ 256k Mobile processor 256 MB SDRAM up to 1GB…

17 Dagstuhl 15/2/2005 Machine Learning for Information Integration 17 D \ D j Meta-learning for Web IE Training in the new stacking framework DjDj L 1 …L N E 1 (j)…E N (j) CMCM ST 1 ST 2 … L 1 …L N E 1 …E N LMLM MD j D = set of documents, paired with hand-filled templates MD = set of meta-level feature vectors

18 Dagstuhl 15/2/2005 Machine Learning for Information Integration 18 Meta-learning for Web IE Stacking at run-time New document d E1E1 E2E2 ENEN … T1T1 T2T2 TNTN Stacked template CMCM T Final template

19 Dagstuhl 15/2/2005 Machine Learning for Information Integration 19 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

20 Dagstuhl 15/2/2005 Machine Learning for Information Integration 20 Ontology Enrichment Highly evolving domain (e.g. laptop descriptions) –New Instances characterize new concepts. e.g. Pentium 2 is an instance that denotes a new concept if it doesnt exist in the ontology. –New surface appearance of an instance. e.g. PIII is a different surface appearance of Intel Pentium 3 We concentrate on instances. The poor performance of many Information Integration systems is due to their incapability to handle the evolving nature of the domain they cover.

21 Dagstuhl 15/2/2005 Machine Learning for Information Integration 21 Ontology Enrichment Multi-Lingual Domain Ontology Additional annotations Validation Ontology Enrichment / Population Domain Expert Annotating Corpus Using Domain Ontology Information extraction machine learning Corpus

22 Dagstuhl 15/2/2005 Machine Learning for Information Integration 22 Enrichment with synonyms The number of instances for validation increases with the size of the corpus and the ontology. There is a need for supporting the enrichment of the synonymy relationship. Discover automatically different surface appearances of an instance (CROSSMARC synonymy relationship). Issues to be handled: Synonym : Intel pentium 3 - Intel pIII Orthographical : Intel p3 - intell p3 Lexicographical : Hewlett Packard - HP Combination : Intell Pentium 3 - P III

23 Dagstuhl 15/2/2005 Machine Learning for Information Integration 23 Compression-based Clustering COCLU (COmpression-based CLUstering): a model based algorithm that discovers typographic similarities between strings (sequences of elements-letters) over an alphabet (ASCII characters) employing a new score function CCDiff. CCDiff is defined as the difference in the code length of a cluster (i.e., of its instances), when adding a candidate string. Huffman trees are used as models of the clusters. COCLU iteratively computes the CCDiff of each new string from each cluster implementing a hill-climbing search. The new string is added to the closest cluster, or a new cluster is created (threshold on CCDiff ).

24 Dagstuhl 15/2/2005 Machine Learning for Information Integration 24 Structure of the talk Web Information integration in CROSSMARC Learning Context Free Grammars Meta-learning for Web Information Extraction Machine Learning for Ontology Maintenance Conclusions

25 Dagstuhl 15/2/2005 Machine Learning for Information Integration 25 SKEL Introduction Information integration can benefit from machine learning. Grammar learning methods have become efficient. Combining IE systems improves performance. Ontologies can be used to annotate examples to learn IE systems and enrich ontologies. Grammar learning in parallel/combination to ontology learning? Conclusions

26 Dagstuhl 15/2/2005 Machine Learning for Information Integration 26 SKEL Introduction This is research of many current and past members of SKEL. CROSSMARC is joint work of the project consortium. Acknowledgements

27 Dagstuhl 15/2/2005 Machine Learning for Information Integration 27 Announcement IJCAI workshop Workshop on Grammatical Inference Applications: Successes and Future Challenges IJCAI-05, Edinburgh, Scotland July 31, 2005 Paper submission deadline: March 19, 2005 URL: http://www.ics.mq.edu.au/~menno/IJCAI05/


Download ppt "Machine Learning for Information Integration on the Web Georgios Paliouras Software & Knowledge Engineering Lab Inst. of Informatics & Telecommunications."

Similar presentations


Ads by Google