Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

An Introduction to GATE
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
University of Illinois Visualizing Text Loretta Auvil UIUC February 25, 2011.
SEVENPRO – STREP KEG seminar, Prague, 8/November/2007 © SEVENPRO Consortium SEVENPRO – Semantic Virtual Engineering Environment for Product.
Information Retrieval in Practice
Search Engines and Information Retrieval
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
LYU0101 Wireless Digital Information System Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu Second semester FYP Presentation 2001~2002.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Information Retrieval in Practice
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
Overview of Search Engines
Databases & Data Warehouses Chapter 3 Database Processing.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Špindlerův Mlýn, Czech Republic, SOFSEM Semantically-aided Data-aware Service Workflow Composition Ondrej Habala, Marek Paralič,
11 October Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid.
Enron s as Graph Data Corpus for Large-scale Graph Querying Experimentation Michal Laclavík, Martin Šeleng, Marek Ciglan, Ladislav Hluchý.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Search Engines and Information Retrieval Chapter 1.
Ontology and Agent based Approach for Knowledge Management
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
Survey of Semantic Annotation Platforms
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Practical Project of the 2006 Joint International Master’s Degree.
Department of Computer Science and Engineering, CUHK 1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System.
Processing and Recommendation Michal Laclavík, Ladislav Hluchý, Martin Šeleng ( research, information extraction, information retrieval, contextual.
Master Thesis Defense Jan Fiedler 04/17/98
Institute of Informatics, Slovak Academy of Sciences Michal Laclavík Ladislav Hluchý.
FI-CORE Data Context Media Management Chapter Release 4.1 & Sprint Review.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Košice, 10 February Experience Management based on Text Notes The EMBET System Michal Laclavik.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
NoteSearch - Find what you’re looking for. Prototype Team B.
Session 4e, 24 October 2007 eChallenges e-2007 Copyright 2007 Institute of Informatics, SAS Network Enterprise Interoperability and Collaboration using.
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
ICCS 2008, CracowJune 23-25, Towards Large Scale Semantic Annotation Built on MapReduce Architecture Michal Laclavík, Martin Šeleng, Ladislav Hluchý.
Ontea: Pattern based Annotation Platform Michal Laclavík.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Workshop 12g, 26 October 2007 eChallenges e-2007 Copyright 2007 Commius consortium Commius: ISU via Michal Laclavík Institute of Informatics, Slovak.
Session 10a, 21st October 2005 eChallenges e-2005 Copyright 2005 K-Wf Grid, Institute of Informatics SAS Experience Management based on Text Notes (EMBET)
Information Retrieval
11 November Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
7th May Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Foundations of Information Systems in Business
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
WIKTBratislava, 28. november Semantic Organization/Enterprise Vision Michal Laclavik, Ladislav Hluchy, Marian Babik, Zoltan Balogh, Ivana Budinska,
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Viet Tran Institute of Informatics, SAS Slovakia.
Information Retrieval in Practice
Information Collection and Presentation Enriched by Remote Sensor Data
CHAPTER 3 Architectures for Distributed Systems
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chaitali Gupta, Madhusudhan Govindaraju
Information Retrieval and Web Design
Presentation transcript:

Information processing Michal Laclavík, Ladislav Hluchý ( research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid and MapReduce applications –Intelligent and Knowledge oriented Technologies Experience from IST: –3 project in FP5: ANFAS, CrosGRID, Pellucid –6 project in FP6: EGEE II, K-Wf Grid, DEGREE (coordinator), EGEE, int.eu.grid, MEDIGRID –4 projects in FP7: Commius, Admire, Secricom, EGEE III Several National Projects (SPVV, VEGA, APVT) IKT Group Focus: –Information Processing (Large Scale) –Graph Processing –Information Extraction and Retrieval –Semantic Web –Knowledge oriented Technologies –Parallel and Distributed Information Processing Solutions: –SGDB: Simple Graph Database –gSemSearch: Graph based Semantic Search –Ontea: Pattern-based Semantic Annotation –ACoMA: KM tool in –EMBET: Recommendation System –Experts on MapReduce and IR (Nutch, Solr, Lucene) Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý URL:

Approach and Solutions

Large scale Text and Graph data processing Core Technology Web crawling –Nutch + plugins Full text indexing and search –lucene, Sorl Information Extraction –Ontea, GATE All above large scale –Hadoop, S4 Graph processing and Querying –Simple Graph Database (SGDB) –gSemSearch –Neo4j –Blueprints Bratislava, 10th November Underlined are the technologies developed by IISAS

Ontea: Information Extraction Tool  Regex patterns  Gazetteers  Resuls  Key-value pairs  Structured into trees  graphs  Transformers, Configuration  Automatic loading of extractors  Visual Annotation Tool  Integration with external tools  GATE, Stemers, Hadoop …  Multilingual tests English, Slovak, Spanish, Italian Bratislava, 10th November

Use of Social Network from Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph Faceted search, navigation Search Prototype Bratislava, 10th November 20116

gSemSearch: Graph based Semantic Search Graph/Network of interacting (interconnected) entities Discovering relation in the Graph (network) using spread of activation algorithm Showing relations of concrete type, e.g. telephone numbers related to a person Navigation over related entities Full-text search of the entities User interface for search User interaction with data (merging, deleting entities) with immediate impact on discovered relations Tested on Enron Corpus – Social Network Search – Bratislava, 10th November 20117

SGDB: Simple Graph Database Storage for graphs Optimized for graph traversing and spread of activation Faster then Neo4j for graph traversing operations Supports Blueprints API Graph Database Benchmarks –Graph Traversal Benchmark for Graph Databases – –Blueprints API - possibility to test compliant Graph databases Bratislava, 10th November 20118

Future Direction: Relations Discovery in Large Graph Data Motivation –Graph/Network data are everywhere: social networks, web, LinkedData, transactions, communication ( , phone). –Also text can be converted to graph. –Interconnecting graph data and searching for relations is crucial. Approach –Forming semantic trees and graphs from text, web, communication, databases and LinkedData –User interaction with graph data in order to achieve integration and data cleansing –Users will do it, if user effort have immediate impact on search results Bratislava, 10th November 20119

Ontea: Pattern based information extraction and semantic annotation Text processing

Ontea: Information Extraction (Features)  Regex patterns  Visual Annotation Tool  Integration with external tools  GATE, Stemers, Hadoop …  Gazetteers  IE System configuration  Automatic loading of extractors  Patterns  Multilingual tests Spanish Slovak English Italian Bratislava, 10th November

Information Extraction Model Address and product patterns Extraction Processing 3 words macro ZIP macro Street number macro Street name macro City name macro Country macro Address patterns Bratislava, 10th November

Segmentation Sentences Paragraphs Objects (Address, Product..) Bratislava, 10th November

Gazetteer Can extract information, which cannot be properly extracted by regular expression patterns (like given names, product names, etc.) Gazetteer extraction approach is combined with regular expressions based extrac- tion. For example personal full names can be extracted with higher precision. Gazetteer is easy to update, because it is configured by simple text files. Information Extraction: Gazetteers configuration Bratislava, 10th November Gazetteer lists simple text files with keywords Gazetteer configuration simple text file with : Information extractor rules

Information Extraction: Rules configuration IE System configuration –IE dynamically loads and run its components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file –IE Components are executing consecutively and operate on a set of information extraction results Bratislava, 10th November Information extractor rules file IE result set Modified IE result set IE component Regex based IE component Gazetteer IE component Result set transformer IE component

Semantic Annotation Bratislava, 10th November Object trees Tree of IE results Set of IE results Text/ Theconcept The concept  InformationExtractor - IE produces a set of extraction results  SemanticAnnotator - SA consumes the IE result set and builds a trees convertible to Ontology instances or objects according to XML schema e.g. Core Components SA first builds an intermediate tree of IE results on which it operates SA first builds an intermediate tree of IE results on which it operates The tree is upon its creation not compliant to Core Components specification and needs to be transformed The tree is upon its creation not compliant to Core Components specification and needs to be transformed Therefore we have tree transformers which transform the IE result tree to a trees Therefore we have tree transformers which transform the IE result tree to a trees

Semantic Annotation Tree transformers –Input is a tree of IE results and output is the modified tree of IE results –Tree transformers are executing consecutively and operate on a tree of information extraction results –Tree transformers, which delete, create, rename, move, switch and order nodes are configured in the SA rules file Bratislava, 10th November Modified tree of IE results Tree of IE results Tree transformer

Social Networks Social nework reconstruction:  probabilistic inference using spreading activation  relies on the output of the information extractor (IE) in the form of complex objects Bratislava, 10th November Preliminary results on a set of 50 Spanish s (phone/name):  Precision 60% (due to lower recall in IE)  Precision 85% (achievable with better IE)  self-healing (with new incoming s)

Social Networks Bratislava, 10th November Results as XML or HTML: (via XSL Transformations) Future:  DataSource for Search for Partner module  Improve the recall of Information Extractor  Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names with initials, etc.  Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlator

Research Acoma

Bratislava, 10th November Acoma Architecture Connected to protocols on desktop or server No need to change working practices – s are received and send as before Received is processed by Acoma and enriched with useful information Extensible with OSGi modules

Bratislava, 10th November System Connectors Connection of Acoma to existing systems –Document Archives –Internet or Intranet Systems –Databases Access or import of data Key-value pair transformation Meta-Connector Web Connector SpreadSheet Connector Database Connector Key-value Transformed Key-value

Bratislava, 10th November Acoma architecture : Message Post Processing Useful hints with links are included in enriched Links lead to internal or external systems (Internet, Intranet)

Bratislava, 10th November Business objects in s Study on 6 organizations show: –Objects can be identified by patterns and gazeteers –It is possible to define set of common objects Objects identified: –Organization: org:Name, org:RegNo, org:TaxNo –Person: person:Name, person:Function –Contact: contact:Phone, contact: , contact:Webpage –Address: address:ZIP, address:Street, address:Settlement –Product: product:Name, product:Module, product:Component, product:BOID –Document: doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest –Inventory: inventory:ResID, inventory:ResType –Other business object ID: BOID

Social Networks and Graph Data Relations among objects Support for search Bratislava, 10th November

Use of Social Network from Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph Faceted search, navigation Search Prototype Bratislava, 10th November

Context based Recommendation, Knowledge Sharing EMBET, Acoma

28 Objective: Recommend and provide user information or knowledge in context EMBET: proactive information and knowledge provision Collaboration among users Knowledge sharing Active knowledge provision Reuse of knowledge: notes and other resources Bratislava, 10th November 2011

29 EMBET: Achievements Software with following functionality –User Problem description –Displaying Knowledge –Adding Knowledge –Knowledge Reuse –Permanent Notes Storage –Voting on Notes EMBET architecture: Core, GUI Context detection Context Matching to display information & knowledge Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA Theory of different context matching algorithms Bratislava, 10th November 2011

30 Acoma: Hint Recommendation

Information Retrieval and Information Extraction lectures

IR Lectures Introduction to Information Retrieval Text Operations, Text Analysis, stemming Crawling, link processing IR Models, Indexing techniques IR Software libraries and systems Ranking by Graph Algorithms (PageRank, HITS, …) and Searching Information Extraction Regular Expressions Large Scale Data Processing on MapReduce Architecture Multimedia Information Retrieval Evaluation Techniques, Precision, Recall Google Semantics and IR, Semantic Web Standards 32Bratislava, 10th November 2011

Lectures conditions Every students gets project focused on –Crawling –Indexing –Ranking –Information Extraction –Large Scale information Processing They have to consult project 3 times during semester Availability of data from day one Lectures are available at: – 33Bratislava, 10th November 2011