Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Bratislava, 10th November 20112 Primary Research Team & Capabilities Dept. of Parallel and Distributed Computing Research and Development Areas: –Large-scale HPCN, Grid and MapReduce applications –Intelligent and Knowledge oriented Technologies Experience from IST: –3 project in FP5: ANFAS, CrosGRID, Pellucid –6 project in FP6: EGEE II, K-Wf Grid, DEGREE (coordinator), EGEE, int.eu.grid, MEDIGRID –4 projects in FP7: Commius, Admire, Secricom, EGEE III Several National Projects (SPVV, VEGA, APVT) IKT Group Focus: –Information Processing (Large Scale) –Graph Processing –Information Extraction and Retrieval –Semantic Web –Knowledge oriented Technologies –Parallel and Distributed Information Processing Solutions: –SGDB: Simple Graph Database –gSemSearch: Graph based Semantic Search –Ontea: Pattern-based Semantic Annotation –ACoMA: KM tool in Email –EMBET: Recommendation System –Experts on MapReduce and IR (Nutch, Solr, Lucene) Director & leader of PDC: Dr. Dipl. Ing. Ladislav Hluchý URL: http://ikt.ui.sav.sk

Approach and Solutions

Large scale Text and Graph data processing Core Technology Web crawling –Nutch + plugins Full text indexing and search –lucene, Sorl Information Extraction –Ontea, GATE All above large scale –Hadoop, S4 Graph processing and Querying –Simple Graph Database (SGDB) –gSemSearch –Neo4j –Blueprints Bratislava, 10th November 20114 Underlined are the technologies developed by IISAS

Ontea: Information Extraction Tool  Regex patterns  Gazetteers  Resuls  Key-value pairs  Structured into trees  graphs  Transformers, Configuration  Automatic loading of extractors  Visual Annotation Tool  Integration with external tools  GATE, Stemers, Hadoop …  Multilingual tests English, Slovak, Spanish, Italian Bratislava, 10th November 20115 http://ontea.sf.net

Use of Social Network from email Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph Faceted search, navigation Email Search Prototype Bratislava, 10th November 20116

gSemSearch: Graph based Semantic Search Graph/Network of interacting (interconnected) entities Discovering relation in the Graph (network) using spread of activation algorithm Showing relations of concrete type, e.g. telephone numbers related to a person Navigation over related entities Full-text search of the entities User interface for search User interaction with data (merging, deleting entities) with immediate impact on discovered relations Tested on Email Enron Corpus –Email Social Network Search –http://ikt.ui.sav.sk/esns/http://ikt.ui.sav.sk/esns/ Bratislava, 10th November 20117

SGDB: Simple Graph Database Storage for graphs Optimized for graph traversing and spread of activation Faster then Neo4j for graph traversing operations Supports Blueprints API https://simplegdb.svn.sourceforge.net/svnroot/simplegdb/Sgdb3 Graph Database Benchmarks –Graph Traversal Benchmark for Graph Databases –http://ups.savba.sk/~marek/gbench.htmlhttp://ups.savba.sk/~marek/gbench.html –Blueprints API - possibility to test compliant Graph databases Bratislava, 10th November 20118

Future Direction: Relations Discovery in Large Graph Data Motivation –Graph/Network data are everywhere: social networks, web, LinkedData, transactions, communication (email, phone). –Also text can be converted to graph. –Interconnecting graph data and searching for relations is crucial. Approach –Forming semantic trees and graphs from text, web, communication, databases and LinkedData –User interaction with graph data in order to achieve integration and data cleansing –Users will do it, if user effort have immediate impact on search results Bratislava, 10th November 20119

Ontea: Pattern based information extraction and semantic annotation Text processing

Ontea: Information Extraction (Features)  Regex patterns  Visual Annotation Tool  Integration with external tools  GATE, Stemers, Hadoop …  Gazetteers  IE System configuration  Automatic loading of extractors  Patterns  Multilingual tests Spanish Slovak English Italian Bratislava, 10th November 201111

Information Extraction Model Address and product patterns Extraction Processing 3 words macro ZIP macro Street number macro Street name macro City name macro Country macro Address patterns Bratislava, 10th November 201112

Segmentation Sentences Paragraphs Objects (Address, Product..) Bratislava, 10th November 201113

Gazetteer Can extract information, which cannot be properly extracted by regular expression patterns (like given names, product names, etc.) Gazetteer extraction approach is combined with regular expressions based extraction. For example personal full names can be extracted with higher precision. Gazetteer is easy to update, because it is configured by simple text files. Information Extraction: Gazetteers configuration Bratislava, 10th November 201114 Gazetteer lists simple text files with keywords Gazetteer configuration simple text file with : Information extractor rules

Information Extraction: Rules configuration IE System configuration –IE dynamically loads and run its components (XMLRegexExtractor, Gazetteer, RuleTransformer) according to setting in IE rules file –IE Components are executing consecutively and operate on a set of information extraction results Bratislava, 10th November 201115 Information extractor rules file IE result set Modified IE result set IE component Regex based IE component Gazetteer IE component Result set transformer IE component

Semantic Annotation Bratislava, 10th November 201116 Object trees Tree of IE results Set of IE results Text/Email Theconcept The concept  InformationExtractor - IE produces a set of extraction results  SemanticAnnotator - SA consumes the IE result set and builds a trees convertible to Ontology instances or objects according to XML schema e.g. Core Components SA first builds an intermediate tree of IE results on which it operates SA first builds an intermediate tree of IE results on which it operates The tree is upon its creation not compliant to Core Components specification and needs to be transformed The tree is upon its creation not compliant to Core Components specification and needs to be transformed Therefore we have tree transformers which transform the IE result tree to a trees Therefore we have tree transformers which transform the IE result tree to a trees

Semantic Annotation Tree transformers –Input is a tree of IE results and output is the modified tree of IE results –Tree transformers are executing consecutively and operate on a tree of information extraction results –Tree transformers, which delete, create, rename, move, switch and order nodes are configured in the SA rules file Bratislava, 10th November 201117 Modified tree of IE results Tree of IE results Tree transformer

Social Networks Social nework reconstruction:  probabilistic inference using spreading activation  relies on the output of the information extractor (IE) in the form of complex objects Bratislava, 10th November 201118 Preliminary results on a set of 50 Spanish emails (phone/name):  Precision 60% (due to lower recall in IE)  Precision 85% (achievable with better IE)  self-healing (with new incoming emails)

Social Networks Bratislava, 10th November 201119 Results as XML or HTML: (via XSL Transformations) Future:  DataSource for Search for Partner module  Improve the recall of Information Extractor  Exploit multi-pass algorithm and named entity recognition: things learned in the first pass will be used in the next, e.g. possible names with initials, etc.  Build an enhanced statistical reasoning procedure on top of the present Social Network Extractor/Correlator

Email Research Acoma

Bratislava, 10th November 201121 Acoma Architecture Connected to email protocols on desktop or server No need to change working practices –Emails are received and send as before Received email is processed by Acoma and enriched with useful information Extensible with OSGi modules

Bratislava, 10th November 201122 System Connectors Connection of Acoma to existing systems –Document Archives –Internet or Intranet Systems –Databases Access or import of data Key-value pair transformation Meta-Connector Web Connector SpreadSheet Connector Database Connector Key-value Transformed Key-value

Bratislava, 10th November 201123 Acoma architecture : Message Post Processing Useful hints with links are included in enriched email Links lead to internal or external systems (Internet, Intranet)

Bratislava, 10th November 201124 Business objects in Emails Study on 6 organizations show: –Objects can be identified by patterns and gazeteers –It is possible to define set of common objects Objects identified: –Organization: org:Name, org:RegNo, org:TaxNo –Person: person:Name, person:Function –Contact: contact:Phone, contact:Email, contact:Webpage –Address: address:ZIP, address:Street, address:Settlement –Product: product:Name, product:Module, product:Component, product:BOID –Document: doc:Invoice, doc:Order, doc:Contract, doc:ChangeRequest –Inventory: inventory:ResID, inventory:ResType –Other business object ID: BOID

Social Networks and Graph Data Relations among objects Support for search Bratislava, 10th November 201125

Use of Social Network from email Includes extracted objects Full text of extracted objects Related objects discovered and ordered by spread activation on social network graph Faceted search, navigation Email Search Prototype Bratislava, 10th November 201126

Context based Recommendation, Knowledge Sharing EMBET, Acoma

28 Objective: Recommend and provide user information or knowledge in context EMBET: proactive information and knowledge provision Collaboration among users Knowledge sharing Active knowledge provision Reuse of knowledge: notes and other resources http://ups.savba.sk/kwfgrid/uaa/ Bratislava, 10th November 2011

29 EMBET: Achievements Software with following functionality –User Problem description –Displaying Knowledge –Adding Knowledge –Knowledge Reuse –Permanent Notes Storage –Voting on Notes EMBET architecture: Core, GUI Context detection Context Matching to display information & knowledge Plain text analysis using Advanced Semantic Annotation Algorithms – OnTeA Theory of different context matching algorithms Bratislava, 10th November 2011

30 Acoma: Hint Recommendation

Information Retrieval and Information Extraction lectures

IR Lectures Introduction to Information Retrieval Text Operations, Text Analysis, stemming Crawling, link processing IR Models, Indexing techniques IR Software libraries and systems Ranking by Graph Algorithms (PageRank, HITS, …) and Searching Information Extraction Regular Expressions Large Scale Data Processing on MapReduce Architecture Multimedia Information Retrieval Evaluation Techniques, Precision, Recall Google Semantics and IR, Semantic Web Standards 32Bratislava, 10th November 2011

Lectures conditions Every students gets project focused on –Crawling –Indexing –Ranking –Information Extraction –Large Scale information Processing They have to consult project 3 times during semester Availability of data from day one Lectures are available at: –http://vi.ikt.ui.sav.sk/Témy_prednášokhttp://vi.ikt.ui.sav.sk/Témy_prednášok 33Bratislava, 10th November 2011

Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Similar presentations

Presentation on theme: "Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)

Similar presentations

Presentation on theme: "Information processing Michal Laclavík, Ladislav Hluchý (Email research, information extraction, information retrieval, contextual recommendation)"— Presentation transcript:

Similar presentations

About project

Feedback