BIOMEDICAL DATA INTEGRATION BASED ON METAQUERIER ARCHITECTURE GROUP MEMBERS -NAIEEM KHAN -EUSUF ABDULLAH MIM -M SAMIULLAH CHOWDHURY ADVISOR : KHONDKER.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

Chapter 5: Introduction to Information Retrieval
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
File Systems and Databases
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
BUSINESS DRIVEN TECHNOLOGY
Automatic Data Ramon Lawrence University of Manitoba
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
MetaQuerier Mid-flight: Toward Large-Scale Integration for the Deep Web Kevin C. Chang.
Overview of Search Engines
Conceptual Architecture of PostgreSQL PopSQL Andrew Heard, Daniel Basilio, Eril Berkok, Julia Canella, Mark Fischer, Misiu Godfrey.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Introduction To System Analysis and design
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Database Design - Lecture 1
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Introduction to MDA (Model Driven Architecture) CYT.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
DEPICT: DiscovEring Patterns and InteraCTions in databases A tool for testing data-intensive systems.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Andrew S. Budarevsky Adaptive Application Data Management Overview.
Presenter: Shanshan Lu 03/04/2010
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Secure Systems Research Group - FAU SW Development methodology using patterns and model checking 8/13/2009 Maha B Abbey PhD Candidate.
CMPS 435 F08 These slides are designed to accompany Web Engineering: A Practitioner’s Approach (McGraw-Hill 2008) by Roger Pressman and David Lowe, copyright.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
Ch- 8. Class Diagrams Class diagrams are the most common diagram found in modeling object- oriented systems. Class diagrams are important not only for.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Object storage and object interoperability
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
Class Diagrams. Terms and Concepts A class diagram is a diagram that shows a set of classes, interfaces, and collaborations and their relationships.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Of 24 lecture 11: ontology – mediation, merging & aligning.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Introduction Multimedia initial focus
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Chapter 10: Process Implementation with Executable Models
Toward Large Scale Integration
Presentation transcript:

BIOMEDICAL DATA INTEGRATION BASED ON METAQUERIER ARCHITECTURE GROUP MEMBERS -NAIEEM KHAN -EUSUF ABDULLAH MIM -M SAMIULLAH CHOWDHURY ADVISOR : KHONDKER SHAJADUL HASAN CO – ADVISOR : JAVED SIDDIQUE

 DATA INTEGRATION  METAQUERIER ARCHITECTURE  BIOMEDICAL DATA Three basic parts of the project

DATA INTEGRATION DATA INTEGRATION What does it mean? Data integration is the process of combining data residing at different sources and providing the user with a unified view of these data. This process emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories). Data integration appears with increasing frequency as the volume and the need to share existing data explodes.

Simple schematic for a data warehouse. The information from the source databases is extracted, transformed then loaded into the data warehouse DATA INTEGRATION DATA INTEGRATION

Difficulties of Data Integration Huge web database Database content are now dynamic Necessity of efficient data crawler Accurate and perfect Query Interfaces Time efficiency Depth Volume of data handling

Importance  Integration from web databases.  In order to get necessary information from different sources data integration is very important.  In order to get a Large scale Integration  Efficient and accurate query answers.  Consider a user, who is moving to a new town. To start with, different queries need different sources to answer: Where can she look for real estate listings? (e.g., realtor.com.) Studying for a new car? (cars.com.) Looking for a job? (monster.com.) Further, different sources support different query capabilities: After source hunting, the user must then learn the gruelling details of querying each source. DATA INTEGRATION DATA INTEGRATION

METAQUERIER ARCHITECTURE METAQUERIER ARCHITECTURE

There are different approaches and paradigms for data integration, some of them are- Materialized: physical, integrated repository is created here. Data Warehouses: physical repositories of selected data extracted from a collection of DBs and other information sources. Mediated: data stay at the sources, a virtual integration system is created. Federated and cooperative: DBMSs are coordinated to collaborate. Exchange: Data is exported from one system to another. Peer-to-Peer data exchange: Many peers exchange data without a central control mechanism. Data is passed from peer to peer upon request, as query answers.

Two basic concerns to use MetaQuerier  First, to make the deep Web systematically accessible, it will help users find online databases useful for their queries  To make the deep Web uniformly usable, it will help users query online databases. METAQUERIER ARCHITECTURE METAQUERIER ARCHITECTURE

MetaQuerier Architecture has two basic stands Dynamic Discovery As Sources are Changing so they must be dynamically discovered for integration. There are no preselected sources On the Fly Integration As queries are ad-hoc, so MetaQuerier must mediated them on the fly for relevant sources. There is not pre configured sources

METAQUERIER ARCHITECTURE METAQUERIER ARCHITECTURE

PARTS OF METAQUERIER SYSTEM Front end Back end Deep web Repository Results Compilation Query Translation Source Selection Database Crawler, Source Clustering, Schema Matching, Interface Extraction

METAQUERIER ARCHITECTURE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER SYSTEM ARCHITECTURE DATA CRAWLER INTERFACE EXTRACTION SOURCE CLUSTERING SCHEMA MATCHING RESULT COMPILATION QUERY TRANSLATION SOURCE SELECTION

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Data Crawler A process to gather certain information from the web database and other resources. Similar to Web Crawling – A process used by search engines to search on Internet as queried. Data Crawler search data by filtering and categorizing to make the system efficient. It has two different segments – Site Crawler Shallow Crawler

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Data Crawler Workflow: o Site Crawler needs efficient query interface. o It takes user querable keywords from the interface and filters the query. o Site Crawler goes through the root page. o It identifies IP addresses. o Shallow Crawler follows those found IP addresses.

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Data Crawler Advantages  Two major challenges can be accomplished through data crawling. Dynamic Discovery Deep web searching.  Dynamic Discovery is covered through Site Crawler.  Deep web searching is covered through Shallow Crawler.

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Interface Extraction  Interface Extraction extracts data the required data from the query interfaces.  Query interface share similar query patterns but sometimes different.  Different query patterns arise due to hidden information or attributes.  These attributes are not visual on interface. Workflow:  Data Crawler hands over a huge amount of unsorted and hidden data.  IE generates a query which extracts the found data.

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Interface Extraction Key Features  It takes query interfaces in HTML format.  Then it functions as a visual language parser.  Interface Extraction tokenizes the page, parses the tokens and then merges potentially multiple trees.  Finally it generates the query capabilities.  The basic idea of interface extraction is to extract query capabilities from query interfaces.

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Source Selection  Defined a common mediated schema for all data sources, we need to match and map the data sources according to mediated schema.  The target user may understand the concepts in their own domain but may not know what on other domains.  The solution is to set the sources to include in data integration and what mediated schema to use.  All ontologies are stored in a common repository.  The system identifies which ontology will be used based on the user submitted query.

PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE Result Compilation  Last process of the data integration.  It aggregates query results to the user.  It compiles data results from different sources into coherent pieces.  Will be used for extracting data from schema matching and matching other attributes across different sources.

Source Clustering Collaborates with source selection which works in the front-end. Clusters sources according to subject domain (e.g. edu, org etc). Sorts data as mediated process which provides data towards schema process. Main task is to construct a hierarchy of clusters with a given set of query capabilities. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Source Clustering (Cont.) CHARACTERISTICS OF DOMAIN ELEMENTS AND CONSTRAINT ELEMENTS: Textboxes cannot be used for constraint elements. Radio buttons or checkboxes or selection lists may appear as constraint elements. An attribute consists of a single element cannot have constraint elements. An attribute consisting of only radio buttons or checkboxes does not have constraint elements. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Source Clustering (Cont.) HOW TO DIFFERENTIATE BETWEEN DOMAIN & CONSTRAINT ELEMENTS: A simple two-step method can be used: 1.First, identify the attributes that contain only one element or whose elements are all radio buttons, or checkboxes or textboxes. 2.Second, an Element Classifier is needed to process other attributes that may contain both domain elements and constraint elements. Each element is represented as a feature of four: element name, element format type, element relative position in the element list, and element values. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Source Clustering (Cont.) DERIVING INFORMATION FROM ATTRIBUTES: Four types of information for each attribute are defined (only for domain elements): 1. Domain type: Indicates how many distinct values can be used for an attribute for queries. There are four domain types are defined in our model:  range  finite  infinite  Boolean 2. Value type: Each attribute on a search interface has its own semantic value type.  All input values are treated as text values PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Source Clustering (Cont.) 3.Default Value:  Indicate some semantics of the attributes.  May occur in a selection list, a group of radio buttons and a group of checkboxes.  Always marked as “checked” or “selected”. 4.Unit:  Defines the meaning of an attribute value  e.g., kilogram is a unit for weight. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Schema Matching Schema defines the tables, the fields in each table, and the relationships between fields and tables. It is the graphical representation of a database structure. Schema matching is the process of identifying two objects whether they are semantically related or not while mapping refers to the transformations between the objects. In data integration, schema matching finds out the semantic domain values among the attributes, which have been found through query interfaces. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Schema Matching Uses data from query capabilities and organize the data as per requirement. It provides data to Source selection and Query Translation and finally sends the data to users at the front-end. MetaQuerier redesigns the process in terms of complex matching instead of one by one process. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Schema Matching PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Query Translation A front end process. Translation is necessary to match and express query conditions in terms of what an interface sends. It is critical to automatically interpret queries Steps For complete query translation: Step 1: extract constraint templates from a query interface. Step 2: find matching templates from given source and target constraint templates PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

Query Translation Constraint mapping: The objective is to find the target constraint with the closest semantic meaning to the source constraint. Query mediation: Mediating queries across multiple sources. Abstract the problem as a pattern of answering query using views. Focus is to decompose a user query into sub-queries across multiple sources. Schema mapping: Translates a set of data values from one source to another one, according to given matching. Only concerns about the equality relation between different schemas. PROCESSES OF THE METAQUERIER ARCHITECTURE PROCESSES OF THE METAQUERIER ARCHITECTURE

BIO MEDICAL DATA - - PROTEIN BIO MEDICAL DATA - - PROTEIN WHAT IS PROTEIN: Any of a large group of nitrogenous organic compounds that are essential constituents of living cells; consist of polymers of amino acids; essential in the diet of animals for growth and for repair of tissues; can be obtained from meat and eggs and milk and legumes TYPE OF MACRO MOLECULE SUPER MOLUCULE PART OF AMINO ACID AMINO ALCANICACID PLOYPEPTIDE

BIO MEDICAL DATA - - PROTEIN BIO MEDICAL DATA - - PROTEIN SOME EXAMPLE OF PROTEIN INFORMATION

BIO MEDICAL DATA - - PROTEIN BIO MEDICAL DATA - - PROTEIN AVAILABLE WEB SERVICES ABOUT PROTEIN

Source Clustering (Example) DERIVING INFORMATION FROM ATTRIBUTES: 1.Domain type: range, finite, infinite and Boolean  Here, two textboxes are used to represent a range for the attribute Production Year, thus the attribute should have range domain type.

Source Clustering (Example) 2.Value type: Distinct Values  For example, the attribute Onlooker’s age or Reader age semantically has integer values, and Production date has date values

Source Clustering (Cont.) 3.Default Value:  In the previous figure, the attribute Onlooker’s age has a default value “all ages” 4.Unit:  one search interface may use “Milligrams/grams” as the unit of its Concentration attribute, while another may use “Litres” for its Concentration attribute.

Query Translation (Example) Two Bio-Medical Data query interfaces and their matching Name of Bio-Medical data – Proteins Constraint templates –  Look at the interfaces  T 1: name  T 2: source  T 3 : onlooker’s age  T 4 : concentration  S 1 : name  S 2 : category  S 3 : concentration; [between; $low, $high]  S 4 : onlooker’s age;[ in; {[18:65],…}]

Query Translation (Example)  S 1 : name  S 2 : category  S 3 : concentration; [between; $low, $high]  S 4 : onlooker’s age;[ in; {[18:65],…}]  T 1: name  T 2: source  T 3 : onlooker’s age  T 4 : concentration  Focus is to translate between “matching” constraint templates S 2 in Q 1 matches T 2 in Q 2.  We need to extract constraint templates (T 1,…,T 4 ).  Given source and target constraint templates (Q 1 and Q 2 respectively), we need to find matching templates.

Query Translation (Example) Constraint mapping across Query Interfaces (Q 1 and Q 2 ) Constraint mapping is to instantiate T 2 into t 2 = [source; all words; "Membrane Protein"]  The best translation of the source constraint s 2, i.e., s 2  t 2

Query Translation (Example) Translation rules T 12 between Q 1 and Q 2 To translate queries we need the following mapping techniques: r1r1 [category; contain; $s]  emit: [source; all; $s] r2r2 [name; contain; $t]  emit: [name; contain; $t] r3r3 [concentration range; between; $s, $t]  $p = ChooseClosestNum($s), emit: [concentration; less than; $p] r4r4 [onlooker’s age; between; $s]  $r = ChooseClosestRange($s), emit: [age; between; $r] Table: Translation Rules

Query Translation (Example)  Text type constraints operators: any, all, exact, start and string values,  Numeric type constraints: equal, greater than, less than, between and numeric values.

The constraint mapping framework Source constraint s and a target constraint template T Gives output to the closest target constraint t opt, that T can generate to s. The type recognizer identifies the type of the constraints, and then dispatches them accordingly to the type handler. Query Translation (Example)

The type handler performs the search to find a good instantiation among possible ones described by T, which is then returned as the mapping. The type recognizer takes the source constraint s and target constraint template T as input, and infers the data type by analyzing the constraints syntactically. The type handler takes the constraints dispatched by the type recognizer as input and performs search among possible instantiations of the target constraint template for the best one. Query Translation (Example)

Mapping the constraints between category in Q 1 and source in Q 2 : Source constraint s = [category; contain; "Membrane Protein"] is instantiated from template S = [category; contain; $val] by populating $val=" Membrane Protein" Target constraint template T = [source; $op; $val] accepts operators $op from {"any words", "all words"} and value $val from any string. Query Translation (Example) t1t1 [source; any; “Membrane Protein”] t2t2 [source; all; “Membrane Protein”] t3t3 [source; any; “Membrane”] t4t4 [source; any; “Protein”] Among the candidate target constraints t 1, t 2,..., from I(T), the constraint mapping thus searches for the element that is closest to the source.

EXAMPLE OF AN INTERFACE EXAMPLE OF AN INTERFACE

“Invention date” implies the Attribute is semantically a date data type. Two elements are used to specify a range query condition with different roles in specifying the condition. Such semantic information is hidden from computers. Not defined on query interfaces. This HIDDEN information about each attribute needs to be revealed and defined to enrich the schema matching. EXAMPLE OF AN INTERFACE EXAMPLE OF AN INTERFACE

OVERVIEW OF THE SYSTEM  SOURCES ARE NOT PREDEFINED AND PRE CONFIGURED. SO NEED TO FIND SOURCES DYNAMICALLY ACCORDING TO THE USER AD HOC INFORMATION  AFTER DISCOVERY OF THE WEB DATABASES ITS IS NEEDED TO EXTRACT THE QUERY CAPABILITIES AND ITS IS ALSO AUTOMATIC AND ON THE FLY  THEN QUERYING THE SOURCES TRANSLATE THE QUERY ON THE FLY SINCE SOURCE ARE UNSEEN

OVERVIEW OF THE SYSTEM WORK FLOW OF THE SYSTEM BACK END  SEMANTIC DISCOVERY - DATA CRAWLER - automatically collect sources from the deep web - INTERFACE EXTRACTION - Extract query capabilities from interface - SOURCE CLUSTERING - Clustering interface into sub domain - SCHEMA MATCHING - Discover semantic matching FRONT END  EXECUTION OF QUERY - PROVIDE USER A DOMAIN CATEGORY - FOR EACH CATEGORY A UNIFIED INTERFACE IS GENERATED BY SM - SELECT APPROPRIATE SOURCES TO RUN QUERY (SS) - SELECTED SOURCES ARE TRANSLATED BY QUERY TRANSLATION - FINALLY AGGREGATE THE RESULT BY RESULT COMPILATION

CONCLUSION Our target is to deploy MetaQuerier as an efficient data integration architecture. The implementation can be done successfully based on Bio Medical Data Inside the subsystem of MetaQuerier there are some conceptual changes can be done to improve the efficiency of handling huge unsorted data

THANK YOU ANY QUESTION ?