State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya.

Slides:



Advertisements
Similar presentations
Map of Human Computer Interaction
Advertisements

Usage of the memoQ web service API by LSP – a case study
The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Database Architectures and the Web
General introduction to Web services and an implementation example
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
CHAPTER 7 Roderick Dickson Kelli Grubb Tracyann Pryce Shakita White.
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Presentation 7: Part 1: Web Services Introduced. Outline Definition Overview of Web Services Examples Next Time: SOAP & WSDL.
Chapter 9 Designing Systems for Diverse Environments.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
© 2006 IBM Corporation IBM Software Group Relevance of Service Orientated Architecture to an Academic Infrastructure Gareth Greenwood, e-learning Evangelist,
A New Computing Paradigm. Overview of Web Services Over 66 percent of respondents to a 2001 InfoWorld magazine poll agreed that "Web services are likely.
ITEC810 Project By: P. M. Mathindri Nilushika Pathiraja 1.
1 Introduction The Database Environment. 2 Web Links Google General Database Search Database News Access Forums Google Database Books O’Reilly Books Oracle.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
Integration of Applications MIS3502: Application Integration and Evaluation Paul Weinberg Adapted from material by Arnold Kurtz, David.
Cloud Computing Concept&nature Cloud computing refers to the applications delivered as services over the Internet and the hardware, and systems software.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
SQL Reporting Services Overview SSRS includes all the development and management pieces necessary to publish end user reports in  HTML  PDF 
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
A university for the world real R © 2009, Chapter 23 Epilogue Wil van der Aalst Michael Adams Arthur ter Hofstede Nick Russell.
Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Database Architectures and the Web Session 5
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
Introduction to MDA (Model Driven Architecture) CYT.
Web Mashups -Nirav Shah.
A Passion for Excellence. InterSystems – at a glance International Software Enterprise International Software Enterprise Headquartered in Cambridge, MA,
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
© 2009 Pearson Education, Inc. Publishing as Prentice Hall 1 Chapter 1: The Database Environment Modern Database Management 9 th Edition Jeffrey A. Hoffer,
KMS Products By Justin Saunders. Overview This presentation will discuss the following: –A list of KMS products selected for review –The typical components.
1 Advanced Software Architecture Muhammad Bilal Bashir PhD Scholar (Computer Science) Mohammad Ali Jinnah University.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Semantic Web Technologies Research Topics and Projects discussion Brief Readings Discussion Research Presentations.
1 Digital Preservation Testbed Database Preservation Issues Remco Verdegem Bern, 9 April 2003.
Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.
The Semantic Logger: Supporting Service Building from Personal Context Mischa M Tuffield et al. Intelligence, Agents, Multimedia Group University of Southampton.
Introduction to the Semantic Web and Linked Data
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Client/Server Computing
Methods and Techniques for Integration of Small Datasets September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban.
CSCE 315 – Programming Studio Spring Goal: Reuse and Sharing Many times we would like to reuse the same process or data for different purpose Want.
Web Services An Introduction Copyright © Curt Hill.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
The Two Cultures: Mashing up Web 2.0 and the Semantic Web The 16 th International World Wide Web Conference (2007) - Position Paper - Presented By Anupriya.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Web Mashups -Nirav Shah.
Cloud based linked data platform for Structural Engineering Experiment
Web Engineering.
Data Warehouse.
Content Management Systems
The Top 10 Reasons Why Federated Can’t Succeed
POOL/RLS Experience Current CMS Data Challenges shows clear problems wrt to the use of RLS Partially due to the normal “learning curve” on all sides in.
C.U.SHAH COLLEGE OF ENG. & TECH.
Middleware, Services, etc.
Chapter 1: The Database Environment
The Database Environment
Grid Based Data Integration with Automatic Wrapper Generation
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
Introduction to SOA and Web Services
Map of Human Computer Interaction
The Database Environment
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Presentation transcript:

State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya

2 Outline Introduction Why integration of bioinformatics sources is especially hard Use of traditional data integration techniques in bioinformatics Use of new integration techniques Where to go from here?

3 Introduction As a discipline, bioinformatics is based on a range of diverse, complex and distributed data resources (900 or more). Because of the existence of such a large number of sources, data integration is critically important in this field. Integration of these sources is a challenging task. It is said that bioinformaticians should be a little ashamed by this situation.  It has been stated that a “Bioinformatics Nation” should be developed from the current set of competing “Princely States”.

4 Introduction (contd.) Why so many sources?  The Web makes it (too) easy to publish data.  Being a resource provider is one way to make a reputation.  Each new sub-discipline develops its own data representations skewed to its biases.  Each type of data has many resources which have many overlaps. In comparison, areas such as particle physics have few centralized data resources.

5 Introduction (contd.) Additional problems of these sources  Each group is highly autonomous and routinely create different data resources and designs.  Different interfaces are provided by different groups (e.g., “flat files”, XML, APIs)  Users are often independent and decoupled from data providers.  Many groups don’t have the expertise or the resources needed to survive (only about 18% have a sustained future)

6 Special Challenges of DI in BI When compared with fields such as astronomy and particle physics, bioinformatics data are not very large. The problem is the complexity of data, arising from several factors  describing a sample and its originating context  diversity of sources of a sample  large number of inter-links, etc.

7 How to Handle the Complexity of Data? Common, shared identities and names: “A biologist would rather share their toothbrush than their gene name” Shared semantics  Ontologies can help but political and theoretical wrangling hinder their development Shared and stable access mechanisms  In 2007, BioMART altered its interfaces four time breaking any client software that used them

8 How to Handle the Complexity of Data? (contd.) Adhering to standards a blue collar science Explicitly stating collection policies and governance Balancing “curation” with ease of use These issues have to be handled while keeping the freedom of rapid innovation.

9 Different Data Integration Techniques used with Bioinformatics Sources

10 Traditional Data Integration Techniques Link Integration (Search):  Directly cross-references a data entry in a data source with another entry in another data source.  Implemented using hyperlinks.  Widely used by bioinformatics systems such as SRS, Entrez and Integr8.  This technique actually represents interlinks created in a haphazard manner.  Vulnerable to name changes, updates, etc.

11 Traditional Data Integration Techniques (contd.) Data Warehousing (Materialization):  Data are extracted, cleaned and stored in a separate, integrated database.  Some people believe that this is the only data integration technique that actually works.  Requires a pre-determined encompassing model.  Involves a high initial cost as well as high maintenance costs; hard to change; commonly decoupled from data providers;  Often result in “data mortuaries”.

12 Traditional Data Integration Techniques (contd.) View Integration (Mediation):  Data is still in the source databases but a virtual warehouse is constructed using mappings.  Uses models such as Global-As-View (GAV) and Local-As-View (LAV).  Popular among database theorists and vendors.  Developing a global model is costly, mappings are often brittle, results in a complex environment.  Automated processes are necessary to make this method practically useful.

13 Traditional Data Integration Techniques (contd.) Integration Application (Ad hoc methods)  Applications specifically designed for a particular integration task.  Generally uses a combination of integration techniques and provides more options for the user.  Avoids the “Big I Challenge” Workflows coordinate a transient workflow between data services and analytical tools and expose the integration methods.

14 New Data Integration Techniques Service Oriented Architectures:  Include technologies such as CORBA and Web Services.  Data Integration has to be achieved by “plumbing” these services.  CORBA is generally considered too heavy despite its technical sophistication.  SOAP based web services has shown promise but have problems such as poor documentation.  These services are necessary to do away with the widespread practice of “screen scrapping”

15 New Data Integration Techniques (contd.) Mashups:  A Web 2.0 idea based on taking data from more than one web-based resource to build a new web- based application (e.g., combing a feed of earthquake measurements with Google maps)  Delivered through the Web, open and light.  Platforms such as Microsoft Popfly and Yahoo! Pipes are already available.  Has been used in applications such as tracking the spread of aviation flu.

16 Mashups (contd.) Emphasizes the role of the user in creating a specific, light-touch, on-demand integration. Relies on the existence of APIs and light- weight tools for development. Preferred by bioinformaticians over heavier general engineering solutions. Just as vulnerable as other integration techniques to identity clashes and concept ambiguities.

17 New Data Integration Techniques (contd.) Semantic Web applications are also expected to help the integration of bioinformatics sources. They are called “smashups”. Smashups should use ontologies and support reasoning. The communication mechanisms may be the same as those used by mashups. (e.g., AJAX in the client side)

18 Architecture of Smashups

19 Requirements of Smashups Simple and stable APIs that can be used by third parties Publishing data as RDF and supporting SPARQL endpoints (or supporting conversion to these formats from databases) Clarifying the semantics Tackling the problem of object reconciliation.

20 Where to go from here? Better naming standards and handling object reconciliation is essential for the success of any data integration technique in bioinformatics. Most promising techniques:  Web Services  Mashups (Web 2.0 Applications)  Smashups (Semantic Web Applications)

21 Thank You! Questions?