State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya.

State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya

2 Outline Introduction Why integration of bioinformatics sources is especially hard Use of traditional data integration techniques in bioinformatics Use of new integration techniques Where to go from here?

3 Introduction As a discipline, bioinformatics is based on a range of diverse, complex and distributed data resources (900 or more). Because of the existence of such a large number of sources, data integration is critically important in this field. Integration of these sources is a challenging task. It is said that bioinformaticians should be a little ashamed by this situation.  It has been stated that a “Bioinformatics Nation” should be developed from the current set of competing “Princely States”.

4 Introduction (contd.) Why so many sources?  The Web makes it (too) easy to publish data.  Being a resource provider is one way to make a reputation.  Each new sub-discipline develops its own data representations skewed to its biases.  Each type of data has many resources which have many overlaps. In comparison, areas such as particle physics have few centralized data resources.

5 Introduction (contd.) Additional problems of these sources  Each group is highly autonomous and routinely create different data resources and designs.  Different interfaces are provided by different groups (e.g., “flat files”, XML, APIs)  Users are often independent and decoupled from data providers.  Many groups don’t have the expertise or the resources needed to survive (only about 18% have a sustained future)

6 Special Challenges of DI in BI When compared with fields such as astronomy and particle physics, bioinformatics data are not very large. The problem is the complexity of data, arising from several factors  describing a sample and its originating context  diversity of sources of a sample  large number of inter-links, etc.

7 How to Handle the Complexity of Data? Common, shared identities and names: “A biologist would rather share their toothbrush than their gene name” Shared semantics  Ontologies can help but political and theoretical wrangling hinder their development Shared and stable access mechanisms  In 2007, BioMART altered its interfaces four time breaking any client software that used them

8 How to Handle the Complexity of Data? (contd.) Adhering to standards a blue collar science Explicitly stating collection policies and governance Balancing “curation” with ease of use These issues have to be handled while keeping the freedom of rapid innovation.

9 Different Data Integration Techniques used with Bioinformatics Sources

10 Traditional Data Integration Techniques Link Integration (Search):  Directly cross-references a data entry in a data source with another entry in another data source.  Implemented using hyperlinks.  Widely used by bioinformatics systems such as SRS, Entrez and Integr8.  This technique actually represents interlinks created in a haphazard manner.  Vulnerable to name changes, updates, etc.

11 Traditional Data Integration Techniques (contd.) Data Warehousing (Materialization):  Data are extracted, cleaned and stored in a separate, integrated database.  Some people believe that this is the only data integration technique that actually works.  Requires a pre-determined encompassing model.  Involves a high initial cost as well as high maintenance costs; hard to change; commonly decoupled from data providers;  Often result in “data mortuaries”.

12 Traditional Data Integration Techniques (contd.) View Integration (Mediation):  Data is still in the source databases but a virtual warehouse is constructed using mappings.  Uses models such as Global-As-View (GAV) and Local-As-View (LAV).  Popular among database theorists and vendors.  Developing a global model is costly, mappings are often brittle, results in a complex environment.  Automated processes are necessary to make this method practically useful.

13 Traditional Data Integration Techniques (contd.) Integration Application (Ad hoc methods)  Applications specifically designed for a particular integration task.  Generally uses a combination of integration techniques and provides more options for the user.  Avoids the “Big I Challenge” Workflows coordinate a transient workflow between data services and analytical tools and expose the integration methods.

14 New Data Integration Techniques Service Oriented Architectures:  Include technologies such as CORBA and Web Services.  Data Integration has to be achieved by “plumbing” these services.  CORBA is generally considered too heavy despite its technical sophistication.  SOAP based web services has shown promise but have problems such as poor documentation.  These services are necessary to do away with the widespread practice of “screen scrapping”

15 New Data Integration Techniques (contd.) Mashups:  A Web 2.0 idea based on taking data from more than one web-based resource to build a new web- based application (e.g., combing a feed of earthquake measurements with Google maps)  Delivered through the Web, open and light.  Platforms such as Microsoft Popfly and Yahoo! Pipes are already available.  Has been used in applications such as tracking the spread of aviation flu.

16 Mashups (contd.) Emphasizes the role of the user in creating a specific, light-touch, on-demand integration. Relies on the existence of APIs and light- weight tools for development. Preferred by bioinformaticians over heavier general engineering solutions. Just as vulnerable as other integration techniques to identity clashes and concept ambiguities.

17 New Data Integration Techniques (contd.) Semantic Web applications are also expected to help the integration of bioinformatics sources. They are called “smashups”. Smashups should use ontologies and support reasoning. The communication mechanisms may be the same as those used by mashups. (e.g., AJAX in the client side)

18 Architecture of Smashups

19 Requirements of Smashups Simple and stable APIs that can be used by third parties Publishing data as RDF and supporting SPARQL endpoints (or supporting conversion to these formats from databases) Clarifying the semantics Tackling the problem of object reconciliation.

20 Where to go from here? Better naming standards and handling object reconciliation is essential for the success of any data integration technique in bioinformatics. Most promising techniques:  Web Services  Mashups (Web 2.0 Applications)  Smashups (Semantic Web Applications)

21 Thank You! Questions?

State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya.

Similar presentations

Presentation on theme: "State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya.

Similar presentations

Presentation on theme: "State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya."— Presentation transcript:

Similar presentations

About project

Feedback