Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Interoperability of Distributed Component Systems Bryan Bentz, Jason Hayden, Upsorn Praphamontripong, Paul Vandal.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Who am I Gianluca Correndo PhD student (end of PhD) Work in the group of medical informatics (Paolo Terenziani) PhD thesis on contextualization techniques.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Traditional Information Extraction -- Summary CS652 Spring 2004.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
1 Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham.
Toward Making Online Biological Data Machine Understandable Cui Tao.
1 Data Integration and Extraction over Molecular Biological Data Cui Tao supported by NSF.
1 CIS607, Fall 2006 Semantic Information Integration Instructor: Dejing Dou Week 10 (Nov. 29)
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Infomaster: An information Integration Tool O. M. Duschka and M. R. Genesereth Presentation by Cui Tao.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
From SHIQ and RDF to OWL: The Making of a Web Ontology Language
Samad Paydar Web Technology Laboratory Computer Engineering Department Ferdowsi University of Mashhad 1389/11/20 An Introduction to the Semantic Web.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
1 Ontology Generation Based on a User-Specified Ontology Seed Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
Overview of Search Engines
ONTOLOGY MATCHING Part III: Systems and evaluation.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Integration of Biological Sources: Current Systems and Challenges Ahead ( Sigmod Record, Vol. 33. No. 3, September 2004 ) Thomas Hernandez & Sybbarao Kambhampati.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
PART IV: REPRESENTING, EXPLAINING, AND PROCESSING ALIGNMENTS & PART V: CONCLUSIONS Ontology Matching Jerome Euzenat and Pavel Shvaiko.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Semantic Learning Instructor: Professor Cercone Razieh Niazi.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The HERMES Heterogeneous Reasoning and Mediator System V.S. Subrahmanian University of Maryland [These slides originated from the HERMES Project sponsored.
Semantic Web - an introduction By Daniel Wu (danielwujr)
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
OWL Representing Information Using the Web Ontology Language.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Mining the Biomedical Research Literature Ken Baclawski.
Faculty Faculty Richard Fikes Edward Feigenbaum (Director) (Emeritus) (Director) (Emeritus) Knowledge Systems Laboratory Stanford University “In the knowledge.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Of 24 lecture 11: ontology – mediation, merging & aligning.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University

2 Research Field Overview My researchSemantic Web Data Integration Schema Matching Information Extraction Bioinformatics

3 Information Extraction “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99]

4 Information Extraction “Information extraction systems process text documents and locate a specific set of relevant items.” [Califf99] “Because the WWW consists primarily of text, information extraction is central to all effort that would use the web as a resource for knowledge discovery.” [Freitag98]

5 Information Extraction Traditional information extraction Hidden web crawling Biological data extraction

6 Traditional Information Extraction Different groups of IE tools: [Laender02] –Wrapper generation tools –NLP-based and learning-based tools –Ontology-based tools

7 Traditional Information Extraction Wrapper generation tools –Lixto [Baumgartner01] Supervised wrapper generation Semi-automatically Not robust; Does not work well with unstructured data –ROADRUNNER [Crescenzi01] Fully automatic wrapper generation Does not generate robust and general wrappers Only works for highly regular web pages

8 Traditional Information Extraction NLP-based and learning-based tools –SRV [Freitag98] Top-down learner Learns based on simple and relational features Single slot filling –RAPIER [Califf99] Bottom-up learner Learns pre-filler, slot filler, and post-filler patterns Only works for free text Single slot filling

9 Traditional Information Extraction Ontology-based tools –BYU Ontos [Embley99] Based on domain-specific extraction ontologies Robust to changes Multiple slot filling Ontologies has to be built manually

10 Hidden Web Crawling Traditional IE tools: publicly indexable web pages Hidden web crawling –Crawl the hidden web according to a user’s query –HiWE (Hidden Web Exposer) [Raghavan01] Source form representation  task-specific DB concepts Fill out and submit forms Retrieve information hidden behind the form

11 Biological Data Extraction Mainly from plain text Extract biological terms –Dictionary-based –Rule-based Extract relationships between biological terms/elements Example systems –BLAST-based name identifier [Krauthammer00] –PASTA (Protein Active Site Template Acquisition) [Gaizauskas03]

12 The Semantic Web Machine-understandable web Gives information a well-defined meaning Allows automation of tasks Provides biologists –Intelligent information services –Personalized web resources –Semantically empowered search engines

13 The Semantic Web Semantic web languages  XOL (XML-based Ontology Exchange Language)  SHOE (Simple HTML Ontology Extension)  OML (Ontology Markup Language)  RDF(S) (Resource Description Framework (Schema))  OIL (Ontology Interchange Language)  DAML+OIL (DARPA Agent Markup Language + OIL)  OWL (Ontology Web Language) Semantic Annotation –Old: indexing of publications in libraries –New: information extraction

14 Schema Matching Previous methods [Raghavan01]: –Individual matchers vs. combining matchers –Schema-based matchers vs. instance-based matchers –Learning-based matchers vs. rule-based matchers –Element-level matchers vs. structure-level matchers

15 Schema Matching LSD (Learning Source Description) [Doan01] –Semi-automatic –Learning-based –Both schema-level and instance-Level –Only 1-1 mappings GLUE & CGLUE [DMD+03] –Ontology alignment –CGLUE: Complex (non-1-1) mappings

16 Schema Matching Cupid [Madhavan01] –Rule-based matcher –Both element-level and structure-level –Schema-based –Works on hierarchical schemas with schema tree –Linguistic similarity & structure similarity –Matches tree elements by weighted similarities

17 Schema Matching COMA (COmbing MAtch) [Do02] –Combines different matchers –Interactive with users –Also an evaluation platform for different matchers

18 Biological Data Integration Challenge: –Huge amount, growing rapidly –Highly diverse in granularity and variety –Different terminologies, ID systems, units –Unstable and unpredictable –Different interface and querying capabilities

19 Biological Data Integration SRS (Sequence Retrieval System) [Etzold96] –Keyword-based retrieval system –Returns simple aggregation of matched records –Only works for relational databases BioKleisli [Davidson97] –Integrated digital library in biomedical domain –No global schema or ontology –A mediator works on top of source-specific wrappers –Horizontal integration

20 Biological Data Integration DiscoveryLink [Haas01] –Mediator-based, wrapper-oriented –Provides virtual DB access from different sources –Cannot deal with complex source data –Hard to add new sources –Requires knowledge of specific query language TAMBIS (Transparent Access to Multiple Bioinformatics Information Sources) [Stevens00] –Mediator-based –Uses global ontology and schema –Maps source and target concepts manually –Not robust to changes –Hard to add new sources

21 Bioinformatics Biological ontology Bioinformatics data source discovery Trustworthiness and provenance

22 Bioinformatics Biological ontology –GO (Gene Ontology) [Ashburner00] Controlled vocabulary –Molecular Function (7278 terms) –Biological Process (8151 terms) –Cellular Component (1379 terms) Is represent knowledge hierarchically

23 Bioinformatics Biology Ontology –LinKBase [Verschelde03] Originally a biomedical ontology –Over 2,000,000 medical concepts –Over 5,300,000 instantiations –543 relations Expanded using GO Only describes simple binary relationships

24 Bioinformatics Bioinformatics data source discovery –First step in integrating or answering queries –Example System: [Rocco03]: Pre-defined classes with class descriptions Tries to map a source with a class Trustworthiness and provenance –Trustworthiness: Consistency Reliability Competence Honesty –Provenance Record History Transformations Annotations updates

25

26

27

28 Summary and Future Work Overcome drawbacks of existing systems Elaborate new algorithms to solve the problem of locating and extracting data from heterogeneous biological sources My researchSemantic Web Schema Matching Information Extraction Bioinformatics