Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.

Similar presentations


Presentation on theme: "A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University."— Presentation transcript:

1

2 A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University

3 Outline Motivation System Overview System Implementation –Languages –Query Execution Experiments

4 Motivation Biological researches ask for –Accessing multiple heterogeneous data sources Lack of common data model, data format –Tracking multiple objects A motivating example: protein sequence analysis

5 An Example >unknown sequence. ……MCMFGSSVIECPNPRIWFVWPYEFPLFLLPGGDRMEI… … NCBI protein-protein BLAST service List of similar sequences Clustering Analysis (ClustalW, RiPE, etc) FormatInformation Out: BLASTFixed column width BLASTPartial name In: FASTAIn: Full name To predict protein function

6 Current Solution Manual –Copy-and-paste keyword search –Format conversion programs –NCBI link-out Database –Load data (BLAST output, sequence database) –Parse input; Re-format output

7 Our Approach Join request between BLAST output and SWISSPROT (sequence database) Data maintained in flat files Query specification and data description are high-level, declarative Data parsing and query processing are behind the scene

8 Advantage Retrieve multiple pieces of information all at once Data easily available Declarative languages only High flexibility Low over-head

9 System Overview query Query parser Metadata collection Dataset descriptors Descriptor parser Application analyzer QUERYINFOR DataReaderDataWriter synchronizer Source data files Target Data file Source/target names Schema & Layout information mappings

10 Outline Motivation System Overview System Implementation –Languages Query Language Metadata Description Language –System Query Analysis Query Execution Experiments

11 Query Language Declarative, SQL-like Projection, selection, cross product, join queries Example AUTOWRAP POSTBLAST FROM BLASTP, SWISSPROT BY BLASTP.SP_ID = SWISSPROT.ID WHERE POSTBLAST.QUERY = BLASTP.QUERY POSTBLAST.SP_AC = BLASTP.SP_AC POSTBLAST.SP_ID = BLASTP.SP_ID POSTBLAST.FULL_DESCR = SWISSPROT.DE POSTBLAST.SEQUENCE = SWISSPORT.SQ POSTBLAST.SCORE = BLASTP.SCORE POSTBLAST.E_VALUE = BLASTP.E_VALUE Target dataset Source datasets Join criteria Attribute pairs

12 Metadata Description Language One descriptor for each flat file dataset. –Schema –Layout Re-usable by different queries Can be learned semi-automatically using data mining techniques Example: BLAST output

13 Schema Descriptors Written in XML DTD format Example

14 Layout Descriptors Example DATASET "BLASTP" { DATATYPE {BLASTP} DATASPACE LINESIZE = 90 { … } DATA {data/Blast_htm.txt} } Dataset name Schema name File layout File location

15 Description of File Layout Layout descriptorActual data file "BLASTP" VERSION … "Query=" QUERY "\nDatabase:" DB_NAME < "\nsp|" SP_AC "|" SP_ID " " DESCR " " SCORE " " E_VALUE > "\n\nALIGNMENT" DUMMY BLASTP 2.2.11 [Jun-05-2005] Reference: … … RID: … … Query= Random 50 residue protein sequence. Database: Non-redundant SwissProt sequences 175,661 sequences; 64,716,374 total letters Score E Sequences producing significant alignments: (Bits) Value sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosin... 30.0 1.5 sp|P00352|AL1A1_HUMAN Oxygen-independent coproporphyrinogen II... 28.1 5.7 sp|P40530|YIE2_YEAST Oxygen-independent coproporphyrinogen II... 28.1 5.7 ALIGNMENTS >sp|P11884|AL1A1_SHEEP Modification methylase MwoI (N-4 cytosine-specific …

16 Query Analysis query Query parser Metadata collection Dataset descriptors Descriptor parser QUERYINFOR DataReaderDataWriter synchronizer Source data files Target Data file Source/target names Schema & Layout information mappings Application analyzer

17 Terminology DLM-VAR node/pair –a pairing of a delimiter and an attribute value –E.g "Query=" QUERY Reach-ability –DLM-VAR node r is reachable from node a iff configuration “ar” is allowed by the layout description Regular v.s Semi-structured Attribute –Regular: fixed number of values per entry –Semi-structured: various number of values per entry Number v.s Index –Label for layout node v.s schema node –1 index/number, 1+ number/index

18 Application Analyzer 1. Label Schema and Layout Tree 2.Query analysis –Record layout information Delimiter look-up table –Draw correspondence between schema and layout Label look-up table –Collect constant values in query Pseudo-label look-up table –Calculate reachable nodes Reachable look-up table –Other information Parameters QUERYINFOR

19 QUERY-PROC Structure Three general action modules –DataReader –DataWriter –Synchronizer One query-specific data module –QUERYINFOR QUERYINFOR DataWriterDataReader Synchronizer Source 1 Source 2 Target

20 QUERY-PROC Structure (cont.) One value buffer –Configuration vary from query to query –Accessible to three general modules QUERY SP_ID SCORE E_VALUE SP_AC BLASTP Source 1 Source 2 Regular Semi-structured

21 QUERY-PROC Action DataReader –Extract attribute value Start: Delimiter look-up table End: Reachable look-up table –Fill value buffer: Label look-up table DataWriter –Retrieve from value buffer: Label look-up table –Write target file: Delimiter look-up table Truncate or wrap: Reachable look-up table + label look-up table

22 QUERY-PROC Action (cont.) Synchronizer –Set up pseudo-attributes: Pseudo label look- up table –Call DataReader on source 1 and 2, Call DataWriter on target: Parameters –Test join conditions: Parameters –Clean value buffer: Parameters

23 Outline Motivation System Overview System Implementation –Languages –System Experiments Outline

24 Post-BLAST Query Enhance BLAST output Join query between BLAST output and SWISSPROT Results in FASTA format 2 modes –UNIQUE: halt once a match is found in source 2 –ALL: search all source 2 entries

25 Chip-supplement Query Look up microarray genes information Join query between protein array and yeast genome database Results in tabular form 2 queries –Chip-Supplement: array join genome –Chip-Supplement-Sorted: genome join array

26 OMIM-plus Query Add reverse links of proteins to disease database Join query between OMIM database and SWISSPROT database Results in OMIM form 86.38 seconds/entry * 12,158 OMIM entry = 291.7 hours

27 Summary A data integration tool Answers query on flat-file datasets Light-weighted –Modest programming efforts –No DBMS –Various flat file formats supported


Download ppt "A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University."

Similar presentations


Ads by Google