Presentation is loading. Please wait.

Presentation is loading. Please wait.

2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University.

Similar presentations


Presentation on theme: "2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University."— Presentation transcript:

1 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University of Thessaloniki Hellas

2 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 2 Introduction (1/3) Semistructured data –Sources: HTML, BibTeX, SGML, etc. –Characteristics: no rigid structure, but some implicit structure, i.e., “schema” –Knowledge of the “schema” is crucial: Querying/browsing information sources Building indexes/views Storage in relational/object-oriented databases Query processing

3 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 3 Introduction (2/3) OEM db &2&3&1 ReviewTitleDirector TitleDirectorAwardTitle Movie Nationality Name BiographyNationalityName Figure 1: Semistructured “movie” objects

4 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 4 –Discovering the common “schema” Large volume / Irregularity of data –Solution: Mining the “schema” Scalable / Can deal with irregularity Association rules proposed by Wang & Liu [6] –Issue: How to deal with dynamic data ? Introduction (3/3)

5 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 5 Our contributions –Maintenance of the discovered schema under insertions of new objects –Schema for the new objects. –Performance evaluation of the method. Motivation

6 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 6 Problem definition Algorithm’s description Performance evaluation Conclusion References Presentation Outline

7 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 7 An Object Exchange Model (OEM) object –Identifier o (i.e., &o) –Value Atomic (integer, float, string) Complex –List:  l 1 :&o 1, l 2 :&o 2, …, l k :&o k  –Bag: {l 1 :&o 1, l 2 :&o 2, …, l k :&o k } where: l i are labels (“roles”) ? denotes the wild card matching any label  is the nil structure that contains no label Object Exchange Model

8 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 8 Definition 1.The nil structure is a tree-expression 2.Let te i be tree-expressions of objects o i. If val(o)=  l 1 :&o 1, l 2 :&o 2, …, l k :&o k  and  i 1, i 2, …, l r  is a subsequence of  1, 2, …, k  then  l i1 :te i1, l i2 :te i2, …, l ir :te ir  is a tree-expression of object o. Representation A tree-expression  l i1 :te i1, l i2 :te i2, …, l ir :te ir  consists of k subtrees te ij each being labeled l ij. Tree-Expressions

9 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 9 Problem definition Input 1.A collection of transaction objects in an OEM graph, denoted as DB 2.A minimum support threshold MINSUP 3.The frequent tree expressions for DB 4.A number of new objects added into the collection, denoted as db The incremental schema maintenance problem is to discover all tree expressions which have support in DB  db greater than or equal to MINSUP. Incremental Schema Mining

10 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 10 DeltaSSD utilizes Negative Borders Definition [Negative Border] Given a collection of S  P(R) of tree expressions, closed with respect to the “weaker than” relation [6], the negative border Bd - of S consists of the minimal tree expressions X  R not in S. DeltaSSD

11 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 11 DB, db, DB (= DB  db) Regular, increment, current database L DB, L db, L DB Frequent tree expressions of DB, db, DB N DB, N db, N DB Negative border of DB, db, DB TE DB L DB  N DB L, N L DB (L DB  N DB ). Negative border of L SupportOf( set, database ) Updates the support count of the tree expressions in set w.r.t. the database NB( set )Computes the negative border of the set LargeOf( set, database ) Returns the tree expressions in set which have support count above MINSUP in the database DeltaSSD (notation)

12 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 12 DeltaSSD BEGIN 1SupportOf(TE DB, db)11If ( |C| ) 2L = LargeOf(TE DB, DB)12 C = C  L 3Small = TE DB -L13 repeat 4If ( L == L DB )14 C = C  NB( C ) RETURN( L DB, N DB )15 C = C – (Small  Small db ) 5N = NB( L )16 until ( C does not grow ) 6 If ( N  Small ) 17 C = C – ( L  N u ) RETURN( L, N )18 if( |C| ) then SupportOf(C, db) 7N u = N – Small19ScanDB = LargeOf(C  N u, db ) 8SupportOf(N u, db)20N’ = NB(L  ScanDB ) – Small 9C= LargeOf( N u )21 SupportOf(N’  ScanDB, DB) 10Small db = N u – C22 L DB = L  LargeOf(ScanDB,DB) 23N DB = NB(L DB  db ) END

13 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 13 Generation of synthetic data One dataset : –(L 1, N 1 ) = (25, 1000) –(L 2, N 2, T 2, I 2, P 2 ) = (25, 1000, 4, 2, 50) –(N 3, T 3, I 3, P 3 ) = (3000, 4, 2, 50) Relatively small database, 3000 objects. Short and “bushy” transactions (thus, few database scans). Experimental settings

14 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 14 Database scans Performance Evaluation WangZJZTDeltaSSD minsup Scan DBScan dbScan DBScan dbScan DBScan db 0.08333312 0.10333312 0.12333312 0.14333312

15 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 15 Operations (CPU time) Performance Evaluation WangZJZTDeltaSSD minsup 10%20%30%10%20%30% 0.08 186027564619811689141621995282572033102756234679212 0.10 825558341810365912411928262529732966426332888325 0.12 362021235920263484362156256069752826808830724951 0.14 10842098877101733113010252620532758348229764508

16 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 16 DeltaSSD is very efficient in terms of database scans DeltaSSD incurs excessive processing in terms of tree matchings Re-computing the frequent tree-expressions is inefficient Future work includes: –Investigation of the complete closure approach –Techniques to reduce the processing cost of tree matching Conclusions

17 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 17 1.Y. Aumann, R. Feldman, O. Liphstat and H. Mannila, "Borders: An Efficient Algorithm for Association Generation in Dynamic Databases", Journal of Intelligent Information Systems, vol. 12, no. 1, pp. 61-73, 1999. 2.R. Feldman, Y. Aumann, A. Amir and Mannila, H., "Efficient algorithms for discovering frequent sets in incremental databases", Proceedings of the ACM Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'97), 1997. 3.H. Mannila and H. Toivonen, "Levelwise Search and Borders of Theories in Knowledge Discovery", Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 241-258, 1997. 4.V. Pudi and J. Haritsa, "Quantifying the utility of the past in mining large databases", Information Systems, vol. 25, no. 5, pp. 323-343, 2000. 5.S. Thomas, S. Bodagala, K. Alsabti and S. Ranka, "An efficient algorithm for the incremental updation of association rules in large databases", Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'97), pp. 263-266, 1997. 6.K. Wang and H. Liu,"Discovering Structural Association of Semistructured Data", IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000. 7.A. Zhou, Jinwen, S. Zhou and Z. Tian, "Incremental Mining of Schema for Semistructured Data", Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), pp. 159-168, 1999. References


Download ppt "2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University."

Similar presentations


Ads by Google