(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.

(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection de changements en XML

22/10/2002 - BDA'02Grégory Cobéna (INRIA)2 Context Consider change-control in XML data warehouses. We want to understand changes We have only the old and new version of documents A diff need to be computed

22/10/2002 - BDA'02Grégory Cobéna (INRIA)3 Organization Motivations Data Model Representing Changes –Version Management and Querying –Comparison of Change representation models –Experiments Detecting Changes –State of the art in change detection –Performance analysis and experiments –Quality analysis and experiments Summary

Motivations

22/10/2002 - BDA'02Grégory Cobéna (INRIA)5 Motivations: Representing Changes Version management, which means that the representation should allow for effective storage strategies Temporal Databases, the support for persistent identification of nodes is mandatory Monitoring: information about changes is used to support triggers or detect events Note: HTML or XHTML documents may be used

22/10/2002 - BDA'02Grégory Cobéna (INRIA)6 Motivations: Detecting Changes Correctness: the diff programs miss no changes Minimality of the result is important to save storage space and network bandwidth Semantics: some algorithms consider more semantics in XML documents Performance: with dynamic services and/or large amounts of data, high speed and low memory usage are mandatory ‘Move operations’: some algorithms support move operations whereas others don’t. This impacts both the performance of the tool and the quality of results.

Data Model

22/10/2002 - BDA'02Grégory Cobéna (INRIA)8 Data Model (quick overview) Operations are: –(i) insert, delete applied to leaves or subtrees –(ii) update of text nodes –(iii) move applied to a subtree root, moving the entire subtree An edit cost is assigned to each operation. Usually, the cost is 1 per node touched The semantic of move is to identify subtrees even when their context has changed. We use the notion of mapping between the two trees. Each node in document A (or B) that is not deleted (or inserted) is matched to the corresponding node in B (or A).

22/10/2002 - BDA'02Grégory Cobéna (INRIA)9 Data Model: Intuition Tai’s model: delete ‘b’ Selkow’s model: delete ‘b’ root bca yx bca yx

Representing Changes

22/10/2002 - BDA'02Grégory Cobéna (INRIA)11 Representing Changes Version Management –There are several version management strategies. For instance, when only deltas are stored, their size must be reduced –We also consider the performance of reconstructing a document given the delta and the previous document. It is linear in all cases. –A simple text-based version management is possible but can not be used for querying. Querying Changes –Labeling nodes by prefix+postfix identifiers improves querying algorithms –Labeling nodes with persistent identifiers improves temporal databases –There is no short labeling scheme that is good for both

22/10/2002 - BDA'02Grégory Cobéna (INRIA)12 Our Example Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z Not Available Notebook 2200MHz Pentium4 $1999 Digital Camera Fuji FinePix 2600Z $299

22/10/2002 - BDA'02Grégory Cobéna (INRIA)13 Different reps

22/10/2002 - BDA'02Grégory Cobéna (INRIA)14 Change Models: XUpdate <xupdate:insert-after select="/catalog[1]/product[2]/description[1]" > $299 <xupdate:remove select="/catalog[1]/product[2]/status[1]" /> XPath expression

22/10/2002 - BDA'02Grégory Cobéna (INRIA)15 Change Models: DeltaXML (Example) Not Available $399 mentions some unchanged nodes The order is important (no ids, no move) Same look’n’feel as the document

22/10/2002 - BDA'02Grégory Cobéna (INRIA)16 Change Models: XyDelta (Example) <xydelta v1_XidMap="(1-30)" v2_XidMap="(1-14;18-23;31-33;24-30)"> Not Available $399 Persistent identifiers What is the parent node?

22/10/2002 - BDA'02Grégory Cobéna (INRIA)17 Change Models: Microsoft XDL (Example) <xd:xmldiff srcDocHash=“fd452bab54320191“ xmlns:xd=" http://schemas.microsoft.com/xmltools/2002/xmldiff "> $299 Updates an element node Verify consistency Identify nodes

22/10/2002 - BDA'02Grégory Cobéna (INRIA)18 Summary Unique advantages of XyDelta –A formal model and nice mathematical properties –Persistent identification of nodes (at least as an option) Still missing for all of them –A framework for querying Nice features that some are missing –Validation by a DTD (may be a problem for DeltaXML, XyDelta) –Verify the source document (only XDL) –Support of ‘move’ operations (only XyDelta and XDL) –Backward deltas (only XyDelta) –Monitoring the delta (only XUpdate and DeltaXML)

22/10/2002 - BDA'02Grégory Cobéna (INRIA)19 Storage Experiments Identifiers save space when few updates

22/10/2002 - BDA'02Grégory Cobéna (INRIA)20 Change Models: Conclusion Change monitoring is easier with DeltaXML and XUpdate Temporal queries are easier to evaluate with XyDelta (persistent identifiers) Future work: –It is not yet clear how to query changes –Define transaction or synchronization protocols

Detecting Changes

22/10/2002 - BDA'02Grégory Cobéna (INRIA)22 State of the art Based on the String Edit Problem (1966) Tree-to-tree correction Algorithms: –find the Minimum Edit Script –in O(m*n) time and space, where m and n are the size of the two documents Other algorithms –Run in linear time or close –Match nodes or subtrees depending on their content

22/10/2002 - BDA'02Grégory Cobéna (INRIA)23 Experiments: Speed of several algorithm

22/10/2002 - BDA'02Grégory Cobéna (INRIA)24 Algorithms: Overview From: To: The cheapest choice would be to move and. (cost=2) But finding the best script with ‘move’ operations is NP-hard The minimum edit script consists in deleting and and then inserting them. (cost=4) (MMDiff) Preprocessing often consists in mapping identical subtrees. In these case, an additional ‘move’ operations will be needed (cost=5)

22/10/2002 - BDA'02Grégory Cobéna (INRIA)25 Experiments: Quality (measured by the Edit Cost)

22/10/2002 - BDA'02Grégory Cobéna (INRIA)26 Experiments: Speed (focus on DeltaXML)

22/10/2002 - BDA'02Grégory Cobéna (INRIA)27 Comparison summary Many other algorithms that have no advantages MMDiff is the reference for quality DeltaXML and XyDiff are good compromises quality/performance; but performances of XyDiff more regular Performance measure for Microsoft available soon – seems comparable in performance to DeltaXML

22/10/2002 - BDA'02Grégory Cobéna (INRIA)28 Other issues Constrained Diff is often interesting: –Using ‘keys’ to match specific nodes (e.g. DeltaXML) –Using XMLSchema or DTD information –Time-constrained diff (e.g. XyDiff) Postprocessing of results?

Summary

22/10/2002 - BDA'02Grégory Cobéna (INRIA)30 What’s next? Representing Changes: –Unify and improve existing features –Support Queries! –Chain versions? Change Detection: –We are currently working on Microsoft’s XML Diff –Use XMLSchema (or DTD) information –Mining changes? Use learning ?

(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.

Similar presentations

Presentation on theme: "(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection.

Similar presentations

Presentation on theme: "(A comparative study for XML change detection) Grégory Cobéna (INRIA), Talel Abdessalem (ENST), Yassine Hinnach (ENST) Etude comparative sur la détection."— Presentation transcript:

Similar presentations

About project

Feedback