Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica

Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica {milano,monscan,catarci}@dis.uniroma1.it Structure-Aware XML Object Identification

Context Object Identification problem: identifying different data instances that refer to the same real-world entity. Complex: no shared Identifiers errors (e.g. misspellings) Well-studied for relational data, but still an open problem in the case of semistructured data. Needed for Semantic Data Integration

Outline Issues in XML object Identification Drawbacks of existing tree similarity measures A structure-aware distance for XML data Experimental Evaluation

XML Object Identification Relational data: Flat and fixed structure Tuples compared pairwise field by field String similarity functions often used to compare fields XML data: Tree-like and flexible structure Optional data Unbounded length lists Structural correspondence more difficult

Contribution We propose a new distance for XML data, the structure-aware XML distance: Structure aware data comparison driven by tree structure Taylored to XML Object Identification avoid issues arising when using existing tree similarity measures for Object Identification

Tree-edit distance Measures cost of making a tree isomorphic to another one by node insertions, deletions and relabellings A cost defined for each operation Distance = cost of a minimal-cost sequence of operations Works well when only tree structure is important, and labels do not have semantics. In XML, data is present on leaves as text, and the structure partially describes its meaning.

Issues XML model is ordered, but Object Identification requires unordered comparisons: Schema languages constrain only structural order, not data order. But tree-edit distance is NP-complete for unordered trees. Note: Also other edit-based distances like the Alignment Distance are also NP-complete in the unordered case.

Examples movie title movie “1994”“T. Guiry”“Lassie” awards “D. Petrie”“H. Slater” awards “Oscar”“Lassie” “Oscar” dog ownername “Oscar”“Lassie” title year director actor actress Compares topology, not data: differences in optional elements influence identification difficult to define a cost model that preserves the semantics of labels

Structure Aware XML Distance Preserves element-label semantics Ignores differences due to optional data Polynomial even for unordered trees

Overlays An overlay O of two trees T 1 and T 2 is a subset of T 1 x T 2 s.t. for nodes v i,v i ’ and inner nodes n i in T i : 1. one-to-one: if (v 1, v 2 ), (v 1 ’, v 2 ’) in O then v 1 = v 1 iff v 2 ’ = v 2 ’ 2. same-path: if (v 1, v 2 ) in O then path(v 1 ) = path(v 2 ) 3. to-leaves: (n 1, n 2 ) in O iff in (v 1, v 2 ) s.t. n 1 = parent(v 1 ) ∧ n 2 = parent(v 2 ) in (v 1, v 2 ) in O We are interested in maximal overlays: ∄ O ’ s.t. O  O ’

Example “jon” “mary”“lise” “karl” A C D F F F T1T1 K D “tom” G “karl” G D “tom” H A C D F “john” “lisa” F “mary” F T2T2 “karl” K

Distance The cost for matching two nodes is zero if they are inner nodes, equal to a string-similarity measure on their textual values if they are leaves. Any string similarity measure can be used. We use the string-edit distance. Cost of an overlay O = sum of costs of all matches in O An optimal overlay has minimal cost among all possible overlays. The structure aware XML distance of two XML trees T 1 and T 2 is the cost of any optimal overlay of T 1 and T 2.

Example Distance calculated as: sdist(“john”, “jan”) + sdist(“lisa”, “lisa”) = 2 sdist(john; jona) + sdist(karl, karl) + sdist(mary; tom) = 10 A CD F “john” “mary” F “lisa” F A C K “Karl”“jan” F T1T1 D F “mary” “lisa” F “Karl” K T2T2

Computing Overlays Best Assignment A CD F “john” “mary” F “lisa” F A C K “Karl”“jan” F T1T1 D F “mary” “lisa” F “Karl” K T2T2

Complexity Computes overlays bottom-up For each couple of nodes, solves a minimum weight bipartite matching problem using a variant of the Munkres algorithm. The cost is bounded by O(|T 1 |x|T 2 |x(deg 1 +deg 2 ) 3 ) Since only nodes with the same label are matched, average performance is better.

Evaluation (cont’d)

Conclusions XML object Identification requires to compare tree-like, flexible structured data We have proposed a structure aware distance for xml data More satisfactory than existing tree similarity measures: Respects XML structure Efficient even on unordered data

Ongoing/Future Work Extension to XML data with structural differences (Almost done!) More efficient algorithm(s)?

Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica

Similar presentations

Presentation on theme: "Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica

Similar presentations

Presentation on theme: "Diego Milano, Monica Scannapieco and Tiziana Catarci Università di Roma “La Sapienza” Dipartimento di Informatica e Sistemistica"— Presentation transcript:

Similar presentations

About project

Feedback