Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Similar presentations


Presentation on theme: "Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search."— Presentation transcript:

1 Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search over Probabilistic XML Data ICDE 2011, Hannover, Germany, 11-16 April, 2011 The 27 th IEEE International Conference on Data Engineering

2 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 2 Outline Introduction Problem and Challenge Our solution Experiments Conclusions

3 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 3 Outline Introduction  Keyword search on deterministic XML  Probabilistic XML Problem and Challenge Our solution Experiments Conclusions

4 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 4 Keyword search on deterministic XML Why keyword search on XML (or structured data)?  most popular way of searching information  no need to learn complex structured query languages  no need or difficult to know the underlying schema / content LCA (Lowest Common Ancestors) based approaches  SLCA (Smallest LCA) [Xu and Papakonstantinou SIGMOD05]  ELCA (Exclusive LCA) [XRank - Guo et al. SIGMOD03, Xu and Papakonstantinou EDBT08, Zhou et al. EDBT10 ]  Other LCA variants, some impose conditions on the LCA nodes or refine the returned fragments, e.g., XSEarch – interconnection [Cohen et al. VLDB03], MLCA [Y Li et al. VLDB2004], XSeek [Liu and Chen SIGMOD07], CVLCA [G Li et al. CIKM07], etc. r x2x2 a1a1 a2a2 x1x1 x3x3 x4x4 a3a3 b1b1 b2b2 b3b3 b4b4

5 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 5 Probabilistic XML and models Uncertain data can be obtained from everywhere, e.g., information extraction, NLP, data cleaning, data integration Many raw data comes from web, natural to use XML Simple dependencies of the data easily captured by parent-child relationship A popular model  PrXML {ind, mux} (first proposed as ProTDB [Nierman and Jagadish VLDB02])  ind  independent; mux  muturally-exclusive Other probabilistic XML models  PrXML C, where C is a subset of {ind, mux, det, exp, cie} [Abiteboul et al. VLDB Journal 09]  det  deterministic; exp  explicit ; cie  conjunction of independent events

6 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 6 A p-document r IND MUX C2C2 C3C3 C1C1 IND D2D2 D1D1 E1E1 E2E2 B1B1 B2B2 C4C4 B3B3 C5C5 MUX B5B5 B4B4 C6C6 0.15 0.25 0.3 0.60.5 0.1 0.3 0.70.9 0.5

7 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 7 Possible world semantics C1C1 MUX IND D2D2 D1D1 E1E1 E2E2 0.5 0.1 0.3 0.70.9 C1C1 D1D1 C1C1 E1E1 C1C1 IND D2D2 E2E2 0.70.9 0.5 0.3 0.1 C1C1 D2D2 C1C1 E2E2 C1C1 D2D2 E2E2 0.007= 0.1*0.7*(1-0.9) 0.063= 0.1*0.7*0.9 0.027= 0.1*(1-0.7)*0.9 C1C1 0.003= 0.1*(1-0.7)*(1-0.9) C1C1 0.1

8 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 8 Find info. from a p-document An important issue: how to query a p-document  Twig queries: [Kimelfeld et al. VLDB Journal 09], [Chang et al. EDBT 09]  Keyword queries (search): open, and the focus of this work;

9 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 9 Outline Introduction Problem and Challenges  Keyword Search on Probabilistic XML Our solution Experiments Conclusions

10 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 10 Keyword search on probabilistic XML The setup:  A p-document encodes a large number of possible worlds. We should always avoid generating all possible worlds from the p-document. Some questions:  First, what is the semantics of keyword search on a p- document?  Second, can we use any traditional method?  Third, if not, what shall we do?

11 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 11 Semantics of keyword search on a p-document We model the results of a keyword query on a p- document T as:  A set of 2-tuples (v, f)  v is an ordinary node in T  f is the probability (confidence) for v to be an SLCA in all possible worlds The results are defined on possible worlds, but we will attempt to avoid generating possible worlds when computing the results.

12 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 12 Can we use any traditional method? To find traditional SLCAs on a p-document naively will bring in trouble:  Distributional nodes are not answers;  MUX semantics does not allow two branches coexist;  An SLCA’s parent may also be SLCA in some possible worlds;  Many nodes can be SLCAs, so we may need answers with top-k probabilities; (ranking the 2-tuples (v,f) by the confidence field f)

13 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 13 Outline Introduction Problem and Challenge Our solution  How to compute SLCA probability of a node  Two top-k search algorithms Experiments Conclusions

14 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 14 Computing SLCA Probability of a Node Idea:  IND nodes and MUX nodes No need to compute;  Ordinary nodes Use keyword distribution probabilities of the child nodes Introduce this first!

15 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 15 keyword distribution probabilities Keyword distribution probabilities (local probabilities)  For each node v in a p-document, we have a table recording the probabilities of keyword distribution under v. For leaf nodes  One field ‘1’, the others ‘0’ For Internal nodes  Computed in a bottom-up way {}{k 1 }{k 2 }{k 1,k 2 } p0p1p2p3 {}{k 1 }{k 2 }{k 1,k 2 } 0100

16 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 16 Ordinary node p (ordinary node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50.1 {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30.1 p0 = 0.1 * 0.2 = 0.02 {}{k 1 }{k 2 }{k 1,k 2 } p0p1p2p3

17 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 17 Ordinary node p (ordinary node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50.1 {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30.1 {}{k 1 }{k 2 }{k 1,k 2 } p0p1p2p3 p1 = 0.1 * 0.4 + 0.3 * 0.2 + 0.3 * 0.4 = 0.22

18 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 18 Ordinary node p (ordinary node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50.1 {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30.1 {}{k 1 }{k 2 }{k 1,k 2 } p0p1p2p3 p2 = 0.1 * 0.3 + 0.5 * 0.2 + 0.5 * 0.3 = 0.28

19 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 19 Ordinary node p (ordinary node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50.1 {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30.1 {}{k 1 }{k 2 }{k 1,k 2 } p0p1p2p3 p3 = 1 – p0 – p1 – p2 = 0.48 p3 = … = 0.48 or

20 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 20 IND node (similar to Ordinary node) p (IND node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50.1 {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30.1 λ1λ1 λ2λ2 {}{k 1 }{k 2 }{k 1,k 2 } 0.1*λ1 + 1-λ10.3*λ10.5*λ10.1*λ1 {}{k 1 }{k 2 }{k 1,k 2 } 0.2*λ2 + 1-λ20.4*λ20.3*λ20.1*λ2 p (IND node) c1c1 c2c2 1 1

21 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 21 MUX node p (MUX node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50.1 {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30.1 λ1λ1 λ2λ2 {}{k 1 }{k 2 }{k 1,k 2 } 0.1*λ10.3*λ10.5*λ10.1*λ1 {}{k 1 }{k 2 }{k 1,k 2 } 0.2*λ20.4*λ20.3*λ20.1*λ2 p (MUX node) c1c1 c2c2 1 1

22 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 22 MUX node p (MUX node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } p0p1p2p3 1 1 {}{k 1 }{k 2 }{k 1,k 2 } 0.1*λ10.3*λ10.5*λ10.1*λ1 {}{k 1 }{k 2 }{k 1,k 2 } 0.2*λ20.4*λ20.3*λ20.1*λ2 p0 = 0.1*λ1 + 0.2*λ2 + 1- λ1 -λ2 p1 = 0.3*λ1 + 0.4*λ2 p2 = 0.5*λ1 + 0.3*λ2 p3 = 0.1*λ1 + 0.1*λ2

23 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 23 Progressive computing with multiple children p (IND or MUX or Ordinary) c1c1 c3c3 λ1λ1 λ3λ3 c2c2 λ2λ2 c1c1 c3c3 c2c2 p 1 1 1 c1c1 c3c3 c2c2 p Intermediate result final result

24 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 24 Computing SLCA Probability of a Node p (ordinary node) c1c1 c2c2 {}{k 1 }{k 2 }{k 1,k 2 } 0.10.30.50 Idea: using keyword distribution probability of the children {}{k 1 }{k 2 }{k 1,k 2 } 0.20.40.30 P slca (p) =0.3*0.3 + 0.5*0.4 = 0.29 Assume we have got these

25 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 25 Computing SLCA Probability Progressively c1c1 c3c3 c2c2 p 1 1 1 c1c1 c3c3 c2c2 p 0 0 0 Intermediate result – the same as keyword distribution probabilities plus an extra field final result The SLCA Probability of node p may be non-zero

26 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 26 Algorithms Integrate the SLCA Probability Computation into Algorithms  First algorithm: PrStack algorithm, stack-based, scans all keyword inverted list once  Second algorithm: EagerTopK algorithm, applies some important pruning strategies

27 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 27 PrStack algorithm Scan all keyword inverted list once in document order Use extended Dewey code Computing a node’s SLCA probability when the node is popped from a stack A node’s SLCA probability is computed after all its children have been processed

28 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 28 EagerTopK Algorithm First, find traditional SLCAs (disregarding node types) using the algorithm in [Xu and Papakonstantinou SIGMOD 05]; Then, start from these initial SLCAs, trace up towards the root, pick out ordinary nodes as SLCAs, and compute their probabilities; Several upper bounds are used for pruning

29 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 29 Pruning Properties Pruning a chain of nodes  Property 1 (IND and Ordinary nodes)  Property 2 (MUX nodes)  Property 3 (All types, looser than Property 1 and 2) Pruning a single node  Property 4 (Ordinary nodes)

30 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 30 Pruning a chain of nodes Property 1 (IND and Ordinary nodes) p (IND node or Ordinary nodes) c1c1 c2c2 r ≤ C1 exists and contains all the keywords

31 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 31 Pruning a chain of nodes Property 2 (MUX nodes) p (MUX node) c1c1 c2c2 r ≤ C1 exists and contains all the keywords

32 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 32 Pruning a chain of nodes Property 3 (all types of nodes, d i are descendants of p) p (all types of nodes) d1d1 d2d2 r ≤ d1, d2 are descendants of p c ≤

33 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 33 Pruning a single node Property 4 (ordinary nodes) p (Ordinary node) c1c1 c2c2 r Local SLCA probability of node p Existence probability of node p

34 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 34 Outline Introduction Problem and Challenge Our solution Experiments Conclusions

35 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 35 Experiments Experimental setup  Intel P4 3.0GHz CPU, 2G RAM, Win XP System, Java  Datasets: DBLP (large and shallow), Mondial (deep, complex and small) and XMark (tuneable deep and size)  Insert distributional nodes randomly into test datasets using the same method as [Kimelfeld et al. SIGMOD 08]  Keyword queries are randomly selected according to different datasets.

36 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 36 Test datasets Keyword queries

37 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 37 Time cost and Memory cost

38 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 38 Varying k

39 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 39 Varying document size 4 XMark datasets size from 10MB to 80MB, top k=10,

40 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 40 Outline Introduction Problem and Challenge Our solution Experiments Conclusions

41 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 41 Conclusions Study: keyword search on probabilistic XML data Contributions:  Result semantics for keyword search on a probabilistic XML document: SLCA semantics on a p-document  SLCA Probability Computation without generating possible worlds  Algorithms PrStack: easy to implement EagerTopK: faster to give top-k answers using a few upper bounds  Experiments conducted

42 Top-k Keyword Search over Probabilistic XML Data, Jianxin Li, Chengfei Liu, Rui Zhou, Wei Wang 42 Thanks! & Questions?


Download ppt "Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search."

Similar presentations


Ads by Google