Presentation is loading. Please wait.

Presentation is loading. Please wait.

Approximate Validity of XML Streaming Data

Similar presentations


Presentation on theme: "Approximate Validity of XML Streaming Data"— Presentation transcript:

1 Approximate Validity of XML Streaming Data
HUANG Cheng LI Jun University Paris-Sud & Huazhong University of Science and Technologies Michel DE ROUGEMONT University Paris II

2 Streaming Data from different sources Approximate decisions
Motivation Streaming Data from different sources Approximate decisions Correct Robust Statistics based computations These days, people use more and more streaming data, like video and radio in the internet, also like the data from the sensors. Because the error rate is relatively high in these data, we need approximate decisions by doing some statistics. 2018/12/6

3 Generalized Statistics On Trees Statistics allow Approximate validity
Plan Generalized Statistics On Trees Statistics allow Approximate validity on words and trees based on Property Testing (Edit Distance with Moves) Property testing for regular tree languages (ICALP 2004) (.pdf), Approximate Satisfiability and Equivalence (.pdf) (LICS 06) Approximate validity on Streaming data In our work, we focus on approximate validity on streaming data. First of all, as the preliminaries, we will introduce the generalized statistics and the approximate validity. 2018/12/6

4 Edit Distances with Moves
Classical Edit Distance: Insertions, Deletions, Modifications Edit Distance with moves Edit Distance with Moves generalizes to Ordered Trees We are familiar to the classical operations on words, they are insertion, deletion, modification. Now we introduce another operation, Move, that is: moving a Sub word from a position to a new one. There is an example here. We can define the Edit distance with moves between two words, that is the minimal number of operations to transfer one word to another. If this number is less than εn, n is the maximal length of this two words, we say that the distance is less than ε. The distance between a word w and a language l is the minimal distance between w and w’, here w’ represents each word which belongs to L. Further more, we can generalize the concepts of move and edit distance with moves to ordered-trees. For example, we can transfer this tree to that one by moving this sub tree. And we can also calculate the distance between two trees according the number of operations. 2018/12/6

5 Statistics on words (k-gram)
word W, length n, n-k+1 blocks, of length k=1/ε For W= k=2, n-k+1=11, Now we introduce the k-gram: k-gram is the validity density of a vector of size K. Here k=1/ε, Let w be a word, the uniform-statistics on w, called u.stat(w) is an unitary vector, which represents the density of sub-words. The length of these sub-words are k. Here is an example, k is 2, u.stat(w) is like this. we can see that the sub-word 00 happens only once in the word, so here is one, and 11 is the total number of the sub-words. Why we are interested in uniform statistics? because the distance between two words w, w’ approximately equals to the one-norm of u.stat between tow words. When we take n samples instead of all the sub-words, we can get the sample statistic Y(w). we have proved that Y(w) is ε-close to u.stat(w). 2018/12/6

6 Statistics for unranked ordered trees
Transformation: Rabin Encoding a b d a b d Now we talk about the transformation of trees Unranked tree Extended 2-ranked tree 2018/12/6

7 Statistics on Trees: generalized k-gram
b w a We abbreviate “author” by a, “db” by b , “work” by w 00 01 10 11 Types of Sub-paths We generalize k-gram from words to trees. The uniform-statistics on tree u.stat(t) is a matrix. each column of this matrix represents each type of sub-path in the rabin encoding. We call that the type of left path is 0 and the type of right one is 1. For example Here is an unranked tree; here is the extended 2-ranked tree associated to this one. let k=3, there are 4 types of sub-paths, called respectively. the first column of u.stat(t) represents the sub-path of type 00. and we can see that there is a sub-tree bwa of this type, so in this column, we have one here. 00 01 10 11 2018/12/6

8 Statistics on Trees: generalized k-gram
type From u.stat(t), we can get two classes of vectors: the global statistics gstat(t) and the local statistics lstat(t) Gstat(t) is a vector represents the density of each type of sub-path Ex: the number of sub-path of type 00 is 1;so there is one here. Lstat(t) is a set of vectors which represent the density of each sub-path of a given type. here is the lstat(t) of type 01, which corresponds the second column of the statistics matrix. 2018/12/6

9 2. Approximate validity based on Property Testing
Let F be a property on a class K of structures U An ε -tester for F is a probabilistic algorithm A such that: If U |= F, A accepts If U is ε far from F, A rejects with high probability A property F is testable if there exists a probabilistic algorithm A s.t. For all ε it is an ε -tester for F Time(A) independent of n= |U| . Robust characterizations of polynomials, R. Rubinfeld, M. Sudan, 1994 O. Goldreich, S. Goldwasser and D. Ron, Property Testing and its connection to Learning and Approximation, 1996. Tester usually implies a linear time corrector. (ε1, ε2)-Tolerant Tester. After introducing the generalized statistics, here comes the second part, approximate validity. Validity is a decision problem. To approximate a decision problem associated with a property F, we look for a random algorithm A such that if a structure U is in F, A accepts. If U is e-far from F, A rejects with high probability. 2018/12/6

10 Regular membership on words
Y(w) H={u.stat(w) : w in r } is a union of polytopes. 2 Polytopes for r. Now we take the regular membership for example for further explaining the approximate validity. r is a regular expression , H is a set of uniform statistics on word w , w belongs to r, geometrically, H is a union of polytopes. Here r is represented by 2 polytopes, this polytope represents this part of r, …… Each vertex of the polytopes represents a loop of r. for example, this vertex represents this loop, i.e. the density of sub-words in this loop. Y(w) is statistics on samples as we mentioned before, it can be represented by a point.  So here approximate validity is that: If the point Y(w) situates in the polytopes H or ε-close to it, we say that w belongs to r, If the point Y(w) situates ε-far from the polytopes H, we can decide that w doesn’t belong to r with high probability. If the point Y(w) situates ε-close to the polytopes H, we can not guarantee anything. Membership Tester: 2018/12/6

11 3. Streaming Data Our work: Propose an algorithm to get a statistic matrix sustat(t), which approximates the matrix ustat(t) ,using constant space The goal: Decide if a given XML file is valid for a DTD In the case of streaming data, we can not get the approximate statistics matrix by sampling(why??) So we will propose an algorithm which constructs the statistics in an online manner and get an approximate statistic matrix sustat(t) 2018/12/6

12 Data structure for Streaming Data
Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> a b h c d j i e f k g a b c e h k d i f j g Data Structure First of all, we will explain the data structure for streaming data. Here is an example of stream. From this stream, we can get a structure of tree like this and after transformation, we obtain the e2rt. We construct three queues: tagpath, typepath and width. In the queue Tagpath, we save the name of each node from the root to the current tag in the path of e2rt. Suppose meeting the open tag h, the path is a b c g h, so we push them in the tagpath. In the queue typepath, we save the type of each path from the root to the current tag. 0 represents the left path, and 1 represents the right one. We have left right,left, right here. So we push in the queue width, we save the number of the brothers we have passed on each level of the unranked tree . so we have 2 2 correspond these 2 nodes and these 2 nodes. After saving the information of a node in these 3 queues, we update the statistics matrix. With the unbounded data structure the matrix can be updated correctly but we must have an infinite memory. 2018/12/6

13 Unbounded data structures
Stream:<a><b></b><c><g></g><h></h></c><d><i></i><j><k></k></j></d><e></e><f></f></a> a b i h c g d e f j k a b c e h k d i f j g We have another example here, it is focused on the operation pop. The stream is the same as the previous one. Before we meet </d>, the state of the 3 queues is like this. When we meet </d>, we pop two nodes from the tagpath, and pop their types from typepath, because the last number in the width is 2 here. After that, we pop the this number 2 from width. These operations are in order to keep the path of the current node in e2rt. We can see that, in the unbounded case, the maximal size of each queue equals to the depth of the e2rt. 2018/12/6

14 Bounded data Structure
h c d j i e f k g Suppose the length of the queues is limited to 4 constant Some of the matrix entries will be missing Now ,I will explain the case of bounded data structure. Here is the previous example. However, this time, we suppose that the length of the queues is limited to 4. when trying to save the information of the nodes from the root to the current note h, it is certain that we will lose some information because of the bounded size, so we put the interrogation mark here. Consequently, in some case, we can not update the statistic matrix correctly. 2018/12/6

15 Streaming algorithm Definition: a k-fork is a node with 2 distinct paths of length more than 2k. Streaming algorithm: Input: <a>bounded push/update sustat(t) </a>pop/recover/update sustat(t) Output: matrix sustat(t) k=3 with the bounded data structures, some errors will be bought in. however, these errors just happen on the k-forks. A node called a k-fork is the one which has two long sub-paths, and the length of both two sub-paths are more than 2k in e2rt. According to the bounded data structure, we propose a streaming algorithm. On one hand, When we meet the open tag of a node, we push its name and type in the queues, if the queues were full, we discard the oldest node. and then update the approximate statistic matrix sustat(t). In this course, we can see that we generate some errors when we return to a fork. In this example, f is a fork, and we parse the stream in the order like this… …and when we come back to f and go to d, we will lose the entries bfd of the statistic matrix. On the other hand, when we meet the close tag of a node, we pop the nodes according the last number in the queue width. In some case, we can recover the lost nodes. for example(dcd), because of this long path, the node d here will be lost, so we will lose this sub-path in the matrix. however, after parsing this long path, we will meet the closed tag of this node d, so we can recover it and also the sub-path, then we update the matrix. Through the streaming algorithm, we can get the approximate matrix sustat Entries missed: b-f-d… Entries recovered: d-c-d… 2018/12/6

16 Streaming algorithm Key Lemma : #forks
Theorem : sustat(t) approximates ustat(t) If Memory=2*k, Although some errors in the statistics occur on forks, the main observation is that there are few forks in an e2rt comparing with the size of the tree. in our work, we have proved that, if k is given, the number of forks are small than this…… Further more ,we can obtain a theorem, the statistic matrix sustat which we get from the streaming algorithm approximates the matrix ustat, which is the ideal one. If the memory is equal to 2k, the errors are less than e times n. the larger the memory is , the fewer the errors there are. 2018/12/6

17 Approximate validity on streaming data
Streaming test(Memory = 2*k): Y(t) ustat(t) sustat(t) Here comes the approximate validity associated with the streaming algorithm. Because sustat(t) approximates ustat(t), We can test if a streaming file is close or far from a DTD by just considering sustat(t). Geometrically, a DTD can be represented by a polytope, like a regular expression. The statistic matrix can be represented by points. We have proved that if the distance between the point sustat(t) and the polytope is smaller than 2e, we can accept that this streaming file belongs to this DTD. Otherwise, we can say that this streaming does not belong to this DTD with high probability, then we reject. DTD 2018/12/6

18 Results: http://www.up2.fr/xmlstream/
Gstat(t) You can use our statistics tool online. In this website, you just need to upload your xml file, and input the value of the parameter k, then you will get the results of global statistics. Here we have an example. We use a XML file which is generated by a project called Xmark. The number of the nodes in this file reaches 200 thousand. Here k equals to 4. this result shows that more than 1/3 of the sub-trees are of this type. And just 2% of sub-trees are of this type. So we know that this tree is larger in width than in length. XML file source : Xmark-- 2018/12/6

19 Results: http://www.up2.fr/xmlstream/
Lstat(t) When you click a rectangle in the results of global statistics, you can get the local statistics of that type. And in this example, the sub-tree ………occurs more frequently. XML file source : Xmark-- 2018/12/6

20 Approximate statistics on Streaming Data Approximate validity
Conclusion Statistics of trees: Generalization of a k-gram Easy to compute on a DOM Approximate statistics on Streaming Data Approximate validity Data Exchange Data Integration Constant space 2018/12/6


Download ppt "Approximate Validity of XML Streaming Data"

Similar presentations


Ads by Google