# UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology Allen Brown, PhD Microsoft.

## Presentation on theme: "UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology Allen Brown, PhD Microsoft."— Presentation transcript:

UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology matt@westbridgetech.com Allen Brown, PhD Microsoft

Why Bother? Numeric Exponents introduced by the W3C XML Schema WG. Restriction is a subsumption relation among content models. And-groups long cherished by Markup Community. UPA is an old constraint on content models in WXS. What is the cost of combination?

Naïve Algorithms Exponential or worse: –All-groups try all exponential cases. –Numeric exponents – unroll - doubly exponential: First unroll: (a{0,3} | b){10, 20} => ((a | aa | aaa | b)…(a |…)…. Then determinise. –Used by XSV, Xerces, Sun. To not try to do better is simply remiss.

UPA Testing Generally just need to check follow sets. Problem for numeric exponents for {m,m}. For example: –(a 1,b 2 ){2,2},a 3 => ababa –((a 1, b 2 ){1,3},a 3 ) => aba or ababa or abababa Is a 1 in follow(b 2 )?

Problem for All-groups Again, are different branches in each others follow groups? (a & b & c) => follow(a) = {b, c} (a & b? & c) => follow(a) = {b, c} union follow(b) => {a, b, c} ((a,b?) & b &c) => violates UPA

Five properties of particles particles(p) => all particles within p, recursively defined. opaque(p) => a particle is opaque if it cant match the empty string. first(p) => particles in p that can match first letter in a string matching p follow(p) => particles in the outer expression that can match a letter in a string after substring matched by p. confusion(p) => particles in p which could conflict with follow(p) (a, b?) => b is in confusion((a, b?))

Special Considerations follow(p) restricted as follows: –(((a?,b){m,m}),c) => follow(b) = {c} –(((a?,b){m, n}),c) => follow(b) = {c, a, b} –((a & b & c), d) => follow(c) = {d}

Sources of UPA Violation Consider P in –(, {0,1}, P, ) –(, ( | P), ) –(, ( & P), ) UPA violation requires 2 terminals: –One before P, one inside P – need first(P) –Both inside P – in a moment –One inside P, one after P – need confusion(P) –One before P, one after P – opacity(P) is false

Internal Consistency P{m, m} – if P obeys UPA, then confusion(P) intersection first(P) != {} If P is ( & & ) then –overlap in first sets –confusion( ) intersects (first( ) U first( ) != {} –And so on for and

UPA Algorithm UPA( ) => = a then if b i, b j in follow(a), then i=j = {m,n} the UPA( ) and first( ) # confusion( ) = {} = ( 1 |…| n ) and # 1 n first( i ) = {} then /\ 1 n UPA( n ). =( 1 & … & n ) and # 1 n first( i ) = {} then /\ 1 n (UPA( i )) and (confusion( i ) # (U j!=I first( j )) = {} =( 1, …, n ) then UPA( 1 ) /\ UPA(( 2, …, n ))

Subsumption for Exponents Two steps –For fixed exponents –For exponent ranges Most equipment carries over Will use B or b to refer to base model, and R or r to refer to restricted model

Traditional Subsumption through transformation into automaton. Calculate intersection of automata (R intersects not(B)) should be empty (not(B) is the inversion of the accepting states of B). Once again, too huge when everything is unrolled.

Our Machines Represent regex as graph. Forward edges, matching terminals, form a DAG Back edges, matching exponents, form connected components. Each back edge marked with its arity.

Execution Model Letters are matched going forward by edges. Machine is trapped when a back-edge is entered. Cant leave until obligation (value of back edges) fulfilled. Edge constraints fulfilled in lifo order. Stack maintains current iterations.

Example (a,((a,b) 2 |b)) 2 a a b b 2 2

Subsumption Checking Start as usual. When entering head of a back edge, add entry to machines stack. When both reach repeated state: –Tail of a back edge –Previously seen in list of traversed states Determine if there is a matched component Maximally reduce exponents for matched edges

For Example (a,(a,b,a,b) 6,b 3,c) <= (a,((a,b) 2 |b) 9,c) (r, b) let (r, b) r b (0,0) a (1,1) [], [] (1,1) a (2,2) [0], [0,0] (2,2) b (3,3) [0], [1,0] a a b a b b c (3,1) a (4,2) [0], [1,0] (4,2) b (5,3) [1], [2,0] (5,3) X (5,1) [], [6] b c (5,1) b (6,3) [1], [] a c (6,3) X (6,3) [], [] b (6,3) c (7,4) [], [] 2 9

Reducing Exponents Find cross-product back-edge (start r and start b ) Get r and b (number iterations each) Get leftover (total r – start r ) = l r l r div r = quot r and rem r, etc. new r = l r – ( r * min(quot r, quot b )) +start r

Why So Complicated Compare (a,a,a) 7 and (a, a) 12 Must go 3 rounds of (a,a) for 2 rounds of (a,a,a). l r = 7 l b = 12 d r = 2 d b = 3 l r div d r = 3 rem 1 l b div d b = 4 rem 0 new r =7–(2*3)+0=1 new b =12-(3*3)+1=3 Hence, max 6 rounds of (a, a, a) and 9 of (a, a).

Generalized Exponents Must keep track of minimum and maximum possible transitions. Edges can contribute to both min or max. Cant exit until max > min allowed. Must exit before min > max allowed.

So…. Generate as few min r/b as possible. –If they exceed max r/b, youre screwed Generate as many max r/b as possible –Means you can use a forward transition –Use parsimoniously to maximize the amount matched

More Complex Machinery Back edge constraints have min and max. Some back edges increment just max value Back edges increment both min and max values. Max means maximun possible match. Min means minimum possible match.

Example ((a, b?){3, 5}, c) a b c c 3,5

Four Kinds of Pairs When hitting a min-edge/min-edge: –Calculate min/min values (prev. algorithm with min exponents) –Calculate max/max values (prev. algorithm with max exponents) –Move forward when possible –If min ever exceeds max, fail. When hitting a max-edge/max-edge –Calculate min/min values –Calculate max/max values –When max > min, you can progress (when leaving a cycle set min to passing value) –Else fail. Etc.

After exiting loop, some iterations remain. As all unabsorbed transitions attempted, all possibilities tried. Given ( ) {m b,n b } And ( ) {m r,n r }, ( ) {m r,n r } Ensure m r +m r > m b and n r +n r < n b

If rest of expression matches longest and shortest (i.e., matched m or matched n) then will match all iterations. Matching longest will try all alternatives. Matching shortest will try least alternatives. As first sets repeat, UPA shows there must be optionality or iteration.

Nested Exponents ( {m,n}{m,n} (a{m,n} | b){m, n} Edges in machine have multiple exponents. Depth of n makes 2 (n-1) ranges Each must be tried Requires tracking scope. Requires lookahead.

Cost Without nesting, algorithm is exponential in number of exponents – each exponent requires testing min and max. With nesting, remains exponential, as this doesnt affect the number of exponents. Still a huge improvement over unrolling.

Example ((a?,b{8,9}){2,3},c) > (a,(b,b){3,3},(b,b){6,6},c) First 6 bs at level 2, remaining 12 iterate both levels At higher levels ranges overlap – need to check all possibilities a1a1 b2b2 b2b2 c0c0 c0c0 {8, 9}{2,3} abbbbc 3 6

((a?,b{8,9}){2,9},c) > (a,(b,b){3,3},(b,b){6,6},c) 8*9=72, 9*8=72 Need to check ending of 8 and start of 9 Need lookahead to choose. Represented as ranges at all levels. a1a1 b2b2 b2b2 c0c0 c0c0 {8, 9}{2,9} abbbba

Conclusions Numeric exponents are hard to work with for subsumption. All-groups are not that difficult. Interaction will be even more annoying. Need to implement and test.

Download ppt "UPA and Restriction for All- Groups and Numeric Exponents Matthew Fuchs, PhD Westbridge Technology Allen Brown, PhD Microsoft."

Similar presentations