Download presentation
Presentation is loading. Please wait.
Published byLaureen Hood Modified over 9 years ago
1
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst
2
November 2003CSA4050: Computational Morphology IV 2 What is xfst? xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. xfst and other Xerox tools employ a notation very close to the notation we have been using so far. For full documentation on the syntax and semantics of Xerox REs, see – http://www.fsmbook.com
3
November 2003CSA4050: Computational Morphology IV 3 Simple Commands command line (via babe) > xfst define: give a name to an RE print: print information read: read information various stack operations file interaction
4
November 2003CSA4050: Computational Morphology IV 4 define command define name regexp ; xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; xfst[0]: define baz R1 & R2;
5
November 2003CSA4050: Computational Morphology IV 5 print words print words name - see the words in the language called name xfst[0]: print words R1 d c b a xfst[0]:
6
November 2003CSA4050: Computational Morphology IV 6 print net print net name - see detailed information about the network name. xfst[0]: define z R1 & R2; xfst[0]: define baz R1 & R2; xfst[0]: print net z Sigma: a b c d e f g Size: 7 Net: FC370 Flags: deterministic, pruned, minimized, epsilon_free, loop_free Arity: 1 s0: d -> fs1. fs1: (no arcs) xfst[0]:
7
November 2003CSA4050: Computational Morphology IV 7 Some Properties of Networks epsilon free: there are no arcs labeled with the epsilon symbol deterministic: no state has more than one outgoing arc minimised: there is no other network with exactly the same paths that has fewer states. These make sense for FSAs – not necessarily for FSTs.
8
November 2003CSA4050: Computational Morphology IV 8 Equivalent? a:0 ab a a b A B no. states? no. paths? relation encoded?
9
November 2003CSA4050: Computational Morphology IV 9 Remarks A and B encode the same relation {, } They are both deterministic and minimal They have different numbers of states. Arcs labeled with a pair containing an epsilon on one side can sometimes be redistributed or eliminated, reducing the number of states. This situation does not occur with FSAs
10
November 2003CSA4050: Computational Morphology IV 10 FST Determinism: Sequential vs. Unambiguous Unambiguous: for any input there is at most one output. –Transducer A is unambiguous in either direction. Sequential: No state has more than one arc with the same symbol on the input side. –Transducer A is not sequential in one direction. A transducer is sequentiable if the relation it encodes is unambiguous and all the local ambiguities resolve themselves in a fixed number of steps
11
November 2003CSA4050: Computational Morphology IV 11 Basic Stack Operations read regex : push network onto stack: print stack : list items on stack print net : detailed info on top stack item pop stack : remove top item from stack define name : set name to value of top stack item
12
November 2003CSA4050: Computational Morphology IV 12 Stack Operations: intersect net; union net, etc. Load stack with N suitable arguments. Ensure that arguments are pushed onto stack in correct (reverse) order. intersect net command is issued. These are popped from the stack, the operation is performed, and the result written back onto the stack.
13
November 2003CSA4050: Computational Morphology IV 13 Stack Example 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words x1
14
November 2003CSA4050: Computational Morphology IV 14 Stack Example 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words x2/a
15
November 2003CSA4050: Computational Morphology IV 15 Creating Relations A simple example of a transducer can be shown using the crossproduct operator: xfst[0] clear stack xfst[0] define Y [d o g | c a t]; xfst[0] define Z [c h i e n | c h a t]; xfst[0] read regex Y.x. Z We can now use apply up and apply down to test the transducer’s behaviour. x3ab
16
November 2003CSA4050: Computational Morphology IV 16 apply up; apply down applyup(arg,R) = {x | in R} applydown(arg,R) = {x | in R} xfst[0] read regex [d o g | c a t].x.[c h i e n | c h a t]; xfst[1] apply up chien dog cat xfst[1] apply down cat chien chat
17
November 2003CSA4050: Computational Morphology IV 17 Exercise for.x. What RE would perform the correct translations? Define it in xfst. Define an RE in xfst which relates the surface forms "sing", "sang" and "sung" to the lexical form "sing". x3c
18
November 2003CSA4050: Computational Morphology IV 18 Replace Rules Xerox RE notation, includes replace rules. Replace rules do not increase the descriptive power of REs; however, they do provide a powerful abbreviated rule- like notation. There are two main types of replace rules:unconditional and conditional
19
November 2003CSA4050: Computational Morphology IV 19 Unconditional Replace Rules The most straightforward kind of unconditional replace rule is: a -> b This denotes an FS relation in which every symbol a in the upper language corresponds to a symbol b in the lower language. Checkpoint: how does this differ from a:b? What is the FST that computes this relation
20
November 2003CSA4050: Computational Morphology IV 20 Unconditional Replace e.g. xfst[0]: read regex c -> r xfst[0]: apply down cat xfst[0]: apply down dog Where there is no match, the string is identity mapped. The general pattern for simple Replace rules is A -> B, where A and B are REs denoting arbitrarily complex languages (not relations) x4ab
21
November 2003CSA4050: Computational Morphology IV 21 Definition of A → B A → B = [no_A [A.x. B]]* no_A where no_A ~$[A – 0] N.B. if upper does not contain empty str ~$[upper – 0] = ~$[upper] otherwise ~$[upper] is null whereas ~$[upper – 0] contains at least the empty str.
22
November 2003CSA4050: Computational Morphology IV 22 Conditional Replace Rules More complex replace rules can also specify left and right context, as in A -> B || L _ R each lexical substring A is related to a substring B when the left context ends with L and the right context starts with R. A, B, L and R are REs denoting languages not relations. x4c
23
November 2003CSA4050: Computational Morphology IV 23 Special Cases The symbol.#. refers to the absolute beginning or end of string in left and right contexts. For example e -> i ||.#. p _ r Checkpoint: write a replace rule that brings lexical "go" into correspondence with surface "went".
24
November 2003CSA4050: Computational Morphology IV 24 The kaNpat exercise Suppose we have a language in which kaNpat is a lexical string consisting of the morpheme kaN concatenated with the suffix pat. N just before nasal p gets realised as m. p occurring just after an m is realised as m.
25
November 2003CSA4050: Computational Morphology IV 25 kaNpat rules We can write the following two rules to account for this behaviour: Rule 1. [N -> m || _ p] Notice that the lh context is empty, meaning that any context will do. Rule 2. [p -> m || m _] Note that the linguist must keep track of the order in which rules are applied.
26
November 2003CSA4050: Computational Morphology IV 26 Derivation of kammat Lexical: kaNpat apply [N -> m || _ p] Intermediate: kampat apply [p -> m || m _] surface: kammat The first rule feeds the second Checkpoint: what happens if rules are applied in reverse order?
27
November 2003CSA4050: Computational Morphology IV 27 Composing the Relations Each rule describes a certain relation: call these R1 and R2 If R1 maps X to Y and R2 maps Y to Z, then there must exist a single relation which maps directly from X to Z without passing through Y. Mathematically, that relation is the composition of R1 and R2.
28
November 2003CSA4050: Computational Morphology IV 28 Composing the Rules Each rule is compiled into an FST. If Rule1 compiles to F1, and Rule2 to F2, then there must be an F3 which computes the composition of F1 and F2. Checkpoint: write the RE corresponding to the composition of the original 2 rules.
29
November 2003CSA4050: Computational Morphology IV 29 Testing the kaNpat grammar First get rules onto stack xfst[0] read regex [N->m || _p].o. [p->m||m_]; Try the following and explain – apply down (kaNpat; kampat; kammat) – apply up kammat – Try the above but with rules in reverse order X5ab
30
November 2003CSA4050: Computational Morphology IV 30 Practical use of xfst Regular expression files (text) xfst[0] read regexp < regexpfile Binary files (compiled networks) xfst[1]: save stack binfile xfst[0]: load stack binfile Scripts (xfst commands) xfst[0] source scriptfile % xfst -f myscript % xfst -l myscript
31
November 2003CSA4050: Computational Morphology IV 31 A’ is the sequentiable a:0 ab a a 0:b A A’ no. states? no. paths? relation encoded? b:a
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.