# Grammar Induction An ADIOS Review ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A.

## Presentation on theme: "Grammar Induction An ADIOS Review ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A."— Presentation transcript:

ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability We will consider each of these in turn

Is that a dog? (6) 102 (5) (4) 102 (3) (4) 101 (1)(2) 101(3) 103 (1) 104 (1) (2) 104 (3) (2) (3) 103 (6) (5)(7) (6) (5) where 104 (4) thedog ? END (4) (5) aandhorse (2) that cat 102 (1) BEGIN is Is that a cat?Where is the dog?And is that a horse? node edge The Model: Graph representation with words as vertices and sentences as paths.

Detecting significant patterns Identifying patterns becomes easier on a graph Sub-paths are automatically aligned

Rewiring the graph Once a pattern is identified as significant, the sub-paths it subsumes are merged into a new vertex and the graph is rewired accordingly. Repeating this process, leads to the formation of complex, hierarchically structured patterns.

Motif EXtraction

The Markov Matrix The top right triangle defines the P L probabilities, bottom left triangle the P R probabilities Matrix is path-dependent

Pattern significance Say we found a potential pattern-edge from nodes 1 to n. Define m - the number of paths from 1 to n r – the number of paths from 1 to n+1 Because it’s a pattern edge, we know that Let’s suppose that the true probability for n+1 given 1 through n is r/m is our best estimate, but just an estimate What are the odds of getting r and m but still have ?

Pattern significance Assume The odds of getting result r and m or better are then given by If this is smaller than a predetermined α, we say the pattern-edge candidate is significant

The algorithm so far Initialization – load data into pseudograph Until no more patterns are found do For each path detect all sub-paths that live up to the MEX criterion Pick best pattern, add it to graph and rewire paths

How to choose patterns Obviously, the more significant the pattern the better Turns out it helps choosing longer patterns first when segmenting text Lowers the probability for accidentally linking words Also turns out it helps to gradually increase ALPHA

Syntagmatic and Paradigmatic relations Words can take part of two forms of relations with other words – Syntagmatic relations – indicating the words appear together in some contexts Paradigmatic relations – indicating the words can replace one another in a given context Syntagmatic relations are discovered by MEX Candidates for paradigmatic relations are established during a preprocessing step for each search path

Generalization – defining an equivalence class show me flights from philadelphia to san francisco on wednesdays list all flights from boston to san francisco with the maximum number of stops show flights from dallas to san francisco may i see the flights from denver to san francisco please show me flights from to san francisco on wednesdays boston philadelphia denver dallas Generalized search path:

Generalization show me flights from to san francisco on wednesdays boston philadelphia denver dallas i need to fly from boston to baltimore please give me… which airlines fly from dallas to denver please give me a flight from philadelphia to atlanta before ten a m in the morning list all flights going from boston to atlanta on wednesday… P1: from _E1 to _E1 = boston philadelphia denver dallas

Generalization

Context-sensitive generalization Slide a context window of size L across current search path For each 1≤i≤L look at all paths that are identical with the search path for 1≤k≤L, except for k=i Define an equivalence class containing the nodes at index i for these paths Replace i’th node with equivalence class Find significant patterns using MEX criterion

Determining L Involves a tradeoff Larger L will demand more context sensitivity in the inference Will hamper generalization Smaller L will detect more patterns But many might be spurious

The effects of context window width

When it all goes wrong john believes that to please is easy john thinks that to please is fun jack and john believe that to please is hard john that to please is easy believes thinks believe Generalized search path:

Bootstrapping

what are the cheapest flights from denver to boston that stop in atlanta boston philadelphia denver dallas A pre-existing equivalence class: What are the cheapest flights from to that stop in atlanta boston philadelphia denver dallas Generalized search path I: boston philadelphia denver dallas

Bootstrapping What are the cheapest flights from to that stop in atlanta boston philadelphia denver dallas what is the cheapest fare from denver to philadelphia and from pittsburgh to atlanta i would… like the cheapest airfare from boston to denver december twenty sixth show me the cheapest flight from philadelphia to dallas which arrives…

Bootstrapping What are the cheapest from to that stop in atlanta boston philadelphia denver Generalized search path II: denver philadelphia dallas flight flights airfare fare _P2: the cheapest _E2 from _E3 to _E4 flight flights airfare fare boston philadelphia denver denver philadelphia dallas _E2 =_E3 =_E4 =

Bootstrapping Slide a context window of length L along the current search path Consider all sub-paths of length L that begin in a 1 and end in a L These are the candidate paths For each 1≤i≤L For each 1≤k≤L, k≠i Replace node k with the EC that contains node k and maximally overlaps the set of nodes at index k of the candidate paths Continue as before

The ADIOS algorithm Initialization – load all data into a pseudograph Until no more patterns are found For each path P Create generalized search paths from P Detect significant patterns using MEX If found, add best new pattern and equivalence classes and rewire the graph

Alternative rewiring tacks Single mode as just mentioned. Best pattern is selected and added to graph Multiple mode All patterns from the current search path are added to graph in order of significance Batch mode The search is conducted over all paths, best patterns added in the end

Another example

More Patterns

Evaluating performance Define Recall – the probability of ADIOS recognizing an unseen grammatical sentence Precision – the proportion of grammatical ADIOS productions Recall can be assessed by leaving out some of the training corpus Precision is trickier Unless we’re learning a known CFG

An ADIOS drawback ADIOS is inherently a heuristic and greedy algorithm Once a pattern is created it remains forever – errors conflate Sentence ordering affects outcome Running ADIOS with different orderings gives patterns that ‘cover’ different parts of the grammar

An ad-hoc solution Train multiple learners on the corpus Each on a different sentence ordering Create a ‘forest’ of learners To create a new sentence Pick one learner at random Use it to produce sentence To check grammaticality of given sentence If any learner accepts sentence, declare as grammatical

The ADIOS executables A C++/LINUX implementation There are 4 relevant executables – adios.exe The actual implementation of the algorithm create_graph.exe Loads a corpus into the ADIOS’ pseudograph scrambler.exe Randomizes the order of sentences in a corpus convert_grammar.exe Converts a CFG to an ADIOS representation

Preparing the corpus Each path should be in a line of its own Starts with a ‘*’ and ends with an ‘#’ Represent the BEGIN and END nodes, respectively Words (nodes) separated by spaces * Jim and Cindy have a winning personality # * Beth won't be released until Friday # * a horse barked # * the dog loved a cat # * the cats are living very far away #

Creating the graph Done by create_graph.exe –./create_graph.exe –f corpus_file –o proj_name Two files will be created – proj_name.idx – an index file containing the list of nodes (the lexicon) and a numeric code for each node proj_name.grp – a text file describing the pseudograph

Running ADIOS General usage –./adios.exe [-options] –o proj_name ADIOS continuously updates and saves the current graph and pattern files – graph.dat patterns.dat sysparams.dat These files, along with the index file, are important for all other ADIOS operations

Training ADIOS To train, usually the following parameters are used./adios.exe –a train –i proj.idx –g proj.grp –E 0.8 –S 0.01 –o proj Some parameters – -a – the action to perform (train / test / generate / print) -i – the index file name -g – the graph file name -E – eta (the threshold used by MEX) – default 0.8 -S – alpha (the significance level required by MEX) – default 0.01 -o – the project name, which will be used for output and log files

Some additional parameters -W – the context window width – default 5 (use 1000 for no ECs) -r – rewiring mode 0 – no rewiring 1 – single (the most commonly used) 2 – multiple 3 – batch (used for text segmentation) -A – largest pattern size; all patterns above this size will be treated as equal in the rewiring process (default 1)

Result files proj.trace.log – a summary of the algorithm’s run Includes several statistics throughout the processing of the corpus proj.results.txt – the set of patterns the algorithm has detected, along with a ‘pattern spectra’ analysis Best viewed with Excel

Resuming training If ADIOS stalls for some reason, or that you want to continue a run with different parameters (e.g. when incrementing alpha), use –./adios.exe –a train –i proj_name.idx ADIOS will use the existing graph.dat, patterns.dat and sysparams.dat files to resume its operation

Testing ADIOS./adios.exe –a test –i idx_file –I test_file –R 10 –o proj -I – the file containing the test sentences -R – determines the maximum depth of the parse trees. Paths that require deeper parse trees will not be accepted. Default value – 10. Assumes graph.dat and patterns.dat are in same directory

Testing ADIOS Output files – proj.test.results.txt a detailed text file listing the partial parses of each test path proj.test.summary.txt a summary file, listing for each test path the patterns accepted on it and whether it’s accepted as a whole proj.test.classify.txt a text file with a 0/1 result for each test path (number of accepted sentences = number of lines with a ‘1’ in this file)

Testing multiple learners Running adios.exe on a second learner will not overwrite proj.classify.txt Each line will contain the number of learners that accepted the corresponding sentence

Generating new sentences./adios.exe –a generate –i proj.idx –n 100 –R 10 –o proj_name -i – the index file -n – number of sentences to generate -R – maximum parse depth -o – project name

The generator’s output The output file is proj.generate.txt Will contain the new sentences in the ADIOS format Some sentences may be ‘incomplete’ because of the –R option In these, a ~ symbol will appear Before using the generated sentences, these should be removed Use the ‘sed’ command as explained on the webpage

Scrambling sentences Before creating the graph, the sentences in the input corpus can be scrambled using scrambler.exe. Usage -./scrambler.exe –f input_file –o output_file

Using an artificial CFG An artificial CFG in a proper format can be converted to an ADIOS representation For testing precision/recall Using convert_grammar.exe The CFG should be stored in two files CFG_lex.txt – a lexicon file E.g. TA1_lex.txt CFG_grammar.txt – the rewrite rules E.g. TA1_grammar.txt

Convert grammar Usage –./convert_grammar.exe –l lex_file –g grammar_file –o proj_name output files – proj_name.idx – index file graph.dat – the graph patterns.dat – the patterns

Displaying patterns First print the ADIOS learner’s results./adios.exe –a print –i proj.idx Open Matlab and set its workspace to the ADIOS directory Use the pattern.m script pattern(123, ‘proj_name’) will graphically display the pattern/EC from the project names proj_name and whose ID is 123

Using scripts Running ADIOS entails running many executables potentially a large number of times Slowly increasing alpha when segmenting text Training multiple learners This process can be greatly streamlined using scripts See train.sh for an example

Download ppt "Grammar Induction An ADIOS Review ADIOS in outline Composed of three main elements A representational data structure A segmentation criterion (MEX) A."

Similar presentations