Presentation on theme: "1 A Random Text Model for the Generation of Statistical Language Invariants Chris Biemann University of Leipzig, Germany HLT-NAACL 2007, Rochester, NY,"— Presentation transcript:
1 A Random Text Model for the Generation of Statistical Language Invariants Chris Biemann University of Leipzig, Germany HLT-NAACL 2007, Rochester, NY, USA Monday, April 23, 2007
2 Outline Previous random text models Large-scale measures for text A novel random text model Comparison to natural language text
3 Necessary property: Zipf‘s Law Zipf: Ordering words in a corpus by descending frequency, the relation between the frequency of a word at rank r and its rank is given by f(r) ~ r -z, where z is the exponent of the power-law that corresponds to the slope of the curve in a log plot. For word frequencies in NL, z 1 Zipf-Mandelbrot: f(r) ~(r+c 1 ) -(1+c2) : Approximates lower frequencies for very high ranks
4 Previous Random Text Models B. B. Mandelbrot (1953) Sometimes called the “monkey at the typewriter” With a probability w, a word separator is generated at each step, with probability (1-w)/N, a letter from an alphabet of size N is generated H. A. Simon (1955) No alphabet of single letters at each time step, a previously unseen new word is added to the stream with a probability , whereas with probability (1- ), the next word is chosen amongst the words at previous positions. frequency distribution that follows a power law with exponent z=(1- ). Modified by Zanette and Montemurro (2002): - sublinear growth for higher exponents - Zipf-Mandelbrot law by maximum probability threshold
5 Critique on Previous Models Mandelbrot: All words with the same length are equiprobable, as all letters are equiprobable Ferrer i Cancho and Solé (2002): Initialisation with letter probabilities obtained from natural language text solves this problem, but where do these letter frequencies come from? Simon: No concept of „letter“ at all. Both: –no concept of sentence –no word order restrictions: Simon = bag of words, Mandelbrot does not take into account generated stream at all
6 Large-scale Measures for Text Zipf‘s law and lexical spectrum: rank-frequency plot should follow a power law with z 1, frequency-spectrum (probability of frequencies) should follow a power law with z 2 (Pareto distribution) Word length: Should be distributed like in natural language text, according to a variant of the gamma distribution (Sigurd et al. 2004) Sentence length: Should also distributed like in NL, same gamma distribution Significant neighbour-based co-occurrence graph: Should be a similar in terms of degree distribution and connectivity in random text and NL.
7 A Novel Random Text Model Two parts: Word Generator Sentence Generator Both follow the principle of beaten tracks: Memorize what has been generated before Generate with higher probability if generated before more often Inspired by Small World network generation, especially (Kumar et al. 1999).
8 Word Generator Initialisation: –Letter graph of N letters. –Vertices are connected to themselves with weight 1. Choice: –When generating a word, the generator chooses a letter x according to its probability P(x), which is computed as the normalized weight sum of outgoing edges: Parameter: –At every position, the word ends with a probability w (0,1) or generates a next letter according to the letter production probability as given above. Update: –For every letter bigram, the weight of the directed edge between the preceding and current letter in the letter graph is increased by one. Effect: self-reinforcement of letter probabilities: –the more often a letter is generated, the higher its weight sum will be in subsequent steps, –leading to an increased generation probability. with
9 Word Generator Example The small numbers next to edges are edge weights. The probability for the letters for the next step are P(A)=0.4 P(B)=0.4 P(C)=0.2
10 Measures on the Word Generator Word Generator fulfills measures much better than the Mandelbrot model. For other measures, we need something extra...
11 Sentence Generator I Initialisation: –Word graph is initialized with a begin-of-sentence (BOS) and an end-of-sentence (EOS) symbol, with an edge of weight 1 from BOS to EOS. Word Graph: (directed) –Vertices correspond to words –edge weights correspond to the number of times two words were generated in a sequence. Generation: –random walk on the directed edges starts at the BOS vertex. –With a new word probability (1-s), an existing edge is followed from the current vertex to the next vertex –the probability of choosing endpoint X from the endpoints of all outgoing edges from the current vertex C is given by
12 Sentence Generator II Parameter: –With probability s (0,1), a new word is generated by the word generator model –next word is chosen from the word graph in proportion to its weighted indegree: the probability of choosing an existing vertex E as successor of a newly generated word N is given by Update: –For each sequence of two words generated, the weight of the directed edge between them is increased by 1
13 Sentence Generator Example In the last step, the second CA was generated as a new word from the word generator. The generation of empty sentences happens frequently. These are omitted in the output.
14 Comparison to Natural Language Corpus for comparison: The first 1 million words of BNC, spoken English. 26 letters, uppercase, punctuation removed same in word generator 125,395 sentences set s=0.08, remove first 50K sentences average sentence length: words Average word length: letters w=0.4 OOH ERM WOULD LIKE A CUP OF THIS ER MM SORRY NOW THAT S NO NO I DID NT I KNEW THESE PEWS WERE HARD OOH I DID NT REALISE THEY WERE THAT BAD I FEEL SORRY FOR MY POOR CONGREGATION
15 Word Frequency Zipf-Mandelbrot distribution Smooth curve Similar to English
16 Word Length More 1-letter words in the sentence generator Longer words in the sentence generator Curve is similar Gamma distribution here: f(x)~x 1.5 0.45 x
17 Sentence Length Longer sentences in English More 2-word sentences in english Curve is similar
18 Neighbor-based Co-occurrence Graph Min. cooc. freq=2, min. log likelihood ratio=3.84 NB-graph is a small world Qualitatively, English and sentence generator are similar Word generator shows much much less co-occurrences Factor 2 in clustering coefficient and number of vertices English sample sentence gen. word gen.random graph (ER) # of ver avg. sht. path avg. deg cl.coeff E-4 z
19 Formation of Sentences Word graph grows and contains the full vocabulary used so far for generating in every time step. Random walks starting from BOS always end in EOS. Sentence length slowly increases: random walk has more possibilities before finally arriving at the EOS vertex. Sentence length is influenced by both parameters of the model: –the word end probability w in the word generator –the new word probability s in the sentence generator.
20 Conclusion Novel random text model obeys Zipf‘s law obeys word length distribution obeys sentence length shows similar nb-cooccurrence data First model that: produces smooth lexical spectrum without initial letter probabilities incorporates notion of a sentence models word order restrictions
21 Sentence generator at work Beginning: Q. U. RFXFJF. G. G. U. R. U. RFXFJF. XXF. RFXFJF. U. QYVHA. RFXFJF. R TCW. CV. Z U. G. XXF. RFXFJF. M XXF. Q. G. RFXFJF. U. RFXFJF. RFXFJF. Z U. G. RFXFJF. RFXFJF. M XXF. R. Z U. Later: X YYOXO QO OEPUQFC T TYUP QYFA FN XX TVVJ U OCUI X HPTXVYPF. FVFRIK. Y TXYP VYFI QC TPS Q UYYLPCQXC. G QQE YQFC XQXA Z JYQPX. QRXQY VCJ XJ YAC VN PV VVQF C XJN JFEQ QYVHA. U VIJ Q YT JU OF DJWI QYM U YQVCP QOTE OD XWY AGFVFV U XA YQYF AVYPO CDQQ TY NTO FYF QHT T YPXRQ R GQFRVQ. MUHVJ Q VAVF YPF QPXPCY Q YYFRQQ. JP VGOHYY F FPYF OM SFXNJJ A VQA OGMR L QY. FYC T PNXTQ. R TMQCQ B QQTF J PVX YT DTYO RXJYYCGFJ CYFOFUMOCTM PQRYQQYC AHXZQJQ JTW O JJ VX QFYQ YTXJTY YTYYFXK. RFXFJF JY XY RVV J YURQ CM QOXGQ QFMVGPQ. OY FDXFOXC. N OYCT. L MMYMT CY YAQ XAA J YHYJ MPQ XAQ UYBX RW XXF O UU COF XXF CQPQ VYYY XJ YACYTF FN. TA KV XJP O EGV J HQY KMQ U.
22 Questions? Danke sdf sehr gf thank fdgf you g fd tusen sd ee takk erte dank we u trew wel wwd muchas werwe ewr gracias werwe rew merci mille werew re ew ee ew grazie d fsd ffs df d fds spassiva fs fdsa rtre trerere rteetr trpemma eedm