Presentation on theme: "Quantitative aspects of literary texts Adam J. Callahan & Gary E. Davis Department of Mathematics University of Massachusetts."— Presentation transcript:
Quantitative aspects of literary texts Adam J. Callahan & Gary E. Davis Department of Mathematics http://physicsoftext.wordpress.com University of Massachusetts Dartmouth Sigma Xi Research Exhibition April 29 th & 30 th, 2008
Type-token ratio The distribution of word frequencies in text has been studied extensively, since at least the time of Zipf in 1936 until the present. For a text the type-token ratio is, where types(n) is the number of word types in the first n words of the text. The type-token ratio is just the running average of the number of new words in an initial text segment of length n. Typical decay of the type-token ratio with the number of words, n: What sort of curve is this? This data is for the text: With the Turks in Palestine, by A. Aaronsohn.
Power laws A log-log plot of versus – yields a good straight line fit (r 2 = 0.964): This gives an analytical expression for the type-token ratio: In the case of the Aarohnson text, A 3.150 and d 0.270 This is an approximate power law decay of the type-token ratio with the number of words. This line might not look geometrically quite straight, but the correlation coefficient is quite high: r = 0.982
Very slowly varying tails A power law for the type-token ratio,, says that the product should be approximately constant, equal to A. A plot of versus n shows that, typically, this is only true from some point on: How can we determine a “turnover” point n*, beyond which the type-token ratio is a genuine power law? The apparent downward slope from about 5000 words on is something of an illusion due to scale: the slope of the line is approximately 0.000042 – a slope of 1 in 24,000
Regression coefficient analysis We plot the r 2 for a straight line fit to versus for n n 0, against n 0 : For the Aaronsohn text we see a local maximum for r 2 of 0.9975 (r = 0.9988) at n* = 4293. The corresponding least squares value for the index d is 0.383 For n n*, with r 2 1 -- an almost perfect fit to a power law. For n n*, the type-token ratio is better described as a decreasing logarithmic function of n. n*
Entropy The i th word has a relative frequency of occurrence (i,n) in the first n words of a text. We regard (i,n) as the probability of occurrence of the i th word in the first n words of a text. For this probability distribution the Shannon entropy of the initial segment of text of length n is This amounts to treating each initial text segment as a self-contained text, for statistical purposes, with the point of examining how the entropy changes as the text is enlarged to by the addition of a new word or a previously used word. We examined the variation of H(n) with n for a variety of literary texts. When a new word is added to an existing text segment, the entropy necessarily increases. When a previously used word is added to an initial segment of text the entropy will generally rise if the word is used rarely, but fall if it is used often. How does the entropy H(n) vary with n? Empirically we find that H(n) increases approximately logarithmically with n (r 2 = 0.923 for the Aarohnson text): The average statistical “surprise” of the text, with the addition of a new or previously used word, rises approximately logarithmically with the length of the text.
The Voynich manuscript The Voynich manuscript is MS 408 of the Beinecke library at Yale University. It is a still mysterious, undeciphered manuscript written using unusual symbolic forms, but apparently representing a text with linguistic structure [G. Landini, Evidence of linguistic structure in the Voynich manuscript using spectral analysis, Cryptologia, (2001)]. Using the Takahashi transcription of these symbolic forms we plotted the entropy H(n) of the first n words of the Voynich text as a function of n. As for all the other texts we examined, H(n) varies approximately logarithmically with n. However, there is a significantly large block of the Voynich text, about 5000 words from the first 12000 words of the text on - approximately 16% of the total text - for which the entropy decreases. This necessarily indicates a large degree of repetition of words that have been used significantly often in the text before this point. The Voynich text, is becoming significantly statistically less surprising between 12000 and 17000 words
The Voynich manuscript also shows unusual behavior when we plot the r 2 for a straight line fit to versus for n n 0, against n 0 : These successive local maxima and local minima in the plot suggest a variety of different stages of usage of new word types throughout the manuscript. A similar, but less variable, situation holds for Darwin’s Origin of Species: The dip around 53,000 words is approximately where Darwin starts Chapter 6: “Difficulties on Theory”. The shaded area corresponds approximately to the region of decreasing entropy.
Distribution of log returns The distribution of word frequencies is highly skewed, a fact well known even before Zipf quantified it in 1936. Borrowing an idea from finance we look at the distribution of the log returns: where (i) is the frequency of the i th word in the entire text. This distribution is typically highly symmetric, with mean close to 0, but with low kurtosis (broad shoulders), reminiscent of a modified raised cosine distribution rather than a normal distribution: References G. K. Zipf, Human Behaviour and the Principle of Least Effort, Addison-Wesley, 1949. G. Landini, Evidence of linguistic structure in the Voynich manuscript using spectral analysis, Cryptologia, 4 (2001). L. L. Goncalves, L. B. Goncalves, Fractal power laws in literary English. Physica A, 360 (2), 557-575 (2006). S. I. Resnick, Heavy-Tail Phenomena, Springer, 2007. Distribution of log returns Distribution of frequenciesDistribution of –log frequencies