Presentation on theme: "1 A Variety of Literary Puzzles The Iliad and the Odyssey are generally attributed to a single individual named Homer. But both are derived from long oral."— Presentation transcript:
1 A Variety of Literary Puzzles The Iliad and the Odyssey are generally attributed to a single individual named Homer. But both are derived from long oral tradition, and it is not clear whether they are indeed ultimately the work of a single author (or group of authors). Since historical evidence is almost non-existent in this case, the only way of addressing this issue is to look at internal evidence within the texts. 1. The Works of Homer
2 2. The Letters of Saint Paul The New Testament contains a large number of letters traditionally attributed to Saint Paul, but it is not clear whether he indeed wrote them all. In particular the letter to the Hebrews has for a long time been viewed suspiciously on theological grounds, but these are hard to make precise. It is interesting to ask whether such theological speculations can be backed up with hard statistical data derived from textual analysis (e.g. does Hebrews have similar word frequencies and sentence lengths to Pauls known letters?).
3 3.The Dialogues of Plato Plato developed his philosophy in the form of dialogues, putting his own doctrines into the mouth of Socrates his teacher. Little of Socrates own work has survived, and there is clear evidence of development in the philosophy of Platos dialogues in various respects, but in which directions did the development take place? Knowing the order of his dialogues could be crucial for the understanding of his ideas. Stylometric methods have been used to try to place the dialogues in the correct order.
4 4.The Ethics of Aristotle There are two famous works of Ethics by Aristotle, both transcribed from lecture notes by others. These are the Eudemian Ethics and the Nicomachean Ethics. However books 5-7 of the Nicomachean Ethics are identical to books 4-6 of the Eudemian Ethics, raising the question of which of the two sources these books originally came from. This issue is of philosophical as well as literary interest, and has been settled using stylometric research by philosopher Anthony Kenny.
5 5.The Plays of Shakespeare Notoriously Shakespeares plays have been attributed to a variety of authors: e.g. Bacon, Marlowe and various noblemen. There are also a number of plays (for example Edward III and The Two Noble Kinsmen) supposed by some to have been written either by Shakespeare alone or co-authored by him. Stylometry has the potential to bring objective analysis to bear on such (heated!) controversies, and was first attempted by Mendelhall in 1901 (without the benefit of computers, of course!).
6 6.The Federalist Papers The Federalist Papers are a series of articles published in with the aim of promoting the ratification of the new constitution. They were written by three authors, Jay, Hamilton and Madison, under the pseudonym Publius. Some of the papers are of known (and in some cases joint) authorship but others are disputed. Stylometric methods were famously brought to bear by Mosteller and Wallace in the early 1960s to attempt to answer this question. It is now considered as settled.
7 The Signature Stylometric Program The Signature computer program provides a number of tools for literary detection. It can be downloaded from together with various resource files (e.g. Federalist Papers, various novels, Shakespearean texts). Unpack all these into a suitable base directory (e.g. C:\Signature ), run signature.exe, and use Load... from the File menu to select your texts. When loading texts, note that you can select many at once, by control-clicking, i.e. holding down the Ctrl key while you select the texts you want.
8 Frequency Tests within Signature Having loaded your texts, highlight them within the file list box in the usual way (again control- clicking if necessary to select several) – as you do this, an appropriate graph and table will display automatically on the right-hand side of the screen. Five different graphs (and tables) can be chosen using the tabs at the bottom of the screen: these display the frequency of word length, sentence length, paragraph length, letters, and punctuation. Analysis of specific word frequencies is more advanced, and will be dealt with later.
9 Comparing Texts by Eye The graphs produced by the frequency tests can give a good visual impression of the similarities and differences between texts. This is usually much more striking if you select Display percentages rather than absolute frequencies (using the radio buttons at the top), as then the relative frequencies within the various texts appear on a matching scale. Graphs can be displayed flat or deep using the 2D or 3D buttons. Options gives access to many other display possibilities.
10 Adding Rigour (1) - large data Simply comparing one text with another will not necessarily give a reliable indication of common (or different) authorship – any single text, especially a short one, could be idiosyncratic for many different reasons. Tests are far more reliable if they are carried out on large bodies of data, so if you want to check, say, a play for possible authorship by Shakespeare, it is far better to test it against ALL the relevant Shakespearean corpus (e.g. all the known tragedies, or comedies, or even all his plays) rather than just one other play.
11 Combining Texts To treat several texts as a single corpus, select them with the mouse and click on the Combine Files into Corpus button under the file list box. Youll be invited to give a name to the corpus – take this opportunity to give a meaningful name (e.g. Austen, Shakespeare tragedies) as this will appear in your graphs and tables. (To change a corpus name or a files alias, double-click on it.) All corpora will appear in the lower box. These can be graphed exactly as though they were single files (so you can combine files and corpora within a graph; to deselect all, right-click the relevant box).
12 Adding Rigour (2) - Statistics You can add more rigour to your stylometric tests by replacing mere comparison of graphs by eye (which can be extremely misleading) with objective statistical measurement. For this purpose the Signature system provides an option under the Statistics menu that performs a Chi-Square comparison test. It is not necessary to understand the details of how this test works (the calculations can just be left to the computer), but you do need to be able to interpret the results that it yields.
13 Principles of Statistical Testing Most statistical testing works on the principle of calculating some measure (in this case, a measure of how far two texts differ in some respect) which is known to follow some standard distribution (in this case, the so-called Chi-Square distribution). The form of the distribution can be looked up in statistical tables (or stored in a computer program), and then compared against the calculated measure. The result of the test is a p-value, which gives the probability that mere random variation between two samples would give rise to a difference measure of at least the calculated magnitude.
14 Suppose we generate word length frequency data on two texts S and R, R being a large reference text (e.g. all the known works of some author) and S a sample text (e.g. a work of unknown authorship). Then applying the Chi-Square test to these data can be equivalent to asking: Suppose text R gives a reliable guide to the word length frequencies used by an author (so if R contains a 20% proportion of 3-letter words, this reflects a general tendency of that author to use 3-letter words on average one word in five). Then what is the probability that such an author, writing a text of the same length as S, would purely by chance produce a text with a word length frequency (WLF) spectrum which differs from that of R by as much as the actual WLF spectrum of S differs?
15 The Meaning of the Chi-Square Value The Chi-Square value is calculated as a number, and the larger the number, the less likely it is that a text like the sample (in the measured respect, e.g. word length frequencies) could have arisen by chance from an author having the overall characteristics manifested in the reference text. This number is usually compared against a standard value (e.g. the 5% value or the 1% value), to indicate whether it counts as significant to the corresponding degree (typically in scientific research, only a result which is at least significant at the 5% level is taken to be worthy of note).
16 If, for example, the test yields a Chi-Square value which is between the 5% and the 1% values, then Signature will display a message saying that the difference between the sample and reference texts is significant at the 5% level (i.e. sufficiently great that youd only expect such large differences to arise by chance on average at most 1 time in 20). Signature will also display the actual Chi-Square value given by the test, and also the two standard values between which it lies (in the case above, the 5% and the 1% values). This enables you to see roughly how close the actual value came to these limits (e.g. you might be able to see that it very nearly achieved significance at the 1% level).
17 Statistical and Genuine Significance Texts vary more systematically than do the repeatable scientific phenomena for which statistical tests are most typically used. Even a difference at the 1% level might not be genuinely significant in context – it might be that almost any pairs of texts you took, even by the same author, would differ just as much! So when assessing the real significance of Signatures statistical results, its important to make appropriate comparisons of the Chi-Square results you get. For this purpose, theres an option under the File menu to divide single texts into halves, so you can see what results you get from testing one half against the other.
18 What Statistical Tests Do Not Prove! Many people who use statistical tests fall victim to a natural misinterpretation of their results. The p-value does NOT give the probability that the sample (in our case, the sample text) actually came from the same population as (in our case, had the same author as) the reference (the reference text). Rather, it gives the probability that a difference of at least the magnitude measured between the sample and the reference would arise by chance if a sample of similar size were to be taken at random from the reference population. THESE TWO ARE QUITE DIFFERENT!
19 How to Use the Chi-Square Result To assess the actual probability that the sample came from the same population as the reference requires all sorts of background information – it might, for example, be known that the two works had the same (or different) author, or that one authors style is notoriously inconsistent, or that two authors styles are remarkably similar. Any other relevant information bearing on the case (e.g. from conventional historical and literary studies) would also need to be taken into account in any assessment of the actual probability. Use the Chi-Square test, therefore, to confirm the objective significance of apparent differences, rather than as a definitive assessment of probability.
20 Degrees of Freedom The Chi-Square test can operate on any chosen range of data, and its results can depend very largely on what range is chosen. In a word length frequency test, for example, you could apply the test to words of ALL lengths, or just to the length range, say, 2 to 8. The larger the range of data included, the larger the degrees of freedom (i.e. the scope for mere random variation to give rise to a large Chi-Square value). Hence it can be more difficult for the test to give a genuinely significant result. Its best to restrict the test to the visually crucial and reliable data range – where theres plenty of data – and to exclude the long thin tail of the graphs where proportionately large random variation is to be expected.
21 Reference and Sample Texts We have so far taken the reference and sample texts to be quite distinct, but if both are quite short, it can be best to include the sample text within the reference (there is a checkbox in Signature to do this automatically; this is ticked by default, but will be greyed out if the texts involve overlapping corpora). Applying the test is then like asking: Suppose we have an author whose style is typified by the combination of the reference and the sample texts, R and S. Then what is the probability that such an author, writing a text of the same length as S, would purely by chance... The principle here is simply that statistical results are more reliable when theyre based on more data.
22 Word Lists The tests looked at so far are relatively crude from a literary point of view. Stylistic differences can be shown more sensitively by testing for authors usage of particular words rather than mere word lengths etc. Signature provides a Word List facility (under the Wordlists menu). Select Edit wordlists, then click on New List, give your word list a name, then type into the Memo the words you want the list to contain, before clicking on OK (also indicate if you want your list to be case-sensitive). Now when you select any texts, they will if necessary be re-read, and a graph of the listed word occurrences displayed on the Words tab (see bottom-right corner of the screen).
23 Key Words Signatures key words facility can help you decide which words are most useful for author identification. To illustrate this, load the five Federalist Papers files provided, click on Hamilton and Madison, then select Key words from the Wordlists menu. A table will be generated showing the top 50 keywords that tell in favour of Hamilton (+ values), followed by the top 50 that tell in favour of Madison (– values). Move down the list, control-clicking to select upon, while, whilst and on. Then click on the cyan label, and youll enter the word list facility with the chosen words already included. Use this to identify the true author of the Unknown Federalist Papers!
24 Other Features of Signature The File menu enables you to read texts into multiple files (where tags are present), and to re- load texts (useful if editing has taken place meanwhile, or if loading was previously aborted for some reason). You can also hide texts to avoid cluttering the lists. The View menu provides a text viewer into which texts can be read (at the cost of speed). This helps when defining an appropriate filter (also accessible through the View menu) for removing unwanted textual artefacts (e.g. HTML entities, or changing Mr. to Mr to avoid misidentification of sentence endings). The Graphs and Tables menus provide a variety of formats for copying graphs/tables to the ClipBoard.