Presentation on theme: "Corpora in literary and stylistic studies Corpus Linguistics Richard Xiao"— Presentation transcript:
Corpora in literary and stylistic studies Corpus Linguistics Richard Xiao
Aims of this session Lecture – An overview of applications of corpora in literary and stylistic studies – Case study: Culpepers (2002) keyword analysis of six characters in Romeo and Juliet Lab session – To duplicate Culpepers (2002) study
Corpora vs. literary stylistics Stylistic shifts in usage may be observed with reference to features associated with either particular situations of use or particular groups of speakers (cf. Schilling-Estes 2002: 375) – In this sense, similar to registers and genres or dialects and language varieties – …but stylisticians are typically more interested in individual works by individual authors rather than language or language variety as such The use of corpora in stylistics and literary studies is presently very limited
Potential uses of corpora Study of prose style Study of individual authorial styles Authorship attribution Literary appreciation and criticism Teaching of stylistics Study of literariness in discourses other than literary texts (e.g. Carter 1999)
Study of prose style In stylistics, there is a long tradition of focusing on the representation of speech and thought in fiction Leech and Shorts (1981) influential model of speech and thought presentation – Style in Fiction, Longman, 1981 Further refined in Short, Semino and Culpeper (1996), and Semino, Short and Culpeper (1997)
S&TP: Lancaster Speech, Thought and Writing Presentation Corpus Developed during – Written: 260,000 words in size, three narrative genres: prose fiction, newspaper reportage and (auto)biography, which are further divided into serious and popular sections – Spoken: created with the express aim of comparing S&TP in spoken and written languages systematically, 260,000 words, 60 samples from BNCdemo, and 60 samples from oral history archives in the Centre for North West Regional Studies at Lancaster Download:
S&TP categories Direct category, e.g. – direct speech, direct thought and direct writing Free direct category, e.g. – free direct speech, free direct thought, free direct writing Indirect category, e.g. – indirect speech, indirect thought, indirect writing Free indirect category – free indirect speech, free indirect thought, free indirect writing Representation of speech/thought/writing act category Representation of voice/internal state/writing category Report category, e.g. – report of speech, report of thought, report of writing
Authorial styles of individual authors Typically specialized corpora of the works of individual authors, e.g. – A corpus composed of their early and later works to track any stylistic shift over time – A corpus composed of their works belonging to different genres (e.g. plays and essays) to compare their styles across genres – A corpus composed of works by different authors to compare their different authorial styles Large general corpora can provide a means of establishing a norm for comparison when discussing features of literary style (Hunston 2002: 128)
Techniques of studying authorial styles Corpus stylistics goes well beyond simple counting but rather relying heavily on sophisticated statistical approaches – MDA (e.g. Watson 1994) – Principal Component Analysis (e.g. Binongo and Smith 1999) – Multivariate analysis (or more specifically, cluster analysis, e.g. Watson 1999; Hoover 2003) Stylistics + computation + statistics – stylometry, stylometrics, computational stylistics, statistical stylistics, corpus stylistics
Authorship attribution Is the work by Shakespeare or Marlowe? Cluster analysis of frequent words, frequent word sequences, and frequent collocations provides an accurate and robust method for authorship attribution (Hoover 2001, 2002, 2003a, 2003b) Corpus-based authorship attribution has been used as linguistic evidence in court (forensic linguistics) – Confession/witness statements (e.g. Coulthard 1993) – Blackmail/ransom/suicide notes (Baldauf 1999) Plagiarism detection in academic and education settings (e.g. Turnitin UK)
The Derek Bentley case Derek Bentley was hanged in the UK in 1953 for allegedly encouraging his young companion Chris Craig (a minor) to shoot a policeman – The evidence that weighed against him was a confession statement which he signed in police custody but later claimed at the trial that the police had helped him (to?) produce The case was re-opened in 1993, 40 years after Derek was hanged – Malcolm Coulthard, a forensic linguist, was commissioned by Bentleys family to examine the confession as part of an appeal to get a posthumous pardon for Derek
The Derek Bentley case The appeal was initially rejected by the Home Secretary In 1998, another court of appeal overthrew the original conviction and found Derek Bentley innocent In 1999 the Home Secretary awarded compensation to the Bentley family
The Derek Bentley case In Bentleys confession, the word then was unusually frequent – It occurred 10 times in his 582-word confession statement, ranking as the 8 th most frequent word in the statement – It ranked 58 th in a corpus of spoken English, and 83 rd in the Bank of English (on average once every 500 words) Six witness statements – 3 made by other witnesses: then occurs just once in 980 words – 3 by police officers, including two involved in the Bentley case: then occurs 29 times – once in every 78 words!
The Derek Bentley case The position of then – Subject + then (e.g. I then, Chris then) was unusually frequent in Bentleys confession I then occurs three times (once every 190 words) In a 1.5-million-word corpus of spoken English, the sequence occurs just nine times (once every 165,000 words) No instance of I then was found in ordinary witness statements Nine occurrences were found in the police statement In the spoken BoE, then I was 10 times as frequent as I then
The Derek Bentley case The sequence subject + then was characteristic of the police statement Although the police denied Bentleys claim and said that the statement was a verbatim record of what Bentley had actually said, the unusual frequency of then and its abnormal position could be taken to be indicative of some intrusion of the policemens register in the statement
Culpeper (2002) Culpeper, Jonathan (2002) Computers, language and characterisation: An analysis of six characters in Romeo and Juliet. In U. Melander-Marttala, C. Ostman and Merja Kyto (eds.), Conversation in Life and in Literature. Uppsala: Universitetstryckeriet, pp – nks/Keywords-Culpeper.pdf nks/Keywords-Culpeper.pdf
Aim of Culpeper (2002) The broad aim of this paper is to show how the study of an important area within stylistics, namely characterisation, can benefit from an empirical approach, specifically, a methodology for identifying what might be the key words of a text … Such an approach can reveal significant lexical and grammatical patterns without reliance on speculations about what the relevant dimensions are (Culpeper 2002: 12)
Keywords vs. style-markers Enkvist (1964: 29) – Style is concerned with frequencies of linguistic items in a given context, and thus with contextual probabilities. – To measure the style of a passage, the frequencies of its linguistic items […] must be compared with the corresponding features in another text or corpus which is regarded as a norm and which has a definite relationship with this passage. Style as a matter of frequencies, probabilities and norms – We may […] define style markers as those linguistic items that only appear, or are most or least frequent in, one group of contexts. In other words, style markers are contextually bound linguistic elements… (ibid. 34-5) – Elements that are not style markers are stylistically neutral. (ibid. 35) Style-markers…are words whose frequencies differ significantly from their frequencies in a norm (Culpeper 2002: 13) – Keywords (positive and negative)
Preparing the text Problem 1: Which text to use … original version or modern version? – Culpeper opted for a modern edition (to get round problem of spelling variation: sweet vs. sweete, etc.) Problem 2: Shakespeare plays are full of dialogue – How can we get the tool to distinguish between different characters? – Culpeper used a simple tagging scheme, e.g. … …
Who is worth concentrating on …? Culpeper chose his characters based on the number of words that they spoke CharacterTotal no. of words spoken Romeo5031 Juliet4564 Friar Lawrence 2901 Nurse2369 Capulet2292 Mercutio2254 Benvolio1293
Choosing a reference corpus Culpeper opted to make 6 reference corpora – one for each character, e.g. – RC for Romeo = whole play minus Romeos contributions – RC for Juliet = whole play minus Juliets contributions – RC for Nurse = whole play minus Nurses contributions –…–… Why use a reference corpus of the same play? – Characters are partly shaped by their context. Thus, it makes little sense to compare, say, the characters of Romeo and Juliet with the characters of Macbeth or Anthony and Cleopatra, since the fictional worlds of Italy, Scotland and Egypt provide very different contextual influences. Furthermore, characters, like people, are partly perceived in terms of whom they interact with … (Culpeper 2002: 16)
Alternative reference corpora …? Scott and Tribble (2006) have compared Romeo and Juliet against – The Complete Works of Shakespeare – Plays only – Tragedies only – The BNC Interestingly … they found that – A robust core of keywords occur whichever reference corpus is used. These include personal and place names like Benvolio, Romeo, Juliet and Mantua but also terms like banished, county, love and night In contrast to Scott and Tribble (2006), Culpeper (2002) found that his results were more meaningful - in terms of characterisation - when using the other Romeo and Juliet characters (minus the target character) as a reference corpus
Making wordlists for each character Making the characters word lists – Involves telling Wordsmith to only include … – Procedure … Wordlist – Settings – Wordlist specific – Tags – Only part of file – Sections to keep – [specifying start/end tags] Making the reference corpora – Involves telling Wordsmith to exclude anything between … – Procedure … Wordlist – Settings – Wordlist specific – Tags – Only part of file – Sections to cut out – [specifying start/end tags]
Top 10 on wordlists (frequency) ROMEOJULIETCAPULETNURSEMERCUTIOFRIAR LPLAYPRES- DAY SPOKEN ENGLISH PRES- DAY WRITTEN ENGLISH AND I THE TO MY THAT A OF ME IN I TO AND MY THE THAT THOU IS A BE TO YOU AND A MY I IS THE HER NOT I A AND THE YOU TO IT IS MY O A THE OF AND TO THAT I IS IN THOU AND THE TO IN THY THOU OF IS THAT A AND THE I TO A OF MY THAT IS IN THE I YOU AND IT A S TO OF THAT THE OF AND A IN TO (INF) IS TO (PR) WAS IT Q: Do they tell us anything interesting/worthwhile and, if so, what?
Positive keywords for the six characters RomeoJulietCapuletNurseMercutioFriar L Beauty Blessed Love Eyes More Mine Rich Dear Yonder Farewell Me Sick Lips Stars Fair Thine Hand Banished If Or Sweet Be News My Night I Would Yet Thou Words Name Nurse Tybalt Send Husband That swear Go Wife Thank Ha You Thursday Her Child Welcome We Haste Gentlemen Tis Our Make Now Daughter Well Day Hes You Quoth Woeful God Warrant Madam Lord Lady Hie It Your Faith Said Ay She About Ever Sir Marry Ah Fall Well A Hare Very Of He The Oer Thy From Thyself Mantua Part Heaven Forth Her Alone Time Married Letter What differences can you spot between the results here and the results on the previous table?
What key words can tell us about characterisation … Romeos top three key words – beauty, blessed, love Expected? Surprising? … the lover of the play – Other keywords related to love talk = dear, stars, fair – Keywords relating to body parts – eyes, lips, hand – obsessed with the physical? Juliets top key word – if, or, be, yet, would (conditional + modals) – Reflecting her state of mind – anxiety and uncertainty? Capulet most key key word – go – Context reveals that mostly used as an imperative command … Capulet as head of the household to direct other people (see also make and haste), e.g. Go wake Juliet, go and trim her up… Nurses keywords are surge features (i.e. reflecting outbursts of emotion) – god, warrant, woeful, faith, marry, ah
Negative key words for the six characters RomeoJulietCapuletNurseMercutioFriar L You Romeo He Go Her The You And Go Thou That The Of And With Thou My I What I You A Have My IMPORTANT These represent words that are used unusually infrequently (statistically speaking) by these characters. Do you notice anything interesting?
Use of Pronouns within Romeo and Juliet JulietRomeoCapuletnurseMercutioFriar L POSMY I THOU ME MINE THINE YOU WE TIS OUR HES YOU IT YOUR SHE HETHY THYSELF NEGYOU HE THOU MY I YOU MY Romeo and Juliet use first and second person pronouns – Expected? - at the heart of the social interaction in the play But compare Romeos use of me/mine with Juliets use of I … – Culpepers (2002) conclusion: Juliet spends much time in the play bearing her soul … whereas Romeo is much more conscious of his own role as a lover and of the effect of the circumstances upon him (ibid: 24) What about Capulet? – you, we, our, why? Thou-forms vs. you-forms to be covered
Culpepers Conclusion (2002: 27) In some cases, my analysis provided solid evidence for what one might have guessed (e.g. Romeos keywords beauty and love) … … in others, it revealed what I think would be very difficult to guess but fits well a possible interpretation (e.g. Juliets keywords if and yet). … keywords analysis also offers a way into analysing function words, such as pronouns, and accounting for their contribution to style and meaning
What should we take note of …? How he was able to come to his conclusions – The importance of having the right reference corpus – The need to use mark-up (as a means of identifying the different characters) – Knowing how to use Wordsmith … To make the different wordlists To make the keyword lists
Any potential weaknesses … It did not attempt to lemmatize the word forms … so that, for example, loves would form part of the word count of love (Culpeper 2002: 27) Contractions (e.g. Ill) would also have been counted separately Key word analysis … – makes us focus on statistical deviations from a relative norm, and ignores the significance of relatively infrequent deviations from absolute norms (i.e. what your given texts may have in common) – ignores one-off occurrences of words
Now its your turn… Duplicating Culpeper (2002)
The Romeo text Download the Oxford Shakespeare version of Romeo and Juliet – – Local copy available Using tags to separate stage directions from dialogues – Did Culpeper do this? Tag words spoken by each character Alternatively, you can use a local version I have prepared
Sample of tagged text Good morrow, cousin. Is the day so young? But new struck nine. Ay me! sad hours seem long. Was that my father that went hence so fast? It was. What sadness lengthens Romeos hours? Not having that, which having, makes them short. In love? Out Of love? Out of her favour, where I am in love.
Separating words by apostrophes clear from this box and press OK
Making a wordlist for each character Start wordlist function Load the text Setting – Tags – Only part of File - Sections to keep – type in the start/end tags given below – Ignore is default setting – ignore stage directions Make a wordlist for – Romeo_TC ( … ) – Juliet_TC ( … ) – Capulet_TC ( … ) – Nurse_TC ( … ) – Mercutio_TC ( … ) – Friar_L_TC ( … )
Tag and markup Only Part of file
Making a reference list for each character Setting – Tags – Only part of File - Sections to cut out – type in the start/end tags given below – Excluding what is said by the target character Make a wordlist for – Romeo_RC ( … ) – Juliet_RC ( … ) – Capulet_RC ( … ) – Nurse_RC ( … ) – Mercutio_RC ( … ) – Friar_L_RC ( … )
Running words CharacterIn our fileCulpeper (2002) Romeo Juliet Friar Lawrence Capulet Nurse Mercutio
Discrepancies: Some explanations Different tagging – We ignored stage directions – We tried what Culpeper (2002) suggested at the end of his paper, treating contracted words such as Ill as two words A potential problem of this approach with Shakespearean texts – dancd, disturbd, and raisd etc all became two words! – Is there a need to annotate the text? Not done here or in Culpeper (2002), but worth its efforts – the citys side – lets away – Wheres this girl? Want to have a try? –
Top 10 on wordlists Romeo Juliet Capulet Nurse Mercutio Friar L whole play
Keyword settings Selected statistic formula Min. Frequency Cutoff p value
Making a keyword list per character Romeo_kw – Romeo_TC + Romeo_RC Juliet_kw – Juliet_TC + Juliet_RC Capulet_kw – Capulet_TC + Capulet_RC Nurse_kw – Nurse_TC + Nurse_RC Mercutio_kw – Mercutio_TC + Mercutio_RC Friar_L_kw – Friar_L_TC + Friar_L_RC
Romeos keywords by keyness Positive keywordsNegative keywords Aboutness: beauty, love, blessed, dream, joy, sin, kiss, death, poison, soul … Love talk: dear, farewell, stars Body parts: eyes, lips, hand Pronouns: mine, me, thine, thee, my Himself: Romeo, he, him Both: you, we Movement: come, go, up
Juliets keywords by keyness Positive keywords Negative keywords People in interaction: nurse, Romeo, sweet, husband, mother, father State of mind: if, or, be, yet, would Pronouns: my, I, thou Aboutness: news, words, night, swear, send, tongue, speak Herself: her Both: we, you Movement: here, go
Why nurse and husband? (vocal function)
You-forms vs. thou-forms – Plural: ye, you, your, yours, yourself – Singular: thou, thee, thy, thine, thyself You-forms vs. thou-forms (thou, thine, thee) – socio- pragmatic implications – Romeo and Juliet prefer thou-forms (positive) and avoid you-forms (negative) High status social equals use you-forms You-forms are dispassionate and emotionally unmarked Thou-forms are strongly expressive: positive (affection and love) or negative (anger and contempt) – intimacy, love talk – Friar Laurence prefers thou-forms: He is engaged in intimate and emotionally charged discourse – Capulet and the Nurse prefer you-forms: used among social superiors, or individuals of low status talking to people of high social status
Capulets keywords by keyness Positive keywords Negative keywords Pron: thy, thou Others: the, of, that, etc. [full of actions, not a nouny style] Directions: go, haste, make, now, look (imperatives) Pronouns: you, we, her, our (directing and speaking on behalf of the household) etc… [you vs. thou: imperative; less emotional]
Nurses keywords by keyness Positive keywordsNegative keywords Emotional: ay, ah, O, God, woeful, warrant, faith Pronouns: you, your, he, I Address terms: lady, madam, lord, sir Why day? - O day! O day! O day! O hateful day! Pron: thou Why d?
Culpeper might have made the correct decision to treat contractions as one word?
Mercutios keywords by keyness Positive keywords Negative keywords Noun-y style: a, of, the, an – akin to written, less interactive Less interactive style: Lack of Question word: what Lack of 1st person pron: I, my
Friar Ls keywords by keyness Positive keywordsNegative keywords Pronouns: thy, thyself, thou - involved in intimate and emotional charged discourse, "emotional mirror" A man of the Church: heaven, from (heaven) Roles he played in facilitating the plot: Mantua, letter Less emotional (than Nurse): O Pronouns: my, you, I
53 Planning your own study … What should I do first …? – Choose your data and/or tool – Determine what interests you about the data – Come up with some hypotheses that youd like to test out This can be data-driven (what seems to jump out at you from your data) This can be theory-oriented (i.e. testing out something about the language thats taken for granted)