Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpora in literary and stylistic studies

Similar presentations

Presentation on theme: "Corpora in literary and stylistic studies"— Presentation transcript:

1 Corpora in literary and stylistic studies
Corpus Linguistics Richard Xiao

2 Aims of this session Lab session Lecture
An overview of applications of corpora in literary and stylistic studies Case study: Culpeper’s (2002) keyword analysis of six characters in Romeo and Juliet Lab session To duplicate Culpeper’s (2002) study

3 Corpora vs. literary stylistics
Stylistic shifts in usage may be observed with reference to features associated with either particular situations of use or particular groups of speakers (cf. Schilling-Estes 2002: 375) In this sense, similar to registers and genres or dialects and language varieties …but stylisticians are typically more interested in individual works by individual authors rather than language or language variety as such The use of corpora in stylistics and literary studies is presently very limited

4 Potential uses of corpora
Study of prose style Study of individual authorial styles Authorship attribution Literary appreciation and criticism Teaching of stylistics Study of literariness in discourses other than literary texts (e.g. Carter 1999)

5 Study of prose style In stylistics, there is a long tradition of focusing on the representation of speech and thought in fiction Leech and Short’s (1981) influential model of speech and thought presentation Style in Fiction, Longman, 1981 Further refined in Short, Semino and Culpeper (1996), and Semino, Short and Culpeper (1997)

6 S&TP: Lancaster Speech, Thought and Writing Presentation Corpus
Developed during Written: 260,000 words in size, three narrative genres: prose fiction, newspaper reportage and (auto)biography, which are further divided into ‘serious’ and ‘popular’ sections Spoken: created with the express aim of comparing S&TP in spoken and written languages systematically, 260,000 words, 60 samples from BNCdemo, and 60 samples from oral history archives in the Centre for North West Regional Studies at Lancaster Download:

7 S&TP categories Direct category, e.g. Free direct category, e.g.
direct speech, direct thought and direct writing Free direct category, e.g. free direct speech, free direct thought, free direct writing Indirect category, e.g. indirect speech, indirect thought, indirect writing Free indirect category free indirect speech, free indirect thought, free indirect writing Representation of speech/thought/writing act category Representation of voice/internal state/writing category Report category, e.g. report of speech, report of thought, report of writing

8 Authorial styles of individual authors
Typically specialized corpora of the works of individual authors, e.g. A corpus composed of their early and later works to track any stylistic shift over time A corpus composed of their works belonging to different genres (e.g. plays and essays) to compare their styles across genres A corpus composed of works by different authors to compare their different authorial styles Large general corpora can provide ‘a means of establishing a norm for comparison when discussing features of literary style’ (Hunston 2002: 128)

9 Techniques of studying authorial styles
Corpus stylistics goes well beyond simple counting but rather relying heavily on sophisticated statistical approaches MDA (e.g. Watson 1994) Principal Component Analysis (e.g. Binongo and Smith 1999) Multivariate analysis (or more specifically, cluster analysis, e.g. Watson 1999; Hoover 2003) Stylistics + computation + statistics stylometry, stylometrics, computational stylistics, statistical stylistics, corpus stylistics

10 Authorship attribution
Is the work by Shakespeare or Marlowe? Cluster analysis of frequent words, frequent word sequences, and frequent collocations provides an accurate and robust method for authorship attribution (Hoover 2001, 2002, 2003a, 2003b) Corpus-based authorship attribution has been used as linguistic evidence in court (“forensic linguistics”) Confession/witness statements (e.g. Coulthard 1993) Blackmail/ransom/suicide notes (Baldauf 1999) Plagiarism detection in academic and education settings (e.g. Turnitin UK)

11 The Derek Bentley case Derek Bentley was hanged in the UK in 1953 for allegedly encouraging his young companion Chris Craig (a minor) to shoot a policeman The evidence that weighed against him was a confession statement which he signed in police custody but later claimed at the trial that the police had ‘helped’ him (to?) produce The case was re-opened in 1993, 40 years after Derek was hanged Malcolm Coulthard, a forensic linguist, was commissioned by Bentley’s family to examine the confession as part of an appeal to get a posthumous pardon for Derek

12 The Derek Bentley case The appeal was initially rejected by the Home Secretary In 1998, another court of appeal overthrew the original conviction and found Derek Bentley innocent In 1999 the Home Secretary awarded compensation to the Bentley family

13 The Derek Bentley case Six witness statements
In Bentley’s confession, the word then was unusually frequent It occurred 10 times in his 582-word confession statement, ranking as the 8th most frequent word in the statement It ranked 58th in a corpus of spoken English, and 83rd in the Bank of English (on average once every 500 words) Six witness statements 3 made by other witnesses: then occurs just once in 980 words 3 by police officers, including two involved in the Bentley case: then occurs 29 times – once in every 78 words!

14 The Derek Bentley case The position of then
Subject + then (e.g. I then, Chris then) was unusually frequent in Bentley’s confession I then occurs three times (once every 190 words) In a 1.5-million-word corpus of spoken English, the sequence occurs just nine times (once every 165,000 words) No instance of I then was found in ordinary witness statements Nine occurrences were found in the police statement In the spoken BoE, then I was 10 times as frequent as I then

15 The Derek Bentley case The sequence subject + then was characteristic of the police statement Although the police denied Bentley’s claim and said that the statement was a verbatim record of what Bentley had actually said, the unusual frequency of then and its abnormal position could be taken to be indicative of some intrusion of the policemen’s register in the statement

16 Culpeper (2002) Culpeper, Jonathan (2002) Computers, language and characterisation: An analysis of six characters in Romeo and Juliet. In U. Melander-Marttala, C. Ostman and Merja Kyto (eds.), Conversation in Life and in Literature. Uppsala: Universitetstryckeriet, pp

17 Aim of Culpeper (2002) ‘The broad aim of this paper is to show how the study of an important area within “stylistics”, namely characterisation, can benefit from an empirical approach, specifically, a methodology for identifying what might be the “key” words of a text … Such an approach can reveal significant lexical and grammatical patterns without reliance on speculations about what the relevant dimensions are’ (Culpeper 2002: 12)

18 Keywords vs. style-markers
Enkvist (1964: 29) ‘Style is concerned with frequencies of linguistic items in a given context, and thus with contextual probabilities.’ ‘To measure the style of a passage, the frequencies of its linguistic items […] must be compared with the corresponding features in another text or corpus which is regarded as a norm and which has a definite relationship with this passage.’ Style as a matter of ‘frequencies’, ‘probabilities’ and ‘norms’ ‘We may […] define style markers as those linguistic items that only appear, or are most or least frequent in, one group of contexts. In other words, style markers are contextually bound linguistic elements…’ (ibid. 34-5) ‘Elements that are not style markers are stylistically neutral.’ (ibid. 35) ‘Style-markers…are words whose frequencies differ significantly from their frequencies in a norm’ (Culpeper 2002: 13) Keywords (positive and negative)

19 Preparing the text Problem 1: Which text to use … original version or modern version? Culpeper opted for a modern edition (to get round problem of spelling variation: sweet vs. sweete, etc.) Problem 2: Shakespeare plays are full of dialogue How can we get the tool to distinguish between different characters? Culpeper used a simple tagging scheme, e.g. <ROM>…<\ROM> <JUL>…<\JUL>

20 Who is worth concentrating on …?
Character Total no. of words spoken Romeo 5031 Juliet 4564 Friar Lawrence 2901 Nurse 2369 Capulet 2292 Mercutio 2254 Benvolio 1293 Culpeper chose his characters based on the number of words that they “spoke”

21 Choosing a reference corpus
Culpeper opted to make 6 reference corpora – one for each character, e.g. RC for Romeo = whole play minus Romeo’s contributions RC for Juliet = whole play minus Juliet’s contributions RC for Nurse = whole play minus Nurse’s contributions Why use a reference corpus of the same play? ‘Characters are partly shaped by their context. Thus, it makes little sense to compare, say, the characters of Romeo and Juliet with the characters of Macbeth or Anthony and Cleopatra, since the fictional worlds of Italy, Scotland and Egypt provide very different contextual influences. Furthermore, characters, like people, are partly perceived in terms of whom they interact with …’ (Culpeper 2002: 16)

22 Alternative reference corpora …?
Scott and Tribble (2006) have compared Romeo and Juliet against The Complete Works of Shakespeare Plays only Tragedies only The BNC Interestingly … they found that A ‘robust core’ of keywords occur whichever reference corpus is used. These include personal and place names like “Benvolio”, “Romeo”, “Juliet” and “Mantua” but also terms like “banished”, “county”, “love” and “night” In contrast to Scott and Tribble (2006), Culpeper (2002) found that his results were more meaningful - in terms of characterisation - when using the other Romeo and Juliet characters (minus the target character) as a reference corpus

23 Making wordlists for each character
Making the characters’ word lists Involves telling Wordsmith to only include <…> … <\…> Procedure … Wordlist – Settings – Wordlist specific – Tags – Only part of file – Sections to keep – [specifying start/end tags] Making the reference corpora Involves telling Wordsmith to exclude anything between <…> … <\…> Wordlist – Settings – Wordlist specific – Tags – Only part of file – Sections to cut out – [specifying start/end tags]

24 Top 10 on wordlists (frequency)

25 Positive keywords for the six characters
Romeo Juliet Capulet Nurse Mercutio Friar L Beauty Blessed Love Eyes More Mine Rich Dear Yonder Farewell Me Sick Lips Stars Fair Thine Hand Banished If Or Sweet Be News My Night I Would Yet Thou Words Name Tybalt Send Husband That swear Go Wife Thank Ha You Thursday Her Child Welcome We Haste Gentlemen Tis Our Make Now Daughter Well Day He’s You Quoth Woeful God Warrant Madam Lord Lady Hie It Your Faith Said Ay She About Ever Sir Marry Ah Fall A Hare Very Of He The O’er Thy From Thyself Mantua Part Heaven Forth Alone Time Married Letter What differences can you spot between the results here and the results on the previous table?

26 What key words can tell us about characterisation …
Romeo’s top three key words – ‘beauty’, ‘blessed’, ‘love’ Expected? Surprising? … the lover of the play Other keywords related to ‘love talk’ = ‘dear’, ‘stars’, ‘fair’ Keywords relating to body parts – ‘eyes’, ‘lips’, ‘hand’ – obsessed with the physical? Juliet’s top key word – ‘if’, ‘or’, ‘be’, ‘yet’, ‘would’ (conditional + modals) Reflecting her state of mind – anxiety and uncertainty? Capulet most ‘key’ key word – ‘go’ Context reveals that mostly used as an imperative command … Capulet as head of the household to direct other people (see also ‘make’ and ‘haste’), e.g. Go wake Juliet, go and trim her up… Nurse’s keywords are surge features (i.e. reflecting outbursts of emotion) – ‘god’, ‘warrant’, ‘woeful’, ‘faith’, ‘marry’, ‘ah’

27 Negative key words for the six characters
Romeo Juliet Capulet Nurse Mercutio Friar L You He Go Her The And Thou That Of With My I What A Have IMPORTANT These represent words that are used unusually infrequently (statistically speaking) by these characters. Do you notice anything interesting?

28 Use of Pronouns within Romeo and Juliet
Capulet nurse Mercutio Friar L POS MY I THOU ME MINE THINE YOU WE TIS OUR HE’S IT YOUR SHE HE THY THYSELF NEG Romeo and Juliet use first and second person pronouns Expected? - “at the heart of the social interaction in the play” But compare Romeo’s use of ‘me/mine’ with Juliet’s use of ‘I’ … Culpeper’s (2002) conclusion: ‘Juliet spends much time in the play bearing her soul … whereas Romeo is much more conscious of his own role as a lover and of the effect of the circumstances upon him’ (ibid: 24) What about Capulet? – “you”, “we”, “our”, why? Thou-forms vs. you-forms to be covered

29 Culpeper’s Conclusion (2002: 27)
“In some cases, my analysis provided solid evidence for what one might have guessed (e.g. Romeo’s keywords ‘beauty’ and ‘love’) …” “… in others, it revealed what I think would be very difficult to guess but fits well a possible interpretation (e.g. Juliet’s keywords ‘if’ and ‘yet’).” “… keywords analysis also offers a way into analysing function words, such as pronouns, and accounting for their contribution to style and meaning”

30 What should we take note of …?
How he was able to come to his conclusions The importance of having the right reference corpus The need to use mark-up (as a means of identifying the different characters) Knowing how to use Wordsmith … To make the different wordlists To make the keyword lists

31 Any potential weaknesses …
It did not attempt to lemmatize the word forms … so that, for example, ‘loves’ would form part of the word count of ‘love’ (Culpeper 2002: 27) Contractions (e.g. I’ll) would also have been counted separately Key word analysis … makes us focus on ‘statistical deviations from a relative norm, and ignores the significance of relatively infrequent deviations from absolute norms’ (i.e. what your given texts may have in common) ignores one-off occurrences of words

32 Duplicating Culpeper (2002)
Now it’s your turn… Duplicating Culpeper (2002)

33 The Romeo text Download the “Oxford Shakespeare” version of Romeo and Juliet Local copy available Using tags to separate stage directions from dialogues Did Culpeper do this? Tag words spoken by each character Alternatively, you can use a local version I have prepared

34 Sample of tagged text <Exeunt MONTAGUE and LADY. ROMEO. >
<Ben.> Good morrow, cousin. <\Ben.> <Rom.> Is the day so young? <\Rom.> <Ben.> But new struck nine. <\Ben.> <Rom.> Ay me! sad hours seem long. Was that my father that went hence so fast? <\Rom.> <Ben.> It was. What sadness lengthens Romeo’s hours? <\Ben.> <Rom.> Not having that, which having, makes them short. <\Rom.> <Ben.> In love? <\Ben.> <Rom.> Out— <\Rom.> <Ben.> Of love? <\Ben.> <Rom.> Out of her favour, where I am in love. <\Rom.>

35 Separating words by apostrophes
clear ‘ from this box and press OK

36 Making a wordlist for each character
Start wordlist function Load the text Setting – Tags – Only part of File - “Sections to keep” – type in the start/end tags given below Ignore <*> is default setting – ignore stage directions Make a wordlist for Romeo_TC (<Rom.>…<\Rom.>) Juliet_TC (<Jul.>…<.\Jul.>) Capulet_TC (<Cap.>…<\Cap.>) Nurse_TC (<Nurse.>…<\Nurse.>) Mercutio_TC (<Mer.>…<\Mer.>) Friar_L_TC (<Fri._L.>…<\Fri._L.>)

37 Tag and markup Only Part of file

38 Making a reference list for each character
Setting – Tags – Only part of File - “Sections to cut out” – type in the start/end tags given below Excluding what is said by the target character Make a wordlist for Romeo_RC (<Rom.>…<\Rom.>) Juliet_RC (<Jul.>…<.\Jul.>) Capulet_RC (<Cap.>…<\Cap.>) Nurse_RC (<Nurse.>…<\Nurse.>) Mercutio_RC (<Mer.>…<\Mer.>) Friar_L_RC (<Fri._L.>…<\Fri._L.>)

39 Running words Character In our file Culpeper (2002) Romeo 4842 5031
Juliet 4438 4564 Friar Lawrence 2860 2901 Capulet 2282 2292 Nurse 2250 2369 Mercutio 2169 2254

40 Discrepancies: Some explanations
Different tagging We ignored stage directions We tried what Culpeper (2002) suggested at the end of his paper, treating contracted words such as “I’ll” as two words A potential problem of this approach with Shakespearean texts danc’d, disturb’d, and rais’d etc all became two words! Is there a need to annotate the text? Not done here or in Culpeper (2002), but worth its efforts the city’s side let’s away Where’s this girl? Want to have a try?

41 Top 10 on wordlists Romeo Juliet Capulet
Nurse Mercutio Friar L whole play

42 Keyword settings Cutoff p value Selected statistic formula
Min. Frequency

43 Making a keyword list per character
Romeo_kw Romeo_TC + Romeo_RC Juliet_kw Juliet_TC + Juliet_RC Capulet_kw Capulet_TC + Capulet_RC Nurse_kw Nurse_TC + Nurse_RC Mercutio_kw Mercutio_TC + Mercutio_RC Friar_L_kw Friar_L_TC + Friar_L_RC

44 Romeo’s keywords by keyness
Positive keywords Negative keywords Himself: Romeo, he, him Both: you, we Movement: come, go, up Aboutness: beauty, love, blessed, dream, joy, sin, kiss, death, poison, soul … Love talk: dear, farewell, stars Body parts: eyes, lips, hand Pronouns: mine, me, thine, thee, my

45 Juliet’s keywords by keyness
Negative keywords Positive keywords Herself: her Both: we, you Movement: here, go People in interaction: nurse, Romeo, sweet, husband, mother, father State of mind: if, or, be, yet, would Pronouns: my, I, thou Aboutness: news, words, night, swear, send, tongue, speak

46 Why “nurse” and husband?
(vocal function)

47 You-forms vs. thou-forms
Plural: ye, you, your, yours, yourself Singular: thou, thee, thy, thine, thyself You-forms vs. thou-forms (thou, thine, thee) – socio-pragmatic implications Romeo and Juliet prefer thou-forms (positive) and avoid you-forms (negative) High status social equals use you-forms You-forms are dispassionate and emotionally unmarked Thou-forms are strongly expressive: positive (affection and love) or negative (anger and contempt) – intimacy, love talk Friar Laurence prefers thou-forms: He is engaged in intimate and emotionally charged discourse Capulet and the Nurse prefer you-forms: used among social superiors, or individuals of low status talking to people of high social status

48 Capulet’s keywords by keyness
Positive keywords Negative keywords Pron: thy, thou Others: the, of, that, etc. [full of actions, not a ‘nouny’ style] [you vs. thou: imperative; less emotional] etc… Pronouns: you, we, her, our (directing and speaking on behalf of the household) Directions: go, haste, make, now, look (imperatives)

49 Nurse’s keywords by keyness
Positive keywords Negative keywords Pron: thou Why ’d? Emotional: ay, ah, O, God, woeful, warrant, faith Pronouns: you, your, he, I Address terms: lady, madam, lord, sir Why “day”? - “O day! O day! O day! O hateful day!”

50 Why “d”? Culpeper might have made the correct decision to treat contractions as one word?

51 Mercutio’s keywords by keyness
Positive keywords Negative keywords Less interactive style: Lack of Question word: what Lack of 1st person pron: I, my “Noun-y” style: a, of, the, an – akin to written, less interactive

52 Friar L’s keywords by keyness
Positive keywords Negative keywords Less emotional (than Nurse): O Pronouns: my, you, I Pronouns: thy, thyself, thou - involved in intimate and emotional charged discourse, "emotional mirror" A man of the Church: heaven, from (heaven) Roles he played in facilitating the plot: Mantua, letter

53 Planning your own study …
What should I do first …? Choose your data and/or tool Determine what interests you about the data Come up with some “hypotheses” that you’d like to test out This can be data-driven (what seems to “jump out” at you from your data) This can be theory-oriented (i.e. testing out something about the language that’s taken for granted)

Download ppt "Corpora in literary and stylistic studies"

Similar presentations

Ads by Google