Presentation is loading. Please wait.

Presentation is loading. Please wait.

Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation.

Similar presentations


Presentation on theme: "Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation."— Presentation transcript:

1 Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation is at www.lexically.net/downloads/corpus_linguistics

2 Starting Questions 1. Are clusters like “Once upon a time” and “lived happily ever after” oddities in marking text position? 2. Or do many n-grams characterise the beginnings, middles or ends of certain kinds of text? 3. If so, are there any common patterns in text-initial clusters?

3 Context Textual Priming Project, University of Liverpool Michael Hoey Michaela Mahlberg Matthew O’Donnell Mike Scott

4 Textual Priming Project: Aims to investigate how many (and what types of) lexical items are primed to appear in text-initial or paragraph-initial position to identify lexico-grammatical patterns and see how these patterns can be functionally interpreted in the textual contexts. to relate these lexical and corpus-driven facts to current textual descriptions of (hard) news stories that might provide explanations for the positive primings of relevant lexis. from O’Donnell et al 2007

5 Hard News Corpus “Home News” sections of the Guardian and Observer 1998 to 2004 115,654 articles divided thus: headline & lead 1 st sentence of 1 st paragraph (TISC) all other sentences TISC contains 3.2 million tokens The rest: 51.2 million tokens About 470 words per article

6 Research Questions Using the hard news corpus, 1. How many 3-5 word clusters are found to be key in TISC sections? 2. How many are positively and how many are negatively key? 3. What recurrent patterns can be found in the two types of key cluster?

7 Methods (1) 1. Format the corpus in XML and separate out all TISC sections (done by Matt O’Donnell) 2. Use WordSmith’s WordList tool to compute wordlist indexes of 1. all the text 2. all the TISC sections 3. Using WordList, compute 3-5 word clusters for each index, save as.lst

8 Top clusters, all sections GUARDIAN CO UK ONE OF THE A HREF HTTP, WWW GUARDIAN CO and similar web links THE PRIME MINISTER THE END OF AS WELL AS THE NUMBER OF THERE IS A SOME OF THE THERE IS NO

9 Top clusters, TISC ONE OF THE ACCORDING TO A LAST NIGHT AFTER FOR THE FIRST THE FIRST TIME IS TO BE FOR THE FIRST TIME THE MURDER OF ARE TO BE THE DEATH OF OF THE MOST THE HOME SECRETARY WAS LAST NIGHT IT EMERGED YESTERDAY AS PART OF AN ATTEMPT TO THE UNITED STATES THE NUMBER OF ONE OF THE MOST ACCORDING TO THE

10 Methods (2) 4. Use KeyWords tool to compute KWs for the TISC 3-5 word clusters using all the text as a reference corpus 5. Identify patterns in the KW clusters

11 TISC key clusters ACCORDING TO A LAST NIGHT AFTER IT EMERGED YESTERDAY WAS LAST NIGHT ARE TO BE THE MURDER OF LAST NIGHT WHEN THE GOVERNMENT YESTERDAY LAST NIGHT AS IS TO BE WERE LAST NIGHT YESTERDAY AFTER A TONY BLAIR YESTERDAY COURT HEARD YESTERDAY WAS TOLD YESTERDAY WAS JAILED FOR THE DEATH OF YEAR OLD BOY YESTERDAY WHEN THE WITH THE MURDER OF

12 Numbers of Key Clusters

13 RQs 1 & 2: Numbers of KW clusters using a p value of 0.0000001 and minimum frequency of 3 and log likelihood statistic, 8,132 key clusters altogether (in 3.2 million words of text) of which 7,631 were positively key and 501 negatively key though there is repetition as these are 3-5 word n-grams Research Question 2

14 Repetition YESTERDAY FOUND GUILTY YESTERDAY FOUND GUILTY OF YESTERDAY FROM A YESTERDAY FROM THE YESTERDAY GAVE A YESTERDAY GAVE HIS YESTERDAY GAVE THE YESTERDAY GIVEN A YESTERDAY GIVEN THE YESTERDAY GIVEN THE GO YESTERDAY GIVEN THE GO AHEAD

15 Negatively key: A LOT OF A SPOKESMAN FOR THERE IS NO HE SAID THE SAID IT WAS THERE IS A THIS IS A THE FACT THAT AS WELL AS IT WOULD BE SPOKESMAN FOR THE PER CENT OF WE HAVE TO SAID THAT THE BUT IT IS AT A TIME A SPOKESMAN FOR THE SAID HE WAS IT IS NOT THERE WAS NO

16 RQ 1: Numbers of KW clusters Is 8 thousand a large number of distinct key text-initial clusters? In the same amount of text there are 84 thousand 3-5 word clusters of frequency at least 5 altogether… about one in 10 is associated with text initial position at the.0000001 level of significance

17 RQ 1, continued … is 1 in 10 a large number to be key? In the case of SISC (sentences from paragraphs with only one sentence in), we get 507 thousand clusters, of which 2,192 are key (1,747 positively and 445 negatively) which is about 1 in 230

18 PATTERNS

19 RQ 3: patterns recency: in the top 200, seventy express time, generally using yesterday or last night

20 Recency clusters COURT HEARD YESTERDAY TONY BLAIR YESTERDAY YESTERDAY AFTER A WERE LAST NIGHT LAST NIGHT AS THE GOVERNMENT YESTERDAY LAST NIGHT WHEN WAS LAST NIGHT IT EMERGED YESTERDAY LAST NIGHT AFTER YESTERDAY IN A IT EMERGED LAST NIGHT A COURT HEARD YESTERDAY YESTERDAY WHEN A YESTERDAY AFTER THE EMERGED LAST NIGHT LAST NIGHT TO YESTERDAY AS THE YESTERDAY WHEN THE WAS TOLD YESTERDAY

21 Superlatives ONE OF BRITAIN'S MOST ONE OF THE MOST OF THE WORLD'S THE FIRST TIME OF BRITAIN'S MOST FOR THE FIRST FOR THE FIRST TIME

22 Research, Report etc. ACCORDING TO A REPORT A COURT HEARD (YESTERDAY) ACCORDING TO RESEARCH TO A SURVEY IT EMERGED LAST NIGHT IT WAS ANNOUNCED YESTERDAY IT WAS REVEALED YESTERDAY A REPORT PUBLISHED ACCORDING TO A STUDY TO RESEARCH PUBLISHED

23 Attention-grabbers IT EMERGED THAT OBSERVER CAN REVEAL THE OBSERVER CAN REVEAL

24 Indefinite articles positively key…. A BABY GIRL A BAN ON A BEACH IN A BID TO A BITTER ROW A BLACK MAN A BLISTERING ATTACK ON A JURY WAS TOLD YESTERDAY A LABOUR MP A LANDMARK RULING A LAST DITCH ATTEMPT TO A LAST MINUTE A LEADING BRITISH A LEADING SCIENTIST A LEGAL BATTLE A LEGAL CHALLENGE

25 Indefinite articles negatively key A KIND OF A COUPLE OF A GREAT DEAL A KIND OF A LOT MORE

26 IT + reporting verb – positively key IT WAS ANNOUNCED LAST NIGHT IT WAS CLAIMED LAST NIGHT IT WAS CONFIRMED LAST NIGHT IT IS REVEALED TODAY

27 IT otherwise negatively key: IT IS A IT IS ABOUT IT IS EXPECTED IT IS GOING IT IS ONLY IT IS POSSIBLE IT SEEMS TO

28 SAID YESTERDAY – positively key SAID YESTERDAY AFTER SAID YESTERDAY THAT HE SAID YESTERDAY THEY HAD

29 SAID without time – negatively key SAID AT THE SAID HE HAD SAID HE WOULD SAID THE GOVERNMENT SAID THERE WAS NO

30 Conclusions The “once upon a time” syndrome seems to be much more common than might be thought. In text-initial sections of 115 thousand hard news stories (3.2 m. words), out of 8 thousand 3-5 word clusters, about 1 in 10 had text-initial significance whereas in non text-initial sections only 1 in 230 was key

31 Other patterns recency superlatives research, report attention-grabbers indefinite articles IT + reporting verb; SAID + time

32 O’Donnell, Matthew, Mike Scott, Michaela Malhberg & Michael Hoey (forthcoming) ‘When the text counts’ Exploring the Implications of ‘text’ as unit in corpus linguistics. Paper presented at PALC, Łodz.. April 2007. References


Download ppt "Homing in on the Text- Initial Cluster Mike Scott School of English University of Liverpool Aston Corpus Symposium Friday May 4th 2007 This presentation."

Similar presentations


Ads by Google