Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Academic English using Sketch Engine

Similar presentations


Presentation on theme: "Discovering Academic English using Sketch Engine"— Presentation transcript:

1 Discovering Academic English using Sketch Engine
Maha AlHarthi Assistant Professor of Applied Linguistics at PNU Vice-Dean of Graduate Studies at PNU

2 What is Sketch Engine? A corpus building, corpus querying and text analysis tool whose algorithms analyze authentic texts of billions of words to identify what is typical in language and what is rare, unusual or emerging usage. Designed for text analysis or text mining applications. Developed by Lexical Computing Limited since 2003. Used by linguists, lexicographers, translators, students and teachers. First choice solution for publishers, universities, translation agencies and national language institutes throughout the world.

3 Contains 400 ready-to-use corpora in 90+ languages, each having a size of up to 20 billion words to provide a representative sample of language. Users can build their own corpora using a fully automated process. A very wide range of grammar, vocabulary and discourse features can be explored by teachers and students.  The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora.

4 English Academic Corpora in SkE
60-million-word ACL Anthology Reference Corpus (conference and journal papers in natural language processing) 2.6-billion-word DOAJ corpus (consists of journals from many fields covering academic topics) 3-million-word Cambridge Academic corpus(comprised of academic language at undergraduate and post-graduate level from a range of US and UK institutions). 79-million-word CAJA - the corpus of Academic Journal Articles.

5 BAWE The British Academic Written English (BAWE) corpus was developed at the Universities of Warwick, Reading and Oxford Brookes. Hilary Nesi and Sheena Gardner (formerly of the Centre for Applied Linguistics [previously called CELTE], Warwick), Paul Thompson (Department of Applied Linguistics, Reading) and Paul Wickens (Westminster Institute of Education, Oxford Brookes), as part of the project An investigation of genres of assessed writing in British Higher Education. The project was funded by the Economic and Social Research Council from to  

6 Accessing the BAWE corpus
The BAWE corpus can be downloaded for research purposes via the Oxford Text Archive ( for use, with WordSmith Tools, AntConc et al. Information about the corpus, ranging from wordlists to academic publications is available at The guide Using Sketch Engine with BAWE (Nesi and Thompson) is also available there.  Writing for a Purpose Materials based on the project research are available for learners and teachers on the British Council Learn English website It can be freely accessed using the open version of the corpus query tool SketchEngine at register. for the full version for greater capability.

7 Using Sketch Engine with BAWE
Making a simple concordance search Click on the line which says ‘British Academic Written English Corpus’:

8 In the ‘Query’ box, write the word that you are interested in investigating. In this example, we have chosen the word ‘factor’.

9 Query Types simple searches are interpreted as case-insensitive (so a search for catch finds catch, Catch and CATCH) searches for either word form or lemma (so a search for catch finds catch, catching, catches, caught, and a search for caught finds just caught) searching more than one item (with space as separator), a simple search for catch fire finds all the following: caught fire, catching fire, catch fire.

10 ‘Character search’ is designed for languages which do not put spaces between words (Chinese, Japanese). CQL is the corpus query language: used for more sophisticated querying of corpus data.  

11 Click on ‘Make concordance’. You will get a page of results like this:

12 It can be sorted, sampled, filtered (for example by Context, or Text Type) or saved.
A range of frequency analyses are available, including collocation reports and analysis by text types (where the corpus has text types defined). At the individual level of the hit, the user can click on the search term for more context, or on the item in the ‘reference’ column to see the metadata for the item.

13

14 View Options View options: length and number of concordance lines

15 View options: Information about assignments
Every assignment in the BAWE corpus has been coded for these categories of information. ‘text discgroup’ stands for disciplinary groups: ‘AH’ for Arts and Humanities, ‘LS’ for Life Sciences, ‘PS’ for Physical Sciences ‘SS’ for Social Sciences

16 This tells us that the text was written by a female first year Sociology student aged 25 or older, whose first language is English, and who has received all her secondary education in the UK. The assignment received a distinction grade and contained 1632 words.

17

18 The word sketch The function that gives the Sketch Engine its name.
A one-page summary of a word’s grammatical and collocational behavior.

19

20 For catch (verb) just looking at the first column (objects of the verb) we immediately see a number of meanings, idioms and set phrases. We catch a glimpse of or catch sight of something. Fisherman, fishers and anglers (column 2) catch fish, trout and bass. You often want to catch someone’s attention. You sometimes catch your breath and things sometimes catch your eye. Sportsmen and women, in a range of sports, catch passes and balls. Things catch fire. We all sometimes catch buses.

21 The second column, for subject, introduces a couple of complications
The second column, for subject, introduces a couple of complications. Surprise relates to the expression caught by surprise. Touchdown catches is a term from American football: the word sketch succeeds in bringing it to our attention, though catches is a noun which has been misanalysed as a verb. Police introduces a new meaning of the verb (police catch criminals) and Anyone brings to our attention the related pattern Anyone caught [doing X] will be [punished].

22 The third column, and/or, tells us more about the police and sports meanings. Overheat goes with catch fire. Tangle and snag introduce a new meaning where, if a rope or line or piece of cotton or string or wire catches with something else, it no longer runs free.

23 The fourth table brings our attention to the phrasal verbs catch up, catch on, catch out; the fifth, to the reflexive use (I caught myself wondering…). The next set of tables show us what we might be caught in (the crossfire, a trap, the headlights), on (videotape, CCTV), by and with (your pants down). The final column takes us back to the police, with people being caught red-handed and unprepared.

24 Thesaurus The Sketch Engine prepares a ‘distributional thesaurus’ for a corpus (a thesaurus created on the basis of common collocation). If two words have many collocates in common, they will appear in each other’s thesaurus entry. It works as follows: if we find instances of both drink tea and drink coffee, there is an evidence that tea and coffee are similar. We can say that they ‘share’ the collocate drink (verb). For all pairs of words, we compute how many collocates they share, and the ones that share most (after normalisation) are the ones that appear in a word’s thesaurus entry. Distributional thesauruses are a topic of great interest in computational linguistics.

25

26 Examining collocations
In Sketch Engine we can also use the collocation tool to discover statistical information about how strong the collocation is (whether it is not simply random chance that the words occur together within a given range of words).

27

28

29 Measures of collocation: T-score, Mutual Information, LogDice
It is highly recommended to follow the default setting of statistical measure, the T-scores, MI score, LogDice Collocates from a T-score calculation tend to be more frequent words, while collocates from an MI calculation tend to be less frequent words (Hunston 2002: 72-75).

30 T-score It expresses the certainty with which we can argue that there is an association between the words (their co-occurrence is not random). The value is affected by the frequency of the whole collocation which is why very frequent word combinations tend to reach a T-score high value despite not being significant as collocations.

31 Mutual Information This score expresses the extent to which words co-occur compared the number of times they appear separately. MI Score is affected strongly by the frequency, low-frequency words tend to reach a high MI score which may be misleading. This is why Sketch Engine allows setting a limit and words with a frequency below this limit will not be included in the calculation.

32 Defining the range of collocation
If you are interested in the word that immediately precedes “factor” or “factors”, you can change the range to -1 and 0. Typing in 0 to 1 would show the words that immediately follow the key word.

33

34 Users & Uses of Sketch Engine

35 Users of SkE in Universities
In linguistics and languages departments: teaching and research Lexicography, Language teaching and learning, Teaching translation, Terminology, Sociolinguistics, (Critical) discourse analysis, Historical studies. In computing departments: teaching and research in relation to language technology (also called Natural Language Processing, Computational Linguistics) (the home area of all Sketch Engine team members).

36 Lexicography The first users were lexicographers, with Macmillan as the first user for the word sketches, and Oxford University Press as the first for the Sketch Engine. Lexicography requires very large corpora, so there is evidence for rare words and phrases. This is facilitated with the “big data” that can be created from the web.

37 The English learners’ dictionaries had a growing market, and were highly profitable, and were competing intensively with each other to produce ‘the best’ dictionary. Four of the five main dictionary publishers in the UK (Cambridge University Press, Harper Collins, Macmillan, Oxford University Press) used SkE intensively. At CUP and Macmillan, this is just for English; at Collins for the main European languages, and at OUP for large bilingual-dictionary projects for Arabic, Chinese and Portuguese.

38

39 Language Teaching English Language Teaching and the teaching of other languages including Chinese, Japanese and Arabic. The ‘Teaching and Language Corpora’ community has been exploring ways of bringing corpus methods into language-teaching practice since Tim Johns’ work in the 1980s. Johns worked in Birmingham, UK, alongside the COBUILD project for using corpora for lexicography, and the uses of corpora for ELT can be seen as having two parts: indirect use, in the preparation of dictionaries and coursebooks, and direct: in the classroom.

40 Second/Foreign language learning and teaching
Two central questions arise: 1) What are learners saying and writing? 2) What should they be saying and writing? For the first question, there are learner corpora: Learner corpora are valuable for finding out what learners, at various levels, do, and for research into the process of language learning as well as the practicalities of curricula, course development, and testing. In the Sketch Engine there are learner corpora for Arabic, Slovene, Czech and English.

41 For the second question: what should they be saying and writing?
The general language corpora meet that need. If one large population of language learners would like to study at an English-medium university. Their target is the English that is spoken in seminars and written in University-level essays. The British Academic Spoken English (BASE) and British Academic Written English (BAWE) corpora have been created as samples of these target varieties.

42 Translators Translators find corpora (of specific domains: legal, medical, business etc.) useful for identifying the terminology and phraseology of the domain, in the language they are translating into. A number of professional translators are Sketch Engine users.

43 Terminologists One of the challenges for terminologists is finding the concepts and terms. The Sketch Engine can be used for term-finding. 

44 Sociolinguistics Sociolinguists are interested in how language varies between social groups, across age groups, with movements of populations, and between communities. A corpus designed to study these topics is the London English corpus (Kerswill et al. 2013).

45 (Critical) Discourse Analysis
Analyses of a particular kind of language for what it tells us about the attitudes, power relations and perspectives of the participants. This kind of work takes place in a range of departments in the humanities and social sciences: The analysis of British newspaper discourse on migrants and migration, and another study on the representation of Islam in the British Press; portrayal of science in the news; knowledge dissemination through personal blogs.

46 Sketching Muslims: A Corpus Driven Analysis of Representations Around the Word ‘Muslim’ in the British Press 1998–2009 Paul Baker; Costas Gabrielatos; Tony McEnery (2013) Broadsheet: The Daily Telegraph, The Times, The independent, The guardian, The observer Tabloid: The People, The Star, The Sun, The Express, The Mirror, The Daily Mail

47 Historical Studies A central topic for linguists is language development and change: Corpora looking back over the history of a language, and supporting this kind of research: LatinISE (of Latin from the third century B. C. to the twentieth century A.D., McGillivray and Kilgarriff 2013) English Dialogues Corpus (sixteenth–eighteenth centuries; Culpeper and Kytö 2010). For the Arabic world and Islam, it is the language of the Quran and of the culture that the region shares. The different countries each have their own dialect, and the lingua franca, MSA, is closer to classical Arabic than to the dialects: The King Saud University Corpus of Classical Arabic (KSUCCA) brings together many of the central texts of this language, culture and religion, including the Quran and the Hadith.

48 Thank You! mnalharthi@pnu.edu.sa


Download ppt "Discovering Academic English using Sketch Engine"

Similar presentations


Ads by Google