So much of everything Adam Kilgarriff Lexical Computing Ltd.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
BD Finance Training. Table of Contents:  The BD Bible  Stages of working with the Bible  Your KPIs  Napoleon Tools  Cash Flow  The Next Step Dream.
NUMBER TRICKS.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
Chapter 14 Comparing two groups Dr Richard Bußmann.
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Search Engines and Information Retrieval
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
CIS101 Introduction to Computing Week 11. Agenda Your questions Copy and Paste Assignment Practice Test JavaScript: Functions and Selection Lesson 06,
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Solving Systems of Linear Equations using Elimination
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Research Before you begin research, please review this PowerPoint. There is valuable information about how and where to research.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 26 Comparing Counts.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
Author: Marybeth Moretti
Math 5 Unit Review Instructor: Mrs. Tew Turner. In this lesson we will review for the unit assessment and learn test taking strategies.
Genre in a Frequency Dictionary Adam Kilgarriff & Carole Tiberius.
OPEN SOURCE, OPEN MINDS Using Moodle™ to Become an Effective Educator By Tracy Magin and Allen M. Ford.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Discovering Resources at Friedsam Memorial Library.
Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff.
ANOVA (Analysis of Variance) by Aziza Munir
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Next Colin Clarke-Hill and Ismo Kuhanen 1 Analysing Quantitative Data 1 Forming the Hypothesis Inferential Methods - an overview Research Methods Analysing.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
The Internet. 2 So what is the internet? The internet is global network that connects most of the world’s personal computers. The World Wide Web is a.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Gathering, Integrating and Analyzing Usage Data: A look at collection analysis tools and usage statistics standards, and important questions to consider.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
C M Clarke-Hill1 Analysing Quantitative Data Forming the Hypothesis Inferential Methods - an overview Research Methods.
How Can Corpora Help Me To Be Successful in CO150?
Google Verses Gale Why using databases and eResources can be more useful than using Google for Research.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Solving Systems of Equations Algebraically Chapter 3.2.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Information Retrieval
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide Next Time: Make sure to cover only pooling in TI-84 and note.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Statistics 26 Comparing Counts. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
Research Skills for Your Essay Where to begin…. Starting the search task for real Finding and selecting the best resources are the key to any project.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
GDEX: Automatically finding good dictionary examples in a corpus.
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
F test for Lack of Fit The lack of fit test..
Presentation transcript:

So much of everything Adam Kilgarriff Lexical Computing Ltd

To the web on its thirteenth birthday Teenager now, growth spurts wild and luxuriant, Your gangling form changes month on month. Born from the womb of academe in 1994 Pornography was your wetnurse Growing you big and strong Teaching you new tricks (webcams, streaming video, cash payments) that your parents never dreamed of. Teenagers must have independence! So Web-2.0 has arrived. Now everybody feeds you. 2

New life for democracy - are you aren't you? - are you aren't you? Giving voice to millions or cancer Spam spammer spamming Phishing, viruses, Trojans Appropriating mutating multiplying. New life or cancer? Both of course Like nuclear physics and the Russian Revolution. You are huge like Exxon, Zeus and China We tremble and we lick our lips as we watch you grow. AK

Changes Then Documents Library Passive Information Computer Public Now + social media, audio, video + phone exchange, marketplace + Active + Interaction + tablet, smartphone + degrees of privacy 4

5

Dick Whittington off to London where the streets are paved with gold to make his fortune 6

7

Hero or Villain 8

9

As linguists we can Study the web – new – fascinating – mostly language: we are well-placed Use the web as infrastructure – Click and get it Use the web as a source of data All important: all different 10

Sketch Engine Online corpus query tool – Ready-to-use corpora All major world languages Many others – Install your own corpora – Build corpora from the web 11

Language varieties Get a corpus How does it compare with a reference – Qualitatively – Quantitatively Biber 1988: Variation in Speech and Writing

Qualitative Take keyword lists – [a-z]{3,} – Lemma if lemmatisation identical, else word – C1 vs C2, top 100/200 – C2 vs C1, top 100/200 – study

Example: OCC and OEC OEC: general reference corpus – BrE and AmE – All 21 st century – Some fiction OCC: writing for children – Most BrE – Some 21 st century, some earlier – Most fiction Compare 21 st century fiction subcorpora – (not enough British 21 st cent fiction on OEC)

It’s ever so interesting Do it

Liverpool, July 2009Kilgarriff: Simple Maths19 Simple maths for keywords “This word is twice as common here as there”

Liverpool, July 2009Kilgarriff: Simple Maths20 “This word is twice as common here as there” What does it mean? – For word wubble – Ratio=2: – wubble is twice as common in fc as rc Freq (f)Corp SizePer million Focus corp (fc) 4010m4 Reference corp (rc) 5025m2

Liverpool, July 2009Kilgarriff: Simple Maths21 “This word is twice as common here as there” Not just words – Grammatical constructions – Suffixes – … Keyword list – Calculate ratio for all words – Sort – Keywords: at top of list

Liverpool, July 2009Kilgarriff: Simple Maths22 Good enough for keywords? Almost, but 1.Are corpora well matched? 2.Burstiness 3.You can’t divide by zero 4.High ratios more common for rare words

Liverpool, July 2009Kilgarriff: Simple Maths23 1Are corpora well matched? Proportionality – If fiction contains more American, newspaper more British… – genre compromised by region Usual problem Issue in corpus design Not here

Liverpool, July 2009Kilgarriff: Simple Maths24 2Burstiness WordBNC freqBNC files mucosa10319 theology unfortunate Discount frequency for bursty words Gries, CL 2007, also CL journal We use ARF (average reduced frequency) Not here

Liverpool, July 2009Kilgarriff: Simple Maths25 3You can’t divide by zero Standard solution: add one Problem solved fc rcratio buggle100? stort1000? nammikin10000? fc rcratio buggle111 stort1011 nammikin10011

Liverpool, July 2009Kilgarriff: Simple Maths26 4 High ratios more common for rarer words fc rc ratiointeresting? spug101 no grod yes some researchers: grammar, grammar words some researchers: lexis content words No right answer Slider?

Liverpool, July 2009Kilgarriff: Simple Maths27 Solution Don’t just add 1, add n: n=1 n=100 word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc fc+n rc+nRatioRank obscurish middling common

Liverpool, July 2009Kilgarriff: Simple Maths28 Solution n=1000 Summary word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc n=1 n=100n=1000 obscurish1001st2nd3rd middling nd1st2nd common rd 1st

Liverpool, July 2009Kilgarriff: Simple Maths29 But what about Mutual information Log-likelihood Chi-square Fisher’s test … Don’t they use cleverer maths?

Liverpool, July 2009Kilgarriff: Simple Maths30 Yes but Clever maths is for hypothesis testing – Can you defeat null hypothesis? Language is not random, so … you always can Null hypothesis never true Hypothesis-testing not informative Clever maths irrelevant – Kilgarriff 2006, CLLT

Liverpool, July 2009Kilgarriff: Simple Maths31 Moreover… just one answer – grammar words vs content words? – does not help confuses and obscures

Liverpool, July 2009Kilgarriff: Simple Maths32 Example BAWE – British Academic Written English Nesi and Thompson, completed last year – Student essays Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences – fc: ArtsHum, rc: SocSci – With n=10 and n=1000

Liverpool, July 2009Kilgarriff: Simple Maths33

Liverpool, July 2009Kilgarriff: Simple Maths34

Quantitative Methods, evaluation – Kilgarriff 2001, Comparing Corpora, Int J Corp Ling – Then: not many corpora to compare – Now: Many Ad hoc, from web – First question: is it any good, how does it compare Let’s make it easy: offer it in Sketch Engine

A digital native who still likes books and crayons Thank you 37

Original method C1 and C2: – Same size, by design – Put together, find 500 highest freq words For each of these words – Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2 – (f1-f2) 2 /mean (chi-square statistic) Sum Divide by 500: CBDF

Evaluated Known-similarity corpora – Shows it worked – Used to set parameter (500) – CBDF better than alternative measures tested

Adjustments for SkE Problem: non-identical tokenisation – Some awkward words: can’t – undermine stats as one corpus has zero Solution – commonest 5000 words in each corpus – intersection only – commonest 500 in intersection

Adjustments for SkE Corpus size highly variable – Chi-square not so dependable – Also not consistent with our keyword lists Link to keyword lists – link quant to qual Keyword lists – nf = normalised (per million) frequencies – Keyword lists: nf1+k/nf2+k – Default value for k=100 – We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k Evaluated on Known-Sim Corpora – as good as/better than chi-square

Simple maths for keywords This word is twice as common in this text type as that NfreqFreq per m Focus Corp2m8040 Ref corp15m30020 ratio2

Intuitive Nearly right but: – How well matched are corpora Not here – Burstiness Not here – Can’t divide by zero – Commoner vs. rarer words

What’s missing Heterogeneity “how similar is BNC to WSJ”? – We need to know heterogeneity before we can interpret The leading diagonal 2001 paper: randomising halves – Inelegant and inefficient – Depended on standard size of document

New definition, method (Pavel) Heterogeneity (def) – Distance between most different partitions Cluster to find ‘most different partitions’ Bottom-up clustering – until largest cluster has over one third of data – Rest: the other partition Problem – nxn distance matrix where n > 1 million – Solution: do it in steps

Summary Extrinsic evaluation – Method defined – Big project for coming months Corpus comparison – Qualitative: use keywords – Quantitative On beta Heterogeneity (to complete the task) to follow (soon)

Liverpool, July 2009Kilgarriff: Simple Maths48 you should understand the maths you use

Liverpool, July 2009Kilgarriff: Simple Maths49 The Sketch Engine Leading corpus query tool Widely used by dictionary publishers, at universities Large corpora for many lgs available Word sketches Web service Since last week: – Implements SimpleMaths

Liverpool, July 2009Kilgarriff: Simple Maths50 Thank you

Liverpool, July 2009Kilgarriff: Simple Maths51 Language is never ever ever random

Liverpool, July 2009Kilgarriff: Simple Maths52 Language

Liverpool, July 2009Kilgarriff: Simple Maths53 is

Liverpool, July 2009Kilgarriff: Simple Maths54 never

Liverpool, July 2009Kilgarriff: Simple Maths55 ever

Liverpool, July 2009Kilgarriff: Simple Maths56 ever

Liverpool, July 2009Kilgarriff: Simple Maths57 random