Presentation on theme: "School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Natural."— Presentation transcript:
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Natural Language Processing Eric Atwell, Language Research Group
English on the World Wide Web This assignment involves knowledge management, data mining and text analytics from real-world data, sourced from the World Wide Web. This exercise will give you practical experience of methods in data mining, knowledge discovery, text analytics, the CRISP-DM data-mining methodology, reporting and explaining technologies for knowledge management to professional clients or customers.
The imaginary scenario English is used world-wide as the common language of science, engineering, business, commerce, and education; but in any given part of the world, is there a localized regional variety of English? Advertising and marketing literature may need to be localized or adapted to suit local preferences. We want to persuade Canon marketing department that Data Mining and Text Mining are useful for their work. To do this, you are going to produce a short video research report, demonstrating that text analytics can provide empirical evidence to help answer a research question: Are the English-language web-pages from a selected region closer to British English or American English?
Deliverable: report The end result will be a summary of your research aimed at Canon marketing managers. Your 5-10 minute video summary of your findings should include Introduction, Methods, Results, and Conclusions; you should include screenshots with examples of evidence. Students can work in pairs, to share the workload; this means you have someone else to discuss the details with, and also should halve the workload of each individual, though there is some extra overhead in collaborating and coordinating the work. However, if you prefer, you can work on your own and produce an individual report.
1) Background research Read research papers on this topic, including (Atwell et al 2009, Atwell et al 2007) which describe similar exercises undertaken previously, and (Baroni and Bernadini 2004), which describes their use of BootCat and Google to collect Italian medical documents from the Web, and then data-mine these for Italian medical terminology. You should also explore the SketchEngine website which includes WebBootCat, a simple web-interface to BootCat and Google; and the World Wide English Corpus collected by past students, using WebBootCat and Google.
2) Choose region and audience Choose a specific region to study, for which you can find WWW data. Select a regional subcorpus sample of texts, covering between 3 and 6 national domains relevant to your chosen region, from the World Wide English corpus website. You may use SketchEngine and WebBootCat to collect additional national English 200,000-word sample(s) for countries not yet included in the World Wide English Corpus; you will need to register for a free trial username, then follow instructions.
SketchEngine WebBootCat Advanced
WebBootCat Advanced options
Harvest your corpus Tick the URLs you want (default: yes!) WebBootCat visits URLs, downloads, scrapes text This may take some time... Or even stall When finished, download Horizontal or Vertical - you want Vertical, one word per line BUT: check it is 200,000+ words; If not, try again, maybe with different parameter settings
3) Tell me your plans your choice of 3+ WWW domains to for verification; and if you choose to work as a pair, please include NAMES of both partners. If I find the set of domains unsuitable or someone else has already targeted the same group of countries, I may ask you to choose another.
4) Features: differences UK v US Question: is your regional English more like UK or US? Consult expert sources on the English language to identify significant, measurable terms or features which show the difference between British and American English, eg relative freq of color v colour, or center v centre. You have to also think about how to extract feature-values from the data; do not choose features which are difficult to extract or compute (for example, features based on differences in grammar or meaning; these are hard to extract computationally). Some possible expert sources are suggested in Wikipedia; also try appropriate Google searches, and/or Leeds Uni Library. You can try Raysons Log-likelihood calculator or Sharoffs compare-fq-lists.pl to compare two corpus-derived word- frequency lists and highlight words which are noticeably more frequent in one or other corpus.Log-likelihood calculator compare-fq-lists.pl
cslin-gps% perl compare-fq-lists.pl A tool for comparing frequency lists from Fname2 against Fname1 as the reference corpus Usage compare-fq-lists.pl -1 Fname -2 Fname2 [-f N] [-l] [-o] [-s] [-t N] The format of frequency lists is either frq word OR rank frq word -f N --functionwords Do not count the top N most frequent words (50 is a good approximation) -i --ignorezero Do not count words not occurring in the reference corpus -l --latex Output a Latex table -o --odds Compute log-odds instead of loglikelihood (used by Marco Baroni) -s --simplified Only words and scores are output -t N --threshold No more than N words
cslin-gps% perl compare-fq-lists.pl -1 /home/www/eric/db32/uk/ukw -2 /home/www/eric/db32/us/usw | more /home/www/eric/db32/uk/ukw: /home/www/eric/db32/us/usw: Comparing two corpora /home/www/eric/db32/uk/ukw ( ) vs /home/www/eric/db32/us/usw ( ) Word Frq1 Frq2 LL-score Ã CALL NUMBER shall State or state Texas
Contd... Texas Mr program PN Posted Board County percent Video license Department Program States school vehicle Ms ORS Bush City property PM court ÃÃÂ§ Center CA Home Commission department Oregon
cslin-gps% perl compare-fq-lists.pl -1 /home/www/eric/db32/us/usw -2 /home/www/eric/db32/uk/ukw | more /home/www/eric/db32/us/usw: /home/www/eric/db32/uk/ukw: Comparing two corpora /home/www/eric/db32/us/usw ( ) vs /home/www/eric/db32/uk/ukw ( ) Word Frq1 Frq2 LL-score UK BBC London programme Wales aircraft Experience Scotland Ireland British aerospace
Contd... aerospace it programmes i album University you aero aviation Centre Tel Scottish Subjects Cambridge research DeweyClass Author scheme details Manchester colour Britain organisation IRA
Compare-fq-lists.pl results Many terms are local places, names US: Texas, state, Bush; UK: Wales, Manchester, IRA – not really linguistic preferences? Many terms are noise due to unrepresentative sample, skewed to a few topics/domains US: County, Board, vehicle; UK: University, aviation A few terms are recognisably linguistic differences US: program, center; UK: programme, centre, colour
5) Start your report: intro, methods Arabic and Arab English in the Arab World (Atwell et al 2009) provides a model for what should appear in your report; plan your own report IN YOUR OWN WORDS to show you understand what you are doing, and replace Arab English results with your own results. Start by drafting the Introduction, to explain why the question and approach are relevant to language researchers in this region, and your initial assumption about the type of English preferred in the countries you have chosen to investigate (e.g. you might assume that countries in the Commonwealth, mainly ex-British colonies, might prefer British English). Next, expand on the Methods, explaining your approach to finding empirical evidence, and your choice of features for text-mining.
6) Extract feature values For each feature you have chosen to use, you need to extract feature-values for each National English. For example, if you decide that relative frequency of color v colour is a feature you want to include in your model, you will need to extract values of this feature from each National English sample; To help you, word-frequency lists are available on the website for each national text sample.website Create data-files formatted for data-mining (eg WEKA.arff files) encoding these features: for the Gold Standard UK and US samples on the website, and for your national domain web-as-corpus samples
-> center centre centerpercent color colour colorpercent english 1,32,3, 0,20,0, UK 0,25,0, 0,12,0, UK 9,27,25, 0,84,0, UK 0,19,0, 0,24,0, UK 0,16,0, 0,14,0, UK 0,16,0, 0,12,0, UK 0,21,0, 0,38,0, UK 0,25,0, 0,34,0, UK 2,26,7, 2,3,40, UK 2,32,5, 1,59,2, UK 31,0,100, 55,0,100, US 61,0,100, 26,0,100, US 24,0,100, 11,0,100, US 12,1,92, 21,4,84, US 8,0,100, 4,2,67, US 10,0,100, 8,0,100, US 19,0,100, 22,0,100, US 14,0,100, 7,0,100, US 14,0,100, 6,0,100, US 8,5,62, 24,0,100, US
-> center centre centerpercent color colour colorpercent english 10,5,33, 0,20,0, UK test.arff has data (values for 7 features) For ONE other nation (UK English assumed)
7) Train and test models Use these data-mining feature files in a DM tool (eg in WEKA, train data-mining models such as ZeroR, OneR, J-rip rule-based classifier, or J48 decision-tree) with the UK and US datasets, to produce a model which predicts whether a new dataset is UK or US English then test this with your selected national domains, to find how they are classified by the model. Note that if you use WEKA, the arff files must include UK or US as the final feature; this should be your initial assumption of the preference for each country. If the model classifies the country correctly this means WEKA agrees with your assumption; but if the model is wrong, this actually signifies that your assumption was wrong. Make sure you save lots of snapshots of evidence from your data-mining experiments, to paste into your video.
8) Localized English words EXTRA WORK: combine your 3-6 national corpora into a single corpus representing English for the whole region – like the.ARAB Arab English corpus in (Atwell et al 2009). Derive a word-frequency list for your regional English corpus, and use Sharoffs compare-fq-lists.pl to compare the regional word-frequency list with the UK and/or US word-frequency list. Highlight words which are noticeably more frequent in one or other corpus; and try to think of possible explanations for the differences found.compare-fq-lists.pl
9) Add to Report: results, conclusions Extend the research report plan to include Results (evidence, including summary of the evidence showing Data-Mining output) and Conclusions (summarise the evidence showing which variety of English is dominant in each national domain, and in the region overall; discuss noticeable differences comparing regional English with UK and/or US English; and discuss any limitations of your approach). (you may also want to tidy up intro and methods)
Submit Research Video upload the movie to Youtube; then with URL of your movie, AND state whether you agree to let me add this URL to the course website, to let other students see it. - if you want to keep your video private, let me know - please do NOT submit your video report direct to a Journal! (but AFTER it is marked, if its really good I may suggest submitting a paper to the journal...)
References Atwell, Eric; Al-Sulaiti, Latifa; Sharoff, Serge Arabic and Arab English in the Arab World. Proceedings of CL2009 International Conference on Corpus Linguistics. Atwell, Eric; Arshad, Junaid; Lai, Chien-Ming; Nim, Lan; Rezapour Ashregi, Noushin; Wang, Josiah; Washtell, Justin Which English dominates the World Wide Web, British or American? Proceedings of Corpus Linguistics Baroni, Marco; and Bernardini, Silvia BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC2004 LREC: Language Resources and Evaluation Conferences website World Wide English Corpus collected by past students Sharoff, Serge Tools for Processing Corpora (including compare-fq-lists.pl )
Summary: practical experience of … This assignment involves NLP, text analytics and text mining from data sourced from the World Wide Web. This exercise will give you practical experience of methods in data mining, knowledge discovery, text analytics, the CRISP-DM data-mining methodology, and explaining technologies for KM to professional clients or customers – you are the KM Consultant!