Presentation on theme: "School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Combining research and teaching in knowledge management and corpus linguistics."— Presentation transcript:
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Combining research and teaching in knowledge management and corpus linguistics Based on Corpus Linguistics 2007 presentation: Which English dominates the World Wide Web, British or American? by Eric Atwell, Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, and Justin Washtell School of Computing, Leeds University
Outline Introduction Experiments to combine research and teaching in Knowledge Management and Corpus Linguistics, using an AI-inspired intelligent agent architecture, but casting students as the intelligent agents. Each student applies KM/Data-mining to a corpus, then we combine results Methods Students given detailed coursework spec, write up as a research paper Results Draft journal papers by Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, Justin Washtell ( +66 more?) Conclusions Spamming journals? More research questions for next years classes?
Background assumptions The aim of research is to generate conference/journal papers (for RAE, for publicity, for promotion, ?) Computing students should learn to apply ICT technology to practical, real / useful tasks Research-led teaching and learning is a Leeds Univ strength, L&T08 conference theme SO … students could learn by applying ICT to research questions, and writing conference/journal papers on results? BUT … research is hard – surely a student cant come up with the ideas and results for a publishable research paper?! Maybe one student cant … but…
Intelligent Agent Architecture Wikipedia: In computer science, an intelligent agent (IA) is a software agent that exhibits some form of artificial intelligence that assists the user and will act on their behalf, in performing repetitive computer-related tasks. While the working of software agents used for operator assistance or data mining (sometimes referred to as bots) is often based on fixed pre- programmed rules, "intelligent" here implies the ability to adapt and learn … a multi-agent system (MAS) is a system composed of several agents, collectively capable of reaching goals that are difficult to achieve by an individual agent or monolithic system … A multiple agent system (MAS) is a distributed parallel computer system built of many very simple components, each using a simple algorithm, and each communicating with other components. A paradigm of an ant colony or bee swarm is used many times.
Students as intelligent agents Bio-Inspired Computing researchers aim to develop software which behaves like ants, bees, etc to model complex systems Why not use students as super-intelligent agents?? Prof David Cliff: this is cheating – his goal is software agents BUT my goals are CL research, student research-led learning I am not trying to build bio-inspired computing software! Lets see what happens if I apply agent-based architecture to student coursework exercises…
2005-06: PASCAL MorphoChallenge2005 Cognitive Systems and Multidisciplinary Informatics MSc students in my Computational Modelling class. Coursework: build an unsupervised machine learning program to learn morphological analysis of English, Finnish, Turkish. seaside => sea side Systems developed ranged from minimalist to very successful! My hybrid voting system performed better than any individual students system … Atwell, Eric; Roberts, Andrew. Combinatory Hybrid Elementary Analysis of Text (CHEAT) in: Kurimo, M, Creutz, M & Lagus, K (editors) Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes. 2006.
2006-07: UK v US English on WWW 93 Computing students studying Computational Modelling and Technologies for Knowledge Management were given the data-mining coursework task of harvesting and analysing a Data Warehouse from WWW, using WWW-BootCat web-as- corpus technology (Baroni et al 2006). Each student/agent collected English-language web-pages from a specific national top-level domain. The analysis task involved comparing each national sample web-as-corpus with given gold standard samples from UK and US domains, to assess whether national WWW English terminology / ontology was closer to UK or US English. Then, MSc students summarised groups of results for CL07.
Methods CRISP-DM WWW-BootCat and Google Compare to.UK and.US Follow-up: regional overviews
CRISP-DM The task was cast as an exercise in applying the CRISP-DM methodology for computational modelling: the Cross-Industry Standard Process for Data Mining projects. CRISP-DM specifies a series of phases or sub-tasks in a data-mining project; it is a recipe to follow, allowing novices and non- experts to carry out data mining experiments: Business Understanding: map UK v US English on WWW Data Understanding: English text from web-pages Data Preparation: extract word-frequency list, key features? Modelling: compare national wordlist against UK, US Evaluation: are results reasonable, convincing? Deployment: write a report – to submit for assessment!
WWW-BootCat and Google WWW-Bootcat: easy-to-use web front-end to BootCat. User supplies seed terms, typical English words (Sharoff 06). Constrain search to Domain (eg.fr), Language (eg English). WWW-BootCat uses Google to find and download web-pages … hey presto: 200,000-word national English corpus! Problems: Technical, eg user licences/keys required; server downtime, … Small national domains eg South Georgia Island Legal restrictions, eg Algerian law promotes Arabic over French (et al)
Compare to.UK and.US Next, each agent/student had to decide if their national sample was closer to British or American English Computing students/agents could not use Linguistic expertise Instead, compute similarity to.UK and.US gold standards (also collected via WWW-BootCat and Google) Word-frequency Log-Likelihood profiles and averages; Occurrences of selected words (color/colour, tap/fawcet); Lexical analysis only – not syntax or pronunciation
Follow-up: regional overviews This yielded 93 reports on national web-as-corpus analyses… … but still difficult to collate results, see patterns. Follow-up coursework for MSc students: collate and compare results across a group of countries in a single geographical or political region, to produce overviews of English in the region. Students could base their regional overview on the results gathered in the first exercise, though some chose to collate and analyse their own web-as-corpus data afresh. Each regional report was to be written as a research journal paper, targeted at a journal specific to the region.
Results Draft journal papers (accepted for CL2007, BUT they couldnt afford time or fees ) Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, Justin Washtell More draft journal papers by Precious CHIVESE, Binita DUTTA, Dureid EL-MOGHRABY, Sanaz GHODOUSI, Olatomiwale MALOMO, Anh NGUYEN
Junaid Arshad Analysis of English used in a web corpus from the Middle East … Jordan and Egypt English corpora were closer to UK than US English; English websites in Saudi Arabia, Lebanon, Israel, Kuwait, and Bahrain were more similar to US English than UK English; and UEA and Iran English websites contained a mix of UK and US English, with neither dominant…
Chien-Ming Lai Studying Influences of British English and American English on World Wide Web in Southeast Asia by Applying Web as Corpus … The countries studied were Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam. Among these countries, only Philippines and Singapore recognize English as official language, but English is widely used in the other countries … the English texts used in most of the chosen countries in the Southeast Asia are closer to the American English…
Lan Nim The Dominant English Type within the World Wide Web Domains of France and its Former Colonies … This paper investigates the English used in the WWW domains of France (.fr) and its former colonies of Vietnam (.vn), Laos (.ln), Mauritius (.mu) and Senegal (.sn) … British English is more dominant overall in Francophone domains compared to American English. However, some local variation was observed: American English is more widespread in Vietnam, probably due to American political influence after the end of French colonization; and, more surprisingly, American English seems more prevalent than British English in the.FR domain of France.
Noushin Rezapour Asheghi Which English dominates the World Wide Web in countries where English is a native language: British or American? … The results from Log-Likelihood technique in modelling phase indicate that English used in Australian, South African and Irish web sites is closer to British English and text in New Zealand, Jamaican and Canadian web sites are more similar to American English. However, there is not a great difference between the results of comparing these corpora with British and American English… and British spelling is used predominantly in the New Zealand domain…
Josiah Wang Dominance of British and American English on the World Wide Web in Malaysia, Singapore and Brunei … Malaysia, Singapore and Brunei have a history as British post-colonial countries... As a comparison, we have also included three neighbouring countries … Former British colonies like Malaysia, Singapore and Brunei still favour British English on the World Wide Web. In addition, Indonesia and Papua New Guinea which are indirectly influenced by British English (i.e. through the Netherlands and Australia) also tend to lean towards British English. The Philippines on the other hand still continue to exhibit Americas influence with their preference for American English on the Internet.
Justin Washtell The Polynesian influence on English in the World Wide Web of Pacific island nations … This study analyses the effect of indigenous Polynesian languages upon the balance of a core of function (non-lexical) words in sample English web corpora taken from Polynesian island nation domains: from a selection of New Zealand, Cook Islands and French Polynesian websites. These corpora are compared to those recovered from.uk and.us domains and significant grammatical differences are sought. Noted differences are compared with those found between a French corpus from France and one captured from French Polynesian websites using an identical technique…
Conclusions: research We expected US English to dominate the WWW: Computing generally has been American-led US-owned companies might base national websites on US originals Result: British English is holding its own; no clear winner? It is hard to find major differences; International English? Main differences are in pronunciation, not lexis?
Conclusions: student learning Students spent a lot of time on collecting the corpus… painting by numbers exercise – little intelligence needed? OR – practical experience of using web-as-corpus tools? Student feedback: many relished the challenge of a real exercise with large-scale data, contributing to a big result MSc students with papers accepted for MorphoChallenge and Corpus Linguistics conferences were very pleased! (even though I couldnt pay for them to go!) … it would be even better if EVERY student had a chance to publish…
2007-08: 66 Journal papers? http://www.comp.leeds.ac.uk/db32/assessment.htm Re-use the web-as-corpus samples from last year, more time for Data-Mining with WEKA: Select key features e.g. colour v color Train classifier with UK and US samples Classify unseen national samples Classifier is a novel empirical model of UK/US English differences … and more time to write Report: each student to choose a language-related Journal and draft a paper for this!
And finally… I want to run a similar exercise next year: casting students as intelligent agents to combine teaching and research… I need other web-as-corpus research questions to answer, … to be divided into 50+ subtasks, one for each student … with computable metrics, for Computing students SUGGESTIONS WELCOME!