School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Combining research and teaching in knowledge management and corpus linguistics.

Slides:



Advertisements
Similar presentations
Artificial Intelligence
Advertisements

Critical Reading Strategies: Overview of Research Process
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
E-Science Data Information and Knowledge Transformation Thoughts on Education and Training for E-Science Based on edikt project experience Dr. Denise Ecklund.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A research-led coursework assignment for the exceptional student and the.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Fostering language learner autonomy via adaptive conversation tutors by.
COMP3410 DB32: Technologies for Knowledge Management 08 : Introduction to Knowledge Discovery By Eric Atwell, School of Computing, University of Leeds.
Alina Pommeranz, MSc in Interactive System Engineering supervised by Dr. ir. Pascal Wiggers and Prof. Dr. Catholijn M. Jonker.
Tired of hanging around Evaluating projects with young people.
Welcome to the seminar course
What is a CAT?. Introduction COMPUTER ADAPTIVE TEST + performance task.
Aaron Summers. What is Artificial Intelligence (AI)? Great question right?
EXPERT SYSTEMS apply rules to solve a problem. –The system uses IF statements and user answers to questions in order to reason just like a human does.
A Quality Focused Crawler for Health Information Tim Tang.
SENG 531: Labs TA: Brad Cossette Office Hours: Monday, Wednesday.
Chapter 4 DECISION SUPPORT AND ARTIFICIAL INTELLIGENCE
The Decision-Making Process IT Brainpower
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
09:10 Mikko Kurimo: "Unsupervised Morpheme Analysis -- Morpho Challenge Workshop 2007" 09:30 Mikko Kurimo: "Evaluation by a Comparison to a Linguistic.
The Semantic Web Week 1 Module Content + Assessment Lee McCluskey, room 2/07 Department of Computing And Mathematical Sciences Module.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
An Overview of Link Analysis Techniques for Academic Web Sites Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Which English dominates the World Wide Web, British or American? (Combining.
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Hyperlinks and Scholarly Communication Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Methods Seminar, University.
McGraw-Hill/Irwin ©2005 The McGraw-Hill Companies, All rights reserved ©2005 The McGraw-Hill Companies, All rights reserved McGraw-Hill/Irwin.
English Word Origins Grade 3 Middle School (US 9 th Grade) Advanced English Pablo Sherman The etymology of language.
Web Design Process CMPT 281. Outline How do we know good sites from bad sites? Web design process Class design exercise.
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
Geography Scholarship Jane Evans Geography Facilitator Team Solutions.
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Changing Patterns of International Student Mobility Within the Asia Pacific Region: The Influence of History, Culture and Language Christopher Ziguras.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
1 USING EXPERT SYSTEMS TECHNOLOGY FOR STUDENT EVALUATION IN A WEB BASED EDUCATIONAL SYSTEM Ioannis Hatzilygeroudis, Panagiotis Chountis, Christos Giannoulis.
4-1 Chapter 4 Decision Support and Artificial Intelligence Brainpower for Your Business.
4-1 Management Information Systems for the Information Age Copyright 2004 The McGraw-Hill Companies, Inc. All rights reserved Chapter 4 Decision Support.
1 CS 178H Introduction to Computer Science Research Why Do an Honors Thesis?
Level 2 IT Users Qualification – Unit 1 Improving Productivity Chris.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Knowledge Production Division J083. APAN Mission Communicate and share information electronically to facilitate regional understanding, promote confidence.
Assessing the Frequency of Empirical Evaluation in Software Modeling Research Workshop on Experiences and Empirical Studies in Software Modelling (EESSMod)
Some Comments on “The Reports of My Death are Greatly Exaggerated – Expert Systems Research in Accounting” Daniel E. O’Leary University of Southern California.
Presenter: Shanshan Lu 03/04/2010
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
Grades: 6-8 Subject: Artificial Intelligence An Introduction to the Turing Test.
Chapter 4 Decision Support System & Artificial Intelligence.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Topic Maps introduction Peter-Paul Kruijsen CTO, Morpheus software ISOC seminar, april 5 th 2005.
2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Creating User Interfaces (Catch-up XML?) CMS, Usability checklist reports Preparation for user observation studies Blogs, Social Spaces, etc. Homework:
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Improve Own Learning and Performance This is a very important skill If you can analyse how you work – you can make improvements, which will help you in.
1 RESEARCHING USING ONLINE SOURCES _____________________________ A Guide to Searching for and Evaluating Web Pages on the Internet.
My Favorite Top 5 Free Keyword Research Tools –
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Report Writing Lecturer: Mrs Shadha Abbas جامعة كربلاء كلية العلوم الطبية التطبيقية قسم الصحة البيئية University of Kerbala College of Applied Medical.
GLOBAL NATURAL LANGUAGE PROCESSING MARKET BY MANUFACTURERS, COUNTRIES, TYPE AND APPLICATION, FORECAST TO 2022 Published By -> Global Info Research Published->
What is a CAT? What is a CAT?.
CITS4404 Artificial Intelligence & Adaptive Systems
Carl Bro a/s - Team Leader - IPPC-experts - Quality Assurance
iSRD Spam Review Detection with Imbalanced Data Distributions
Applied Linguistics Chapter Four: Corpus Linguistics
Web Mining Research: A Survey
Presentation transcript:

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Combining research and teaching in knowledge management and corpus linguistics Based on Corpus Linguistics 2007 presentation: Which English dominates the World Wide Web, British or American? by Eric Atwell, Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, and Justin Washtell School of Computing, Leeds University

Outline Introduction Experiments to combine research and teaching in Knowledge Management and Corpus Linguistics, using an AI-inspired intelligent agent architecture, but casting students as the intelligent agents. Each student applies KM/Data-mining to a corpus, then we combine results Methods Students given detailed coursework spec, write up as a research paper Results Draft journal papers by Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, Justin Washtell ( +66 more?) Conclusions Spamming journals? More research questions for next years classes?

Background assumptions The aim of research is to generate conference/journal papers (for RAE, for publicity, for promotion, ?) Computing students should learn to apply ICT technology to practical, real / useful tasks Research-led teaching and learning is a Leeds Univ strength, L&T08 conference theme SO … students could learn by applying ICT to research questions, and writing conference/journal papers on results? BUT … research is hard – surely a student cant come up with the ideas and results for a publishable research paper?! Maybe one student cant … but…

Intelligent Agent Architecture Wikipedia: In computer science, an intelligent agent (IA) is a software agent that exhibits some form of artificial intelligence that assists the user and will act on their behalf, in performing repetitive computer-related tasks. While the working of software agents used for operator assistance or data mining (sometimes referred to as bots) is often based on fixed pre- programmed rules, "intelligent" here implies the ability to adapt and learn … a multi-agent system (MAS) is a system composed of several agents, collectively capable of reaching goals that are difficult to achieve by an individual agent or monolithic system … A multiple agent system (MAS) is a distributed parallel computer system built of many very simple components, each using a simple algorithm, and each communicating with other components. A paradigm of an ant colony or bee swarm is used many times.

Students as intelligent agents Bio-Inspired Computing researchers aim to develop software which behaves like ants, bees, etc to model complex systems Why not use students as super-intelligent agents?? Prof David Cliff: this is cheating – his goal is software agents BUT my goals are CL research, student research-led learning I am not trying to build bio-inspired computing software! Lets see what happens if I apply agent-based architecture to student coursework exercises…

: PASCAL MorphoChallenge2005 Cognitive Systems and Multidisciplinary Informatics MSc students in my Computational Modelling class. Coursework: build an unsupervised machine learning program to learn morphological analysis of English, Finnish, Turkish. seaside => sea side Systems developed ranged from minimalist to very successful! My hybrid voting system performed better than any individual students system … Atwell, Eric; Roberts, Andrew. Combinatory Hybrid Elementary Analysis of Text (CHEAT) in: Kurimo, M, Creutz, M & Lagus, K (editors) Proceedings of the PASCAL Challenge Workshop on Unsupervised Segmentation of Words into Morphemes

: UK v US English on WWW 93 Computing students studying Computational Modelling and Technologies for Knowledge Management were given the data-mining coursework task of harvesting and analysing a Data Warehouse from WWW, using WWW-BootCat web-as- corpus technology (Baroni et al 2006). Each student/agent collected English-language web-pages from a specific national top-level domain. The analysis task involved comparing each national sample web-as-corpus with given gold standard samples from UK and US domains, to assess whether national WWW English terminology / ontology was closer to UK or US English. Then, MSc students summarised groups of results for CL07.

Methods CRISP-DM WWW-BootCat and Google Compare to.UK and.US Follow-up: regional overviews

CRISP-DM The task was cast as an exercise in applying the CRISP-DM methodology for computational modelling: the Cross-Industry Standard Process for Data Mining projects. CRISP-DM specifies a series of phases or sub-tasks in a data-mining project; it is a recipe to follow, allowing novices and non- experts to carry out data mining experiments: Business Understanding: map UK v US English on WWW Data Understanding: English text from web-pages Data Preparation: extract word-frequency list, key features? Modelling: compare national wordlist against UK, US Evaluation: are results reasonable, convincing? Deployment: write a report – to submit for assessment!

WWW-BootCat and Google WWW-Bootcat: easy-to-use web front-end to BootCat. User supplies seed terms, typical English words (Sharoff 06). Constrain search to Domain (eg.fr), Language (eg English). WWW-BootCat uses Google to find and download web-pages … hey presto: 200,000-word national English corpus! Problems: Technical, eg user licences/keys required; server downtime, … Small national domains eg South Georgia Island Legal restrictions, eg Algerian law promotes Arabic over French (et al)

Compare to.UK and.US Next, each agent/student had to decide if their national sample was closer to British or American English Computing students/agents could not use Linguistic expertise Instead, compute similarity to.UK and.US gold standards (also collected via WWW-BootCat and Google) Word-frequency Log-Likelihood profiles and averages; Occurrences of selected words (color/colour, tap/fawcet); Lexical analysis only – not syntax or pronunciation

Follow-up: regional overviews This yielded 93 reports on national web-as-corpus analyses… … but still difficult to collate results, see patterns. Follow-up coursework for MSc students: collate and compare results across a group of countries in a single geographical or political region, to produce overviews of English in the region. Students could base their regional overview on the results gathered in the first exercise, though some chose to collate and analyse their own web-as-corpus data afresh. Each regional report was to be written as a research journal paper, targeted at a journal specific to the region.

Results Draft journal papers (accepted for CL2007, BUT they couldnt afford time or fees ) Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, Justin Washtell More draft journal papers by Precious CHIVESE, Binita DUTTA, Dureid EL-MOGHRABY, Sanaz GHODOUSI, Olatomiwale MALOMO, Anh NGUYEN

Junaid Arshad Analysis of English used in a web corpus from the Middle East … Jordan and Egypt English corpora were closer to UK than US English; English websites in Saudi Arabia, Lebanon, Israel, Kuwait, and Bahrain were more similar to US English than UK English; and UEA and Iran English websites contained a mix of UK and US English, with neither dominant…

Chien-Ming Lai Studying Influences of British English and American English on World Wide Web in Southeast Asia by Applying Web as Corpus … The countries studied were Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam. Among these countries, only Philippines and Singapore recognize English as official language, but English is widely used in the other countries … the English texts used in most of the chosen countries in the Southeast Asia are closer to the American English…

Lan Nim The Dominant English Type within the World Wide Web Domains of France and its Former Colonies … This paper investigates the English used in the WWW domains of France (.fr) and its former colonies of Vietnam (.vn), Laos (.ln), Mauritius (.mu) and Senegal (.sn) … British English is more dominant overall in Francophone domains compared to American English. However, some local variation was observed: American English is more widespread in Vietnam, probably due to American political influence after the end of French colonization; and, more surprisingly, American English seems more prevalent than British English in the.FR domain of France.

Noushin Rezapour Asheghi Which English dominates the World Wide Web in countries where English is a native language: British or American? … The results from Log-Likelihood technique in modelling phase indicate that English used in Australian, South African and Irish web sites is closer to British English and text in New Zealand, Jamaican and Canadian web sites are more similar to American English. However, there is not a great difference between the results of comparing these corpora with British and American English… and British spelling is used predominantly in the New Zealand domain…

Josiah Wang Dominance of British and American English on the World Wide Web in Malaysia, Singapore and Brunei … Malaysia, Singapore and Brunei have a history as British post-colonial countries... As a comparison, we have also included three neighbouring countries … Former British colonies like Malaysia, Singapore and Brunei still favour British English on the World Wide Web. In addition, Indonesia and Papua New Guinea which are indirectly influenced by British English (i.e. through the Netherlands and Australia) also tend to lean towards British English. The Philippines on the other hand still continue to exhibit Americas influence with their preference for American English on the Internet.

Justin Washtell The Polynesian influence on English in the World Wide Web of Pacific island nations … This study analyses the effect of indigenous Polynesian languages upon the balance of a core of function (non-lexical) words in sample English web corpora taken from Polynesian island nation domains: from a selection of New Zealand, Cook Islands and French Polynesian websites. These corpora are compared to those recovered from.uk and.us domains and significant grammatical differences are sought. Noted differences are compared with those found between a French corpus from France and one captured from French Polynesian websites using an identical technique…

Conclusions: research We expected US English to dominate the WWW: Computing generally has been American-led US-owned companies might base national websites on US originals Result: British English is holding its own; no clear winner? It is hard to find major differences; International English? Main differences are in pronunciation, not lexis?

Conclusions: student learning Students spent a lot of time on collecting the corpus… painting by numbers exercise – little intelligence needed? OR – practical experience of using web-as-corpus tools? Student feedback: many relished the challenge of a real exercise with large-scale data, contributing to a big result MSc students with papers accepted for MorphoChallenge and Corpus Linguistics conferences were very pleased! (even though I couldnt pay for them to go!) … it would be even better if EVERY student had a chance to publish…

: 66 Journal papers? Re-use the web-as-corpus samples from last year, more time for Data-Mining with WEKA: Select key features e.g. colour v color Train classifier with UK and US samples Classify unseen national samples Classifier is a novel empirical model of UK/US English differences … and more time to write Report: each student to choose a language-related Journal and draft a paper for this!

And finally… I want to run a similar exercise next year: casting students as intelligent agents to combine teaching and research… I need other web-as-corpus research questions to answer, … to be divided into 50+ subtasks, one for each student … with computable metrics, for Computing students SUGGESTIONS WELCOME!