Japan Advanced Institute of Science and Technology

Slides:



Advertisements
Similar presentations
Issues in developing narrative structures Postgraduate writing, seminar 7 John Morgan.
Advertisements

The role of interpersonal language in CLIL Ana Llinares ConCLIL Project seminar Jyväskylä, 3rd February.
Dissecting A Journal Article
Uses of a Corpus “[E]xplore actual patterns of language use”
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Chapter 1 What is Science
Research Methods in Crime and Justice
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Research Methodologies
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Dissemination and Critical Evaluation of Published Research Peg Bottjen, MPA, MT(ASCP)SC.
Experimental Evaluation in Computer Science: A Quantitative Study Paul Lukowicz, Ernst A. Heinz, Lutz Prechelt and Walter F. Tichy Journal of Systems and.
Methodologies for Evaluating Dialog Structure Annotation Ananlada Chotimongkol Presented at Dialogs on Dialogs Reading Group 27 January 2006.
LELA English Corpus Linguistics
Research Proposal Development of research question
Corpora and Language Teaching
Scientific method - 1 Scientific method is a body of techniques for investigating phenomena and acquiring new knowledge, as well as for correcting and.
Copyright © 2006 Pearson Education, Inc. publishing as Benjamin Cummings. The Literature of Health Education Chapter 9.
Experimental Psychology PSY 433
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Research in Language Learning and Teaching Short introduction to research and its planning.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
How to write a publishable qualitative article
Research Methods for Computer Science CSCI 6620 Spring 2014 Dr. Pettey CSCI 6620 Spring 2014 Dr. Pettey.
Cis-Regulatory/ Text Mining Interface Discussion.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
The Chicago Guide to Writing about Multivariate Analysis, 2 nd edition. Paper versus speech versus poster: Different formats for communicating research.
Overview of the research process. Purpose of research  Research with us since early days (why?)  Main reasons: Explain why things are the way they are.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Skills Building Workshop: PUBLISH OR PERISH. Journal of the International AIDS Society Workshop Outline Journal of the International.
Readings in Foreign Journals and Press Zou Qiming Telephone:
Sanna Liimatainen T Internetworking Seminar1 Scientific Writing T Internetworking Seminar Sanna Liimatainen, Lic. Sc. (Tech)
Evaluating Research Articles Approach With Skepticism Rebecca L. Fiedler January 16, 2002.
Doing discourse analysis. Criteria for developing a discourse analysis project a well-focused idea that is phrased as a question or set of closely related.
1 f02kitchenham5 Preliminary Guidelines for Empirical Research in Software Engineering Barbara A. Kitchenham etal IEEE TSE Aug 02.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Dr Jamal Roudaki Faculty of Commerce Lincoln University New Zealand.
Planning an Applied Research Project Chapter 3 – Conducting a Literature Review © 2014 by John Wiley & Sons, Inc. All rights reserved.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Review of the Literature. REVIEW OF THE LITERATURE “The systematic identification, location, scrutiny and summary of written materials that pertain to.
Copyright 2010, The World Bank Group. All Rights Reserved. Principles, criteria and methods Part 2 Quality management Produced in Collaboration between.
Corpus approaches to discourse
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
RESEARCH DESIGN & CORPUS COMPILATION. Corpus design is intrinsic and a fundamental part of the analysis. It is guided by the RQ and affects the results.
1 f02laitenberger7 An Internally Replicated Quasi- Experimental Comparison of Checklist and Perspective-Based Reading of Code Documents Laitenberger, etal.
Colorado State University
RESEARCH An Overview A tutorial PowerPoint presentation by: Ramesh Adhikari.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
Lynn W Zimmerman, PhD INTRODUCTION TO RESEARCH METHODOLOGY.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
Maya Sharsheeva, reference-librarian AUCA Library Effective information search in the Library e-Resources.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
Dr.V.Jaiganesh Professor
How to write a publishable qualitative article
E303 Part II The Context of Language Research
Introduction to Research Methodology
PLANNING AND DESIGNING A RESEARCH STUDY
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus-Based ELT CEL Symposium Creating Learning Designers
Experimental Psychology PSY 433
Welcome.
Using GOLD to Tracking L2 Development
Research in Language Learning and Teaching
Scientific Laws & Theories
Presentation transcript:

Japan Advanced Institute of Science and Technology Corpus linguistics: Pitfalls and problems John Blake Japan Advanced Institute of Science and Technology

Research context Postgraduate research institute in Japan Scientific and Technology Abstracts - Japanese & English Theory to underpin a tool to help researchers draft abstracts in English 02

Four-step (linear/recursive) Process Research is presented in a linear manner (Latour & Woolgar, 1986) But, research is often a messy, recursive non-linear process… Design Construction Annotation Analysis 04

Critique the method. 19 questions arose from critically evaluating a 5-sentence description. 03

i. A corpus-based study of scientific research abstracts (SRAs) written in English in the field of Information Science 05

1. Why choose a corpus study? Fastest growing methodology in linguistics (Gries, forthcoming) – Ad populum? Insufficiency of relying on intuition (Hunston, 2002; Reppen, 2010) – Hasty generalization? Importance of frequency and recurrence (Stubbs, 2007) 06

2. Why choose corpus-based approach? Choices Corpus-driven (Tognini-Bonelli, 2001) Corpus-based Corpus-informed Problems Confirmation bias Cherry picking 07

3. Why focus on SRAs? Problems - drafting SRAs for novice researchers & NNESs Gap in research - Few large-scale studies of RAs, No holistic studies of SRAs. No corpus studies in information science Importance Address the problems and fills the gap Meets needs of doctoral students & faculty 08

4. Why study this topic? Importance SRAs (already discussed) “publish in English or perish” (Ventola, 1992, p.191). Growth in Information science 09

ii. A tailor-made corpus of all abstracts (n=1581) published in 2012 in 5 IEEE journals was created. 10

5. Why create a corpus rather than use an existing one? Purpose is paramount (Nelson, 2010) Corpus needs to be representative of language under investigation (Reppen, 2012) No existing corpus of SRAs in Info. Science 11

6. Why choose that sample size? Size is a vexed issue (Carter & McCarthy, 2001) Bigger is better (Sinclair, 1991) Hapax legomena Balanced and representative is better e.g. Domain-specific research …smaller corpora (Hunston, 2002) Ballpark figures 80% of all corpus studies on RAs (n<100 texts) 5% of all corpus studies on RAs (n>100 texts) 12

7. Why 1581 texts? Is it balanced? Size related to practicality (McEnery & Hardie, 2012. Isotextual vs. Isolexical (Oakey, 2009) “the solution may be to include all issues of the selection of publications from a given week, month or year. This will allow the proportions to determine themselves.” (Hunston, 2002) 13

8. How much time or money is needed? Collect 50, then estimate for whole sample Estimate efficiency gains Automate? Outsource? 14

9. Why 5 IEEE journals? Selection of publication Why 5? Representativity Balance Size Why 5? Why journals not conference proceedings? Why IEEE? 15

10. Permission necessary? Permission from editors, authors? Terms of use of IEEE Xplore permission to download texts, prohibit any form of sharing texts. Problem cannot share so need to ensure replicability through detailed method 16

iii. The corpus was collected manually according to a fixed protocol, checked, and then clean versions stored securely in triplicate. 17

11. How to collect the corpus? Automatic or manual Copy, paste and save one text in one text file (txt) Concatenate files later if necessary Create a hotkey for repetitive key strokes 18

12. How to create an error-free corpus? Standard operating procedure (protocol) Systematic Written & verifiable (need for method anyway) Built-in checks Reduce accuracy errors Similarity analysis to identify duplicates Tools, such as Ferret (Lyon, Malcolm and Dickerson, 2001) Found duplicate SRA….but the journal had published one article, twice! 19

13. How and where to store? Securely, e.g. encrypted (protect data) Three locations (offline, online & working) Fire in office destroys offline & working Virus destroys online & working In both cases, one corpus survives 20

14. Should nonsensical characters and typos be deleted for the text files? Clean corpus policy Version 1: clean for record Version 2: deleted nonsensical characters corrected erroneous typos deleted within breaks 21

iv. The corpus was annotated using UAM Corpus Tool, in layers using specially-created code sets and part-of-speech coding. 22

15. Why the UAM Corpus Tool? Designed for functional analysis Code in multiple layers. Easy to use but ver.2.8 multiple crashes Spanish explanatory videos and help forum Ver 3.0 (O'Donnell, 2014) much more stable 23

16. Selection of annotation code sets? Standard tag sets Part-of-speech (tag set & tagger or built-in concordance tool) FOG tag set, e.g. some UAM corpus tool Tailor-made tag sets Categories necessary for research purpose … 24

Coding for titles Title type Features & sub-features 1 Title type 2 Features & sub-features UAM Corpus Tool v. 2.8.14 25 7

v. Specialist Informants coded a selection of SRAs and reliability was compared. 22

17. Inter- and intra-coder reliability? Coders and coding Specialist vs. Linguist Intra and inter Statistical comparison Kappa statistic (Carletta, 1996) Sufficient degree of accuracy (gold standard?) Ontological units give very different results (word, sentence, text) 26

18. Resolution of differences between coders? Resolve differences through Discussion, Majority vote, Expert? Viability of automatic coding? Bag-of-words, Linguistic Subjectivity (individual, cultural, linguistic) State assumptions and limitations 27

19. How to prove your hypothesis? True or False? All swans are white. “I loves you” is correct. What is the rule to predict the next number…..2, 4, 8, ? 28

With big data and selective sampling of data, most hypotheses can be proved. Seek to disprove your hypothesis and don`t fall foul of confirmation bias. 29

References Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22( 2), 249-254. Carter, R. and McCarthy, M.J. (2001). Size isn't everything: Spoken English, corpus and the classroom. TESOL Quarterly, 35 (2), 337-340. Gries, S. Th. (forthcoming) Some current quantitative problems in corpus linguistics and a sketch of some solutions. Language and Linguistics. Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Latour,B. & Woolgar, S. (1986). Laboratory Life: The Construction of Scientific Facts (2nd Edition). Princeton, NJ: Princeton University Press. Lyon, C., Malcolm, J., & Dickerson, B. (2001). Detecting short passages of similar text in large document collections. In Proceedings of Conference on Empirical Methods in Natural Language Processing. SIGDAT Special Interest Group of the ACL. McEnery, T. & Hardie, A. (2012). Corpus Linguistics. Cambridge: Cambridge University Press. 30

References Nelson, M. (2010). Building a written corpus: What are the basics? In A. O' Keeffe and M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp.53-65). Oxon: Routledge. Oakey, D. (10 February 2009). The lexical bundle revisited: Isolexical and isotextual comparisons. English Language Research seminar: Corpus Linguistics and Discourse. University of Birmingham. O`Donnell, M. (2014) UAM Corpus Tool [software] Reppen, R. (2010). Building a corpus: What are the key considerations? In A. O`Keeffe and M. McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics (pp.31-37). Oxon: Routledge. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University. Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam: John Benjamins Ventola, E, (1992). Writing scientific English: Overcoming intercultural problems. International Journal of Applied Linguistics, 2 (2), 191-220.

Any questions, comments or suggestions? johnb@jaist.ac.jp