Practical Free-text Plagiarism Investigation Fintan Culwin School of Computing London South Bank University London SE1 0AA

2 On 18 th August, the JISC plagiarism service launched an improved version, addressing some of the points that are raised in these notes! On the same day all SBU servers were taken off-line due to the slammer worm!! Some of the comments on the JISC service made may have been addressed by the improvements! These notes and the JISC brochure refer largely to the old version, comments on the new version are shown in red. Important Disclaimer(?)

3 This tutorial will introduce the processes involved in using the JISC service and give examples of its use from the 300 plus final year computing and BIT projects that were processed in The limitations of the service as discovered will be explored and the design and implementation of additional tools needed to complement the JISC service will be presented. Free text plagiarism is a large and growing problem. Tools to assist with its detection are sadly necessary but unfortunately not sufficient. Some of the reasons why some students resort to cheating will be explored and some of the pedagogic responses that can possibly forestall it will be presented. General Description, as advertised

4 Following the tutorial attendees will have knowledge of: The nature and limitations of the JISC plagiarism detection service. The operation of the JISC service. Interpretation of the results of the JISC service and use of the JiscView utility. The use of the OrCheck tool to follow up a JISC investigation. The use of the Praise tool to detect intra-corporal plagiarism. The use of Freestyler to investigate single documents. Specific Learning Objectives

5 Much of the background and experience for this tutorial results from processing over 300 final year projects through the JISC system in the summer of Additionally, a number of utilities and systems have been developed at SBU for free text originality investigation. Previously, experience of developing and operating source code detection systems since circa This led to the JISC commissioned report on source code plagiarism detection. My Qualifications

6 JISC plagiarism detection service using iParadigms (aka TurnItIn) technology. Free to UK institutions for at least the next year. Available Services UKRUND, originally a Swedish service now based in Brussels (awaiting evaluation). CopyCatch, desktop intra-corporal system, now free of charge. OrCheck, (also PRAISE, VAST & FreeStyler) free of charge from SBU. Various other systems with varying degrees of capability and availability (FindSame, HowOriginal etc.).

7 Classification space intra-corporalextra-corporal documentcorpa desktop server commercialfree databasestylistics text-onlystyled documents openproprietary in house remote

8 Why do students cheat? because the task they have been set it too difficult for them because they are not capable of doing the task set because they are capable but not sufficiently organised because they are capable but want a better mark because everyone else is cheating because cheating has become a habit because they do not agree that they are cheating because the resources required are not available because the tutor connives with the cheating because they are not prepared to devote the amount of time the task would take because the number of assessment tasks set is unreasonable because they have devoted the time and feel they deserve the mark because their families want them to get a better mark because the institution is inhumane

9 Because the perceived chances of being caught and the perceived punishment if caught are less than than the perceived benefit of cheating, at the time when the cheating occurs. essentially...

10 JISC Plagiarism Report... technology can only assist us, it will never replace the expertise of humans... the answer to problems usually lies in process and procedures not technology alone. Electronic detection has its place in institutions but the real solutions lie in appropriate assessment mechanisms, supportive institutional culture, clear definitions of plagiarism and policies for dealing with it and adequate training for staff and students. If these areas are improved, the need, desire, and appeal of plagiarism can be taken away for most students."

11 Implications for Practice change the assignment specification for every presentation assess process as well as product assess at a higher level (of Blooms taxonomy) individualise assessment tasks it is your responsibility to educate your registrar about the exact nature of academic misconduct it is your responsibility to educate your students about the boundaries between cooperation, collusion and copying it is your responsibility to ensure that an average student can complete an assessed task in a reasonable time participate in groupworks innovate assessment techniques

12 4 Stage Process collectiondetection confirmation investigation

13 Detection is (always?) capricious Source : Downloading Detectives, Satterwhite & Gerein None located a verbatim passage from the on-line Encyclopedia Britannica.

14 Capricion in Practice 3% as reported by the JISC service. 8 1 / 2 % following tutor manual Google search. 9% following OrCheck on Ch4. ? ~11% following full OrCheck investigation???

15 Ordered Originality List not in order within bands

16 Originality Report

17 All ~315 projects were submitted to the JISC system (mostly by the students but some manually from our in-house submission system). In additional first and second markers were asked to flag any that they thought suspicious (more capricion?). Flagged reports were investigated using OrCheck and a number that had been reported comaratively clean by JISC were shown to be significantly non-original. Some that remained suspicious still reported clean (one was adjudged suitable for non-evidential investigation but dropped for lack of resuources). About 50 originality reports were visually examined and a number cleared (excessively long cited appendixes and common JavaScript in technical appendixes). Final Year Project

18 About 20 reports were categorised as extensive, substantive, or significant. Summary notes were made onall of these and JiscView and/or OrCheck visualisations produced. The project panel decided to proceed with the 9 extensive and substantive cases. First supervisors (some who should have known better!) were prepared to excuse extensive (~50%) demonstrated non-originality and/or suggested informal capping. Of the 9 cases processed formally, penalties ranged from cancellation of all level 3 marks (and award of DipHE), cancellation of the project mark (and award of unclassified), cancellation of the project mark (but allowed to resubmit next year). Final Year Project

19 Quantitiative Corpa Analysis ColdFusion area Hypothesised real line

20 Revised Service - Ordered List

21 Revised Service - Originality Report

22 Revised Service - Side by Side Comparison the two panes are not hyperlinked

23 Comments on the JISC service 1 The nature of the detection engine is unknown (although guesses can be made). It is (necessarily) administratively cumbersome. There is no facility for batch enrolment of students onto the system. (Possibly addressed.) There is no batch submission of documents (although a tutor can submit on behalf of a student). (Possibly addressed.) There is no facility for batch downloading. (I had to manually review about 50 originality reports over a weekend and had to obtain each one individually to take them home.) There is no batch submission of additional URLs each has to be submitted individually, (with a re-analysis after each one). The four hour turnaround on reanalysis of a document made semi- manual investigation cumbersome. (Addressed in the upgrade.)

24 Comments on the JISC service 2 There is no facility to integrate it with WebCT or BlackBoard. The system has some aspects of a MLE (e.g peer review, on-line grading). (Not in the JISC version.) The precise quantitative degree of similarity is not stated or used to precisely order the list. (Possibly addressed.) There is no side by side comparison of submission and hit(s). (Addressed in the upgrade.) The significant and extent of the non-originality within the document can be unclear, particularly with large documents. (See the JiscView utility.) The system can lose some hits (i.e. a hit reported may disappear if a reanalysis of the document is requested). (Addressed in the upgrade.) There is no management reporting capability. (e.g. a convenient printer friendly list of all submissions received, etc.)

25 Comments on the JISC service 3 The sensitivity of the detection cannot be controlled (e.g. only consider runs of n words, exclude this domain, exclude anything from this base document, exclude hits below n%). The use of red highlighting confused some tutors (they assumed that it was more significant than other colors). (A lesson for all tool designers!) The submission of styled documents (RTF *.DOC etc.) can be impacted by firewalls and congestion. (Inevitable with any such system and large documents.) There is no facility to exclude non-discursive content or appendixes. (Many of the 15% hits reported were due to JavaScript in technical appendixes supplied by tools such as ColdFusion.) The open use of the system, where students can view the originality reports, may mislead students (and tutors?!) regarding the true nature of the document.

26 JiscView The JISC textual representations, whilst adequate for small documents, proved less useful for large projects. The colour coding did not give a precise quantitative measure and the relative location of the various non-original parts was also unclear. To address these problems a small utility, JiscView, was developed to provide a high level, non-interactive, map of a JISC non-originality report. The utility may have been invalidated by the revised JISC service. It is only available upon request with many caveats and no documentation.

27 JiscView in Operation A JiscView image contains one pixel for every character, colour coded as in the originality report. The width is arbitrary (just wide enough to accommodate the text at the top). It gives a precise quantitative measure of non-originality, in this case 24%.

28 OrCheck OriginalityChecker is an in-house, desktop, single-document, free-of-charge, database (Google) driven, text only, non-proprietary tool. Essentially, it provides some assistance with the process of manually performing a Google driven keyword search and (in particular) with interpreting the extent and significance of any matches in the documents returned. In the final year project investigation it was used to locate URLs to manually feed into the JISC service. It was also used in passive mode to prepare evidential reports for the investigation phase.

29 document loadedconcordance generated OrCheck in Operation 1

30 search in progresshits obtained OrCheck in Operation 2

31 OrCheck in Operation 3 textual comparisongraphical representation

32 PRAISE Prioritised Ring to Assist In Similarity Evaluations is an in-house, desktop, intra-corporal, free-of-charge, stylistic, (text only), non-proprietary tool. It is used to detect and display the degree of similarity between the documents in a corpus. Although designed for text-only use it will operate upon styled texts (though its behaviour is somewhat unknown). It uses the words2 metric, shown from Thomas Lancasters - thesis to be efficient and effective. It is intended to allow an OrCheck and/or VAST viewer to be spawned from it for detailed investigation.

33 PRAISE in Operation 1

34 PRAISE in Operation 2

35 PRAISE in Operation 3 The documents are arranged on the torc in gross similarity sequence. Controls are provided to vary the number of documents and the degree of similarity shown. When one document is selected all other documents linked to it, at or above the similarity level are also shown. (From here an OrCheck visualisation will be launched). When two documents (i.e. one link) are selected details of that degree of similarity are shown. (From here a VAST visualisation will be launched) An alternative tabular view of the information also needs to be provided. Extra-corporal Web sourced documents can be included and are shown in a different colour. (An OrCheck style capability to obtain such documents needs to be included.)

36 VAST Visual Analysis of Similarity Tool is an in-house, desktop, double-document, free-of-charge, stylistic driven, text only, non-proprietary tool. It provides a detailed OrCheck like visualisation and investigation of a pair of documents. VAST is more capable of fuzzy matching than OrCheck and so is more capable of detecting similarity beneath superficial disguises. However it is less precise in its highlighting and is unable to give a (precise) quantitative value to the similarity. VAST can also be used to track changes in the drafts of a document.

37 VAST in Operation

38 FreeStyler FreeStyler is an in-house, desktop, single-document, free-of-charge, stylistic, text only, non-proprietary tool. It provides rolling-average, interactive graphs of various stylistic measurements. The intention is that if there is more than one voice in a document, the differences should become visible in the graphs. (In practice this has not proved to be so easy!). FreeStyler can also be used as a writing tool (checking reading age across a document, ensuring consistency of voice and spelling conventions etc.).

39 FreeStylerIII in Operation

40 Inform students clearly and demonstrate the technology at the first project lecture (as was done in 2002/3). Have students sign and return the JISC DPA form as part of project registration. Encourage final year core unit tutors to use the JISC service routinely. Require students to submit the body of the report (only) to JISC, but to submit the full report in-house. Staff development and clear agreed guidelines to all tutors regarding the significance of non-originality. Have agreed time relief for coordinating the systems and advising on issues. Final Year Project 2004

