A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
1 SELC:A Self-Supervised Model for Sentiment Classification Likun Qiu, Weishi Zhang, Chanjian Hu, Kai Zhao CIKM 2009 Speaker: Yu-Cheng, Hsieh.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcírio Chaves, Leonardo Andrade, Mário J. Silva
Latent Semantic Analysis
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists Luo Si & Jamie Callan Language Technology Institute School of Computer.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
A Language Independent Method for Question Classification COLING 2004.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Departamento de Lenguajes y Sistemas Informáticos Cross-language experiments with IR-n system CLEF-2003.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Academic writing.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
A German Corpus for Similarity Detection
Presentation transcript:

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

Introduction Plagiarism is one of the most serious forms of academic misconduct It is defined as “the use of another person's written work without acknowledging the source” A study with over 80,000 students in the US and Canada found that many of them have already commited a plagiarism offense –36% of undergraduate students –24% of graduate students Several types –Word-for-word, paraphrasing, text translation, etc.

Introduction Cross-language plagiarism is becoming more commom –Evolution of automatic translation systems –Increasing availability of textual content in many different languages Common scenario –A student downloads a paper, translates it using a automatic translation tool, corrects some translation errors and presents it as his own work It can also involve self-plagiarism –Usually aims at increasing the number of publications

Introduction What is the task? –Detect the plagiarized passages in the suspicious documents and their corresponding text fragments in the source documents even if the documents are written in different languages Known as External plagiarism analysis

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

Related Work Monolingual Plagiarism Analysis –Fingerprints, fuzzy-fingerprints,... Cross-Language Plagiarism Analisys –Statistical bilingual dictionary + bilingual text alignment –Use EuroWordNet to transform words into a language independent representation PAN competition –Enables different methods to be compared against each other –It was held as an evaluation lab in conjunction with CLEF 2010

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)

(1) Language Normalization All documents are converted into a common language English was chosen –More translation resources –One of the easiest languages to translate into Used a language guesser and an automatic translation tool

The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)

(2) Retrieval of Candidate Documents Problem: It is not feasible to perform exhaustive comparisons Solution: Use passages from the suspicious document as a query to be sent to an IR system Note that documents are divided into subdocuments (paragraphs) in order to reduce the amount of text that must be analyzed At the end of this phase, we have a list of at most ten candidate subdocuments for each passage in the suspicious document

The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)

(3) Feature Selection and Classifier Training The goal is to build a classification model that can learn how to distinguish between a plagiarized and a non-plagiarized text passage Annotated synthetic examples used for training J48 classification algorithm Features –The cosine similarity between the suspicious passage and the candidate subdocument –The similarity score assigned by the IR system –The position of the candidate subdocument in the rank generated –The length (in characters) of the suspicious and the candidate subdocument

The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)

(4) Plagiarism Analysis Submit the test instances to the trained classifier and let it decide whether the suspicious passage is, in fact, plagiarized from one of the candidate subdocuments SubDoc 1 SubDoc 2 SubDoc Classifier Plagiarized Or Non-Plagiarized class labels Suspicious Document Passage 1Passage 2Passage 5... Index Retrieval

The Proposed Approach Suspicious Documents Original Documents Norm. Susp. Documents Norm. Orig. Documents Language Normalization Suspicious Document for each Index Retrieval Candidate Documents Training Corpus Feature Selection + Classifier Training Classification Model Plagiarism Analysis Preliminary Result Post-Processing Final Result (1) (2) (4) (5) (3)

(5) Result Post-Processing Join the contiguous plagiarized passages detected by the method in order to decrease the granularity score The granularity score is a measure that assesses whether the plagiarism method reports a plagiarized passage as a whole or as several small plagiarized passages

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

Experiments Multilingual Test Collection –ECLaPA collection assembled from the Europarl Parallel Corpus (English, Portuguese and French) –An analogous monolingual corpus was also assembled –Available at Terrier IR System (Porter Stemmer + Stop-Word Removal) Weka (J48 classification algorithm) Google Translator (as language guesser) LEC Power Translator Evaluation Measures (PAN competition)

Experiments - Results Monolingual vs. Multilingual Recall was the most affected measure –Loss of information due to the translation process 86% of the overall score of the monolingual baseline ---MonolingualMultilingual% of Monolingual Recall % Precision % F-Measure % Granularity % Overall Score %

Experiments - Results Detailed analysis The larger the passage the easier the detection Plagiarized passages detected –Monolingual 90% vs. 77% Multilingual --- MonolingualMultilingual ShortMediumLargeShortMediumLarge Detected Total %

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

Summary We proposed and evaluated a new method for CLPA –Used a classification algorithm in order to decide whether a text passage is pagiarized or not We assembled an artificial cross-language plagiarism test collection to evaluate the method –It is freely available Cross-language experiment achieved 86% of the performance of the monolingual baseline

Future Work Improve the time spent during the analysis of each suspicious document –Analyze each suspicious passage in a different computer? Test other features during the classifier training phase Evaluate the method while detecting plagiarism between documents written in unrelated languages –English vs. Chinese/Japanese –Many plagiarism cases happen between these pairs of languages Citation analysis

Outline Introduction Related Work The Proposed Approach Experiments Summary and Future Work

A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande do Sul CLEF 2010 Questions?