1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Prep Year Curriculum What will my child learn?.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Learn how to search for information the smart way Choose your own adventure!
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
How to Research for an Essay and Avoid Plagiarism
Natural Language Understanding
Literature Search Techniques 2 Strategic searching In this lecture you will learn: 1. The function of a literature search 2. The structure of academic.
Are downloads and readership data a substitute for citations? The case of a scholarly journal? Christian Schlögl Institute of Information Science and Information.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Style Guidelines. Title The title page contains several main pieces of information 1. Project Title 2. Team Number 3. Student team member names and their.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Paraphrasing and Plagiarism. PLAGIARISM Plagiarism is using data, ideas, or words that originated in work by another person without appropriately acknowledging.
Zolkower-SELL 1. 2 By the end of today’s class, you will be able to:  Describe the connection between language, culture and identity.  Articulate the.
Chris Luszczek Biol2050 week 3 Lecture September 23, 2013.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
CHAPTER 15, READING AND WRITING SOCIAL RESEARCH. Chapter Outline  Reading Social Research  Using the Internet Wisely  Writing Social Research  The.
1 Computational Linguistics Ling 200 Spring 2006.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
CITATION vs. PLAGIARISM INTRODUCTION Citation is the act of identifying sources. There are two types of citation.  Citation as a note or reference  Citation.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Writing Assignment. OBJECTIVES Our objective is that you will be able to: express biological concepts in your own words access the research literature.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
NATURAL LANGUAGE PROCESSING
What is Research?. Intro.  Research- “Any honest attempt to study a problem systematically or to add to man’s knowledge of a problem may be regarded.
Speech Recognition Created By : Kanjariya Hardik G.
How to search for relevant information. Preparing to search: PLAN WHAT am I looking for? WHY do I want it? WHEN? Time period? HOW? Document type? What.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
MT320 MT320 Presented by Gillian Coote Martin. Writing Research Papers  A major goal of this course is the development of effective Business research.
Language Model for Machine Translation Jang, HaYoung.
Using Google Scholar Ronald Wirtz, Ph.D.Calvin T. Ryan LibraryDec Finding Scholarly Information With A Popular Search Engine Tool.
INFORMATION SOURCES Resources in a library are determined by the information requirements of the users of the Library.
Statistical NLP: Lecture 7
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Text Based Information Retrieval
Authorship Attribution Using Probabilistic Context-Free Grammars
CS 430: Information Discovery
CS 430: Information Discovery
CSc4730/6730 Scientific Visualization
CS 430: Information Discovery
Natural Language Processing
Presentation transcript:

1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing short sequences Exploits sequencing characteristics of natural language. It has a natural analogue in human sequencing processors. Caroline Lyon, University of Hertfordshire,

2 Human sequencing functions Primitive sequencing processors in the sub-cortical basal ganglia part of the brain control motor functions, e.g. walking. These sub-cortical sequencing processors also contribute to cognitive processing, e.g. language, complementing cortical functions. Reference Human Language and our Reptilian Brain, P Lieberman, 2000

3 Sequencing in human speech and language Sequential processing is necessary at many levels: Phonetic Syllabic Lexical Syntactic Phonetics: speakers must control a sequence of independent motor acts to produce speech sounds.

4 Sequencing in speech and language (2) Phonetic segments can only be combined in certain ways to produce phonemes and then syllables. Different languages have different phonemic systems, but all have sequential constraints. Syllables combine to make words, which combine to make phrases, which combine to make sentences. All have constraints.

5 The need for sequential processing Many of our most frequently used words are homophones: True of other languages too. This does not seem to impede communication. Our primary method of disambiguation is through sequential processing of short strings of words: e.g. only has one interpretation.

6 Alternative method of avoiding word ambiguity A recent mathematical model of human language asserts that there are unique mappings from sounds to meanings, that absence of word ambiguity is a mark of evolutionary fitness. [ Computational and evolutionary aspects of language, M Novak et al. Nature, June 2002, vol 417, pp ; and other references] This is a logical suggestion, but it is not how human language works.

7 Language models Language can be modelled by a regular grammar – a linear sequence of symbols. Chomsky showed that this is inadequate. However, it has produced effective practical applications. Speech recognition systems are typically based on Markov models. The Ferret is based on a model of simple linear sequences.

8 Concepts underlying the Ferret (1) A text can be converted into a set of short sequences of adjacent words – bigrams, trigrams etc. Example with trigrams A storm was forecast for today becomes (a storm was) (storm was forecast) (was forecast for) (forecast for today)

9 Concepts underlying the Ferret (2) To find similar passages in two documents, both texts are converted to sets of trigrams. Then the sets are compared for matches. Independently written texts have a sprinkling of matches. But copied passages (not necessarily identical) have a significant number of matches, above a threshold.

10 Zipfian distribution of words (or why this method works) A small number of words occur frequently, but most words occur rarely. This phenomenon is more pronounced for bigrams and trigrams. The characteristic form of a text based on trigrams will have a few frequent trigrams, but most will be rare. References Prediction and Entropy of Printed English, C Shannon, 1950 Many publications in speech recognition literature.

11 Statistics from Wall Street Journal corpus (1) Number of words Number of distinct trigrams Number of singleton trigrams % of trigrams that are singletons 972,868648,482556,18586% 4,513,7162,420,1681,990,50782% 38,532,51714,096,10910,907,37377% From Handbook of Standards and Resources for Spoken Language Systems, Gibbon et al., 1997

12 Statistics from Wall Street Journal (WSJ) corpus (2) WSJ is a narrow domain. Topics are revisited. On close dates, subjects may be very similar. Yet after over 38 million words have been analyzed, a new article will on average have 77% new trigrams.

13 The Ferret and speech recognition systems “Sparse data” is a key problem in speech recognition. New input to a system typically contains a number of previously unseen trigrams. Ferret exploits this problem: sparse data means a text has characteristic features that do not appear in other texts, unless passages are copied.

14 Comparison metrics in the Ferret Set theoretic measures are used to compare Documents. 2 texts of comparable length have Resemblance, R If N A and N B are the sets of trigrams in texts A and B, then: There is a threshold for R, found empirically, above which texts are suspiciously similar ______

15 Benchmarking Resemblance threshold Experiments were conducted on The Federalist Papers. This set of essays, the basis of the American Constitution, is very well known. 81 of the papers were used. 2 authors. All are on related topics. The maximum measure of resemblance between two of these essays suggests an upper limit on similarity between independently written texts.

16 The Ferret process To find similar passages in large document collections Documents are converted to.txt from Word (or shortly from.pdf) Each text is converted to a set of trigrams; in this form, each is compared with each other. 3. A table showing Resemblance between each pair of texts is displayed in ranked order. The user can select any pair, display side by side, and see matching sections highlighted, save if wanted.

17 The Ferret as plagiarism detector for students’ work Detects plagiarism or collusion in work from large cohorts of students. Short sections of similar text can be identified, with some insertions and deletions. Documents from the web can be included in a semi- automatic process: top 50 hits from a search are converted to.txt and added to other texts. Reference: Experiments in Electronic Plagiarism Detection, C Lyon et al. TR 388, Computer Science Dept., University of Hertfordshire, 2003

18 Ferret demonstration Aim: to find if there are similar passages in any two documents Data: 320 texts of 10,000 words, taken from Gutenberg site. Simulated copying by pasting passages of 100 to 400 words from one text into another. 100 texts of student work, 2000 – 5000 words each. 34 documents from Dutch students. Please bring other data to try.