Measures to Detect Word Substitution in Intercepted Communication David Skillicorn, SzeWang Fong School of Computing, Queen’s University Dmitri Roussinov.

Slides:



Advertisements
Similar presentations
A small taste of inferential statistics
Advertisements

Selecting Suspicious Messages in Intercepted Communication David Skillicorn School of Computing, Queens University Research in Information Security, Kingston.
Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.
Brief introduction on Logistic Regression
Imbalanced data David Kauchak CS 451 – Fall 2013.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Chapter 10.  Real life problems are usually different than just estimation of population statistics.  We try on the basis of experimental evidence Whether.
The Scientific Method.
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
Fundamentals of Forensic DNA Typing Slides prepared by John M. Butler June 2009 Appendix 3 Probability and Statistics.
Exam 1 Review u Scores Min 30 Max 96 Ave 63.9 Std Dev 14.5.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
BCOR 1020 Business Statistics
Decision Tree Models in Data Mining
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 11 Introduction to Hypothesis Testing.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Fundamentals of Hypothesis Testing: One-Sample Tests
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Chapter 10 Hypothesis Testing
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Chapter 11 Testing Hypotheses about Proportions © 2010 Pearson Education 1.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.3 Estimating a Population Mean.
Chapter 6 USING PROBABILITY TO MAKE DECISIONS ABOUT DATA.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Multigroup Models Byrne Chapter 7 Brown Chapter 7.
Hazard Identification
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
Chapter 21: More About Test & Intervals
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Exam 1 Review u Scores Min 30 Max 96 Ave 63.9 Std Dev 14.5.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Statistical Techniques
Chapter 8: Introduction to Hypothesis Testing. Hypothesis Testing A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
+ Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population Mean.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Paul van Mulbregt Sheera Knecht Jon Yamron Dragon Systems Detection at Dragon Systems.
AP Statistics From Randomness to Probability Chapter 14.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Stats Methods at IC Lecture 3: Regression.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Erasmus University Rotterdam
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Chapter 8: Estimating with Confidence
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 7-9
Module 5: Data Cleaning and Building Reports
RAID Redundant Array of Inexpensive (Independent) Disks
Common Core Math I Unit 2: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Chapter 8: Estimating with Confidence
Common Core Math I Unit 1: One-Variable Statistics Boxplots, Interquartile Range, and Outliers; Choosing Appropriate Measures.
Ensemble learning.
Machine Learning in Practice Lecture 17
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Evaluating Classifiers
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
2/5/ Estimating a Population Mean.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Presentation transcript:

Measures to Detect Word Substitution in Intercepted Communication David Skillicorn, SzeWang Fong School of Computing, Queen’s University Dmitri Roussinov W.P. Carey School of Business, Arizona State University

What my lab does: 1. Detecting anomalies when their attributes have been chosen to try and make them undetectable; 2. Deception detection in text (currently Enron, Canadian parliament, Canadian election speeches, trial evidence); 3. Markers in text for anomalies and hidden connections.

Governments intercept communication as a defensive measure. Increasingly, this requires `domestic’ interception as well as more longstanding external interception – a consequence of asymmetric warfare. A volatile issue! Organizations increasingly intercept communication (e.g. ) to search for intimidation, harassment, fraud, or other malfeasance. This may happen in an online way, e.g. as a response to due diligence requirements of post-Enron financial regulation; or may happen forensically after an incident. There’s no hope of human processing of all of this communication. Indeed, there’s too much for most organizations to afford sophisticated data-mining of all of it. Ways to automatically select the interesting messages are critical.

Early-stage filtering of communication traffic must: * be cheap * have a low false negative rate (critical) * but a high false positive rate doesn’t matter (too much) The goal is to be sure about innocence. This is important to communicate to the public – all messages/calls/ s are not examined equally.

First technique for message selection: use a list of words whose presence (with the right frequency) in a message indicates an interesting message. This seems like a very weak technique, because the obvious defense is not to use words that might be on the watchlist. However, … * although the existence of the list might be public, it’s much harder to guess what’s on it and where it ends. E.g. `nuclear’ yes `bomb’ yes `ammonium nitrate’ ?? `Strasbourg cathedral’ ?? * the list’s primary role is to provoke a reaction in the guilty (but not in the innocent)

One possible reaction: encryption – but this is a poor idea since encryption draws attention to the message. Another reaction: replace words that might be on the watchlist by other, innocuous, words. Which words to choose as replacements? If the filtering is done by humans, then substitutions should ‘make sense’, e.g. Al Qaeda `attack’  `wedding’ works well, because weddings happen at particular places, and require a coordinated group of people to travel and meet.

Of course, large-scale interception cannot be handled by human processing. If the filtering is done automatically, substitutions should be syntactically appropriate – e.g. of similar frequency. Can substitutions like this be detected automatically? YES, because they don’t fit as well into the original sentence; The semantic differences can be detected using syntactic markers and oracles for the natural frequency of words, phrases, and bags of words.

We define a set of measures that can be applied to a sentence with respect to a particular target word (usually a noun). 1. Sentence oddity (SO), Enhanced sentence oddity (ESO) SO = frequency of bag of words, target word removed frequency of entire bag of words ESO = frequency of bag of words, target word excluded frequency of entire bag of words Intuition: when a contextually appropriate word is removed, the frequency doesn’t change much; when a contextually inappropriate word is removed, the frequency may increase sharply. increase  possible substitution

Example original sentence: “we expect that the attack will happen tonight” Substitution: `attack’  `campaign’ “we expect that the campaign will happen tonight” f(we expect that the attack will happen tonight) = 2.42M f(we expect that the will happen tonight) = 5.78M SO = 2.4 f(we expect that the campaign will happen tonight) = 1.63M f(we expect that the will happen tonight) = 5.78M SO = 3.5

2. Left, right and average k-gram frequencies Many short exact (quoted) strings do not occur, even in large repositories!! k-grams estimate frequencies of target words in context, but must keep the context small (or else the estimate is 0). left k-gram = frequency of exact string from closest non-stopword to the left of the target word, up to and including the target word. right k-gram = frequency of exact string from target word up to and including closest non-stopword to the right. average k-gram = average of left and right k-grams. small k-gram  possible substitution

Examples of exact string frequencies “the attack will happen tonight” f = 1 even though this seems like a plausible, common phrase Left k-gram: “expect that the attack” f= 50 Right k-gram: “attack will happen” f = 9260 Left k-gram: “expect that the campaign” f = 77 Right k-gram: “campaign will happen” f = 132 This should be smaller than 50, but may be affected by ‘election campaign’

3. Maximum, minimum, average hypernym oddity (HO) The hypernym of a word is the word or phrase above it in a taxonomy of meaning, e.g. `cat’  `feline’. If a word is contextually appropriate, replacing it by its hypernym creates an awkward (pompous) sentence, with lower frequency. If a word is contextually inappropriate, replacing it by its hypernym tends to make the sentence more appropriate, with greater frequency. HO = frequency of bag of words with hypernym – frequency of original bag of words increase  possible substitution

Hypernym examples Original sentence: we expect that the attack will happen tonight f = 2.42M we expect that the operation will happen tonight f H = 1.31M Sentence with a substitution: we expect that the campaign will happen tonight f = 1.63M we expect that the race will happen tonight f H = 1.97M

Hypernyms are semantic relationships, but we can get them automatically using Wordnet (wordnet.princeton.edu). Most words have more than one hypernym, because of their different senses. We can compute the maximum, minimum and average hypernym oddity over the possible choices of hypernyms.

4. Pointwise mutual information (PMI) PMI = f(target word) f(adjacent region) f(target word + adjacent region) where the adjacent region can be on either side of the target. We use the maximum PMI calculated over all adjacent regions that have non-zero frequency. (Frequency drops to zero with length quickly.) PMI looks for the occurrence of the target word as part of some stable phrase. increase  possible substitution

Frequency oracles: We use Google and Yahoo as sources of natural frequencies for words, quoted strings, and bags of words. Some issues: * we use frequency of pages as a surrogate for frequency of words; * we don’t look at how close together words appear in each page, only whether they all occur; * search engines handle stop words in mysterious ways * order of words matters, even in bag of word searches * although Google and Yahoo claim to index about the same number of documents, their reported frequencies for the same word differ by a factor of at least 6 in some cases

Test data We want text that is relatively informal, because most intercepted messages will not be polished text ( , phone calls). We selected sentences of length 5-15 from the Enron corpus. Many of these sentences are informal (some are bizarre). We constructed a set of sentences containing substitutions by replacing the first noun in each original sentence by a frequency- matched substitute. We discarded sentences where the first noun wasn’t in the BNC corpus, or did not have a hypernym known to Wordnet.

We built a set of 1714 ‘normal’ sentences, and 1714 sentences with a substitution (but results were very stable for more than about 200 sentences) We also constructed a similar, smaller, set from the Brown corpus (which contains much more formal, and older, texts).

For each measure, we built a decision tree predicting normal vs substitution, using the measure value as the single attribute. This gives us insight about the boundary between normal and substitution sentences for each kind of measure. MeasureBoundary: odd if Semantic oddity> 4.6 Enhanced semantic oddity> 0.98 Left k-gram< 155 Right k-gram< 612 Average k-gram< 6173 Min hypernym oddity> Max hypernym oddity> -6 Average hypernym oddity> -6 Pointwise mutual information> 1.34 These are newer results than those in the paper

Individual measures are very weak detectors. (75%/25% split, J48 decision tree, single attribute, Weka) Measure Detection rate % False positive rate % Area under ROC curve Semantic oddity Enhanced semantic oddity Left k-gram Right k-gram Average k-gram Min hypernym oddity Max hypernym oddity Average hypernym oddity Pointwise mutual information

Single-measure predictors make their errors on different sentences. Combining them produces much stronger predictors. Combining using a decision tree trained on the full set of measure values: Combining using a random forest (50 trees, Mtry = 4): Surprising this isn’t better. Measure Detection rate % False positive rate % Area under ROC curve Combined decision tree Measure Detection rate % False positive rate % Random forest9011

The families of measure are almost completely independent:

…and each sentence’s classification is almost completely determined by its score w.r.t one measure, i.e. most sentence have a neutral score on all but one measure (family) – something deeper here.

We expected better results for the Brown corpus, reasoning that context should be more helpful in more-careful writing. In fact, the results for the Brown corpus are worse. This may reflect changes in language use, since the 60s. Our oracles are much better representatives of recent writing. But puzzling… Measure Detection rate % False positive rate % Area under ROC curve Combined decision tree Measure Detection rate % False positive rate % Random forest8313

Results are similar (within a few percentage points) across different oracles: Google, Yahoo, MSN despite their apparent differences. Results are also similar if the substituted word is much less frequent than the word it replaces. No extra performance from rarity of the replacement word. (cf Skillicorn ISI 2005 where this was critical) But some loss of performance if the substituted word is much more frequent than the word it replaces. This is expected since common words fit into more contexts.

Why do the measures make errors? Looking at the first 100 sentences manually… * some of the original sentences are very strange already, written in a hurry or with strange abbreviations or style * there’s only one non-stopword in the entire sentence, so no real context * the substitution happens to be appropriate in the context There’s some fundamental limit to how well substitutions can be detected because of these phenomena. Both detection rate and false positive rate may be close to their limits.

Mapping sentence predictions to message predictions: There’s considerable scope to get nicer properties on a per-message basis by deciding how many sentences should be flagged as suspicious before a message is flagged as suspicious. It’s likely that an interesting message contains more than 1 sentence with a substitution. So a rule like: “select messages with more than 4 suspicious sentences, or more than 10% suspicious sentences” reduces the false positive rate, without decreasing the detection rate much.

Summary: A good way to separate ‘bad’ from ‘good’ messages is to deploy a big, visible detection system (whose details, however, remain hidden), and then watch for reaction to the visible system Often this reaction is easier to detect than the innate differences between ‘bad’ and ‘good’. Even knowing this 2-pronged approach, senders of ‘bad’ messages have to react, or else risk being detected by the visible system.

For messages, the visible system is a watchlist of suspicious words. The existence of the watchlist can be known, without knowing which words are on it. Senders of ‘bad’ messages are forced to replace any words that might be on the watchlist – so they probably over-react. These substitutions create some kind of discontinuity around the places where they occur. This makes them detectable, although a variety of (very) different measures must be used – and, even then, decent performance requires combining them. So far, detection performance is ~95% with a ~10% false positive rate.

?