Selecting Suspicious Messages in Intercepted Communication David Skillicorn School of Computing, Queens University Research in Information Security, Kingston.

Slides:



Advertisements
Similar presentations
Divide-and-Conquer CIS 606 Spring 2010.
Advertisements

Finding Unusual Correlation Using Matrix Decompositions David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College
Beyond Keyword Filtering for Message and Conversation Detection David Skillicorn School of Computing, Queen’s University Math and CS, Royal Military College.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
“Advanced Encryption Standard” & “Modes of Operation”
Transformations We want to be able to make changes to the image larger/smaller rotate move This can be efficiently achieved through mathematical operations.
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Dimensionality Reduction PCA -- SVD
Chap 1: Overview Concepts of CIA: confidentiality, integrity, and availability Confidentiality: concealment of information –The need arises from sensitive.
Best-First Search: Agendas
BSBIMN501A QUEENSLAND INTERNATIONAL BUSINESS ACADEMY.
Fifth Workshop on Link Analysis, Counterterrorism, and Security. or Antonio Badia David Skillicorn.
Lecture 2 Page 1 CS 236, Spring 2008 Security Principles and Policies CS 236 On-Line MS Program Networks and Systems Security Peter Reiher Spring, 2008.
HCI 530 : Seminar (HCI) Damian Schofield. HCI 530: Seminar (HCI) Transforms –Two Dimensional –Three Dimensional The Graphics Pipeline.
Tagging Systems Mustafa Kilavuz. Tags A tag is a keyword added to an internet resource (web page, image, video) by users without relying on a controlled.
Introduction to Cryptography and Security Mechanisms: Unit 5 Theoretical v Practical Security Dr Keith Martin McCrea
Long distance communication Multiplexing  Allow multiple signals to travel through one medium  Types Frequency division multiplexing Synchronous time.
Data Mining.
Foundations of Network and Computer Security J J ohn Black Lecture #3 Aug 28 th 2009 CSCI 6268/TLEN 5550, Fall 2009.
Chapter 6 Errors, Error Detection, and Error Control
The Relational Database Model. 2 Objectives How relational database model takes a logical view of data Understand how the relational model’s basic components.
Summary of Lecture 1 Security attack types: either by function or by the property being compromised Security mechanism – prevention, detection and reaction.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Methods and Philosophy of Statistical Process Control
Chapter 8 communication skills Section 8.1 Defining Communication
Introduction Information in science, business, and mathematics is often organized into rows and columns to form rectangular arrays called “matrices” (plural.
The Relational Database Model
Lecture 18 Page 1 CS 111 Online Design Principles for Secure Systems Economy Complete mediation Open design Separation of privileges Least privilege Least.
by Marc Comeau. About A Webmaster Developing a website goes far beyond understanding underlying technologies Determine your requirements.
Chapter 6.7 Determinants. In this chapter all matrices are square; for example: 1x1 (what is a 1x1 matrix, in fact?), 2x2, 3x3 Our goal is to introduce.
Chapter 8: Systems analysis and design
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Rudiments of Proper Display: I. Tables “Getting information from a table is like extracting sunbeams from a cucumber” (Farquhar & Farquhar, 1891) Howard.
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
PHP meets MySQL.
File Processing - Database Overview MVNC1 DATABASE SYSTEMS Overview.
The Scientific Method Honors Biology Laboratory Skills.
Section 2.2 Echelon Forms Goal: Develop systematic methods for the method of elimination that uses matrices for the solution of linear systems. The methods.
Copyright © 2013, 2009, 2005 Pearson Education, Inc. 1 5 Systems and Matrices Copyright © 2013, 2009, 2005 Pearson Education, Inc.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
The Relational Database Model
SINGULAR VALUE DECOMPOSITION (SVD)
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
1 Chapter Six - Errors, Error Detection, and Error Control Chapter Six.
Technical Report of Web Mining Group Presented by: Mohsen Kamyar Ferdowsi University of Mashhad, WTLab.
Database Systems, 9th Edition 1.  In this chapter, students will learn: That the relational database model offers a logical view of data About the relational.
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Issues concerning the interpretation of statistical significance tests.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 3 The Relational Database Model.
A337 - Reed Smith1 Structure What is a database? –Table of information Rows are referred to as records Columns are referred to as fields Record identifier.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Social Networks and Surveillance: Evaluating Suspicion by Association Ryan P. Layfield Dr. Bhavani Thuraisingham Dr. Latifur Khan Dr. Murat Kantarcioglu.
DO LOCAL MODIFICATION RULES ALLOW EFFICIENT LEARNING ABOUT DISTRIBUTED REPRESENTATIONS ? A. R. Gardner-Medwin THE PRINCIPLE OF LOCAL COMPUTABILITY Neural.
Review HW: E1 A) Too high. Polltakers will never get in touch with people who are away from home between 9am and 5pm, eventually they will eventually be.
Lecture 8 Source detection NASSP Masters 5003S - Computational Astronomy
Computer Graphics Matrices
1 Project Quality Management QA and QC Tools & Techniques Lec#10 Ghazala Amin.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Role Of Network IDS in Network Perimeter Defense.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
Matrices, Vectors, Determinants.
Relational Database Model
Presentation transcript:

Selecting Suspicious Messages in Intercepted Communication David Skillicorn School of Computing, Queens University Research in Information Security, Kingston (RISK) Math and CS, Royal Military College

Legal interception occurs in three main contexts: 1.Government broad-spectrum interception of communication for national defence and intelligence (e.g. Echelon). Usually excludes communication between `citizens, who are hard to identify in practice. Usually a simple surrogate rule is applied. 2.Law enforcement interception pursuant to a warrant. 3.Organizational interception, typically and IM, for improper behaviour (e.g. SOX), criminal activity, and industrial espionage. Lots of other communication takes place in public, and so is always available for examination: chat, blogs, web pages

For governments and organizations, the volumes of data intercepted are large: 3 billion messages per day for Echelon; 1 TB /day for the CIA. Finding anything interesting in this torrent is a challenge – 1 in a million or less. Early stage processing must concentrate on finding the definitely uninteresting, so that it can be discarded. Selecting the potentially interesting can be done downstream, using more sophistication because the volumes are much smaller.

The main approach for selection: Use a set of keywords whose presence causes a message to be selected for further analysis. Example: German Federal Intelligence Service: nuclear proliferation (2000 terms), arms trade (1000), terrorism (500), drugs (400), as of 2000 (certainly changed now). It also seems plausible that a range of other techniques are applied, based on properties such as: content overlap, sender/receiver identities, times of transmission, specialized word use, etc.. (Social Network Analysis)

General strategy considerations

1. Models that assume that the problem is to discover the boundary between good and bad based on some fixed set of properties can be defeated easily. The carnival booth approach – probe to learn the boundary, then avoid it. Looking for a fixed set of anomalies means missing an unexpected anomaly. Randomizing the boundary can help. Better to look for anything unusual rather than the expected unusual.

2. It is hard for humans to behave unnaturally in a natural way. This is even more true when the behaviour is subconscious. e.g. Stephen Potters Oneupmanship for tennis players customs interviews digit choices on tax returns/accounts/false invoices So theres an inherent signature to unnatural behaviour, in any context.

3. Create a big, obvious, primary detection system … then create a secondary detection system that looks for reaction (evasion) of the first system ! Innocent people either dont know about or dont react to such a system; but those who are being deceptive cannot afford not to. (The more the primary system looks for markers that are subconsciously generated, the harder it is to react appropriately.) The boundary between innocence and reaction is often easier to detect than the boundary between innocence and deception.

How does this apply to communication? Most informal communication relies on subconscious mechanisms governing textual markers such as word choice, and voice markers such as pitch. Awareness of simple surveillance measures may cause problems with these mechanisms, creating detectable changes. The presence of a watchlist of words suggests substituting innocuous words – but word choice is also partly a subconscious process.

Detecting substitution in conversations

Replacing words that might be on the keyword watch list by other words or locutions could prevent messages from being selected based on their content. But … knowing that there is a watch list is not the same thing as knowing whats on it: `bomb is probably a word not to use; What about: `fertilizer, `meeting, `suicide, …? A keyword watch list plays the role of a primary selection mechanism; it doesnt matter that its existence is known, but it does matter that some of its details are unknown. Randomization can even be useful.

Substitution can be: * based on a codebook (e.g. `attack = `wedding) * generated on the fly We expect that most substitutions on the fly will replace a word with a new word whose natural frequency is quite different: `attack is the 1072 nd most common English word `wedding is the 2912nd most common English word This can be avoided, but only with some attention – more later.

The use of a substitution with the `wrong frequency in a number of messages may make the entire conversation unusual enough to be detected. This has the added advantage that it can put together messages that belong, even if their endpoints have been obscured.

Linguistic background The frequency of words in English (and many other languages) is Zipf – frequent words are very frequent, and frequency drops off very quickly. We restrict our attention to nouns. In English Most common noun – time 3262 nd most common noun – quantum

A message-frequency matrix has a row corresponding to each message, and a column corresponding to each noun. The ij th entry is the frequency of noun j in message i. The matrix is very sparse. We generate artificial datasets using a Poisson distribution with mean f * 1/j+1, where f models the base frequency. We add 10 extra rows representing the correlated threat messages, using a block of 6 columns, uniformly randomly 0s and 1s, added at columns

messages nouns

Technology – Matrix decompositions. The basic idea: * Treat the dataset as a matrix, A, with n rows and m columns; * Factor A into the product of two matrices, C and F A = C F where C is n x r, F is r x m and r is smaller than m. Think of F as a set of underlying `real somethings and C as a way of `mixing these somethings together to get the observed attribute values. Choosing r smaller than m forces the decomposition to somehow represent the data more compactly. F A = C

Two matrix decompositions are useful : Singular value decomposition (SVD) – the rows of F are orthogonal axes such that the maximum possible variation in the data lies along the first axis; the maximum of what remains along the second, and so on. The rows of C are coordinates in this space. Independent component analysis (ICA) – the rows of F are statistically independent factors. The rows of C describe how to mix these factors to produce the original data. Strictly speaking, the row of C are not coordinates, but we can plot them to get some idea of structure.

First 3 dimensions – SVD The messages with correlated unusual word usage are marked with red circles

First 3 dimensions – ICA

(Fortunately) both unusual word use and correlated word use are necessary to make such messages detectable. Correlation with proper word frequencies (SVD) So ordinary conversations dont show up as false positives!!

Correlation with proper word frequencies (ICA)

Uncorrelated with unusual word frequencies (SVD) Conversations about unusual things dont show up as false positives either!!

Uncorrelated with unusual word frequencies (ICA)

This trick permits a new level of sophistication in connecting related messages into conversations when the usual indicators are not available. It does exactly the right thing – ignoring conversations about ordinary topics, and conversations about unusual topics, but homing in on conversations about unusual topics using inappropriate words. Because the dataset is sparse, SVD takes time linear in the number of messages. The complexity of ICA is less clear but there are direct hardware implementations (SOSUS).

Message-rank matrices are useful because they defend against the countermeasure of rules like use the word 5 ranks below the one you want to use. A message-rank matrix has a row corresponding to each message, and a column corresponding to the rank, in English, of the j th most frequent noun in the message. Message-rank matrices have many fewer columns, which makes them easier and faster to work with (e.g. Enron dataset: 200,000+ words but average number of nouns per message <200).

messages rank of jth noun in message

Replacing words with those, say, five positions down the list does not show up in the SVD of a message-frequency matrix:

But its very clear in the SVD of a message-rank matrix:

Detecting substitutions in individual messages

What about if the substitution is a word with the same natural frequency? Can we still detect the substitution because of a bump in the flow of the sentence? The graph of adjacent words in English has a small world property – paths outwards to rare words quickly return to the common-word centre. So looking at frequencies of pairs (triples) of words is not very revealing, e.g. Everything is ready for the watch the watch for the watch ready for the watch is ready for the watch get slowly more unusual

Weve developed a number of measures for oddity of a word in a context. Each one independently is quite weak. However, combining them produces a usable detector. We use Googles responses as a surrogate for frequencies of words, quoted phrases, and bags of words in English. Google sees a lot of text… but its a blunt instrument because we only use the number of documents returned as a measure of frequency (this doesnt seem to matter); and Googles treatment of stop words is a bit awkward.

Measures I: Contextualized frequency When a word is appropriate in a sentence, the frequencies f{ the, cat, sat, on, the mat } and f{ the, sat, on, the, mat } should be quite similar. But … f{ the, unicorn, sat, on, the mat } and f{ the, sat, on, the mat} should be very different. This could signal that `unicorn is a substituted word.

So we define sentence oddity to be frequency of bag of words with word of interest omitted frequency of bag of words containing all words The larger this measure is, the more likely that the word of interest is a substitution (we hope). We use the frequency of a bag of words because most strings of any length dont occur at all, even at Google. However, short strings might occur with measurable frequency – this is the basis of our second measure.

Measures II: k-gram frequency Given a word of interest, its left k-gram is the string preceding the word of interest up to and including the first non-stopword. right k-gram is the string following the word of interest up to and including the first non-stopword. A nine mile walk is no joke (f = 33) left k-gram: mile walk (f = 50) right k-gram: walk is no joke (f = 876,000)

Using a k-gram avoids the problems of a small-world adjacency graph – it ignores visits to the (boring) middle region of the graph, but captures connections between close visits to the outer layers. Its a way to get a kind of 2-gram, both of whose words are non- trivial. If the word of interest is a substitute, both its left and right k- grams should be small. Left and right k-grams measure very different properties of sentences.

Measures III: Hypernym oddity The hypernym of a noun is a more general term that includes the class of things described by the noun. e.g. broodmare – mare – horse – equine – odd-toed ungulate – hoofed mammal – mammal – vertebrate Notice that the chain oscillates between ordinary words and technical terms. In informal text, ordinary words are much more likely than technical terms. However, a substitution might be a much less ordinary word in this context.

We define the hypernym oddity to be f( bag of words with word of interest replaced by its hypernym) – f( bag of words with word of interest) We expect this measure to be positive when the word of interest is a substitution, and close to zero or negative when the word is appropriate. Although hypernyms are semantic relatives of the original words, we can get them automatically using Wordnet – although there are usually multiple hypernyms and we cant tell which one is `right automatically.

Pointwise mutual information (PMI) PMI = f (word) f (adjacent region) f (word + adjacent region) where + is concatenation in either direction, and the maximum is taken over all adjacent regions that have non-zero frequencies. PMI blends some of the properties of sentence oddity and k-grams. It looks for stable phrases (those with high frequency).

We use the Enron corpus (and the Brown news corpus), and extract sentences at random. In , such sentences are often very unusual – typos, shorthand, technical terms. So its difficult data to work with. We replaced the first noun in each Enron sentence by the noun with closest frequency, using the BNC frequency rankings, and removed sentences where the new noun wasnt known to Wordnet or the sentence (as a BoW) occurred too infrequently at Google. This left a set of 1714 ordinary sentences and a set of 1714 sentences containing a noun substitution. Having two sets of sentences allowed us to train a decision tree on each of the measures to determine a good boundary value between ordinary and substitution sentences.

MeasureDetection rate False positive rateBoundary Sentence oddity Enhanced sentence oddity Left k-gram Right k-gram Average k-gram Minimum hypernym Maximum hypernym Average hypernym PMI Combined9511 Enron dataset

Each individual measure is very weak. However, they make their errors on different sentences, so combining their predictions does much better than any individual measures. Results for the Brown corpus are similar, although (surprisingly) a little weaker – we expected that more formal language would make substitutions easier to spot. This may reflect changing writing styles, under-represented at Google. Results are the same when Yahoo is used as a frequency oracle.

Detecting offline connections using online word usage

Analysing a matrix whose rows represent the s of individuals, whose columns represent words, and whose entries are the frequency of use of words by individuals in Enron s allows us to address questions such as: * Does word usage vary with company role, either explicit or implicit? * Do people who communicate offline develop similarities in their word usage? * Can changing word usage over time reveal changing offline relationships? YES to all three.

Enron 1999

Enron 2000

Enron 1 st half of 2001

Enron 2 nd half of 2001

Detecting mental state using word usage

Humans leak information about their mental states quite strongly – but we are not wired to notice. The leakage comes via frequencies and frequency changes in little words, such as pronouns. Detection via software is straightforward. Detecting mental state means that we can: * decide which parts of bin Ladens messages he believes and which are pitched for particular audiences * distinguish between testosterone and terrorism on Salafist websites * assess the truthfulness of witnesses (and politicians)

Weve had some success with detecting: * deceptive s in the Enron corpus * speeches with spin in the Winter 2006 federal election * testimony at the Gomery commission Validation is still an issue; also differences in the signature of deception in different contexts.

Summary: Language production is mostly a subconscious process, so it is hard to use it unnaturally. Even with knowledge of detection systems, it is difficult to adjust language production to remain concealed. This can be exploited using layered detection systems, with the second layer looking for reaction to the existence of the first layer. This kind of `shallow analysis of language hasnt been explored much, so theres lots of potential for new, powerful detection techniques.

?