1 Hidden Markov Models for Software Piracy Detection Shabana Kazi Mark Stamp HMMs for Piracy Detection 1

2 Intro  Here, we apply metamorphic analysis to software piracy detection  Very similar to techniques used in malware detection o But, problem is completely different o Has nothing to do with malware  We show that there are other applications of such techniques HMMs for Piracy Detection 2

3 Software Piracy  Software piracy is major problem o By 2009 estimate, $3 to $4 lost to piracy for every $1 in software sales  Usually, piracy consists of taking software without modification  In some cases, software is modified o Commercial theft of intellectual property o Thief really doesn’t want to get caught… HMMs for Piracy Detection 3

4 Software Piracy  We assume software is stolen o And modified, making it hard to detect o If completely rewritten from scratch, we won’t detect it by our approach  Want to make life hard for bad guys o Ideally, major modifications required  How much modification is need before we cannot reliably detect? HMMs for Piracy Detection 4

5 Goals  Technique applicable to any software  No special effort by developer o Nothing extra inserted into code  We only require access to exe file  Not a watermarking scheme o More like software “birthmark” analysis  Also not plagiarism detection o Here, want a “deeper” analysis HMMs for Piracy Detection 5

6 Use Case  You work for Alice’s Software Company o And you develop fancy software for ASC  Trudy’s Software Company (TSC) develops suspiciously similar product  You suspect TSC of stealing your code o Not identical, but seems similar  What can you do? o We’ve got some ideas that might help… HMMs for Piracy Detection 6

7 Use Case  Using the technique discussed here  Can easily measure code similarity  Low similarity? o Then no hope of proving code is stolen  High similarity? o Further (costly) analysis is warranted  High similarity does not prove stolen o But a good reason to take a closer look HMMs for Piracy Detection 7

8 Background  Metamorphic software o Metamorphic techniques (dead code, permutation, substitution)  HMM o Basic ideas and notation o The 3 problems and their solutions (discussed at a high level)  We’ve seen all of this before HMMs for Piracy Detection 8

9 Overview  Training and scoring  Train HMM on slightly morphed copies of given “base” software o Slight morphing to avoid overfitting  Score morphed copies and other files o Here, morphing serves to simulate modifications by attacker  Want to know how much morphing required before detection fails HMMs for Piracy Detection 9

10 Metamorphic Generator  Built our own metamorphic generator  Morph based on extracted opcodes o Morphing consists of dead code insertion o Specify a dead code percentage and number of blocks to insert  Do not require morphed code works o Makes detection more difficult, not easier o A worst-case scenario, detection-wise HMMs for Piracy Detection 10

11 Training  Given a base executable file…  Extract its opcode sequence  Generate 100 slightly morphed copies o Each morphed 10%, using dead code extracted from random “normal” file  Train HMM on morphed copies o Using 5-fold cross validation o Note: We train one model for each “fold” HMMs for Piracy Detection 11

12 Training  Illustration of training process o Slightly morphed copies of base program HMMs for Piracy Detection 12

13 Determine Threshold  For each of 5-folds o Train HMM o Score 20 morphed files (match set) and 15 normal (nomatch set)  Determine threshold based on scores o Threshold is highest score of normal file o Implies FPR = 0; equivalently, TNR = 1 (for the given “fold”) HMMs for Piracy Detection 13

14 Setting a Threshold  Process used to set threshold HMMs for Piracy Detection 14

15 Experiments  Want to determine robustness  For each base file tested…  Train to obtain HMM and threshold  Morph base file at various percentages o Using various morphing strategies o Refer to this morphing as tampering  Score each tampered copy o Classify, based on threshold HMMs for Piracy Detection 15

16 Experiments  Scoring tampered files HMMs for Piracy Detection 16

17 Experiment Details  For each base file o 6 models o 10 tamper percent for each o 100 files each o So, 6000 scores! HMMs for Piracy Detection 17

18 Experiment Details  Tested 10 base files, each data point o So 60,000 scores computed… HMMs for Piracy Detection 18

19 Experiment Details  Repeated entire experiment 6 times o Using different number of blocks in training phase o Training made little difference on scores o So, here we only give results where 1 block used in training phase  In total 360,000 scores computed o And 360 “models” generate o That is, 1800 HMMs (one per fold) HMMs for Piracy Detection 19

20 Results: Bar Graph HMMs for Piracy Detection 20

21 Results: 3-d Plot HMMs for Piracy Detection 21

22 Conclusions  Results look very promising o Robust  high degree of morphing required before base file undetected o Practical  only requires exe, no special effort when developing o Applies to any exe, at any time  Overall, strong software “birthmark” strategy with practical implications HMMs for Piracy Detection 22

23 Future Work  Statistical analysis somewhat weak o Results may be stronger than it appears  Many other scores/combinations of scores can be tested o Results can only get better  Consider other morphing techniques o And other file types (e.g., bytecode) o And mitigations for 1-block morphing … HMMs for Piracy Detection 23

24 References  S. Kazi and M. Stamp, Hidden Markov models for software piracy detection, Information Security Journal: A Global Perspective, 22:140-149, 2013Hidden Markov models for software piracy detection HMMs for Piracy Detection 24

