Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.

Similar presentations


Presentation on theme: "Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory."— Presentation transcript:

1 Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

2 Reference Measurement and Classification of Humans and Bots in Internet Chat Steven Gianvecchio, Mengjun Xie, ZhenyuWu, and Haining Wang Department of Computer Science The College of William and Mary (USENIX Security),2008 2

3 Outline Background Measurement Classification System Experimental Evaluation Conclusion 3

4 Outline Background Measurement Classification System Experimental Evaluation Conclusion 4

5 Chat Bots vs. BotNets BotNets – networks of compromised machines some use chat systems (IRC) for C&C, others use P2P, HTTP, etc. abuse various systems Chat Bots – automated chat programs some are helpful, e.g., chat loggers can abuse chat systems and their users Send spam,spread malicious software, mount phishing attacks Our focus is on the Yahoo! Chat system. 5

6 Outline Background Measurement Classification System Experimental Evaluation Conclusion 6

7 Measurement August-November 2007 – we collect data August 2007 – Yahoo! adds CAPTCHA very few chat bots October 2007 – bots are back 7

8 Measurement August and November 2007 many chat bots 1,440 hours of chat logs 147 chat logs 21 chat rooms 8

9 Measurement To create our dataset, we read and label the chat users as human, bot, or ambiguous In total, we recognized 14 different types of chat bots different triggering mechanisms different text generation techniques 9

10 Types of Chat Bots Periodic Bots – sends messages based on periodic timers Random Bots – sends messages based on random timers Responder Bots – responds to messages of other users Replay Bots – replays messages of other users 10

11 Humans inter-message delay – evidence of heavy tail message size – well fit by Exponential (λ=0.034) 11

12 Periodic Bots inter-message delay – several clusters with high probabilities message size – messages built from templates approximate a normal distribution 12

13 Random Bots inter-message delay – Equilikely distribution at 40, 64, and 88; Uniform distribution 45-125 message size – messages selected from a small database 13

14 Responder Bots inter-message delay – human-like timing message size – multiple templates of different lengths 14

15 Replay Bots inter-message delay – cluster with high probabilities (replay bots are periodic) message size – human-like size, well fit by Exponential (λ=0.028) 15

16 Outline Background Measurement Classification System Experimental Evaluation Conclusion 16

17 Classification System Entropy Classifier detects abnormal behavior based on message sizes and inter-message delays accurate but slow Machine Learning Classifier detects “learned” patterns based on message content fast but must be trained 17

18 18 Observation – chat bots are less complex than humans, and thus, lower in entropy exploits the low entropy of chat bots Corrected Conditional Entropy Test (CCE) estimates higher-order entropy Entropy Test (EN) estimates first-order entropy Entropy Classifier 18

19 Machine Learning Classifier Observation - chat spam like email spam is a text classification problem exploits message content of chat bots CRM114 a powerful text classification system 19

20 20  Hybrid Classification System  entropy classifier builds and maintains the bot corpus  machine learning classifier uses the bot and human corpora BOT CORPUS CLASSIFY AS CHAT BOT HUMAN CORPUS CLASSIFY AS HUMAN INPUT ENTROPY CLASSIFIER MACHINE LEARNING CLASSIFIER

21 Outline Background Measurement Classification System Experimental Evaluation Conclusion 21

22 Experimental Evaluation Types of Chat Bots Periodic Bots Random Bots Responder Bots Replay Bots Classifiers entropy classifier – 100 messages machine learning classifier – 25 messages 22

23 Experimental Evaluation Classification Tests Ent – entropy classifier SupML – fully-supervised ML classifier, trained on AUG BOTS SupMLre – fully-supervised ML classifier, retrained on NOV BOTS EntML – entropy-trained ML on AUG BOTS 23

24 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP EN(imd) 121/12168/681/3051/51109/10940/407/1713 CCE(imd) 121/12149/684/3051/51109/10940/4011/1713 EN(ms) 92/1217/688/3046/5134/1090/407/1713 CCE(ms) 77/1218/6830/3051/516/1090/4011/1713 OVERALL 121/12168/6830/3051/51109/10940/4017/1713 24  Entropy Classifier  EN – entropy  CCE – corrected conditional entropy  (imd) – inter-message delay  (ms) – message size

25 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP EN(imd) 121/12168/681/3051/51109/10940/407/1713 CCE(imd) 121/12149/684/3051/51109/10940/4011/1713 EN(ms) 92/1217/688/3046/5134/1090/407/1713 CCE(ms) 77/1218/6830/3051/516/1090/4011/1713 OVERALL 121/12168/6830/3051/51109/10940/4017/1713 25  EN(imd) and CCE(imd)  problems against responder bots  detect most other chat bots

26 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP EN(imd) 121/12168/681/3051/51109/10940/407/1713 CCE(imd) 121/12149/684/3051/51109/10940/4011/1713 EN(ms) 92/1217/688/3046/5134/1090/407/1713 CCE(ms) 77/1218/6830/3051/516/1090/4011/1713 OVERALL 121/12168/6830/3051/51109/10940/4017/1713 26  EN(ms) and CCE(ms)  problems against random and replay bots  detect most other chat bots

27 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP EN(imd) 121/12168/681/3051/51109/10940/407/1713 CCE(imd) 121/12149/684/3051/51109/10940/4011/1713 EN(ms) 92/1217/688/3046/5134/1090/407/1713 CCE(ms) 77/1218/6830/3051/516/1090/4011/1713 OVERALL 121/12168/6830/3051/51109/10940/4017/1713 27  OVERALL  detects all chat bots  false positive rate is ~0.01  100 messages

28 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP Ent 121/12168/6830/3051/51109/10940/4017/1713 SupML 121/12168/6830/3014/51104/1091/400/1713 SupMLre 121/12168/6830/3051/51109/10940/400/1713 EntML 121/12168/6830/3051/51109/10940/401/1713 28  Entropy and Machine Learning Classifiers  Ent – entropy classifier (from last slide)  SupML – fully-supervised ML classifier, trained on AUG BOTS  SupMLre – fully-supervised ML classifier, retrained on NOV BOTS  EntML – entropy-trained ML on AUG BOTS

29 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman Test TP FP Ent 121/12168/6830/3051/51109/10940/4017/1713 SupML 121/12168/6830/3014/51104/1091/400/1713 SupMLre 121/12168/6830/3051/51109/10940/400/1713 EntML 121/12168/6830/3051/51109/10940/401/1713 29  Ent  OVERALL results from previous slide

30 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP Ent 121/12168/6830/3051/51109/10940/4017/1713 SupML 121/12168/6830/3014/51104/1091/400/1713 SupMLre 121/12168/6830/3051/51109/10940/400/1713 EntML 121/12168/6830/3051/51109/10940/401/1713 30  SupML  has problems against November bots  needs to be retrained for new bots  SupMLre  detects all bots

31 AUG BOTSNOV BOTS periodicrandomrespondperiodicrandomreplayhuman test TP FP Ent 121/12168/6830/3051/51109/10940/4017/1713 SupML 121/12168/6830/3014/51104/1091/400/1713 SupMLre 121/12168/6830/3051/51109/10940/400/1713 EntML 121/12168/6830/3051/51109/10940/401/1713 31  EntML  false positive rate is ~0.0005 (Ent is ~0.01)  25 messages

32 Outline Background Measurement Classification System Experimental Evaluation Conclusion 32

33 Conclusion Measurements overall, chat bots are less complex than humans some chat bots more human-like Classification System exploits benefits of both classifiers quickly classifies known chat bots accurately classifies unknown chat bots 33

34 Thank you !


Download ppt "Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory."

Similar presentations


Ads by Google