Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cognitive Support for Intelligent Survivability Management Dec 18, 2007.

Similar presentations

Presentation on theme: "Cognitive Support for Intelligent Survivability Management Dec 18, 2007."— Presentation transcript:

1 Cognitive Support for Intelligent Survivability Management Dec 18, 2007

2 2Outline Project Summary and Progress Report –Goals/Objectives –Changes –Current status Technical Details of Ongoing Tasks –Event Interpretation –Response Selection –Rapid Response (ILC) –Learning Augmentation –Simulated test bed (simulator) Next steps –Development and Integration –Red team evaluation OLC

3 Project Summary & Progress Report Partha Pal

4 4Background Outcome of DARPA OASIS Dem/Val program –Survivability architecture Protection, detection and reaction (defense mechanisms) Synergistic organization of overlapping defense & functionality –Demonstrated in the context of an AFRL exemplar (JBI) –With knowledge about the architecture, human defenders can be highly effective Even against a sophisticated adversary with significant inside access and privilege (learning exercise runs) Survivability architecture provides the dials and knobs, but an intelligent control loop –in the form of human experts—was needed for managing them What was this knowledge? How did the human defenders use it? Can the intelligent control loop be automated? Managing: Making effective decisions

5 5 Incentives and Obstacles Incentives –Narrowing of the qualitative gap in “automated” cyber- defense decision making –Self managed survivability architecture Self-regenerative systems –Next generation of adaptive system technology From hard-coded adaptation rules to cognitive rules to evolutionary Obstacles (at various levels) –Concept: insight (sort of)– but no formalization –Implementation: Architecture, Tool capability & choice –Evaluation: How to create a easonably complex context & wide range of incidents Real system? –Evaluation: how to quantify and validate Usefulness and effectiveness Measuring technological advancement

6 6 CSISM Objectives Design and implement an automated cyber- defense decision making mechanism –Expert level proficiency –Drive a typical defense enabled system –Effective, reusable, easy to port and retarget Evaluate it in a wider context and scope –Nature and type of events and observations –Size and complexity of the system Readiness for a real system context –Understanding the residual issues & challenges

7 7 Main Problem Making sense of low-level information (alerts, observations) to drive low-level defense-mechanisms (block, isolate etc.) such that higher-level objectives (survive, continue to operate) are achieved Doing it as well as human experts –And also as well as in other disciplines Additional difficulties –Rapid and real time decision-making and response –Uncertainty due to incomplete and imperfect information –Widely varying operating conditions (no alerts to 100s of alerts per second) –New symptoms and changes in adversary’s strategy

8 8 For Example…. Consider a missing protocol message alert – Observable: a system specific alert A accuse B of omission – Interpretation A is not dead (he reported it) Is A lying? (corrupt) B is dead B is not dead, just behaving badly (corrupt) A and B cannot communicate – Refinement (depending on what else we know about the system, the attacker objective..) Other communications between A and B A svc is dead if the host is dead.. OS platform and likelihood of multi-platform exploits.. –Response selection Now or later? Many options –dead(svc) => restart (svc) | restart (host of svc) –cannot-communicate (host1, host2)=> ping | retry operation.. –corrupt(host)=> reboot(host)| block(host)| quarantine(host).. Now consider a large number of hosts, sequence of alerts, various adversary objectives, and trying to keep the mission going related

9 9Approach Interpret Respond React Learn Stream of events and observations Actions Hypotheses Modify parameter or policy System Multiple levels of reasoning Varying spatial and temporal scope Different techniques The main control loop is partitioned into 2 main parts: event interpretation & response selection React

10 10 Concrete Goals A working prototype integrating –Policy based reactive (cyber-defense) response –“Cognitive” control loop for system-wide (cyber-defense) event interpretation and response –Learning augmentation to modify defense parameters and policies Achieve expert level proficiency –In making appropriate cyber-defense decision Evaluation by –“ground truth”: ODV operator responses to symptoms caused by red team –Program metrics

11 11 Current State Accomplished quite a bit in 1 year – KR and reasoning framework for handling cyber-defense events well developed –Proof of concept capability demonstrated for various components at multiple levels OLC, ILC, Learner and Simulator E.g., Prover9, Soar (various iterations) –Began integration and tackling incidental issues –Evaluation ongoing (internal + external) Slightly behind in terms of response implementation and integration –Various reasons (inherent complexity, and the fact that it is very hard to debug the reasoning mechanism) Longer term issue: Confidence in such cognitive engine? Is a system-wide scope really tenable? Is it possible to build better debugging support? –Taken mitigating steps (see next)

12 12 Significant Changes Translation & Map Down accusation evidence Process Accusation & Evidence Constraint network Prune (Coherence and proof) Build Knowledge about bad behavior (bin 1), protocols and scenarios (bin 4) Knowledge about info flow (bin 2) and protocols and scenarios (bin 4) Knowledge about attacker goal (bin3) Recall the linear flow using various types of knowledge? That was what we were planning in June. This evolved, and the actual flow looks like the following: Alerts and observations Garbage collect Refine

13 13 Significant Changes (contd) Response mechanism –Do in Jess/Java instead of Soar –Issues Get the state accessible to Jess/Java Viewers –Dual purpose– usability and debugging –Was: Rule driven– write a Soar rule to produce what to display –Now: get the state from Soar and process

14 14Schedule Midterm release (Aug 2007) [done] Red team visit (Early 2008) Next release (Feb 2008) Code freeze (April 2008) Red team exercises (May/June 2008)

15 Event Interpretation and Response (OLC) Franklin Webber

16 16 OLC Overall Goals Interpret alerts and observations –(sometimes lack of observations triggers alerts) Find appropriate response –(sometimes it may decide that no response is necessary) Housekeep –Keep history –Clean up

17 17 OLC Components Event Interpretation Response Selection Summary History Learning responsesaccusations and evidence

18 18 Event Interpretation Main Objectives: Essential Event Interpretation –Interpreting events in terms of hypotheses and models Uses deduction and coherence to decide which hypotheses are candidates for response Incidental Undertakings –Protecting the interpretation mechanisms from attack: flooding and resource consumption –Current status and plans Note that items with a * are in progress Event interpretation creates candidate hypotheses which can be responded to.

19 19 Event Interpretation Decision Flow Response Selection Event Interpretation history theorem proving coherence Summary claims dilemmas generator hypotheses learning

20 20 Knowledge Representation Turn very specific knowledge into an intermediate form amenable to reasoning –i.e. “Q2SM sent a malformed Spread Message” -> “Q2SM is Corrupt” Specific system inputs are translated into a reusable intermediate form which is used for reasoning. Create a graph of inputs and intermediate states to enable reasoning about the whole system –Accusations and Evidence –Hypotheses –Constraints between A and B Use the graph to enable deduction via proof and to perform a coherence search

21 21 Preparing to Reason Observations and Alerts are transformed to Accusations and Evidence –Currently translation is done in Soar but may move outside to keep the translation and reasoning separate* Alerts and Observations are turned into Accusations and Evidence that can be reasoned about. Alerts: notification of an anomalous event Evidence: generic observation Accusations: generic alert Observation: notification of an expected event

22 22 Alerts and Accusations By using accusations the universe of bad behavior used in reasoning is limited, with limited loss of fidelity. The five accusations below are representative of attacks in the system –Value: accused sent malformed data –Policy: accused violated a security policy –Timing: accused send well-formed data at the wrong time –Omission: expected data was never received from accused –Flood: accused is sending much more data than expected CSISM uses 5 types of accusations to reason about a potentially infinite number of bad actions that could be reported.

23 23Evidence* While accusations show unexpected behavior evidence is used for expected behavior Evidence limits the universe of expected behavior used in reasoning, with limited loss of fidelity. –Alive: The subject is alive –Timely: The subject participated in a timely exchange of information Specific “historical” data about interactions is used by the OLC, just not in event interpretation CSISM uses two types of evidence to represent the occurrence of expected actions for event interpretation.

24 24Hypotheses When an accusation is created a set of hypotheses are proposed that explain the accusation –For example a value accusation means either the accuser or accused is corrupt and that the accuser is not dead. The following hypotheses (both positive and negative) can be proposed –Dead: Subject is dead; fail-stop failure –Corrupt: Subject is corrupt –Communication-Broken: Subject has lost connectivity –Flooded: Subject is starved of critical resources –OR: a meta-hypothesis that either of a number of related hypotheses are true Accusations lead to hypotheses about the cause of the accusation.

25 25 Reasoning Structure Hypotheses, Accusations, and Evidence are connected using constraints The resulting graph is used for –Coherence search – Proving system facts A graph is created to enable reasoning about hypotheses. accusation OR host dead host dead comm broken host corrupt

26 26 Proofs about the System The OLC needs to derive as much certain information as it can, but it needs to do this very quickly. The OLC does model-theoretic reasoning to find hypotheses that are theorems (i.e., always true) or necessarily false For example, it can assume the attacker has a single platform exploit, and consider each platform in turn, finding which hypotheses are true or false in all cases. Then it can assume the attacker has exploits for two platforms and repeat the process A hypothesis can be proven true or proven false or have an unknown proof status Claims: Hypotheses that are proven true “Claims” are definite candidates for response

27 27Coherence Coherence partitions the system into clusters that make sense together –For example, for a single accusation either the accuser or the accused may be corrupt but these hypotheses will cluster apart Responses can be made on the basis of the partition, or partition membership when a proof is not available* In the absence of provable information coherence may enable actions to be taken.

28 28 Protection and Cleanup Without oversight resources can be overwhelmed –Due to flooding: we rate limit incoming messages * –Excessive information accumulation We take two approaches to mitigate excessive information accumulation* –Removing outdated information by making it inactive If some remedial action has cleared up a past problem If new information makes previous information outdated or redundant If old information contradicts new information If an inconsistency occurs we remove low-confidence information until the inconsistency is removed –When resources are very constrained more drastic measures are taken Hypotheses that have not been acted upon for some time will be removed, along with related accusations Resources are reclaimed and managed to prevent uncontrolled data loss or corruption.

29 29 Current Status and Future Plans Knowledge Representation –Accusation translation is implemented May need to change to better align with the evidences –Evidence implementation in process Will leverage the code and structure for accusation generation –Use of coherence partition in response selection--ongoing Protection and Cleanup are being implemented –Flood control development is ongoing –The active/inactive distinction is designed and ready to implement –Drastic hypothesis removal is still being designed Much work has been accomplished, work still remains.

30 30 Response Selection Decide promptly how to react to an attack Block the attack in most situations Make “gaming” the system difficult –Reaction based on high-confidence event interpretation –History of responses is taken into account when selecting next response –Not necessarily deterministic Main Objectives:

31 31 Response Selection Decision Flow Response Selection propose prune Event Interpretation responses potentially useful responses history Summary claims dilemmas learning

32 32 Response Terminology A response is an abstract OLC action, described generically –Example: quarantine(X), where X could be a host, file, process, memory segment, network segment etc. A response will be carried out in a sequence of response steps –Steps for quarantine(X) && isHost(X) include Reconfigure process protection domains on X Reconfigure firewall local to X Reconfigure firewalls remote to X –Steps for quarantine(X) && isFile(X) include Mark file non-exectuable Take specimen then delete A command is the input to actuators that implement a single response step –Use “/sbin/iptables” to reconfigure software firewalls –Use ADF Policy Server commands to reconfigure ADF cards –Use tripwire commands to scan file systems Resp1 Step1 Step2 Step3 and Cmd1 specialization Resp2 or

33 33 Kinds of Response Refresh – e.g., start from checkpoint Reset – e.g., start from scratch Isolate -- permanent Quarantine/unquarantine -- temporary Downgrade/upgrade – services and resources Ping – check liveness Move – migrate component The DPASA design used all of these except ‘move’. The OLC design has similar emphasis.

34 34 Response Selection Phases Phase I: propose –Set of claims (hypotheses that are likely true) implies set of possibly useful responses Phase II: prune –Discard lower priority –Discard based on history –Discard based on lookahead –Choose between incompatible alternatives –Choose unpredictably if possible Learning algorithm will tune Phase II parameters

35 35Example Event interpretation claims “Q1PSQ is corrupt” Relevant knowledge: –PSQ is not checkpointable Propose: –(A) Reset Q1PSQ, i.e., reboot, or –(B) Quarantine Q1PSQ using firewall, or –(C) Isolate Quad 1 Prune: –Reboot has already been tried, so discard (A) –Q1PSQ is not critical, so no need to discard (B) –Prefer (B) to (C) because more easily reversible, but override if too many previous anomalies in Quad 1 Learning –Modify the definition of “too many” used when pruning (B)

36 36 Using Lookahead for Pruning Event interpretation provides an intelligent guess about the attacker’s capability OLC rules encode knowledge about attacker’s possible goals Lookahead estimates the potential future state, given assumptions about capability, goals, and response selection If response X has better potential future than Y, favor X

37 37Status Design –Rules for proposing responses encoded in first- order logic –Corresponding pruning rules described in English Implementation –Mounting responses for given hypotheses prototyped in Soar –Actual response logic is being moved outside Soar Risk mitigation step –Some specific to a particular Learning Exercise run Much less complete than Event Interpretation, but we are getting there…

38 Fast Containment Response and Policies Michael Atighetchi (On behalf of Steve Harp)

39 39 Inner Loop Controller (ILC) Objectives Policy Driven: Implement policies and tactics from OLC on a single host. Autonomous: high speed response can work when disconnected from the OLC by an attack or failure Flexible: Policies can be updated at any time Adaptive: Use learned characteristics of host and monitored services to tune the policy. Low impact on mission: able to back out of defensive decisions when warranted Goal: Attempt to contain and correct problems at the earliest stage possible

40 40 Current ILC Policies Generic... Excessive resource consumption –Memory, CPU load Tampering with component files Unexpected service termination Unexpected network connections Specialized... JBI client registration failure ADF Agent failure modes SELinux enforcement status

41 41 Policy Example ‏ File Integrity Check Fails, F1 Local File F1: status=corrupted File Tampering Problem File Cleanup Subgoal Block Network Suspend Affected Services Quarantine Restore File Remove Block/Suspend Reasons

42 42 ILC : Recent Work ILC Configuration –Made this driven by (ontology) model of DPASA –No need for configuration files for each host –ILC runs on each simulated host, automatically loads required policy components Integration –Hitherto ILC has been (unit) tested as stand-alone –Initial integration with jess DPASA simulator complete, broadening support for multiple policies –Adjustments to API to match simulator

43 43 ILC : Current Status ILC policy to handle various applications Model driven configuration Metrics –Rules: 94; Functions: 134; Frames: 24; Globals: 20 –Base reaction time (in unit test): ~ 4 ms. (Measuring the inference part only.) Target reaction time is: < 100 ms.

44 44 ILC : Ongoing Work Complete integration with the rest of CSISM framework: –DPASA Simulator –ILC---OLC Interaction Designed; integration: TBD Testing –Verify correct reactions in simulator to various simulated attacks –Measure reaction times

45 Learning Augmentation Michael Atighetchi (On behalf of Karen Haigh)

46 46 Learning Augmentation: Motivation Why learning? –Extremely difficult to capture all the complexities of the system, particularly interactions among activities –The system is dynamic (static configuration gets out of date) Core Challenge: Adaptation is the key to survival Offline Training + Good data + Complex environment - Dynamic system Online Training - Unknown data + Complex environment + Dynamic system CSISM’s Experimental Sandbox + Good data (self-labeled) + Complex environment + Dynamic system Very hard for adversary to “train” the learner!!! Human + Good data - Complex environment - Dynamic system Sandbox approach successfully tried in SRS phase 1

47 47 Development Plan for Learning in CSISM 1.Responses under normal conditions (Calibration) Important first step because it learns how to respond to normal conditions Showed at June PI meeting 2.Situation-dependent responses under attack conditions 3.Multi-stage attacks Since June

48 48 Beta= Calibration Results for all Registration times These two “shoulder” points indicate upper and lower limits. As more observations are collected, the estimates become more confident of the range of expected values (i.e. tighter estimates to observations) June07 PI meeting

49 49 Multistage Attacks Multistage attacks involve a sequence of actions that span multiple hosts and take multiple steps to succeed. –A sequence of actions with causal relationships. –An action A must occur set up the initial conditions for action B. Action B would have no effect without previously executing action A. Challenge: identify which observations indicate the necessary and sufficient elements of an attack (credit assignment). –Incidental observations that are either side effects of normal operations, or chaff explicitly added by an attacker to divert the defender. –Concealment (e.g. to remove evidence) –Probabilistic actions (e.g. to improve probability of success) Not yet

50 50 Architectural Schema for Learning of Attack Theories and Situation-Dependent Responses CSISM Sensors (ILC, IDS) Observations ending in failure of protected system. Only some are essential Defense Measures Experimenter ABC X ? Viable Attack Theories Viable Defense Strategies and Detection Rules Attack Theory Experimenter ABD 5 C ABC “Sandbox” AC BC A BD X X Observations Actions Failure

51 51 Multi-Stage Learner Do { –Generate Theory according to heuristic Complete set of theories is all permutations of all members of Powerset( observations ) –Test Theory –Incrementally Update OLC / ILC rulebase } while Theories remain The hard part!

52 52 Heuristics & Structure of Results Primary Goal: Find all shortest valid attacks (i.e. minimum required subset) as soon as possible –Example: In ABCDE, AC and DE may both be valid Secondary Goal: Find all valid attacks as soon as possible –Example: In ABCDE, ABC may also be valid Heuristics –Shortest first –Longest first –Edit distance to original –Dynamic resort to valid set Initially, edit distance to the original attack Remaining theories are compared to all valid attacks; edit distance is averaged –Dynamic Resort / Free to remove “chaff” Same as “Dynamic Resort to valid set”, but cost of deletion is zero Worst Case Comparison: Sort theories so that –Shortest valid attack is found last –All valid attacks at the end

53 53 Comparison of Heuristics 4-observation; 3-stage attack 4 obs = 64 potential trials 10 obs = 10 million potential trials

54 54 Incremental Hypothesis Generation Enhanced query learner generates attack hypotheses –incrementally, with low memory overhead it is able to explore large observation spaces (>>8 steps) –in heuristic order to acquire the concept rapidly Heuristic bias: –look for shorter attacks first (adjustable prior) –suspect order of steps has an influence –suspect steps to interact positively (for the attacker) –performance comparable to edit-dist+length

55 55 Incremental Hypothesis Generation: Results Target concept: The disjunction: ".*A.*B.*" or ".*B.*C.*" Scores represent sum of trial numbers for elementary concepts Note: 1.There are many possible observation sequences that could generate these target concepts; scores is average of 8 of the sequences 2.For observation sequences longer than 8, learners that pre-enumerate and sort their queries run out of memory #ObsLenEditDist EditDist +LenSONNI SONNI: Short-Ordered-NonNegative Incremental Hypothesis Generator

56 56 Status, Development Plan & Future steps June07 PI Meeting 1.Responses under normal conditions (Calibration) a.Analyze DPASA data (done) b.Integrate with ILC (single node) (done) c.Add experimentation sandbox (single-node) d.Calibrate across nodes 2.Situation-dependent responses under attack conditions 3.Multi-stage attacks Since June 1.Development of sandbox, and initial integration efforts with learner (done) Attack actions, observations, and control actions Quality signal 2.Development of multistage algorithm (version 1.0 done) Theories with sandbox Incremental generation of theories TODO: ILC input / OLC output

57 Simulated Testbed Michael Atighetchi (on behalf of Michael Atighetchi)

58 58 Why Simulation ? Defense-Enabled JBI as tested under OASIS Dem/Val Simulation of defense- enabled system Use a specification Use as integration middleware Use for red team experimentation

59 59 JessSim: The JESS Simulator Generated via Protégé plugin Implemented via JESS rules and functions

60 60 JessSim Current Status Implemented Protocols (#=14) Plumbing (5 rules) Alert (6 rules) Registration (8 rules) SELinux (1 rule) Reboot (3 rules) LC message (3 rules) ADF (3 rules) Heartbeat (1 rule) PSQ (3 rules) Tripwire (3 rules) ServiceControl (1 rule) POSIX Signals (1 rule) Process Memory/CPU status (2 rules) Host Memory/CPU status (2 rules) Implemented Attacks (#=8) Avail: Disable SELinux service Avail: Shutdown a host Avail: Cause a Downstream Controller to crash Avail: Cause corruption of endpoint references in SMs Avail: Killing of processes via kill -9 Integrity: Corruption of files Policy Violation: Creation of a new (rogue) process Avail: Causing a process to overload CPU Test Coverage: –Unit tests: 28 junit tests covering protocol details –OASIS Dem/Val: Main events of DPASA Run #6. Fidelity –Focused on application-level protocols

61 61 JessSim Ongoing Work Increase fidelity of network simulation –checks for network connectivity [ crash(router) => com broken (A,B)] –Simulation of TCP/IP flows for ILC Increase fidelity of host simulation for ILC –install-network-block / remove-network-block –note-network-connection / reset-network-connection –quarantine-file / restore-file / delete-file / checkpoint-file –note-selinux-down / note-selinux-up –shun-network-address / unshun-network-address –Enable-interface / disable-interface –Set-boot-option Protocols for ILC/OLC communication –forward-to-olc() Cleanup –Convert all time unit to seconds in all scenarios

62 Next Steps: Integration and Evaluation Partha Pal

63 63 Learning Integration ILC –Learning: –Pre-deployment calibration: learn threshold parameters for registration times –Calibrate across nodes OLC-Learning –Results from learning with the experimentation sand box –Parameter tuning –New rules/heuristics

64 64 ILC OLC Integration ILC->OLC –Calls to OLC implemented in ILC policies via calls to ilc-api –ILC as an Informant to OLC –ILC as Henchman of OLC OLC->ILC –OLC can process alerts forwarded to it from the ILC –Consider ILC as a mechanism during response selection

65 65 JessSim Integration ILC integration with jessSim –“ArrestRunawayProcess” loop working –Implement file, network, and reboot protocols necessary to support other existing ILC loops OLC integration with jessSim –OLC fully integrated with jessSim –Adjust integration given changes due to Moving transcription logic Alerts->Accusation, Observations- >Evidence into Jess Performing response selection in Jess Integration Framework –All components execute within a single JVM –Support execution of ILC and OLC on dedicated hosts to measure timeliness.

66 66 Integration Framework Current Status

67 67 Integration Framework Needed for Red Team Experimentation

68 68Evaluation Interaction with Red and White Teams –Initial telecon (Late October) –Continued technical interchange about CSISM capabilities –Potential gaps/disagreements How to use the simulator Evaluation goals –Next steps Demonstration of the system Red team visit Code drop

Download ppt "Cognitive Support for Intelligent Survivability Management Dec 18, 2007."

Similar presentations

Ads by Google