Model-based Assessment of Adaptation Capability of Networked Systems Presented by: Kaliappa Ravindran Other student contribuitors at CUNY: Arun Adiththan,

Model-based Assessment of Adaptation Capability of Networked Systems Presented by: Kaliappa Ravindran Other student contribuitors at CUNY: Arun Adiththan, Khurshid Fayzullaev, Mohammad Rabby Kaliappa Ravindran and Yassine Wardei Department of Computer Science City University of New York (CUNY – City College) New York, NY 10031, USA. Steven Drager Trusted Resilient Systems Division Air Force Research Laboratory Rome, NY 10031, USA. &

Organization of talk 1.System-level assessment: a pre-condition for self-managing systems 2. Algorithm verification versus System assessment 3.Our approach for externalized system assessment methods (Advantages: Reusable and Reconfigurable assessment tools; Need for generality is exacerbated as systems become complex) 5. Study: Adaptive server placement in content distribution networks 6. Conclusions and Future works [non-functional properties: e.g., stability, convergence, agility, adaptivity, etc]

Topics pertaining to system-level trust & resilience 1. Intrinsic complexity of adaptive network systems ^^ Functional requirements ^^ Non-functional requirements 2.Autonomic system assessment & reconfiguration adaptation  reconfiguration 3.Concrete understanding/analysis of how the system as a whole deals with network failures and resource outages 4. Is the existing technology for algorithm verification useful ? It only tells whether an algorithm is correct or not; but does not address the probabilistic aspects of trust & resilience [probabilistic verification tools Such as PRISM (UK) and PAT (Singapore) are useful] adaptation granularity: parametric-level algorithm-level system-level

external environment conditions E* incident on software/hardware sub-systems (e.g., errors, threat, outage) DISTRIBUTED APPLICATION network infrastructure Service-oriented algorithms Service interface Processes, procedures and their interactions System components (VMs, network links, data storage, compute servers,...) final QoS actually achieved (q’) specs of desired QoS (q) visible state  internal state s* parametric representation E mapping between internal-state and interface events adaptation logic & processes adaptive network system S Programmable system providing core functionality exercising of resources R as determined by parameters par and algorithmic processes Software Engineering view of self-managing network systems [internal parameters par] [parameters known E  E*] External assessment module H observe QoS-related meta-data reason about how good is the system S in meeting QoS specs

A system that is good but is not verifiably good is as good as not being good !! Motto

Assessing the quality of QoS support mechanisms [holistic evaluation of service support system PLUS adaptation logic] Adaptive networked system S (core functions PLUS adaptation functions) Input QoS specs (q) Observe QoS achieved ( q’ ) hostile external environment E* (e.g., failures, attacks, outages, …) A coarse measure of the “goodness” of system S under harsh condition e  E* : 0 < q’ ≤ q (q-q’) q  = [1 - ] (S,q,e) 0.0 <  ≤ 1.0 (S,q,e) ζ = 1.0  100% trustworthiness bestowed on S for meeting the QoS specs = 0.0  not trustworthy at all hostility of environment condition e 0.01.0 (most severe) prescribed QoS q achieved QoS q’ higher adaptivity exhibited by S 0.0 dq’ de [ 1- | |] measure of system capability to adapt Instances of S with different algorithms and/or parameters Examples of QoS performance, availability, resilience,.

Students in class Professor coverage & pace of course materials Conduct of tests/exams QoS: how much a student has learned ? grades, student attribute, % of materials covered,... Intended level of course coverage tracking error  Department chairman system being evaluated Is the professor doing a right job ? Is the course coverage and depth OK ? How much do students learn ?... A day-to-day example of complex systems (adjust based on  ) (a student falling sick, a student forgetting about exam, long absence from class due to work conflicts,.. ) uncontrolled external events Core requirement of a professor: Ability to teach a course on the subject Non-functional attribute of professor: How good and effective is the teaching ?

incidence of hostile environment conditions E* actual QoS experienced by application (output) system state  visible at INT Controller (compute resource adjustments) corrective action reference QoS q (input) Observer (state-to-QoS mapping) control-theoretic loop to realize QoS-to-resource mapping ADAPTIVE APPLICATION (infrastructure model programmed into controller) External assessment entity SI y z p q [ say, cloud-based service-infrastructure] Storage & processing INT INT: service interface Network connections INT achieved QoS q’ (steady-state)  q-q’ QoS tracking error [ , q, , E] LOG reason about system behavior (resilience, performance, robustness,..) analyze logged meta-data service-support system S Management view of system adaptation processes and external assessment management interface What is the “gold standard” for comparison ??

distributed network system S adaptation logic & processes core-functional elements FUNC (algorithms, cloud resources,..) signal flows Management entity H verification of core functional behavior [FUNC(S),q,q’] evaluation of para-functional behavior [PARA(S),q,q’,E] input reference QoS q actual output QoS q’ trigger APPLICATION-LEVEL USER incidence of uncontrolled environment E* assess(S) {95%-good, 80%-good,.., bad,…} [quality(S)=good] AND [assess(S)=good]  [accuracy(H)=high]; Axioms of system assessment [quality(S)=good] AND [assess(S)=bad]  [accuracy(H)=low]; [quality(S)=bad] AND [assess(S)=bad]  [accuracy(H)=high]; [quality(S)=bad] AND [assess(S)=good]  [accuracy(H)=low]. [quality(S)=70%-good] AND [assess(S)=90%-good]  [accuracy(H)=medium]; monitor Assessing non-functional attributes of system behavior

QoS specs q, algorithm parameters par, system resource allocation R are usually controllable inputs In contrast, environment parameters e  E* are often uncontrollable and/or unobservable, but they do impact the service-level performance (e.g., component failures, network traffic fluctuations, attacks, etc) environment parameter space: E* = E(yk)  E(nk)  E(ck) parameters that the designer knows about parameters that the designer does not currently know about parameters that the designer can never know about Algorithm design decisions face this uncertainty --- so, designer makes certain assumptions about the environment (e.g., no more than 2 nodes will fail during execution of a distributed algorithm). When assumptions get violated, say, due to attacks, algorithms fall short of what they are designed to achieve  Evaluate how good an algorithm performs under strenuous conditions Modeling of external environment

Network-based realization of service-support system S Distributed application desired QoS specs q (input) actual QoS achieved q’ (output) Service-oriented distributed algorithm (exercises infrastructure resources to provision QoS) collect QoS trace data Computational model of system S QoS meta-data log reproduce input QoS specs generate simulated output QoS [fast-forward mode] read observed output trace Intelligent management entity system trace & error data simulated input environment E (  E* ) reason about behavior of system S ACTUAL SYSTEM BEING ASSESSED SIMULATED MODEL OF SYSTEM determine QoS tracking error & system modeling error [mathematical formulas & computational procedures for QoS-to-resource mapping] Our proposed mechanisms for system evaluation E*: Incidence of external uncontrolled events (attacks, failures, resource depletions,.. )

Information needed to assess the adaptation capability of a network Given a networked system S, externalize the system-internal algorithmic processes employed by S for QoS adaptation Externalization (**)  Exposure of meta-data and information about the service-layer algorithms & processes to a trusted external entity across a secure interface (the native algorithms & processes itself execute within the system boundary) Closed-form computational model of algorithmic processes Declarative specification of the computational model [similar to the publication of technical reports on various elements of a project and providing the necessary software hooks, for external evaluation !!] Externalization merely requires that the algorithmic processes be specified in an explicit way (using a declarative language such as Haskell) --- in contrast with the implicit encoding of these processes in the currently prevalent assessment methods. However, system-internal algorithms itself do not change because of externalization

CLIENT APPLICATIONS WEB SERVICE X : attack on VM CLOUD INFRASTRUCTURE (manages resources: VM, storage,..) server instantiation on VMs (# of VMs running server: N v =3) service interface infrastructure interface (management of server redundancy, security,..) Server replication control algorithm # of VMs actually suffering failure: f a = 2 X X leasing of VMs from cloud (K=6) server task requests (  t s,..) DATA CHANNEL CONTROL CHANNEL QoS specs (response time desired: T   fault-tolerance desired  ) actual QoS observed (response time: t resp ; fault-tolerance:  ) VM running server instance Idle VM node (available cycles: V c /V s ) meta-data log for QoS assessment [, t s, T, , , , t resp,..] estimation of hard-to-observe parameters [f a, V c..] [N v, K, f m,..] + system management interface declarative specs of algorithm Sample illustration of “externalization” in a replicated web service f m : max.# of server faults that the algorithm is designed to tolerate (f m =1)

Reduce the software complexity of network systems by separating the functional requirements and non-functional requirements Scientific merits of our approach The probabilistic verification methods of network systems that underlie our system assessment approach are useful in building resilient and correctly functioning autonomous systems Less domain-specific knowledge needed in our assessment methods  Cross-domain applicability (e.g., UAVs and autonomous cars, smart supply-chain systems, big-city surveillance systems) Existing works on adaptation elsewhere [Sanders and et al, ABDELZAHER (UIUC)] focus primarily on domain-specific approaches to assess network systems. The difficulty in separating the functional and non-functional properties of a system under study in these methods limit their domain-neutrality and portability.

content relay network client terminals (E-readers, smart-phones, pocket-PCs,..) Pull: client reaches server site to read content Push: server reaches client site to publish content external content [e.g., news, posts, streaming,.. ] CONTENT server (R) (processes content pages and stores page descriptors) cNc3c2c1... PaPa PbPb PcPc push(P c ) pull(P a ) content pages Case study: Content Distribution over wide-area networks (e.g., Akamai, Digital Island,...)

p-a p-b agent-3 p-b U({p-b}) U({p-a, p-b}) U({p-a, p-b}) content pages update message for pages {x} (push) U({x}): Layered View client 2 client 1 client 3 content server R agent-1 agent-2 idle node & interconnect local access link content push/pull-capable proxy node content-forwarding proxy node x z y u v w q attacks, outages, dynamics, mobility.. (environment E*) agent-R CORE SYSTEM S pull CDN realized on cloud infrastructure meta-data log log QoS parameters: (L, L’, [proxy setup],..) Cloud infrastructure (IaaS + PaaS) Content delivery algorithms (SaaS) (distribution tree set up, proxy placement, content push/pull, … ) Content server R & clients 1-3 (SaaS) CDN service interface (nodes/links, storage, processing,..) adaptation processes: (agent-based latency monitor & control logic) latency specs L observed latency L’ instantiate components Management entity H (reasons about QoS capability) model-based analysis

Reasons for adjusting proxy placement 1. Load balancing and latency reduction 2. Recovery from crash of proxy nodes 3.Changes in infrastructure characteristics (say, exclude a node from placement sets for privacy & security) 4. MTD (moving-target-defense) considerations...... Determining the optimal proxy placement is an NP-complete problem

Cost incurred for content delivery proxy placement [ordering of placements is not known] local minima global minimum partial search triggered by controller (successful) partial search (unsuccessful) and rollback x xx x A heuristics-based controller may forcibly escape from local minima, wherein the system traverses a limited region to seek QoS improvements relative to the current QoS. May not necessarily find the global optimum !! Illustration of search for cost-optimal QoS partial search (successful)

Discrete-event simulation (DES) of CDN configurations DES provides low-cost but accurate model of system configuration under study (such as CDN)  Easy-to-change topology configurations and algorithm/system parameters  Operational models and queuing-theoretic models are useful for setting a good set of initial conditions to start the solution search, but they are limited in their accuracy (partly due to the Poissonian and Markovian assumptions to make the models tractable) Employed a “fluid-flow” model of traffic (instead of using NS2-like tools). Reasons for using fluid-flow model of traffic: 1. Our goal is not in assessing a CDN per se, but is to evaluate our assessment methodology as applied to a CDN; 2. Avoidance of traffic non-stationarity aspects that are often simulated with NS2-like tools, while retaining the essence of proxy control algorithms employed in CDN

Client-driven update of proxies (use of time-stamps to determine out-of-date copy).... server R proxy X(R)client GTS=1 GTS=2 GTS=3 (LTS=1,GTS=1) request(p) notify(p) request(p) notify(p) request(p) notify(p)update_TS(p,2) (LTS=1,GTS=2) request(p) notify(p) get_page(p) Update_page(p) request(p) notify(p) (LTS=2,GTS=2) (local copy) (updated local copy) c >> s TIME occurrence of new content update events Content page p

NE q’(u) NE q(u ) SE q NE e(d) NE e(u) NE f(u) NE f(d) NE r(d) SE r NE a(u) SE a NE a(d) read requests from client c (rate: c) read requests from client c’ (rate: c’) content updates from server R (rate: s) Queuing-network of sample CDN simulated NE-X(u): network element of node X to forward control messages along upstream path (service rate: ) NE-X(d): NE of node X to forward control/data messages downstream (service rate: ) SE-X: storage element of node X that reads/writes content from/to disk (service rate: )  ba 1 2 control message rate  c’ data message rate  [1-u (c’,a) ]. c’ control message rate u (c’,a). c’ control message rate u (c,q). c data message rate  [1-u (c,a) ]. u (c,q). c control message rate u (c,q). u (c,a). c “fluid-flow” model of traffic splits. NE b(d,2) NE b(d,1) NE b(u) NE b(d,2) NE b(d,1) NE b(u) X’ NE a(u) SE a NE a(d) Equivalent queuing-theoretic representation xx ooo ggg gg x g,gg,g : data items x,xx,x : control messages o,oo,o : read requests on disk SE (R) X X,X’: completed requests from c,c’ flow of transactions (GTS updates, LTS checks, read/update,..)

6 8 10 12 14 0.10.20.30.40.5 6 8 10 12 14 0.10.20.30.40.5 15 30 45 60 0 0.10.20.30.40.5 0.030.050.090.07 15 30 45 60 0.01 0 15 30 45 60 0 0.10.20.30.40.50.0 0.030.050.090.07 9.0 12.0 6.0 14.0 client x client y server R content updates content read request client u client z 0 7 9 6 2 8 31 4 5 proxy node with content store & forward capability proxy node with content forward capability path segment in distribution tree node/link not in tree server update rate ( s ) content read latency L’ (normalized units) s L’L’ s L’L’ per-client read overhead O’ (normalized units) O’O’ server update rate ( s ) s s O’O’ feed x feed y feed z feed u Graphs show the latency experienced (L’) by client feeds: cached content page c (x)=0.1 c (y)=0.1 c (z)=0.1 c (u)=0.1 c (x)=0.16 c (y)=0.16 c (z)=0.16 c (u)=0.16 c (x)=0.05 c (y)=0.05 c (z)=0.05 c (u)=0.05 queues overflow for s > 0.75 queues overflow for s > 0.1 queues overflow for s > 1.5 [ s and c (.) are expressed in normalized units relative to storage & network bandwidth]

utility value of CDN service in accessing event data p U(,L’) actual QoS enforced L’ displeasure due to the actual QoS L’ being lower than the desired QoS L lower value of L’  better QoS degree of client displeasure in using CDN service is modeled as a cost (e.g., receiving delayed updates of battle status by a soldier reduces his/her effectiveness) net displeasure experienced by a client in using CDN service= k.[1-U(L,L’)] for k> 0 minimize total cost incurred by CDN in serving M clients: 1.0 0.0 client-prescribed QoS tolerance for event read: L(p) good quality degraded quality U(L’) = 1.0  Highest pleasure = 0.0  No pleasure at all k i : weight assigned to i-th client (i.e., the degree to which the CDN service provider considers the displeasure of i-th client as important): say, given as a range 0 and 1 (higher value means more importance) total resources expended R o, R l : normalization constants

Utility function: if L’ <= L 0 U(L’) = 1.0 otherwise, U(L’) = exp[-ɤ × (L’ – L 0 )] sensitivity to delay minimum delay Client displeasure: D(L’) = 1 – U(L’) revenue loss for CDN provider (i.e., penalty):  = ∑ i ( D i (L’) ) × α i + β × O’ client weight overhead weight overhead cost selective search to find lowest revenue loss for the 18-node topology with 3 proxy nodes In military applications, “penalty” depicts a reduction in the system ability to meet mission-criticality best so far proxy node placements in distribution tree 147101316192225283134 3 3.5 4 4.5 5 5.5 revenue loss (  ) [normalized units]... 31425109786 proxy node placements revenue loss (  ) 4 4.5 5 5.5. 3 2 8 15 9 1 server R clientc c a client c e client c d 4 5 12 10 11 13 17 18 14 7 6 16 exhaustive search to find lowest revenue loss for the 9-node topology with 2 proxy nodes other local minima global minimum partial search 1112 partial search & rollback [solution search cycles are triggered by the CDN controller]

parameters of client workloads & QoS specs Model-based estimation of overhead & latency O CL/SR, L CL/SR i, j, k CDN simulator request arrivals from clients i, j, k (different content size/type) task planning & scheduling task events (based on combined client request arrival specs) Controller observed QoS q’  [L’,O’] QoS specs q  [L,O] error  =  -  ’ schedule tasks to resources at proxy nodes place proxies V”  V’ to reduce  [tree T(V’,E’)  G(V,E), CL/SR] [optimal methods A for “proxy placement” (hand-computed)] initial inputs [G(V,E), system parameters, utility functions,..] plug-in of CDN model T(V’,E’) G(V,E) state feedback (node/link usage) for CDN assessment overall cost measure overhead & latency 0.5 1.0 1.5 2.0 2.5 0 percentage of nodes used as content distributing proxies 5%10%20%30% |V’|: 15 nodes; 5 client clusters Average # of hops traversed by a client request: 3 A : Optimization algorithm employed for computing proxy placement (evolutionary, greedy,...) RESULTS COLLECTED

x x x x x 123467 i th placement, computed as optimal, in the face of unknown event space (E* - E i ) from external environment o o o [an event type e ij  E i has a known value, say, v ij (i.e., v ij is measurable)] 5 o o x QoS attained by actual CDN system (as determined from DES) [system employs genetic and/or greedy search methods to find a (sub-)optimal proxy placement] QoS output computed with an analytical model of CDN system, with Poisson-approximated client traffic o ‘gold standard’ for comparison purposes SL (E i  E*) SL: QoS deviation due to server laxity 3 2 8 15 9 1 server R c caca cece 4 512 10 11 13 17 18 14 7 6 16 cdcd CDN topology studied 8 flow of client requests & server update traffic [avg. feed rate: 1.5/sec; latency seen by Ca-Ce: 188, 198, 259,372, 363 (msec)] 0 Proxy node for content store & forwarding node Content forwarding node Node/link not part of distribution path... 2.236 2.475 3.521 4.290 (normalized) cost for placements 2,5 are shown Normalized cost incurred [‘cost’ is computed as inverse function of latency L’] steady-state QoS output of CDN system [QoS:  (L,O)] latency [incurred for content access on a per-pull basis] overhead [  weighted sum of L and O] Increasing values (lower the better)

Future research plans Injection of attack and stressor events on network-system being tested Incorporation of system utility functions and SLA penalty as part of audit analysis of network systems Identification of probabilistic measures of system adaptation quality (e.g., determination of “measurement bias”) Machine-intelligence and Markov decision tools for system analysis in the face of uncertain events Autonomic methods for system reconfigurations

Model-based Assessment of Adaptation Capability of Networked Systems Presented by: Kaliappa Ravindran Other student contribuitors at CUNY: Arun Adiththan,

Similar presentations

Presentation on theme: "Model-based Assessment of Adaptation Capability of Networked Systems Presented by: Kaliappa Ravindran Other student contribuitors at CUNY: Arun Adiththan,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Model-based Assessment of Adaptation Capability of Networked Systems Presented by: Kaliappa Ravindran Other student contribuitors at CUNY: Arun Adiththan,

Similar presentations

Presentation on theme: "Model-based Assessment of Adaptation Capability of Networked Systems Presented by: Kaliappa Ravindran Other student contribuitors at CUNY: Arun Adiththan,"— Presentation transcript:

Similar presentations

About project

Feedback