Download presentation
Presentation is loading. Please wait.
Published byIsabella Briggs Modified over 6 years ago
1
Hálózatok megbízhatósága és biztonsága Hálózatok tervezése megbízhatósági szempontok figyelembevételével Takács György Halterv
2
Halterv
3
Halterv
4
A QoS négyféle szempontja
Halterv
5
ITU-T E.800 definíciók Availability (rendelkezésre állás) Availability of an item to be in a state to perform a required function at a given instant of time or at any instant of time within a given time interval, assuming that the external resources, if required, are provided. ([ITU-T E.802]) Reliability (megbízhatóság) The probability that an item can perform a required function under stated conditions for a given time interval. Disaster recovery, Business continuity All activities associated with the restoration of a network provided service after disasters. Examples of such disasters are fire, earthquakes, vandalism, bombings, or software malfunctioning. expressed as the arithmetic mean. Halterv
6
ITU-T E.800 definíciók Service level agreement (SLA)
A service level agreement is a formal document listing a set of performance characteristics and target values (or range) to be delivered for a service or portfolio of services by the service provider. Fault (meghibásodás) The inability of an item to perform a required function, excluding that inability due to preventive maintenance, lack of external resources or planned actions. Mean time between failures (MTBF) The expectation of the time between failures computed from a statistically significant number of samples usually expressed as the arithmetic mean. Halterv
7
Security-specific terms I.
Security The term 'security' is used in the sense of minimizing the vulnerabilities of assets and resources. An asset is anything of value. A vulnerability is any weakness that could be exploited to violate a system or the information it contains ([ITU-T X. 800]). Information security (A titkosság, eredetiség, rendelkezésre állás megőrzése) Security preservation of confidentiality, integrity and availability of information ([ITU-T X.1051]). Data security Security preservation of integrity and availability of data. Privacy (Személyes információk megtartásának joga) The right of individuals to control or influence what information related to them may be collected and stored and by whom and to whom that information may be disclosed. Halterv
8
Security-specific terms II.
Password Confidential authentication information usually composed of a string of characters ([ITU-T X.800]). Confidentiality (Titoktartás) The property that information is not made available or disclosed to unauthorized individuals, entities, or processes ([ITU-T X.800]). Data confidentiality A service that can be used to provide for protection of data from unauthorized disclosure. The authentication framework supports the data confidentiality service. It can be used to protect against data interception ([ITU-T X.509]). Halterv
9
Security-specific terms III.
Integrity (Eredetiség) The property that data have not been altered in an unauthorized manner ([ITU-T H.235.0]). Data integrity The property that data have not been altered or destroyed in an unauthorized manner ([ITU-T X.800]). Malware (Rosszindulatú szoftver) A generic name for software which intentionally performs actions which can damage data or disrupt systems. Hacking (Jogtalan hozzáférés) The term used to describe malicious acts of a wide ranging nature such as overcoming access controls, denial of service, theft of information or installation of malware. Phishing (Adathalászat) Creating a replica of an existing web page to fool a user into submitting personal, financial, or password data. Halterv
10
Security-specific terms IV.
Virus (Computer virus) A computer program that can copy itself and infect a computer without permission or knowledge of the user Worms A self-replicating computer program. Worms always harm the network by consuming bandwidth. Trojan horse Fraud (Csalás) Spam Halterv
11
Fundamentals of reliability issues in network planning
You must dimension networks higher parameters then the exact specification calculated from the demand parameters. Networks always need some spare capacity. Calculate with unpredictable situations! Reliability and availability dimensioning are important part of network dimensioning and planning. Halterv
12
Organizations are increasingly reliant on computer networks for business or mission-critical applications. The scope and size of these networks have expanded so rapidly over the past two decades that considerable effort and expense are now targeted at keeping network resources available, sometimes 24 hours a day, all year. Traditionally this area of network design has been the preserve of large mainframe sites and those sites requiring high levels of protection (such as nuclear power plants). However, the explosion of Web-based business methods means than many more organizations are now eager to maintain high availability in order to minimize service losses. Halterv
13
If the network is poorly designed, and insufficient attention is paid to providing availability in core systems, users can experience anything from slow response times to complete loss of service (referred to as downtime) for extended periods. The technical issues in maintaining high availability are both complex and subtle, and it is the network designer’s job to balance loss probability against cost, providing guidance to senior management on the likelihood of failures and their impact on the business. Halterv
14
Networks are rarely static environments, and budgets are finite
Networks are rarely static environments, and budgets are finite. In practice network designers are required to make a range of pragmatic and technical decisions that address, accept, mitigate, or transfer the risks of failure—all within the constraints of a budget. The designer must also ensure that the solutions provided are scalable, so that additional nodes, services, and capacity can be added without major upheaval and without adversely affecting existing users. Downtime for truly business- and mission- critical systems can equate to losses of millions of dollars per minute; these organizations, therefore, demand high-availability (HA) networks and are often prepared to go to extraordinary lengths to achieve them. Halterv
15
Halterv
16
Halterv
17
Failure knows no boundaries in a network design, and the smallest component failure can effectively bring down a whole business without warning (e.g., a failed hard disk controller on your core e-business server could stop all transactions). For practical reasons organizations are invariably broken down into teams responsible for different aspects of IT (desktop support, communications, applications, database, cabling, etc.). When a problem occurs, it is all too common for application staff to blame the network and vice versa. To maintain HA networks, different disciplines must work together, both at the design phase and subsequently. Good diagnostic, monitoring, and management tools can also help. Halterv
18
Anything that can go wrong, will go wrong —Murphy
Planning for failure When designing a reliable data network, network designers are well advised to keep two quotations in mind at all times: Anything that can go wrong, will go wrong —Murphy Whatever can go wrong will go wrong at the worst possible time and in the worst possible way . . . Expect the unexpected. (Számíts a váratlanra!) —Douglas Adams, The Hitchhiker’s Guide to the Galaxy Halterv
19
For example, a cable break is a hard failure,
Failure refers to a situation where the observed behavior of a system differs from its specified behavior. A failure occurs because of an error, caused by a fault. The time lapse between the error occurring and the resulting failure is called the error latency. Faults can be hard (permanent) or soft (transient). For example, a cable break is a hard failure, whereas intermittent noise on the line is a soft failure. Halterv
20
Single Point of Failure (SPOF) indicates that a system or network can be rendered inoperable, or significantly impaired in operation, by the failure of one single component. For example, a single hard disk failure could bring down a server; a single router failure could break all connectivity for a network. Multiple points of failure indicate that a system or network can be rendered inoperable through a chain or combination of failures (as few as two). For example, failure of a single router, plus failure of a backup modem link, could mean that all connectivity is lost for a net. Planning for failure In general it is much more expensive to cope with multiple points of failure and often financially impractical. Halterv
21
The system should also provide recovery from multiple failures.
Fault tolerance indicates that every component in the chain supporting the system has redundant features or is duplicated. A fault-tolerant system will not fail because any one component fails (i.e., it has no single point of failure). The system should also provide recovery from multiple failures. Components are often overengineered or purposely underutilized to ensure that while performance may be affected during an outage, the system will perform within predictable, acceptable bounds. Halterv
22
This may be in hot standby, cold standby, or load-sharing mode.
Fault resilience (gyors kiszabadulás a hibás állapotból) implies that at least one of the modules or components within a system is backed up with a spare (e.g., a power supply). This may be in hot standby, cold standby, or load-sharing mode. In contrast with fault-tolerant systems, not all modules or components are necessarily redundant (i.e., there may be several single points of failure). For example, a fault-resilient router may have multiple power supplies but only one routing processor. By definition, one fault-resilient component does not make the entire system fault tolerant. Halterv
23
Disaster recovery is the process of identifying all potential failures, their impact on the system/network as a whole, and planning the means to recover from such failures. Halterv
24
Calculating the true cost of downtime
Network designers are largely unfamiliar with financial models. It is, however, imperative in designing reliable networks that the designer gathers some basic financial data in order to cost justify and direct suitable technical solutions. The data may come from line managers or financial support staff and may not be readily collated. Without these data the scale of the problem is undefined, and it will be hard to convince senior financial and operational management that additional features are necessary. Halterv
25
To illustrate the point let us consider a hypothetical consumer-oriented business (such as an airline, car rental, vacation, or hotel reservation call center). The call center is required to be online 24 hours a day, 7 days a week, 365 days a year. The business has 800 staff involved in call handling (transactions), each with an average burdened cost of $25 an hour (i.e., the cost of providing a desk, heating, lighting, phone, data point, etc.). There is a small profit made on each transaction, plus a large profit on any actual sale that can be closed. We assume here that there are on average three sales closed per hour. Halterv
26
Cost of Idle Staff is calculated as (Headcount × Burdened
Cost × Downtime). Production Losses are calculated as (Headcount ×Transactions per Hour × Profit per Transaction × Downtime). Lost Sales are calculated as (Headcount × Sales per Hour × Profit per Sale × Downtime). Halterv
27
Halterv
28
Halterv
29
Developing a disaster recovery plan
All networks are vulnerable to disruption. Sometimes these disruptions may come from the most unlikely sources. Natural events such as flooding, fire, lightning strikes, earthquakes, tidal waves, and hurricanes are all possible, as well as fuel shortages, electricity strikes, viruses, hackers, system failures, and software bugs. History shows us that these events do happen regularly. As in1999 and 2000 we saw the seemingly impossible: power shortages in California threatened to cripple Silicon Valley, and a combination of fuel shortages, train safety issues, and massive flooding …. Halterv
30
In fact, various studies indicate that the majority of system failures can be attributed to a relatively small set of events. These include, in decreasing order of frequency, natural disaster, power failure, systems failure, sabotage/viruses, fire, and human error. There is also a general consensus that companies that take longer than a full business week to get back online run a high risk of being forced out of business entirely (some analysts state as high as 50 percent). Halterv
31
A general approach to the creation of a Disaster Recovery (DR) :
Benchmark the current design—Perform a full risk assessment for all key systems and the network as a whole. Identify key threats to system and network integrity. Analyze core business requirements and identify core processes and their dependence upon the network. Assign monetary values of loss of service or systems. Define the requirements—Based on business needs, determine an acceptable recovery window for each system and the network as a whole. If practical specify a worst-case recovery window and a target recovery window. Specify priorities for mission- or business-critical systems. Halterv
32
Define the technical solution—Determine the technical response to these challenges by evaluating alternative recovery models, and select solutions that best meet the business requirements. Ensure that a full cost analysis of each solution is provided, together with the recovery times anticipated under catastrophic failure conditions and lesser degrees of failure. Develop the recovery strategy—Formulate a crisis management plan identifying the processes to be followed and key personnel response to failure scenarios. Describe where automation and manual intervention are required. Set priorities to clearly identify the order in which systems should be brought back online. Halterv
33
Develop an implementation strategy—Determine how new/additional technology is to be deployed and over what time period. Document changes to the existing design. Identify how new/additional processes and responsibilities are to be communicated. Develop a test program—Determine how business- and mission-critical systems may be exercised and what the expected results should be. Define procedures for rectifying test failures. Run tests to see if the strategy works; if not, make refinements until satisfied. Implement continuous monitoring and improvements—Once the disaster recovery plan is established, hold regular reviews to ensure that the plan stays synchronized as the network grows or design features are modified. Halterv
34
Disaster recovery models
Halterv
35
Tape or CD site backup—Tape or CD-ROM backup and restore are the widely used DR methods for sites. Traditionally, key data repositories and configuration files are backed up nightly or every other night. Backup media are transported and securely stored at a different location. This enables complete data recovery should the main site systems be compromised. If the primary site becomes inoperable, the plan is to ship the media back, reboot, and resume normal operations. Pros and Cons: This is a low-cost solution, but the recovery window could range from a few hours to several days; this may prove unacceptable for many businesses. Media reliability may not be 100 percent and, depending upon the backup frequency, valuable data may be lost. Halterv
36
Electronic vaulting—With remote electronic vaulting, data are archived automatically to tape or CD over the network to a secure remote site. Electronic vaulting ideally requires a dedicated network connection to support large or frequent background data transfers; otherwise, archiving must be performed during off-peak periods or low-utilization periods (e.g., via a nightly backup). Backup procedures can, however, be optimized by archiving only incremental changes since the last archive, reducing both traffic levels and network unavailability. Pros and Cons: The operating costs for electronic vaulting can be up to four times more expensive than simple tape or CD backup; however, this approach can be entirely automated. Unlike simple media backup there is no requirement to transport backup data physically. Recovery still depends on the most recent backup copy, but this is likely to be more recent due to automation. Electronic vaulting is more reliable and significantly decreases the recovery window (typically, just a few hours). Halterv
37
interfaces (e.g., SCSI, Fibre Channel, etc.).
Data replication/disk mirroring—Remote disk mirroring provides faster recovery and less data loss than remote electronic vaulting. Since data are transferred to disk rather than tape, performance impacts are minimized. With disk mirroring you can maintain a complete replica file system image at the backup site; all changes made to production data are tracked and automatically backed up. Data are typically synchronized in the background, and when the recovery site is initialized or when a failed site comes back online, all data are resynchronized from the replica to production storage. Note that data may be available only in read-only mode at the recovery site if the original site fails (to ensure at least one copy is protected), so services will recover but applications that are required to update data may be somewhat compromised unless some form of local data cache is available until the primary storage comes online. A disk mirroring solution should ideally be able to use a variety of disks using industrystandard interfaces (e.g., SCSI, Fibre Channel, etc.). Halterv
38
Data replication/disk mirroring
Pros and Cons: Data replication is more expensive than the previous two models, and for large sites considerable traffic volumes can be generated. Ideally, a private storage network should be deployed to separate storage traffic from user traffic. Although more optimal, this requires more maintenance than earlier models. Halterv
39
Server mirroring and clustering—These techniques can be used to significantly reduce the recovery time to acceptable levels. Ideally, servers should be running live and in parallel, distributing load between them but located at different physical locations. If incremental changes are frequently synchronized between servers, then backup could be a matter of seconds, and only a few transactions may be lost (assuming there isn’t large-scale telecommunications or power disruption and staff are well briefed on what to do and what not to do in such circumstances). The increasing focus on electronic commerce and large-scale applications such as ERP means that this configuration is becoming increasingly common. Halterv
40
Server mirroring and clustering
Pros and Cons: This approach is widely used at data centers for major financial and retail institutions but is often too expensive to justify for small businesses. Server mirroring requires more infrastructure to achieve (high-speed wide area links, more routers, more firewalls, and tight management and control systems). Halterv
41
Storage Area Networks (SANs) and Optical Storage Network (OSNs)—There is increasing interest in moving mission- and business- critical data off the main network and offloading it onto a privately managed infrastructure called a Storage Area Network (SAN). Storage can be optically attached via standard high-speed interfaces such as Fibre Channel and SCSI (with optical extenders), providing a physical separation of storage from 600 meters to 10 kilometers. Servers are directly attached to this network (typically via Fibre Channel or ESCON/FICON interfaces [5] and are also attached to the main user network. SANs may be further extended (to thousands of kilometers) via technologies such as Dense Wave Division Multiplexing (DWDM), forming optical storage networks. This allows multiple sites to share storage over reliable high-speed private links. Halterv
42
Storage Area Networks (SANs) and Optical Storage Network (OSNs)
Pros and Cons: This approach is an excellent model for disaster recovery and storage optimization. It significantly increases complexity and cost (though storage consolidation may recover some of these costs), and it is, therefore, appropriate only for major enterprises at present. One big attraction for many large enterprises is that the whole storage infrastructure can be outsourced to a Storage Service Provider (SSP). This facilitates a very reliable DR model (some providers are currently quoting four-nines (99.99 percent) availability. Halterv
43
Quantifying availability
A% = Operational Time/Total Time Halterv
44
Halterv
45
Halterv
46
Halterv
47
Halterv
48
Halterv
49
Halterv
50
Bérelt vonali szolgáltatások
Folyamatosan felügyelt és menedzselt digitális távközlési hálózatunk két tetszőleges pont közötti állandó kapcsolatot tesz lehetővé. Silver (99,7% rendelkezésre állás) és Gold (99,9% rendelkezésre állás) emelt minőségben is igénybe vehető szolgáltatásunkkal Ön rugalmasan alakíthatja ki saját kommunikációs rendszerét. Gold: évi 8,76 óra leállás! Havi 43,8 perc leállás! Halterv
51
Mean Time Between Service Outages (MTBSO) or Mean Time Between Failure (MTBF) is the average time (expressed in hours) that a system has been working between service outages and is typically greater than 2,000 hours. Since modern network devices may have a short working life (typically five years), MTBF is often a predicted value, based on stress-testing systems and then forecasting availability in the future. Devices with moving mechanical parts such as disk drives often exhibit lower MTBFs than systems that use fixed components (e.g., flash memory). Halterv
52
Mean Time To Repair (MTTR) is the average time to repair systems that have failed and is usually several orders of magnitude less that MTBF. MTTR values may vary markedly, depending upon the type of system under repair and the nature of the failure. Typical values range from 30 minutes through to 3 or 4 hours. A typical MTTR for a complex system with little inherent redundancy might be several hours. Halterv
53
Halterv
54
Soros rendszerre: Halterv
55
Halterv
56
Halterv
57
Halterv
58
Halterv
59
Halterv
60
Halterv
61
Halterv
62
Halterv
63
Halterv
64
Ajánlott irodalom: Network reliability: Models, Measures and Analysis
Halterv
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.