Presentation on theme: "Sun’s weak points in UE10000. 4/5/2001 11.21Partitions Review Page 2 Sun’s Weak Points in UE10000 DSD/DR is Not used by Customers Sun will not provide."— Presentation transcript:
Sun’s weak points in UE10000
4/5/ Partitions Review Page 2 Sun’s Weak Points in UE10000 DSD/DR is Not used by Customers Sun will not provide DSD reference sites [Giga]. Regular system administrator can not do the DSD/DR changes, it takes very skilled system administrator to handle the DSD/DR changes [Giga]. Very few customers use DSD/DR in database related production environment. DRS/DR are used more often in testing environment [Giga]. Few customers use DSDs. Those who do say it works fine most of the time. [Gartner]. Quality Problems Terrible problems with USII last year [unable to do root cause analysis]. Some customers won’t return to Sun, but will stay in Sun fold with Fujistu [Giga]. E Cache problem does not only bring down the affected domain, it brings the whole UE10K down. Sun has been having great difficulty to design reliable Enterprise level servers. Due to their background as a workstation vendor they are behind in “design for reliability” technology. The UltraSPARC II based systems did not have ECC in cache memory with all the reliability problems as a result. The USIII now supports ECC in level-2 cache, but they are still behind as they have no chip-kill technology or DMR. No Virtual Partitions No Goal based and Multi System Workload Management
4/5/ Partitions Review Page 3 SINGLE POINTS OF FAILURE (SPOF) HP has the lowest SPOF failure rate: The SPOF failure rate between partitions in Superdome (called the 'infrastructure failure rate') is lower than the infrastructure failure rate of S390 Lpars and certainly much lower than SUN UE10K domains How can this be??? when SUN quotes that the UE10K has “Complete Hardware Redundancy”? SUN’s definition on SPOF: Looking carefully at the literature, “Complete Hardware Redundancy” means: A fully redundant system will always recover from a system crash, by using (booting from) standby hardware. Therefore, this “complete hardware redundancy” is really a collection of ‘single points of failure’ by HP’s definition (the one the customer cares about). Source: Ken Pomaranski, Hardware HA Architect
4/5/ Partitions Review Page 4 Does Sun really understand reliability? From UE10K RAS manual: “Sun has made the time required for a module replacement much shorter [over time]. This enhancements coupled with improved diagnostic capabilities have reduced the cycle time on systems, simultaneously increasing reliability and availability. “ There is currently no industry adopted means to measure MTBF. Therefore, comparisons between vendors is of questionable use. “Each UE10K can be configured to have 100% HW redundancy” Isn’t reliability about ‘keeping systems running?’ How then does Sun track server reliability? Shouldn’t the UE10K then never fail?
4/5/ Partitions Review Page 5 Sun’s Customers Understand! Topping their list of complaints are the frequency of server crashes caused by the problem [memory], fixes that don't work and Sun's tendency to initially blame the problem on other factors before acknowledging it - often only under a nondisclosure agreement. – Computer World – 9/04/2000 "They treated the whole thing like a cover-up“, said one user at a large utility in the Western U.S. who asked not to be named. – Computer World – 9/04/00 “The long-standing nature of the problem and Sun's handling of the issue raise troubling questions about the quality of Sun's hardware and support” – Gartner group Engineers have long known that memory chips can be disrupted by radiation and other environmental factors. That is why Hewlett-Packard and IBM use error-correcting code, or ECC, which detects cache errors and restores bits that were changed by mistake. – Forbes 11/13/2000 Sun servers lack ECC protection. "Frankly, we just missed it. It's something we regret at this point," Shoemaker [Sun executive VP] says. – Forbes 11/13/2000 What else have they ‘missed’??
4/5/ Partitions Review Page 6 Sun’s UE10K Dynamic Reconfiguration Weaknesses Sun’s UE10K implementation of DR is not quite as dynamic as SUN would have you believe. It’s a marketing tale!!! Hot swapping I/O requires that CPU and memory also be brought down. Any DR activity requires that the database be shut down, therefore making applications unavailable during the process. DR cannot be used in combination with memory interleaving across system boards which reduces maximum performance. Sun customers have to choose between good system performance or DR functionality, but cannot get both at the same time! DR is not supported in combination with SunCluster fail-over. Since during a DR operation the system halts, SunCluster considers this system to be failing and starts a fail-over procedure to another system. Sun customers have to choose between a true multi-system, high availability solution and the use of DR, but cannot get both at the same time! DR conflicts with Intimate Shared Memory (ISM) used by demanding applications. To improve performance, most memory intensive applications, like databases, make use of the Intimate Shared Memory (ISM) capability in the E Most applications using ISM do not allow dynamic addition or removal of their shared memory allocation. Using memory intensive applications with ISM (like large databases) and making the most efficient use of partitions prevent the use of DR. Deactivating/moving a system board with full memory can take 15 minutes (backup and rearrange memory contents). All activities in the affected partitions(s) have to be paused during that time! (To compensate Sun introduced TurboDR boards with just CPU’s, no memory...) Source: John Wiltschut, BSTO Marketing
Why Sun is being defensive: Superdome vs. E10000
4/5/ Partitions Review Page 8 Sun blames HP and IBM for copying the E10000 The truth is: Superdome is more original than the E10000 has ever been: the E10K is an exact copy of the Cray CS6400 Sun is just playing catch-up with the E10000’s inferior performance, reliability and functionality The E10000 is an end-of-line product based on old technology and without future expansion capabilities Superdome is built as an advanced architecture based on the latest technology and with a very strong growth potential Sun has never developed a high-end server by themselves.
4/5/ Partitions Review Page 9 The E10000 is COPIED by Sun (from Cray) The CS6400 was developed by Cray and announced in It supported up to 64 SuperSPARC processors (60 MHz) and ran CRS-OS, based on Solaris, but modified by Cray. Most of the CS6400 used less than 30 CPU’s as it did not scale very well. In 1996 Sun purchased this technology from Cray/SGI and introduced a copy in 1997 under the name E All basic technology was already present in the CS6400 and Sun has never added any break-through improvements
4/5/ Partitions Review Page 10 Sun claims: Supported with Solaris since SMP CPUs in Single Cabinet HP Superdome supports 64 CPU’s in a single system with SMP functionality. Superdome is built as an advanced architecture based on the latest technology and with a very strong growth potential. The modular packaging allows you to use only half the size up to 32 processors. SD has 3 base cabinet configu- rations. The E10K comes in full size, even with only a few CPUs. A 48-CPU Superdome delivers 71% more performance* in a system that is only 20% wider than a 64-CPU E The reality: The Cray CS6400 (announced in 1993) was not developed by Sun, ran CRS-OS and had very limited scalability. The E10K is a copy of the CS6400 without significant breakthrough technology added by Sun. * based on TPC benchmark with Oracle
4/5/ Partitions Review Page 11 Sun claims: Supported with Solaris since 1997 Full Dynamic Partitioning HP is the first vendor to provide the full spectrum of partitioning: Hyperplex, nPartitions, virtual partitions and automatic resource partitioning. The different levels of partitioning can be combined as desired. nPartitions can be added and removed within an active Superdome. Virtual Partitions are dynamic at the CPU level, not just the cell level. Sun still does not support “full” dynamic partitioning (it does not support dynamic control by applications). Dynamic System Domains (DSD) require operator intervention and usually a reboot. The use of DSD has many limitations: it cannot be combined with memory interleaving, SunCluster fail-over or Intimate Shared Memory*. Domains always have to be multiples of 4 CPU’s. The reality: * see whitepaper DSD and DR -- the true story
4/5/ Partitions Review Page 12 only hp offers the full spectrum of partitioning resource partitions hyperplex virtual partitions prm (Process Resource Mgr) hp-ux wlm (Workload Manager) isolationflexibility –complete hardware and software isolation –multiple OS images –hardware isolation per cell –complete software isolation –multiple OS images –software isolation –multiple OS images –dynamic resource allocation –automatic goal- based resource allocation via set slo’s –1 OS image new! hard partitions with multiple nodes hard partitions within a node virtual partitions within hard partitions suncluster no high- speed interconnect 8 node max. doesn’t work with sun’s dr dynamic system domains (dsd) require reboot in most situations difficult to modify configuration (sun experts are usually needed) solaris resource manager (srm) expensive doesn’t manage i/o not goal-based like hp-ux wlm...Sun can’t match nPartitions new!
4/5/ Partitions Review Page 13 Sun claims: Supported with Solaris since 2000/1997 Automated DR* / Hot-swap CPU + Memory HP-UX can dynamically deallocate processors and memory with DPR and DMR (dynamic processor and memory resilience) in case of failures. This is a fully automatic process. Cell boards can be added and removed in an active Superdome. HP has been using error checking and correcting in cache memory to prevent most processor and system failures. Sun hasn’t in the US II. Automated DR is nothing more than scripting of an otherwise manual cell board replacement process. Dynamic Reconfiguration (DR) has many limitations (similar to DSD’s**) If a processor fails then the domain crashes and a reboot is required. This is neither automatic nor dynamic. The reality: * DR = Dynamic Reconfiguration ** see whitepaper DSD and DR -- the true story
4/5/ Partitions Review Page 14 Sun claims: Supported with Solaris since 1999 Interdomain Networking HP supports other high-speed communication links like Hyperfabric, Fibre-Channel etc., and recommends not to use IDN because of the lack of isolation between partitions. Interdomain networking (IDN) uses shared memory and the connected domains are not isolated from failures in the other domains. As IDN violates hardware isolation (the main reason for partitioning) it increases the risk of down-time. Sun does not support high-speed interconnect like Hyperfabric for high- bandwidth data transfer between nodes and partitions. The reality:
4/5/ Partitions Review Page 15 Sun claims: Supported with Solaris since 2000 (December) Clustered File Systems HP supports multiple file system options depending on customer needs. CIFS/9000 is a global file system supporting multi- platform, multi-OS file systems. MC/ServiceGuard provides a superior, mature solution with support up to 16 nodes, hundreds of applications and has more than 45,000+ installations. Hyperplex supports hundred of clustered nodes. This was promised for SunCluster 3.0 but was never delivered (confirmed during the press conference). Sun tries to get around it by using marketing terms like ‘cluster- aware file system’ and ‘cluster file service’. Sun’s clustering solutions have always been behind and customers have always preferred other solutions. Even now SunCluster 3.0 only support 8 nodes and is focused on Solaris only. The reality:
4/5/ Partitions Review Page 16 Sun claims: Supported with Solaris since 2000 (December) Global Network Services HP ‘s MC/ServiceGuard already provides flexible IP addresses so that applications can fail-over to other nodes in a cluster without any problem. HP is focused on supporting multi-platform, multi-OS environments based on customer demand. This is mainly about abstracting an IP service from a network interface, such that applications can be moved in a cluster (HA fail-over). To speak in Sun terms: nothing new... Sun is focused on Solaris-only solutions with no support for multi-OS. The reality:
4/5/ Partitions Review Page 17 –Sun’s current systems do not have Error Checking and Correcting, Dynamic Processor and Memory Resilience or Chip-Kill technology. –Analysts and press have reported serious problems with Sun E10000 systems at customer sites. See the Forbes and Gartner articles. –The US II processor lacks performance compared to current HP’s offerings, resulting in much lower system performance. Even the US III will barely meet the current PA-RISC performance levels. What Sun does not say... Reliability Performance –Today’s applications like broadband and datawarehousing requires high I/O bandwidths, which Sun does not deliver. I/O bandwidth –Current Sun products are basically end-of-life. The US III requires new boxes and runs only the Solaris 8 OS. Investment protection –Sun’s vision is limited to Solaris/SPARC only; Not towards multi-platform environments. Multi-platform support Sun’s systems are lagging in all these areas
4/5/ Partitions Review Page 18 Who is really playing Catch-Up?
4/5/ Partitions Review Page 19 leadership performance, flexibility, availability / K+115K/156K leadership limited weakness Page 19 performance/ hp superdome sun e10000 scalability CPU memory I/O tpm flexibility hyperplex nPartitions virtual partitions resource partitions utility pricing iCOD IA-64 Multi-OS availability multi-system single system investment protection
4/5/ Partitions Review Page 20 Sun’s Dark Secret Sun Screen Sun Microsystems’ servers have been crashing for more than a year. Sun has kept the flaw secret--and hasn’t yet fixed it 11/13/2000
Sun and HP Reliability Comparisons
4/5/ Partitions Review Page 22 Why HP can fulfill the customer needs better than Sun HP understands what available systems really mean. Availability is the BASE upon which all other features are built: High Quality / Resilient Hardware (Hardware that keeps running) Hard Partitions Virtual Partitions Flexible Compute Management Multi-system HA Event Mgmt Nothing matters without this!
4/5/ Partitions Review Page 23 Reliability Comparison HPUE10KSUNFIRE Internal cache error correction YESNO Dynamic processor resilience YESSOME Chip kill protection YES NO HW scrubbing YESNO Dynamic memory resilience YESNO PCI bus error isolation YESNO Full PCI OLAR YESNO Address bus ECC YESNO Redundant DC / DC converters YESNO Full stuck-at bit correction YESNO Interconnect reliability experience YESNO CPU MEMORY IO BACKPLANE
4/5/ Partitions Review Page 24 Reliability Comparison (2) HPUE10KSUNFIRE 5 nines solution availability YESNO Data center wide HA solutions YESNO Customer care for quality issues YES (*)NO Proven domain isolation YESNO Solution level verification YES?? ‘Cosmic ray’ tolerance YESNO SOLUTION LEVEL HP projects that the above reliability ‘oversights’ result in SUN systems with 2-4x greater failure rates than HP systems. This has been proven by field experience. (*) Rather than blame customers for quality problems, HP closely tracks field data and works PROACTIVELY to fix potential field quality problems.