Presentation is loading. Please wait.

Presentation is loading. Please wait.

.. Resistance is futile…. Dec 6, 2004

Similar presentations


Presentation on theme: ".. Resistance is futile…. Dec 6, 2004"— Presentation transcript:

1 .. Resistance is futile…. Dec 6, 2004
The x86 Server Platform .. Resistance is futile…. Dec 6, 2004

2 Server shipments – Total vs x86

3 Market Share: Servers, United States, 2Q04
United States: Vendor Revenue by Operating System (Millions of Dollars) 2Q03 3Q03 4Q03 1Q04 2Q04 Market Share Growth 2Q03- 1Q04- Windows 1,534.1 1,692.3 1,671.6 1,645.6 1,665.5 34.79% 36.18% 8.6% 1.2% Unix 1,622.6 1,474.6 1,554.1 1,374.2 1,471.9 36.79% 31.98% -9.3% 7.1% Others 820.2 823.7 1,142.4 897.2 852.6 18.60% 18.52% 3.9% -5.0% Linux 433.2 497.3 552.5 555.0 613.2 9.82% 13.32% 41.5% 10.5% Total 4,410.2 4,487.9 4,920.7 4,472.1 4,603.1 100.00% 4.4% 2.9% Michael McLaughlin, Market Share: Servers, United States, 2Q04  7 October 2004, Gartner

4 x86 Platform CPUs Intel AMD Xeon MP – Gallatin (future is Potomac)
Xeon SP/DP – EM64T - Nacona Itanium II MP – Madison (future is Montecito) AMD Opteron

5 Gallatin - MP 130 nm 3 GHz 4 MB L3 Cache FSB MHz

6 ES7000 – 32 Gallatins

7 Nacona – Single Processor with EM64T
90 nm Clock Speed – GHz L3 – 4 MB FSB – 800 Mhz

8 Itanium II - Madison 130 nm 9 MB L3 cache 1.6 GHz FSB – 400 MHz

9

10

11 STOP Why Multi-Core? .. And while we’re at it, why Multi-Threading?
It’s all about the balance of Silicon real estate Compiler technology Cost Power …. to meeting the constant pressure to double performance every 18 months

12 Memory Latency vs CPU Speed
Microprocessor Operating Frequency (GHz) DRAM Access Frequency (10-9 sec)-1 10.0 10.0 1.0 1.0 Microprocessor on-chip clock Commodity DRAM 0.1 0.1 0.01 0.01 1990 1995 2000 2005 2010 Production Year

13 Processor Architecture
When latency ↓ Ø and bandwidth ↑ ∞ we will have the perfect CPU A great deal of innovation has centered around approximating this perfect world CISC CPU Cache RISC EPIC Multi-Threading Multiple Cores

14 Complex Instruction Set Computer
Hardware implements assembler instructions MULT A, B hardware loads registers, multiplies and stores results Multiple clocks needed for an instruction RAM requirements are relatively small Compilers translate high level languages down to assembler instructions – Von Neumann hardware

15 CPU Cache When CPU speeds started to increase, memory latency emerged as a bottleneck CPU caches were used to keep local references “close” to the CPU For SMP systems, memory banks were more than a clock away It is not uncommon today to find 3 orders of magnitude between the fastest and slowest memory latency

16 Reduced Instruction Set Computer
Hardware is simplified – fewer transistors are needed for full instruction set RAM requirements are higher to store intermediate results and more code Compilers are more complex Clock speeds increase because instructions are simpler Deterministic, simple instructions allow pipelining

17 Pipelining 25% busy Higher Clock Speeds! 100% busy 80% busy 60% busy

18 Branch Prediction While processing in parallel, branches occur
Branch prediction is used to increase the probability that a specific branch will be followed If incorrect, the pipeline is “dead” and the CPU stalls Statistics 10%-20% of instructions are branches Predictions are incorrect about 10% of the time As the pipeline increases, probability of miss increases and cycles will be discarded 80-deep pipeline / 20% branches / 10% miss => 80% chance of miss and a penalty of 80 cycles

19 Itanium II Epic Instruction Set Explicitly Parallel Instruction Computing
Compiler can indicate code that can be executed in parallel Both branches are pipelined No lost cycles due to miss-prediction Pipeline can be deeper Complexity continues to move into the compiler

20 Multi-Threading

21

22 Multiple Cores Fabrication sizes continue to diminish
The additional real estate has been used to put more and more memory on the die Multi-core technology provides a new way to exploit the additional space The clock rates cannot continue to climb due to the excessive heat P = C * V2 * f C - switch capacitance V – Supply Voltage f – clock frequency Multiple cores is the next step to providing faster execution times for applications

23 (End of 2005?)

24

25

26

27

28

29

30 AMD Opteron 800 Series 130 nm Clock Speed – 1.4-2.4 GHz L2 – 1 MB
6.4 GB/s Hypertransport

31 Architectural Comparison
Hypertransport™ GB/s Opteron Opteron Xeon Xeon Xeon Xeon 6.4 GB/s Opteron Opteron Memory Address Buffer PCI-X Bridge DDR 144-bit PCI-X Bridge SNC PCI-X Bridge PCI-X Bridge Memory Address Buffer PCI-X Bridge Other Bridge I/O Hub Memory Address Buffer I/O Hub Memory Address Buffer

32 Mapping Workloads onto Architecture
Consider a dichotomy of workloads: Large Memory Model – This needs a large, single system image and a large amount of coherent memory Database apps - SQL Server / Oracle Business Intelligence – Data Warehousing + Analytics Memory-resident databases 64 bit architectures allow memory addressability above 1 TB Small/Medium Memory Model – This can be cost-effective in workloads that do not require extensive shared memory/state Stateless Applications and Web Services Web Servers Clusters of systems for parallelized applications and grids

33 Large Server Vendors Intel Announcement (Nov 19)
Otellini said product development, marketing and software efforts (for Itanium) will all now be aimed at "greater than four-way systems". He also said, "The mainframe isn't dead. That's where I'd like to push Itanium over time." The size of the SMP is affected by Intel’s chip set support for coherent memory OEM Vendors (Unisys, HP, SGI, Fujitsu, IBM) Each has unique “chip set” to build basic four-ways into large SMP systems IBM has Power5, which is a direct competitor Intel 32-bit and EM674T This could emerge as the flagship product

34 Where Are We Going? Since the early CISC computers, we have moved more and more of the complexity out to the compiler to achieve parallelism and fully exploit the silicon “real estate” The power requirements, along with the smaller fabrication sizes, have pushed the CPU vendors to exploit multiple cores The key to performance for these future machines will be the application’s ability to exploit parallelism


Download ppt ".. Resistance is futile…. Dec 6, 2004"

Similar presentations


Ads by Google