Presentation is loading. Please wait.

Presentation is loading. Please wait.

By Mansur ARSLAN.  What is Nehalem?  Hyperthreading and MultiCore  Nehalem Cache Structure  Nehalem Intercomm Architecture : Quick Path  Inside Nehalem.

Similar presentations


Presentation on theme: "By Mansur ARSLAN.  What is Nehalem?  Hyperthreading and MultiCore  Nehalem Cache Structure  Nehalem Intercomm Architecture : Quick Path  Inside Nehalem."— Presentation transcript:

1 By Mansur ARSLAN

2  What is Nehalem?  Hyperthreading and MultiCore  Nehalem Cache Structure  Nehalem Intercomm Architecture : Quick Path  Inside Nehalem Processing  Front-end  Branch prediction  Loop Stream Decoder  Power & Energy Management  Turbo Boost  Power gates

3

4  Nehalem is the codename for an Intel processor microarchitecture, successor to the Core microarchitecture.  The first processor released with the Nehalem architecture is the desktop Core i7,which was released in November 2008.  The microarchitecture is named after the Nehalem Native American nation in Oregon.

5  " Tick-Tock " is a model adopted by Intel since 2007 to follow every microarchitectural change with shrinking of the processor technology.  Every "tick" is a shrinking of processor technology of the previous microarchitecture  Every "tock" is a new microarchitecture. Every year, there is expected to be one tick or tock

6

7  Hyper-threading works by duplicating certain sections of the processor that store the architectural state such as  Instruction Flag Registers,  Interrupt Mask Registers,  Memory management unit Registers,  Status registers  but not duplicating the main execution resources.  OS sees it as two Processors. Thus when a processor stalled due to sth (e.g.cache miss) another task can run in the same resources.  Better than non-HT in performance  But not as much as multiple CPUs even though added costs.

8  A multi-core processor is an integrated circuit to which two or more individual processors ( cores ) have been attached.  The cores are typically integrated onto a single integrated circuit die (known as a chip multiprocessor or CMP), or they may be integrated onto multiple dies in a single chip package.  Better than multiple CPUs in cache coherence and communication.  Still does not solve stall inefficiency.

9  Nehalem has both Multi-Core technology and enhanced Hyper-threading.  Thus got advantages of both.  And decreasing the flows of each by some enhancement methods we will see.

10

11

12  The L1 caches are :  32kb for Instructions and 32kb for Data  These are really fast TLBs  The L2 caches are:  Slower than L1 but still TLB  256K is Smaller than previous designs. Why?

13  The results of Nehalem cache-memory latency tests show that this solution is efficient:  As it is seen Nehalem has a little slower in L1 cache but to cover up it keeps the L2 cache smaller than Penryn thus lowers the L2 cache latency very much.  And it has a 8MB of L3 cache with not so much latency.

14  Inclusive L3 Cache Memory, i.e. the data stored in L1 and L2 caches is duplicated in L3 cache.  Thus Inter-Core communication is not necessary at cache miss.  L3 cache has flags to show where the data comes from..  If a core modifies the data in L3 cache and these data initially belong to different core/cores, the L1/L2 caches of these cores get updated simply. UsingMESIF algorithm.

15  What is MESIF Algorithm?  Every cache line is marked with one of the states : M odified, E xclusive, S hared, I nvalid or F orward  Modified : The cache line is present only in the current cache, and is dirty ; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state.

16  Exclusive : The cache line is present only in the current cache, but is clean. It may be changed to the “Shared” at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it.  Shared : The cache line may be stored in other caches of the machine & is "clean" ; it matches the main memory. The line may be discarded (chg. to Invalid state) at will.  Invalid : Indicates that the cache line is invalid.  Forward : indicates that a cache should act as a designated responder for any requests for the given line

17  A prefetch is a mechanism that observes memory access patterns and tries to anticipate which data will be needed several cycles in advance.  In the Conroe architecture Intel used hardware prefetchers that increased performance in desktop applications.  Due to loss of perf. it had to be disabled in Servers.  Now Nehalem (actually Core-2 also) uses Prefetch only to acquire data & instructions to L1 and L2.

18

19  QuickPath Architecture A platform architecture that provides high- speed connections between microprocessors and external memory, and between microprocessors and the I/O hub.  Main concepts are  Scalable Shared Memory  Integrated Memory Controllers  Quick Path Interconnect

20

21  Instead of a single shared pool of memory connected to all processors through FSBs and memory controller hubs:  Each processor will have its own dedicated memory that it accesses directly through an Integrated Memory Controller.  Integrating the memory controller into the silicon die improves memory access latency  Available memory bandwidth scales with the number of processors added

22  If a processor needs to access the dedicated memory of another processor, it can do so through a high-speed QPI that links all the processors.

23  QPI is point-to-point. No single bus that all the processors must use and contend with each other to reach memory and I/O.  Improves scalability, eliminating the competition between processors for bus bandwidth.  Available memory bandwidth also scales with the number of processors added

24  Let’s Remember The Die Structure

25  QPI is specified as a 5-layer architecture, with separate physical, link, routing, transport *, and protocol layers.  Physical layer: Transmits a 20-bit phit using a single clock edge on 20 lanes when all 20 lanes are available, or on 10 or 5 lanes when the QPI is reconfigured due to a failure.  Link layer: Each flit contains an 8-bit CRC generated by the link layer transmitter and consists of 4 x 20bit phits. If the link layer receiver detects a CRC error, the receiver notifies the transmitter via a flit on the return link of the pair and the transmitter resends the flit.

26  Routing layer: The routing layer sends a 72-bit unit consisting of an 8-bit header and a 64-bit payload. The header contains the destination and the message type.  Transport layer Sends and receives data across the QPI network from its peers on other devices that may not be directly connected (i.e., the data may have been routed through an intervening device.)  Protocol layer: The protocol layer sends and receives packets on behalf of the device. A typical packet is a memory cache row. The protocol layer also participates in cache coherency maintenance by sending and receiving cache coherency messages.

27

28  Front-End Fetches and Predecodes Instructions  These instructions then goes into Branch Prediction

29  If Branch Predictor predicts the branch is taken  The branch target buffer (BTB) is responsible for predicting the target address.  Nehalem uses 2 level BTB. Most prob. with same Branch Pred. Algorithm  L1 could hold a smaller but faster history (256-512 entries e.g.)  L2 could hold a bigger but slower one (e.g. 2-8K Target RIPs) for more detailed Branch Target prediction Especially for large footprint applications

30  After Branch prediction phase comes the Instruction Queue that does alignment and Fusion  MacroOp Fusion can fuse many two instructions into one complex instruction (e.g. comparison followed by conditional branching as in Core-2 but many more than that)  3xSimple 1xComplex decoder  Thus 4+1 Instructions decoded in one cyclex

31  Better performance due to change of place of LSD  If a loop exists disable all Branch Pred., IF, Decoding  Instead of 18 inst., 28 Micro-Ops inside the Buffer  Also Return Stack Buffer, responsible for correct prediction of functions’ return addresses.

32

33

34

35

36  Source: Consists of doped silicon i.e. silicon containing impurities that lower resistance  Drain – Doped with impurities in the same way as the source. A transistor is completely symmetrical, meaning current can flow from source to drain, or drain to source.  Channel – The region where current flows when transistor is in the ‘on’ state.  Silicon dioxide: A good insulator. For a gate dielectric, a thin silicon dioxide layer is desirable for high performance But the thinner the layer, the higher the leakage.  Old 65nm Gates

37  High-K-Metal 45nm Gates  High-k material  Has good insulating properties  Creates high-field effect (hence “high-k”) between the gate and channel.  Also, because high-k materials can be thicker than silicon dioxide – while retaining the same desirable properties – they greatly reduce leakage.

38  Jeff Casazza (Intel), First the tick, now the tock: Intel Microarchitecture Nehalem, 2009  George Alfs,Nick Knupfler, Intel Core Microarchitecture WP, 2007  Robert Chau, Summan Datta, Mark Doczy, Brian Doyle, Jack Kavalieros and Matthew Metz, High-K/Metal-Gate Stack and its MOSFET characteristics, IEEE Electron Device Letters Vol 25, June 2004  Intel® QuickPath Architecture White Paper  Intel High-Kmetal Gate Transistor Glossary of terms  Intel 45-nm Fact Sheet  Intel® Turbo Boost Technology in Intel® Core™ White Paper  www.intel.com/technology/architecture-silicon/45nm-core2/ www.intel.com/technology/architecture-silicon/45nm-core2/  www.intel.com/pressroom/kits/45nm www.intel.com/pressroom/kits/45nm  spectrum.ieee.org/semiconductors/design/the-highk-solution spectrum.ieee.org/semiconductors/design/the-highk-solution  www.pcper.com/article.php?aid=608 www.pcper.com/article.php?aid=608  www.xbitlabs.com/articles/cpu/display/nehalem- microarchitecture_3.html www.xbitlabs.com/articles/cpu/display/nehalem- microarchitecture_3.html

39  www.hardwarezone.com/articles www.hardwarezone.com/articles  en.expreview.com en.expreview.com  www.notebookcheck.net/ www.notebookcheck.net/  www.3dnews.ru www.3dnews.ru  www.xcpus.com www.xcpus.com  www.dvhardware.net/article29268.htm www.dvhardware.net/article29268.htm  www.intel.com/business/resources/ www.intel.com/business/resources/  en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture) en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)  www.tomshardware.com/reviews/Intel-i7-nehalem-cpu www.tomshardware.com/reviews/Intel-i7-nehalem-cpu  chip-architect.com/news/Shanghai_Nehalem.jpg chip-architect.com/news/Shanghai_Nehalem.jpg  www.xbitlabs.com/articles/cpu/display/nehalem- microarchitecture_5.html www.xbitlabs.com/articles/cpu/display/nehalem- microarchitecture_5.html  www.realworldtech.com/page.cfm?ArticleID=RWT040208182719  www.intel.com/technology/platform-technology/hyper- threading/index.htm www.intel.com/technology/platform-technology/hyper- threading/index.htm  en.wikipedia.org/wiki/Hyper-threading en.wikipedia.org/wiki/Hyper-threading  en.wikipedia.org/wiki/Architectural_state en.wikipedia.org/wiki/Architectural_state  en.wikipedia.org/wiki/Multi-core_(computing) en.wikipedia.org/wiki/Multi-core_(computing)  en.wikipedia.org/wiki/Intel_QuickPath_Interconnect en.wikipedia.org/wiki/Intel_QuickPath_Interconnect

40  www.intel.com/technology/quickpath/ www.intel.com/technology/quickpath/  theinteli7.com/2009/02/intel-processors/quick-path-technology- adopted-in-core-i7/ theinteli7.com/2009/02/intel-processors/quick-path-technology- adopted-in-core-i7/  en.wikipedia.org/wiki/Front-side_bus en.wikipedia.org/wiki/Front-side_bus  www.intel.com/technology/quickpath/whitepaper.pdf www.intel.com/technology/quickpath/whitepaper.pdf  computer.howstuffworks.com/nehalem-microprocessor- microarchitecture3.htm computer.howstuffworks.com/nehalem-microprocessor- microarchitecture3.htm  www.realworldtech.com/page.cfm?ArticleID=RWT040208182719 www.realworldtech.com/page.cfm?ArticleID=RWT040208182719  www.lostcircuits.com/ www.lostcircuits.com/  www.maximumpc.com/article/features/everything_you_need_know_a bout_nehalems_turbo_mode www.maximumpc.com/article/features/everything_you_need_know_a bout_nehalems_turbo_mode  www.osnews.com/story/20199/Intel_Gives_turbo_Boost_to_Nehalem www.osnews.com/story/20199/Intel_Gives_turbo_Boost_to_Nehalem  computer.howstuffworks.com/nehalem-microprocessor- microarchitecture4.htm computer.howstuffworks.com/nehalem-microprocessor- microarchitecture4.htm  www.atomicmpc.com.au/News/126302,idf-wrap-up-nehalem-turbo- boost.aspx www.atomicmpc.com.au/News/126302,idf-wrap-up-nehalem-turbo- boost.aspx  www.benchmark.rs/tests/editorial/Nehalem_munich/presentations/Te chnology_Insight_Nehalem.pdf

41


Download ppt "By Mansur ARSLAN.  What is Nehalem?  Hyperthreading and MultiCore  Nehalem Cache Structure  Nehalem Intercomm Architecture : Quick Path  Inside Nehalem."

Similar presentations


Ads by Google