Presentation is loading. Please wait.

Presentation is loading. Please wait.

CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,

Similar presentations


Presentation on theme: "CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,"— Presentation transcript:

1 CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA

2 CML http://www.aviral.lab.asu.edu CMLSummary 2  Cannot improve performance without improving power-efficiency  Cores are becoming simpler in multicore architectures  Caches not scalable (both power and performance)  Limited Local Memory multicore architectures  Each core has a scratch pad (e.g., Cell processor)  Need explicit DMAs to communicate with global memory  Objective:  How to enable vector data structure (dynamic arrays) on the LLM cores?  Challenges:  1. Use local store as temporary buffer (e.g., software cache) for vector data  2. Dynamic global memory management, and core request arbitration  3. How to use pointers when the data pointed to may have moved ?  Experiments  Any size vector is supported  All SPUs may use vector library simultaneously – and is scalable

3 CML http://www.aviral.lab.asu.edu CML From multi- to many-core processors IBM XCell 8iGeForce 9800 GTTilera TILE64 3  Simpler design and verification  Reuse the cores  Can improve performance without much increase in power  Each core can run at a lower frequency  Tackle thermal and reliability problems at core granularity

4 CML http://www.aviral.lab.asu.edu CML Memory Scaling Challenge Intel 48 core chip Strong ARM 1100 4  In Chip Multi Processors (CMPs), caches guarantee data coherency  Bring required data from wherever into the cache  Make sure that the application gets the latest copy of the data  Caches consume too much power  44% power, and greater than 34% area  Cache coherency protocols do not scale well  Intel 48-core Single Cloud-on-a-Chip has non- coherent caches

5 CML http://www.aviral.lab.asu.edu CML PPE Element Interconnect Bus (EIB) Off-chip Global Memory PPE: Power Processor Element SPE: Synergistic Processor Element LS: Local Store SPE 0SPE 2 SPE 5 SPE 4 SPE 3SPE 1 SPE 6 Limited Local Memory Architecture  Cores have small local memories (scratch pad)  Core can only access local memory  Accesses to global memory through explicit DMAs in the program  e.g. IBM Cell architecture, which is in Sony PS3. SPE 7 5 LS SPU

6 CML http://www.aviral.lab.asu.edu CML LLM Programming  Task based programming, MPI like communication #include extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid (&hello_spu); } Main Core int main(speid, argp) { printf("Hello world!\n"); } Local Core int main(speid, argp) { printf("Hello world!\n"); } Local Core int main(speid, argp) { printf("Hello world!\n"); } Local Core int main(speid, argp) { printf("Hello world!\n"); } Local Core int main(speid, argp) { printf("Hello world!\n"); } Local Core int main(speid, argp) { printf("Hello world!\n"); } Local Core = spe_create_thread Otherwise, efficient data management is required! 6  Extremely power-efficient computation  If all code and data fit into the local memory of the cores

7 CML http://www.aviral.lab.asu.edu CML Managing data Local Memory Aware Code Original Code int global; f1(){ int a,b; global = a + b; f2(); } int global; f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2(); } 7

8 CML http://www.aviral.lab.asu.edu CML Vector Class Introduction Vector Class is widely used library for programming! 8  One of classes in Standard Template Library(STL) for C++  Implemented as dynamic arrays, sequential container  Elements stored in contiguous storage locations  Can be accessed by using iterators or offsets on regular pointers to elements  Compared to arrays:  Vector have the ability to be easily resized  Capacity increase and decrease is handled automatically  They usually consume more memory than arrays when their capacity is handled automatically  This is in order to accommodate extra storage space for future grownth

9 CML http://www.aviral.lab.asu.edu CML Vector Class Management main() { vector vec; for(int i = 0; i < N; i++) vec.push back(i); } SPE code Max N is 8192 N0N0 8192 INTs is only 32KB, far less than 256KB of local memory. Why it crashes so early? 9  All code and data need to be managed  This paper focuses on vector data management  Vector management is difficult  Vector size is dynamic and can be unbounded  Cell programming manual suggests “Use dynamic data at your own risk”.  Restricting the usage of dynamic data is restrictive for programmers.

10 CML http://www.aviral.lab.asu.edu CML Outline of the Talk 10  Motivation  Related Works on Vector Data Management  Our Approach of Vector Data Management  Experiments

11 CML http://www.aviral.lab.asu.edu CML Related Works SPE Local Memory Global Memory …… LLM Architecture SPE Local Memory DMA They ensure data coherency across different spaces. What about size of local memory is small? 11  Different threads can access vector concurrently, no matter it is in one address space or different spaces.  They provide efficient parallel implementations, abstract platform details, provide an interface to programmers to express the parallelism of the problems, automatically translate from one space to another  Shared memory: MPTL[Baertschiger2006], MCSTL[Singler2007] and Intel TBB[Intel2006]  Distributed memory: POOMA[Reynders1996], AVTL[Sheffler1995], STAPL[Buss2010] and PSTL[Johnson1998]

12 CML http://www.aviral.lab.asu.edu CML Space Allocation and Reallocation Unlimited vector requires evicting older vector data to global memory and reallocating more global memory! Vector Data allocated space 0x010100 0x010200 (a) When the vector use up the allocated space Vector Data New allocated space 0x010500 0x010600 (b) We allocate a large space and move all data 0x010700 12  push_back & insert  Adds elements  Needs to be re-allocated for a larger space when there is no unused space

13 CML http://www.aviral.lab.asu.edu CML Space Allocation and Reallocation  Static buffer?  Small vector -> low utilization; large vector -> overflow  SPU thread can’t use malloc() and free() on global memory  Hybrid: DMA + mailbox SPE struct msgStruct { int vector_id; int request_size; int data_size; int new_gAddr; }; (2)operation type vector data Global Memory (4) restart signal (1) transfer parameters by DMA SPE thread PPE PPE thread (3) operate on vector, update new_gAddr in the data structure (5) get new vector address by DMA mailbox based 13

14 CML http://www.aviral.lab.asu.edu CML Element Retrieving 133th element: block index = 128 = 133 / 16 * 16 …… Block Size is 16 …… 0 th element 1 st element 15 th element 128 th element 143 th element …… Block 0 Block 7 Based on the global address, we can know whether this block is in the local memory or not. If not, fetch it. 14  Block index: index of 1 st element in the block  Each block contains a block index, besides the data; blocks are in linked list.  Global address:

15 CML http://www.aviral.lab.asu.edu CML Vector Function Implementation  But elements shifting now is a challenging task under LLM architecture  Because we cannot use pointers in the local memory to access global memory & DMA requires alignment New Element Global Memory for (……) (*b++) = (*a++); Local Memory New Element 15  In order to keep semantics, we implemented all functions. But only insert function is shown here.  Original insertion can take advantage of pointers.

16 CML http://www.aviral.lab.asu.edu CML Pointer problem needs to be solved! Pointer Problem  In order to support limitless vector data, global memory must be leveraged.  Two address spaces co-exist, no matter what scheme is implemented, pointer issue exist. vec Global Memory (a)Pointer points to a vector element struct* S { …… int* ptr; } Local Memory vec (b) The vector element is moved to global memory ? struct* S { …… int* ptr; } Local Memory Global Memory 16

17 CML http://www.aviral.lab.asu.edu CML Pointer Resolution (a) Original Program (b) Transformed Program main() { vector vec; int* a = vec.at(index); int sum = 1 + *a; int* b = a; } main() { vector vec; int* a = ppu_addr(vec,index); a = ptrChecker(a); int sum = 1 + *a; a = s2p(a); int* b = a; } ppu_addr: returns global address ptr pointing to the vector element. ptrChecker: –checks whether ptr is pointing to a vector data; –guarantees the data pointed is in the local memory; –returns the local address. s2p: transforms local address back to global address Local address should not be used to identify the data. 17

18 CML http://www.aviral.lab.asu.edu CML Experimental Setup  Hardware  PlayStation 3 with IBM Cell BE  Software  Operating System: Linux Fedora 9 and IBM SDK 3.1  Benchmarks: some possible applications using vector data. 18

19 CML http://www.aviral.lab.asu.edu CML Unlimited Vector Data 4 B …… B: Bytes 8 B 16 B 2 n+2 B …… Why? 19

20 CML http://www.aviral.lab.asu.edu CML Impact of Block Size 20

21 CML http://www.aviral.lab.asu.edu CML Impact of buffer Space buffer_size = number_of_block × block_size. 21

22 CML http://www.aviral.lab.asu.edu CML Impact of Associativity  Higher associativity -> high computation spent on looking up data structure & low miss ratio 22

23 CML http://www.aviral.lab.asu.edu CMLScalability 23

24 CML http://www.aviral.lab.asu.edu CMLSummary 24  Cannot improve performance without improving power-efficiency  Cores are becoming simpler in multicore architectures  Caches not scalable (both power and performance)  Limited Local Memory multicore architectures  Each core has a scratch pad (e.g., Cell processor)  Need explicit DMAs to communicate with global memory  Objective:  How to enable vector data structure (dynamic arrays) on the LLM cores?  Challenges:  1. Use local store as temporary buffer (e.g., software cache) for vector data  2. Dynamic global memory management, and core request arbitration  3. How to use pointers when the data pointed to may have moved ?  Experiments  Any size vector is supported  All SPUs may use vector library simultaneously – and is scalable


Download ppt "CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,"

Similar presentations


Ads by Google