Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC.

Slides:

Advertisements

Similar presentations

Advertisements

Part IV: Memory Management

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Chapter 2: Memory Management, Early Systems

Analysis of programs with pointers. Simple example What are the dependences in this program? Problem: just looking at variable names will not give you.

Procedure Calls Prof. Sirer CS 316 Cornell University.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

1 Today’s lecture  Last lecture we started talking about control flow in MIPS (branches)  Finish up control-flow (branches) in MIPS —if/then —loops —case/switch.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Data Structures: A Pseudocode Approach with C

Week 7 - Friday.  What did we talk about last time?  Allocating 2D arrays.

Chapter 7 Memory Management Operating Systems: Internals and Design Principles, 6/E William Stallings Dave Bremer Otago Polytechnic, N.Z. ©2009, Prentice.

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

Nov 12, 2009IAT 8001 Hash Table Bucket Sort. Nov 12, 2009IAT 8002  An array in which items are not stored consecutively - their place of storage is calculated.

1 1 Lecture 4 Structure – Array, Records and Alignment Memory- How to allocate memory to speed up operation Structure – Array, Records and Alignment Memory-

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Multiprocessing Memory Management

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Organization and Architecture

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

COMPSYS 304 Computer Architecture Memory Management Units Reefed down - heading for Great Barrier Island.

Optimizing RAM-latency Dominated Applications

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

MIPS coding. SPIM Some links can be found such as:

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and notes from the Patterson and Hennessy Text.

IS 2610: Data Structures Searching March 29, 2004.

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

Comp 245 Data Structures Linked Lists. An Array Based List Usually is statically allocated; may not use memory efficiently Direct access to data; faster.

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

1 Symbol Tables The symbol table contains information about –variables –functions –class names –type names –temporary variables –etc.

External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Kovács Zita 2014/2015. II. félév DATA STRUCTURES AND ALGORITHMS 26 February 2015, Linked list.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.

1 Data Organization Example 1: Heap storage management –Keep track of free chunks of memory Example 2: A simple text editor –Maintain a sequence of lines.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Circular linked list A circular linked list is a linear linked list accept that last element points to the first element.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Array 10 GB Hard Disk 2 GB 4 GB2 GB 3 GB DATA 4 GB Free Store data in the form of Array (Continuous memory locations) Solution-1: No Solution. Memory Insufficient.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

1 Data Organization Example 1: Heap storage management Maintain a sequence of free chunks of memory Find an appropriate chunk when allocation is requested.

Memory Management Damian Gordon.

Computer structure: Procedure Calls

15-740/ Computer Architecture Lecture 3: Performance

Chapter 2 Memory and process management

William Stallings Computer Organization and Architecture 8th Edition

A Memory Aliased Instruction Set Architecture

Multiscalar Processors

Data Structure and Algorithms

Lab Find Live CMPUT 229.

Address-Value Delta (AVD) Prediction

Procedures and Calling Conventions

CS 286 Computer Architecture & Organization

Data Structures & Algorithms

Why we have Counterintuitive Memory Models

Cache Performance Improvements

Lecture Topics: 11/20 HW 7 What happens on a memory reference Traps

What Are Performance Counters?

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC Retreat 2002

on-chip data cache cache to execution latency: 1-3 compute cycles main memory main memory to cache latency: ~50 compute cycles Memory Prefetching Problem Want data in cache before requested by execution core But all the data will not fit How to prefetch data in a timely manner w/o wasting cache space? prefetcher

struct node_t { int datum1; char *datum2; struct node_t *next; }; struct node_t *node;. for (node = head; node != NULL; node = node->next) { if (key1 == datum1 && strcmp(key2,datum2)) break; } 0x1000 0x1004 0x x2500 0x2504 0x2508 0x3000 0x3004 0x “ford” 2 “trillian” 3 “marvin” Example linked list and its traversal Address accesses: 1000, 1004, 1008, 2500, 2504, 2508, 3000… obvious structure not apparent in simple list of memory locations

Prefetching Linked Data Structures In Olden benckmark suite, pointer loads contributed 50-90% to data cache misses Linked data structure elements can occupy non-contiguous portions of memory Prefetching sequential blocks of memory is not effective Solution: prefetch elements of data structure sequentially Use information from element just fetched to find next element to prefetch Split prefetching task into two parts: Picking out the structure of linked data elements and build a representation of the data structure Prefetching this data structure effectively using representation

for (node = head; node != NULL; node = node->next) { if (key1 == datum1 && strcmp(key2,datum2)) break; } loop: lw $10,0($16) lw $4,4($16) bne $10,$6,endif jal strcmp bne $7,$0,endif j outloop endif: lw $16,8($16) bne $16,$0,loop outloop: lw $16,8($16) lw $10,0($16) lw $4,4($16) lw $16,8($16) lw $10,0($16) lw $4,4($16) lw $16,8($16) lw $10,0($16) lw $4,4($16) node root datum1 datum How do we get here? ? Loop in CCompiled loop Dynamic memory accesses  3 Linked Data Structure Manifestation

Problem: could be up to 50 unrelated load instructions for every load instruction relevant to a particular data structure. Solution: extract load dependencies through register use analysis and analysis of store memory addresses. $16 0x118 register producing instruction address 0x118 lw $16,8($16) $3 0x100 register producing instruction address $16 0x118 0x100 lw $3,0($16) Found dependency between loads 0x118 & 0x100 Register use analysis 0x124 lw $5,64($10) 0x12c lw $2,0($11)

Spill table: same idea, keeps track of loads that produced values which are temporarily saved to memory Keep another table of load instruction addresses and the values they produce and we get a chain of dependent loads:. lw $16,8($16) lw $10,0($16) lw $4,4($16) lw $16,8($16) lw $10,0($16) lw $4,4($16) lw $16,8($16) lw $10,0($16) lw $4,4($16). offset:8offset:0offset:4offset:8 … … Dependency arcs Data flow graph of dependent loads

Chain of dependencies is still linear in the number of memory accesses Apply compression to dependency chain SEQUITUR compression algorithm Produces hierarchical grammar: offset:8offset:0offset:4offset:8 … … rule1  3 rule1=offset:8 lds back:3 offset:0 lds back:1 offset:4 lds back:2 lw $16,8($16) lw $10,0($16) lw $4,4($16) lw $16,8($16) lw $10,0($16) lw $4,4($16) lw $16,8($16) lw $10,0($16) lw $4,4($16) Compression of data flow graph

Data Structure found in Benchmark while (list != NULL) { prefcount++; i = village->hosp.free_personnel; p = list->patient; /* This is a bad load */ if (i > 0) { village->hosp.free_personnel--; p->time_left = 3; p->time += p->time_left; removeList(&(village->hosp.waiting), p); addList(&(village->hosp.assess), p); } else { p->time++; /* so is this */ } list = list->forward; } Loop from “health” benchmark in Olden suiteRepresentation generated rule1  10 rule1=offset:8 lds back:3 offset:0 lds back:1 offset:4 lds back:1

Prefetching Dependent load chains with compression provide compact representations of linked data structures. Representations that compress well can be cached and then used to later prefetch the data structure. To prefetch, unroll compressed representation; grab the root pointer value from the stream of incoming load instructions rule1: offset:8 lds back:3 initiating load instruction produces 1000 offset:0 lds back:1 fetch data at memory location 0x1000 (gives 1) offset:4 lds back:2 “ “ 0x1004 (gives“ford”) rule1: offset:8 lds back:3 “ “ 0x1008 (gives 2500) offset:0 lds back:1 “ “ 0x2500 (gives 3000) offset:4 lds back:2 “ “ 0x2504 (gives 2).

0x118 lw $16,8($16) 0x100 lw $10,0($16) 0x104 lw $4,4($16) 0x118 lw $16,8($16) 0x100 lw $10,0($16) 0x104 lw $4,4($16) 0x118 lw $16,8($16) 0x100 lw $10,0($16) 0x104 lw $4,4($16) rule1  3 rule1=offset:8 lds back:3 offset:0 lds back:1 offset:4 lds back:2 0x118 Representation cache 0x118 lw $16,8($16) returns value 4000 prefetch mem location 0x4000 prefetch mem location 0x4004 prefetch mem location 0x4008 prefetch mem location 0x4100. When the load instruction with address 0x118 is executed again, initiate prefetch, grab result of initiating prefetch as the first memory location prefetched. Representations are cached on first load instruction address Prefetch flow

“Introspective” System Architecture for Prefetcher Representation building and prefetching tasks are rather modular so could split the tasks up on different processors. compute build prefetch 0x118 lw $16,8($16) 0x100 lw $3,0($16) … rule1  3 rule1=offset:8 lds back:3 offset:0 lds back:1 offset:4 lds back:2 start value=1000 prefetch 0x1000 prefetch 0x1004 … prefetch result 1 prefetch result “ford” …

Preliminary results: correct, timely prefetches? Most prefetches are later accessed by regular instructions. Great! Unfortuately, prefetched data is usually already there. Too bad!

Representations work, prefetches late... Representation phase of the algorithm seems to work reasonably well for linked lists. Works less well on trees & arrays Provides no performance gains due not enough prefetching and late prefetches –More work being done to start prefetching farther along in data structure