Hw/Sw Co-design 발전 동향 조준동 VLSI Algorithmic Design Automation Lab.

Hw/Sw Co-design 발전 동향 조준동 VLSI Algorithmic Design Automation Lab.
School of Information and Telecommunication Sungkyunkwan Univ.

모듈1: 통합설계방법론의 필요성 및 개요 학습 목표 선수지식
칩의 집적도가 높아짐에 따라 MPSoC 와 NoC 와 같이 프로세싱 컴포넌트들의 병행성도 증가하게 된다. 이 경우 시스템 설계 및 검증은 더욱 어려운 문제가 되는데 이러한 추세를 대비하는 통합설계 기술의 전망을 살펴보도록 한다. 선수지식 논리설계 컴퓨터구조 Copyrightⓒ2005 J.D.Cho,

목차 Motivation Multiprocessor platform
On chip Communication (Network on Chip) Low Power Design HW/SW Codesign Methodology Copyrightⓒ2005 J.D.Cho,

Post Pc = Mobile computing + Intelligent environment
109 times bandwidth and 106 times power consumption 3GOPS to search a song in 0.5sec by humming from a D/B (containing 2000 songs) and 3D TV also requires several GOPS. By National Technology Roadmap for Semiconductors, in 2010, 4 billion transistors with 50nm is integrated into one chip and its clock speed would be 10GHz New design methodology is required to handle wiring delay and intrinsic electrical noise. Ultra low energy ( Mops/mW), Ultra low cost S/W and H/W co-design, S/W-driven Design Reuse (e.g., software-Defined Radio) Copyrightⓒ2005 J.D.Cho,

Predicting the future 1899: Charles H. Duell, U.S. Patent Office:
Everything that can be invented has been invented. 1943: Thomas J. Watson, Chairman of the Board, IBM :I think there is a world market of about five computers. 1948: IBM: The computer has no commercial value. 1981: Bill Gates, Chairman, Microsoft: 640 kilobytes of RAM ought to be enough for anybody. Copyrightⓒ2005 J.D.Cho,

McKinsey Curve: dynamics of R&D disciplines
saturation: limitations met new discipline on top of it by .... ... by innovation maturity of a discipline fundmental issues consolidation year Copyrightⓒ2005 J.D.Cho,

EDA Industry Revolutions
closer to programmers‘ mind set courtesy [Keutzer / Newton] EDA industry paradigm switching every 7 years 1999 HLLs, (Co-) Compilation Data-Stream-based DPU arrays 2006 1978 Transistor entry: Applicon, Calma, CV ... 1992 Synthesis: Cadence, Synopsys ... 1985 Schematics entry: Daisy, Mentor, Valid ... Copyrightⓒ2005 J.D.Cho,

Embedded Systems and Portable Computing
92% of market Knowledge base needed Hardware/Software Codesign Copyrightⓒ2005 J.D.Cho,

A Multimedia Embedded Chip
Copyrightⓒ2005 J.D.Cho,

What are the properties of these Ambient Intelligence architectures?

Silicon technology roadmap

발전 방향 Wireless processing system은 높은 throughput과 함께 많은 계산을 필요로
하지만 엄격한 power 제약이 있음 재구성 SoC 구현은 parallelism 에 의해 성능향상을 시도하고, IP reuse를 사용 Hot spot bottleneck에 의한 성능 예측을 통한 Algorithm partitioning 멀티미디어 응용 제품의 확대와 이에 필요한 대용량의 burst 데이터 전송요구를 만족하기 위한 통신 대역폭을 확장 Dual-Core Architecture (ARM+DSP) -> Multiprocessor SoC Copyrightⓒ2005 J.D.Cho,

Key Challenges With Chip Design

HIERARCHY OF PLATFORMS

최근 연구동향 Intel’s Reconfigurable Radio Architecture. (mesh + nearest neighbor) Reconfigurable Baseband Processing, Picochip Portable Components using Containers for Heterogeneous Platforms, Mercury Computer Systems, Inc. A configurable Platform, Altera, Excalibur, Xilinx Virtex FPGA Adaptive Computing Machine, Quicksilver Tech. Mercury, Sky, Galileo, Tundra (crossbars, bridges) Virginia Tech’s reconfigurable hardware Copyrightⓒ2005 J.D.Cho,

Full Application Platform
users design full applications on top of hardware and software architectures Nexperia Texas Instrument's OMAP multimedia platform Infineon's M-Gold 3G wireless platform, Parthus' Bluetooth platforms ARM's PrimeXsys wireless platform Copyrightⓒ2005 J.D.Cho,

OMAPTM(open multimedia application platform)
OMAP architecture는 platform의 전체 clocking과 idle mode의 전체 control을 할 수 있는 SW/OS가 있다. Dual core architecture는 task에 대해 가정 적당한 process에게 task를 할당하는 것이 가능 Copyrightⓒ2005 J.D.Cho,

Processor-centric platform
focus on access to a configurable processor but doesn't model complete applications Program-in Chip-out (PICO), HP Lab. UC. Berkeley, GARP Improv Systems ARC Tensilica Triscend Copyrightⓒ2005 J.D.Cho,

Fully programmable platform
consisting of FPGA logic and a processor core System on a programmable chip(SOPC) Altera's Excalibur, Xilinx' Virtex-II Pro and Quicklogic's QuickMIPS Xilinx-IBM XBlue architecture Copyrightⓒ2005 J.D.Cho,

Coarse grain Reconfigurable Computing, Reiner Hartenstein, TU Kaiserslautern
The new machine Paradigm :Configware is going mainstream, Hardware / Configware /Software do-design is the new mind set for digital systems engineering Copyrightⓒ2005 J.D.Cho,

Configurable Logic Block (CLB)
Fine-grain Morphware: Drawbacks FPGA Architectures SRAM-based Look-up Tables (LUTs) Problems: Routing: reduces Performance Bad Ratio: active / passive Elements reconfigurable Interconnect (Switching Boxes) Source: R. Hartenstein Configurable Logic Block (CLB) LUT Copyrightⓒ2005 J.D.Cho,

Reconfigurability Overhead
S resources needed for reconfigurability L area used by application partly for configuration code storage This slide gives the Xilinx XC4000E FPGA segmentation architecture. As shown in this slide, a routing channel comprises three types of subchannels: single-length line subchannel (indicated by blue color), double-length line subchannel (indicated by brown color) and longline subchannel (indicated by red color) . “hidden RAM” not shown Copyrightⓒ2005 J.D.Cho,

Merit of Coarse grain Approach
hardwired 1000 10 1 0.1 0.01 0.001 2 0.5 0.25 0.13 0,07 MOPS / mW µ feature size T. Claasen et al.: ISSCC 1999 instruction set processors standard microprocessor DSP FPGAs (reconfigurable logic) Wiring by abutment: a 32 Bit KressArray example if coarse grain cells are full custom and mesh-connected, *) R. Hartenstein: ISIS 1997 rDPAs (reconfigurable computing)* 100 and 2nd level interconnect ressources layouted over the cells the array is almost as area-efficient as hardwired Copyrightⓒ2005 J.D.Cho,

Communication- centric platform
interconnect architecture but doesn't typically provide a processor or a full application Sonics' SiliconBackplane PalmChip's CoreFrame architectures. Copyrightⓒ2005 J.D.Cho,

The History of Paradigm Shifts
“Mainstream Silicon Application is switching every 10 Years” custom standard 1957 1967 1977 1987 1997 2007 Makimoto’s Wave “The Programmable System-on-a-Chip is the next wave“ reconfigurable µproc., memory 2nd Design Crisis 1st Design Crisis ? TTL LSI, MSI Makimoto’s 1st wave: TTL: nand gate, nor gate, flipflop etc. are general purpose; chips for pocket calculators, radio, tv, etc. are application-specific Makimoto’s 2nd wave: microprocessor, mocrocontroller, RAM memory are general purpose; graphics, multimedia, communication chips, etc. are application-specific Makimoto’s 3rd wave: FPGAs (gates and flipflops) are general purpose; question: will the second half wave go application-specific ? ? ASICs, accel’s Published in 1989 What’s coming next ? Copyrightⓒ2005 J.D.Cho,

The anti universe hydrogen anti hydrogen
Paul Dirac predicted a complete anti universe consisting of antimatter “There are regions in the universe, which consist of antimatter ..... .... But there are asymmetries” when a particle hits its antiparticle, both are converted into energy: Annihilation We are not aware, that there is a new area in computing sciences , which consists of antimatter of computing We can think back to our "playdough" example: we cut out a star, and its "negative image" appeared in the dough. If we now put the star back in its "negative image " hole, there are no more stars, and all our original dough (=energy) is back! Reconfigurable Computing is made from this antimatter: data-stream-based computing Copyrightⓒ2005 J.D.Cho,

- + - Machine and Anti Machine CPU DPU + „von Neumann“
Anti Machine paradigm + CPU - v. N. machine paradigm st electronic computer (Konrad Zuse) st microprocessor (Ted Hoff) instruction stream spinning „data streams“ (systolic array: Kung / Leiserson) anti machine paradigm published rDPA / DPSS (supersystolic: Rainer Kress) Machine paradigm: „von Neumann“ novel compilation techniques data stream spinning Copyrightⓒ2005 J.D.Cho,

IBM’s Coreconnect 초기의 32 비트에서 시작하여 128비트까지 대역폭을 확장

Sonics Smart Interconnect IP

SMART (Sonics Methodology and Architecture for Rapid Time-to-Market)
plug-and-play on-chip communications network Packet-based 50 employees in a year IP 및 설계환경 제공, SoC 설계 지원 Cadence와 연합 SiliconBackplne III는 통신+미디어 Copyrightⓒ2005 J.D.Cho,

Nexperia Digital Video Platform
Designing the initial platform, along with the pnx8500, wasn't quick and easy. It involved about 300 hardware, software and systems people working between 1999 and 2001, of which 60 were involved with hardware. Copyrightⓒ2005 J.D.Cho,

Microprocessor Architecture Research
Wave Pipelining, Prof. Mike Flynn at Stanford Multithreaded Processors Single-Chip Multiprocessors, Prof. Kunle Olukotun at Stanford Vector/Stream Processors, Prof. Bill Dally at Stanford Intelligent RAM, Prof. Dave Patterson at U .C. Berkeley Reconfigurable Computing, DARPA program Don Alpert Copyrightⓒ2005 J.D.Cho,

Single-Chip Multiprocessors
Hydra Project — Prof. Kunle Olukotun at Stanford — Targets thread-level parallelism 4 CPUs on a Chip 3-Level Cache Hierarchy Parallelizing Compiler Technology Don Alpert Copyrightⓒ2005 J.D.Cho,

Advantages of multi-processors:
Performance: possibility to exploit thread level parallelism combined with ILP Energy: low energy cost per instruction by customizing the nodes (ASIPs) + effective memory hierarchy and distributed customisable organisation Flexible: programmable nodes Scalability: memory bandwidth is scalable (if good memory hierarchy is used) Copyrightⓒ2005 J.D.Cho,

SoC designs with special-purpose processor accelerators attached to the common bus have been used
IBM PowerNP Operate in parallel with processor Dedicated to specific tasks Programmable & flexible C.J. Georgiou, V. Salapura, M. Denneau, "A Programmable Scalable Platform for Next Generation Networking, " Network Processor Design, Issues and Practices, Vol. 2, Morgan-Kaufmann Publishers, 2004 Copyrightⓒ2005 J.D.Cho,

Advantages of multiprocessor subsystems in SoC design
The multiprocessor subsystem is connected to the SoC bus via a bridge This separation accommodates different speeds, bus widths, signals, and signaling protocols between the SoC bus and multiprocessor interconnect The subsystem interconnect fabric (i.e., switch) is optimized for multiprocessor operation Only data traffic flows between the multiprocessor and the rest of the SoC The computational capacity of the multiprocessor subsystem is parameterized Number of processor clusters, embedded memories, and memory sizes can be optimized for the particular application Software development is simplified Basic communication and system management primitives are already available to the designer of the application Copyrightⓒ2005 J.D.Cho,

NoC (network on chip) 단일 반도체 칩 상에 통신망 구조를 이식
U.C. Berkeley 단일 반도체 칩 상에 통신망 구조를 이식 OSI model에 의해서 전송 프로토콜을 정의 DSP/microprocessor/Memory 등을 H/W-S/W co-design 이용 단일 칩 내에서 연결 코드 최적화 및 저전력 software IP 라이브러리 구축 모듈간 연결을 위한 버스 구조 구성 요소 Region: 특수한 토폴로지/네트워크 구조를 허용하는 영역 Backbone Wapper : 전송되는 메시지를 적절한 형태로 변환, 복잡하다 복잡하고 대형 시스템에 적합 이 슬라이드에서는 시스템 상의 구성 요소에 대해서 주소를 부여하고 상호간의 데이터가 패킷의 형태로 송신지로부터 목적지까지 목적지 주소를 갖고 통신 네트워크를 통해 전송되는 구조를 갖는 Network On Chip 설계 기반에 대해 설명한다. Network On Chip, 약칭 NoC는 美 U.C. Berkeley등에서 연구되기 시작한 개념으로 단일 반도체 칩의 시스템 설계에 통신망 구조를 이식하고자 하는 시도이다. 이렇게 이식된 통신망 구조는 OSI 7 layer와 같은 전송 프로토콜로 정의 되고, DSP/mProcessor/Memory/ASIC 등과 같은 이종의 대형 시스템이 단일 칩 내에서 random한 데이터 전송을 가지며 연결되는 구조를 지원하기 위해 연구되기 시작하였다. 이 NoC 설계는 Sweden Royal Institute of Technology 및 Finland VTT Electronics 등에서의 연구에 의해 보다 구체화 되고 구성 요소들을 정의하였다, 이들이 구성한 NoC 구성은 CLICHÉ라는 이름의 2차원 격자 배열된 전송선과 교차점을 구성하는 스위치 그리고 스위치에 매달인 자원(resource)인 대형 시스템으로 구성되어 있으며 구체적은 형상은 다음 슬라이드에서 설명한다. 이들에 의해 정의된 구성 요소들은 다음과 같다. [1] Region 2차원 배열된 스위치 네트워크에서 단일 스위치 혹은 일정 범위의 스위치 들을 포괄하는 고립된(insulated) 특수한 영역으로 스위치에 연결된 자원을 포함한 영역 개념이다. 이 영역은 NoC 구성과 다른 특수한 topology 혹은 특정 내부 통신 구조도 허용될 수 있다. 성능상의 이유로 CLICHÉ와 같은 구조가 적합하지 않을 경우 사용된다. Sub-network과 같은 개념은 아니며 효율적인 방법으로 통신을 구성하기 위한 작은 메커니즘 정도로 고려된다. [2] Backbone 백본은 NoC에 기반한 시스템을 위한 포괄적인 개발 플랫폼이다. 백본의 역할은 설계 지침 및 유연성을 둘다 갖는 ASIC 설계를 위한 견실한 시작점을 제공하는 것이다. NoC 설계 과정을 backbone/platform/application 개발과 같은 세가지 단계로 분류 했을 경우, 백본 설계 과정에서는 region들의 타입을 결정하고 통신 채널 및 스위치들, 네트워크 인터페이스 및 resource들 그리고 통신 프로토콜을 준비하는 단계이다. 이렇게 준비된 기본적인 요소들을 사용하여 설계자는 소프트웨어 혹은 구성가능 하드웨어들을 백본을 통해 mapping 할 수 있다. [3] Wrapper 이종의 자원간의 통신을 위해서, 그리고 서로 다른 region 간의 데이터 교환을 위해서 전송되는 메시지를 적절한 형태로 변환하는 것으로 복잡하다. 이러한 NOC 구조는 네트워크 구성이 비교적 복잡하고 패킷의 전송등을 통해 random 데이터 전송에 효율적이므로 대형 시스템의 구성에 적합니다. Copyrightⓒ2005 J.D.Cho,

Adaptive System on Chip

Scheduled Communication
A tiled architecture 각 tile은 computational core 이며 각 interface가 네트웍을 구성 Core interface는 하나 이상의 tile 에서 발생하는heterogeneous processing의 사용을 제공함 The system connect using statically scheduled mesh of interconnect Data 는 이웃하는tile 과 communication pipeline 에 의해 이동하므로 fast clock rate 와 interconnection resource의 시 분할이 가능 Core 와 runtime interconnect 의 재설정 능력에 의해 dynamic power management 를 가능케 한다. aSoC 의 구조에 대해서 살펴보면 다음과 같다 타일로 구성되며 각 타일은 computational core 이며 각각의 interface 가 Mesh interconnect 로 되어 있다. Data 는 이웃하는 타일과 pipeline 에 의한 고속 통신과 interconnection resouce의 시분할 사용이 가능하다. 그리고 Core 와 runtime interconnect이 재설정 특성을 이용해서 dynamic power management 를 가능하게 한다. Copyrightⓒ2005 J.D.Cho,

Communication Interface
A detailed view of the interconnect memory sequencer and interface crossbar (minus tags and FIFOs) appears in Figure 3. Each instruction accessed from the interconnect memory contains a number of data fields. Each of the six three-bit fields represents the source port for one of the interface crossbar output ports. Enabling pass transistors within the crossbar are driven by the output of three-tofive decoders. Additional fields within each communication instruction indicate the interconnect memory branch address, a comparison-select enable bit, and a bit to force a sequencer jump. In most cases the PC increments after each communication clock cycle to point to the next instruction in the interconnect memory. In the figure it can be seen that a path from the crossbar to the interface control does allow for some run-time routing decision-making regarding transfer lengths. The functionality of this circuitry is best illustrated through the use of a brief example. Consider the multi-step transfer of data from the local IP core (FIFO port Ci,) to a set of destination ports. It is known at compile time that the multi-step transfer will be performed a number of times, but the exact number of sequences is not known. One iteration of the sequence is as follows: 1: Fifo out ( C i n ) -> South p o r t (Sout) 2 : Fifo out ( C i n ) -> West port (Wout) 3 : Fifo out (Cin) -> East p o r t (Eout) 4 : Fifo out (Cin) -> I n t e r f a c e p o r t ( I o u t ) The first three transfers can be accomplished by three consecutive state words stored in the interconnect memory which, after decoding, configure the local crossbar. For these three steps, the PC is incremented following each step completion. During the fourth step, a count value from the 이 논문에서는 Mpeg4 에 기반한Motion Estimation(ME) core 의 동적인 설정에 다음 3가지 즉, full,spiral, three step 방법을 이용 이 방법들의 선택은 ME 의 power comsumption 과 전체적인 video quality 에 달려 있음 Global power 와 quality requirement과 access 된 Control tile 에서 best search method 와 ME tile에 관련된 configuration stream 정보를 보냄 독립적이게 만들어진 heterogeneous core 는 독립적인 clock domain 을 필요로 하고 재구성 가능한 IP core 도 재설정 가능한 clock domain을 필요로 함 간단한 설정가능한 clock reference로는 global clock을 2^n 으로 곱하거나 나눈값으로 n값은 3비트 바이너리 값이다. 이 n값은 run-time 으로 tile 의 local config. Line 을통해 controller로 로드됨 이로서 불필요한 core computation slack 을 줄이고 current spike 를 줄여 power 를 줄일수 있음 System communication 은 jump 나 load 명령어 등으로 실행중에 local config. Line 을 통해 controller 로 interconnect memory를 변경하여 새로운 스케줄을 할 수 있다. 이를 이용해 사용되지 않는 tile 의 data 전송을 제거 Stream data that passes through a communication interface is scheduled for a specific communication - clock cycle based on data link availability. the result of scheduling for each interface is a set of instructions for its associated interconnect memory. Copyrightⓒ2005 J.D.Cho,

Evaluation Methodology
To evaluate the benefits of aSOC interconnect, architectural simulators for PGA logic, a MIPS R4000, and a multiply-accumulate unit were used. These simulators were integrated with NSIM [15], an interconnect simulator that supports both static and dynamic routing protocols. Copyrightⓒ2005 J.D.Cho,

From buses to networks on chip?
Low power wireless networks Uncertain knowledge of physical medium Communication is dominant energy consumer Can we adapt design and optimization techniques from (wireless) networking to SOC communication? Packetized communication on chip Requires overhaul of architectures, CAD, software to be communication centric Protocol stack: Simple (3 layer) vs. more complex (7 layer ISO/OSI) Physical, Data link, Network Copyrightⓒ2005 J.D.Cho,

What is NoC? Less but ‘programmable’ wires by introducing switches (routers). Shared bus: communication bottleneck Point to point connection: many under-utilized long wires Structured approach to interconnect; wires are either short to get on the network, router to router. Separation of computation (IPs) and communication (NOC) Copyrightⓒ2005 J.D.Cho,

Future SoC Interconnect Challenges

Network Architectures and control
Giovanni De Micheli Copyrightⓒ2005 J.D.Cho,

활용 분야 - 선택적인 QoS를 보장하는 프로토콜을 지원하여 Real Time Application 및 대용량 데이터 대역폭이 요구되는 응용 분야에 적합 - High frame rate video 및 3D 그래픽 관련 등과 같은 멀티미디어 대용량 응용분야 SoC 설계 - 온칩 네트워크 핵심 IP 및 설계 지원 툴을 하나의 플랫폼화한 플랫폼 기반 설계 환경을 구축하여 이를 다양한 SoC 설계에 활용함 Copyrightⓒ2005 J.D.Cho,

What next from networking?
Computer networking On demand wakeup ??? Packet switching Error correction Rumor routing CDMA System on Chip design Copyrightⓒ2005 J.D.Cho,

On chip communications
evolving from deterministic baseband signaling interconnects to on-chip networking and communications Indeed, complete integration of all layers of a networked node on a single chip physical  transceiver, modem link/MAC  packet scheduling routing  routing protocols transport  TCP application  adaptive buffering IC designer is also a networked system designer. Copyrightⓒ2005 J.D.Cho,

Technology challenges: Global interconnect wires
Gate delay decreasing 25% per generation Wire delay increasing 100% Communicate across a chip — 1 clock at 400 MHz in 0.35μm — 12.4 clocks at 1 GHz in 0.1μm Global wires violate scaling laws Global communication structures become performance and power bottlenecks [ITRS Roadmap 2001] Copyrightⓒ2005 J.D.Cho,

Architecture: Bus based systems
Advantages Simple, extensible, area efficient Disadvantages Comm. bottleneck (poor scaling), arbitration overhead Widely used: AMBA, IBM CoreConnect, Wishbone Techniques to increase efficiency Bus splitting, burst mode transfers, split transactions [IBM CoreConnect Spec.] [AMBA Spec.] Copyrightⓒ2005 J.D.Cho,

Case study: Bus Splitting
Used in several SOC buses Reduced capacitive load Smaller sized drivers 16% to 50% energy savings Depends on comm. patterns Architectural implications More concurrency in transactions Split can be vertical or horizontal Tools to guide the splitting Related to floor planning [Hsieh, TCAD02] M2 M3 M4 M1 M5 Split along bus length (multi-bus system) M1 M2 M3 M4 M5 Split across bus width Copyrightⓒ2005 J.D.Cho,

Physical design Limitations come from interconnect physics
Giovanni De Micheli Limitations come from interconnect physics Delay on global wires and delay uncertainty Crosstalk due to capacitive coupling among wires Electric signalling techniques Trade-off noise immunity vs. energy vs. speed Sense small swings -> low energy and fast transitions Synchronization across large chips Is synchronization possible at high clock rates? What is the probability of synchronization failure? Copyrightⓒ2005 J.D.Cho,

Reliability of information
Giovanni De Micheli Information transfer is inherently unreliable at the electrical level, due to: Timing errors Cross-talk Electro-magnetic interference (EMI) Soft errors The problem will get increasingly more acute as technology scales down Giovanni De Micheli Copyrightⓒ2005 J.D.Cho,

Systems on chips: a communication-centric view
Design component interconnection under: Uncertain knowledge of physical medium Incomplete knowledge of data traffic Design interconnection as a micro-network Leverage network design technology Manage information flow To provide for performance Power-manage components based on activity To reduce energy consumption Giovanni De Micheli Copyrightⓒ2005 J.D.Cho,

Network design objectives
Low communication latency Streamlined control protocols High communication bandwidth To support demanding SW applications Low energy consumption Wiring switched capacitance dominates High system-level reliability Correct communication errors, data loss Giovanni De Micheli Copyrightⓒ2005 J.D.Cho,

Framework for NoC modelling
Three types of basic components for a system-level model Tasks RTOS services Task sheduling Resourse allocation Execution synchronization Communication network Copyrightⓒ2005 J.D.Cho,

Compiler Research Issues
Synthesis of RTOS elements in the compiler On the application side: Generation of an efficient application-specific static/run-time scheduler and synchronization On the hardware side: Generation of device drivers, memory management primitives, etc. using hardware specifications Automatic retargetability for family of target architectures Automatic application partitioning Mapping of process/task-level concurrency onto multiple PEs using programmer guidance in programmer’s model Copyrightⓒ2005 J.D.Cho,

Embedded Multiprocessor SoC Memory Management
Given that current computers waste much time transferring data between compute and storage units, it is appealing to combine significant processing power and a large amount of memory in the same chip. Designers of multiprocessor SoC with heterogeneous processing elements and significant on-chip memory. They have to decide whether the allocation will be static of dynamic? The static allocation of memory makes the on-chip memory utilization inefficient especially for applications whose memory requirements change significantly during run-time. Dealing with memory allocation between the PEs in a dynamic way can make the memory utilization more efficient. But dynamic memory allocation is not deterministic; moreover, it typically requires hundreds or thousands of clock cycles in the worst case. Dynamic memory management can consume a great amount of a program’s execution time. To reduce the execution time of dynamic memory management routines and make their execution times deterministic, many researchers have proposed hardware accelerators for dynamic memory management. Copyrightⓒ2005 J.D.Cho,

Dynamic Power Management
Dynamic Power Management 는 data content 의 run-time variation에 따른 서로 다른 clock domain을 이용한 frequency 의 감소로 인한 power saving Pre-computation에 의한 반복적인 switching 제거 Valid data stream data일 경우만 연결시켜 불필요한 switching 을 제거 Reconfigurable clock based system balancing creates an environment of just in time computing which can reduce overall power usage. Prefetch many frames in a optimal-sized buffer aSoC Dynamic power management는 위에 제시된 식과 data content 의 run-time variation 을 이용한다. 본 논문에서 구현한 것은 위의 식에서 effective capacitance 와 서로 다른 clock domain을 이용한 frequency 의 감소로 인한 power saving이다. Effective capacitance 는 불필요한 switching 동작을 제거하여 줄이게 된다. Copyrightⓒ2005 J.D.Cho,

Power Metric Based on network activity and HSPICE circuit simulation of interconnect, the network power consumption(Pint) is: T : represents the number of tiles PIF/D: overhead of the instruction memory fetch and decode s: the number of stream Nvs and Nivs: the number of valid and invalid transfer for stream s while Ps is the power consumed in transferring 1 bit through stream s Interconnect, network power comsumption은 network activity and HSPICE circuit simulation 을 통해 위의 식으로 표현된다. Copyrightⓒ2005 J.D.Cho,

Energy Issue in On-chip Bus Arbitration
Centralized bus arbitration As bus scale grows up, energy inefficient Energy cost of communicating with the arbiter and the arbiter complexity grows up more than linearly. Distributed bus arbitration Code division multiple access (ISSCC’00) Just began to consider this problem. Copyrightⓒ2005 J.D.Cho,

HW/SW Codesign Methodology

Hardware/Software Co-design: Definition
Hardware(ASIC, FPGA)와 software(DSP,MCU)가 복합된 시스템을 체계적이며 효율적으로 설계 Meeting System level objectives by exploiting the synergism of hardware and software through their concurrent design To Hardware if speed, power, area and special Use software as a means of differentiating products based on the same hardware platform. Performance Requirements 몇몇의 Function들은 Hardware로의 구현이 더 용이 반복적으로 사용되는 Block, Parallel하게 구성되어 있는 Block Modifiability Software로 구성된 Block은 변형이 용이 Implementation Cost Hardware로 구성된 Block은 공유해서 사용이 가능 Scheduling Data와 Control의 의존성만 없다면 SW와 HW는 Concurrent하게 scheduling Copyrightⓒ2005 J.D.Cho,

HW/SW Co-Synthesis: Pareto Point

Time-Space Exploration
Enumerate all Trade-off’s and select the one with the most benefit. Branch and Bound method for estimating SoC metric. Jiang Xu and Wayne Wolf Princeton University Copyrightⓒ2005 J.D.Cho,

Methodology Requirement:
Need for revolutionary design methods enabling: Faster ‘Time To Market’ through IP reuse, standard communication interfaces and scalable interconnect topology (NoC) Increased flexibility through SW programmability and configurable HW Enable to map an application to a platform to increase the productivity of a platform user Copyrightⓒ2005 J.D.Cho,

H/W and S/W 통합 저전력 설계 Matlab/SPW Signal-master H/W ORINOCO S/W Cossap,
알고리즘 선택 Matlab/SPW S/W S/W 코아 에너지 예측 클러스터 링 Cossap, Synopsys DSP Station SW 에너지 효율 계산 클러스터 스케쥴링 ORINOCO HW SW 통합 HW 에너지 효율 계산 시스템 수준 에너지 예측 Seamless Co-centric Signal-master 클러스터 선택 H/W 합성 및 에너지 예측 Synopsys Copyrightⓒ2005 J.D.Cho,

재구성 플랫폼 기반 설계 방법 Real-time reconfiguration architecture with minimum configuration time Design space exploration Dynamic Memory and Power management On a Chip (MPoC) aSoC 를 이용한 low power signal processing Introduction aSoC 구조에 대한 overview aSoC 의 dynamic power management 를 위해 현재 사용되고 있는 VLSI power management system 을 알아본다. 실험적인 측면에서 접근방법 결과 Copyrightⓒ2005 J.D.Cho,

Why ASIPs? The Energy-Flexibility Gap

Hw/Sw Partitioning on Single-Chip Platforms
Configurable logic Numerous single-chip commercial devices with uP and FPGA Triscend E5 (shown) Triscend A7 Atmel FPSLIC Xilinx Virtex II Pro Altera Excalibur More sure to come… Make hw/sw partitioning even more attractive uP and peripherals Cache/memory Copyrightⓒ2005 J.D.Cho,

iSoC iSoC는 SoC design 의 scalability, flexibility를 향상시키기 위한 on-chip communication architecture Dynamic Configuration 규칙적이고 유연한 구조로 global communication을 위한 traffic, power, speed, area requirement 모델링을 위해 예측 가능한 framework를 제공 aSoC 는 SoC 디자인에서 scalability 와 flexibility 를 향상시키기 위한 on-chip communication architecture 로서 본 논문에서는 위의 특성을 이용한 dynamic power management 를 보여준다. Copyrightⓒ2005 J.D.Cho,

iSOC Compiler Divides applications into parts, each of which fit into a specific core. Determines data communications between the cores in a space-time fashion Generate interconnect memory contents for each individual interface. Copyrightⓒ2005 J.D.Cho,

Application-specific multiprocessor SoC design flow

Mission Statement To carry out R&D programs which are 3 to 10 years ahead of today’s industrial needs in the field of .. Design Technology for Integrated Information and Communication Systems for Human’s Well-Being Reconfigurable SoC, Multi-media multi-Mode terminals, BAN for health-monitoring Copyrightⓒ2005 J.D.Cho,

결론 New Computing Architecture Paradigm Architectural exploration tools
Dynamic Real-time reconfiguration architecture with minimum configuration time Dynamic Memory and Power management On a Chip (MPoC) Copyrightⓒ2005 J.D.Cho,

References aSOC: A Scalable, Single-Chip Communications Architecture
Jian Liang, Sriram Swaminathan, and Russell Tessier Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA {jliang, Configurable Platforms With Dynamic Platform Management: An Efficient Alternative to Application-Specific System-on-Chips Krishna Sekar Kanishka Lahiri Sujit Dey Dept. of ECE, UC San Diego, La Jolla, CA NEC Laboratories America, Princeton, NJ Copyrightⓒ2005 J.D.Cho,

References Copyrightⓒ2005 J.D.Cho,
Ackland et al., A Single Chip, 1.6 Billion, 16b MAC/s Multiprocessor DSP, IEEE JSSC, March 2000 • Agrawal, Raw Computation, Scientific American, August 1999 • Benini and De Micheli, Networks on Chip: A New SoC Paradigm, IEEE Computer, January 2002 • Benini and De Micheli, Powering Networks on Chip, Proceedings ISSS, October 2001 • Bertozzi, Benini and De Micheli, Low-Power Error-Resilient Codes for On-Chip Data Busses, DATE 2002 • Dally and Towles, Route Packets, not Wires, DAC 2001 • Guerrier and Grenier, A Generic Architecture for On-Chip Packet Switched Interconnections, DATE 2000 • Ho, Mai and Horowitz, The Future of Wires, IEEE Proceedings, January 2001 • Hu and Marculescu, Energy Aware Mapping for Tile-Based NoC Architectures, ASPDAC 2003 • Rijpkema et al., Trade off in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip, DATE 2003 • Yoshimura et al., DS-CDMA Wired Bus with Simple Interconnection Topology for Parallel Processing System LSIs, ISSC 2000 • Worm, Ienne, Thiran and De Micheli, An Adaptive, Low-Power Transmission scheme for On- Chip Networks, ISSS 2002 • Ye, De Micheli, Benini, Packetized On-chip Interconnect Communication Analysis for MPSoCs, DATE 2003 • Zhang et al., A 1V Heterogeneous Reconfigurable DSP IC for Wireless Baseband Digital Signal Processing, JSSC, November 2000 Copyrightⓒ2005 J.D.Cho,

Co-design On-line Sites
IMEC ftp reports (Cathedral): ftp://ftp.imec.be/pub/vsdm/reports/ Stanford Tech Reports: Synopsys Research Publications: URLs to Hardware/Software Co-Design Research: Bibliography of Hardware/Software Codesign: Ralf Niemann's Codesign Links and Literature: Copyrightⓒ2005 J.D.Cho,

Hw/Sw Co-design 발전 동향 조준동 VLSI Algorithmic Design Automation Lab.

Similar presentations

Presentation on theme: "Hw/Sw Co-design 발전 동향 조준동 VLSI Algorithmic Design Automation Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hw/Sw Co-design 발전 동향 조준동 VLSI Algorithmic Design Automation Lab.

Similar presentations

Presentation on theme: "Hw/Sw Co-design 발전 동향 조준동 VLSI Algorithmic Design Automation Lab."— Presentation transcript:

Similar presentations

About project

Feedback