Download presentation
Presentation is loading. Please wait.
Published byMarilyn Barrett Modified over 9 years ago
1
A Multiprocessor System-on-Chip for Real-Time Biomedical Monitoring and Analysis: Architectural Design Space Exploration Rustam Nabiev Biomedical Engineering Dept. Karolinska University Hospital Huddinge, Stockholm, Sweden Iyad Al Khatib IMIT, ICT, KTH Royal Institute of Technology Stockholm, Sweden Davide Bertozzi ENDIF University of Ferrara Ferrara, Italy Mohamed Bechara ECE, FEA American University of Beirut Beirut, Lebanon Luca Benini DEIS University of Bologna Bologna, Italy Axel Jantsch IMIT, ICT, KTH Royal Institute of Technology Stockholm, Sweden Francesco Poletti DEIS University of Bologna Bologna, Italy Hasan Khalifeh ECE, FEA American University of Beirut Beirut, Lebanon 43rd Design Automation Conference (DAC 06)
2
Outline ► ► Motivation MPSoC for ECG analysis – –ECG analysis algorithm – –Architectural bottleneck analysis Architecture exploration – –Architecture tuning and optimization – –Scalability analysis – –Comparison with state-of-the-art solutions Conclusions
3
Motivation 50% of these deaths could be avoided with a reliable combination of cost effective monitoring and analysis United States, 2003 Heart diseases and stroke statistics 2006 update American Heart Association Deaths All Ages<8585+ 0 200,000 400,000 600,000 800,000 1,000,000 Alzheimer COPD Cancer Other CVD Stroke Heart Disease Heart disorders are by far the leading cause of death in the world for both women and men (29.2%) [UN World Health organization Report 2003] World market for biomedical devices for ECG monitoring > 1B$ [Novosense05]
4
State of the art Limited processing power and tight power budgets of Holter devices has traditionally limited their functionality to data acquisition ECG recorded continuously for 24 hrs of normal activity Recording of full ECG traces or abnormal events Offline record analysis at the medical center ECG diagnosis Remote real-time monitoring through a communication link involves: 1.Transmission of a huge amount of life-critical data 2.A 100% functional always-ON connection
5
Real-time ECG analysis Real-time in-situ ECG MONITORING & ANALYSIS aims to: 1.Promptly react to life-threatening heart malfunctions 2.Relax requirements on telemedicine links Challenges Physiological variability of QRS complexes Base-line wander Muscle noise Artifacts due to electrode motion Power-line interference Preserve patient mobility Moving from 3-to-12-lead analysis Larger sampling frequencies Tight power budgets Challenges Physiological variability of QRS complexes Base-line wander Muscle noise Artifacts due to electrode motion Power-line interference Preserve patient mobility Moving from 3-to-12-lead analysis Larger sampling frequencies Tight power budgets Sensor technology evolution Scalable energy-efficient HW-SW platforms Algorithm development
6
Contribution of this work 1. Remove HW bottlenecks Scalable computational horsepower Scalable communication architecture 2. Remove SW bottleneck Scalable algorithm for RT ECG analysis Parallelization strategy 4. Create a functionally and timing-accurate virtual platform 0.13um industrial technology-homogeneous power models Integrates industrial IP cores, interconnect fabric, IOs 5. Explore the design space Demonstrates 12-lead analysis @ >1KHz Performance and power analysis and tuning A flexible and scalable HW-SW MPSoC platform for real-time ECG analysis C O N Shared Memory Private Memory DSP Local Memory Private Memory … DSP Local Memory Tile 1 Tile N … N
7
time Voltage mV Medical background P R Q S T U 1-lead ECG signal: PQRSTU peaks – –Each peak and inter-peak distance is related to a different heart activity Sampling frequencies: 250 Hz, 1kHz – –Higher sampling frequencies might enhance analysis accuracy (e.g., resolve two peaks very close to each other) A common analysis algorithm is Pan-Tompkins – –QRS detection – –Cascade of 4 filters: band pass, differentiator, squaring operation, and finally a moving window integrator ECG is an electrical recording of the heart activity V6V6 V2V2 V1V1 V3V3 V5V5 V4V4 aVR aVL aVF RA LA LL RL Ground (G) + - Lead I Lead II + - Lead III + - + - + - + - Traditionally, 3 sensors used (3-lead ECG) Recently: 12-lead ECG analysis (9 sensors) –Much improved resolution –Heavy storage and computation requirements
8
Proposed ECG system Interconnection of up to 9 sensors Up to 12 chan 16bit A/D Converter 1 16bit A/D Converter N A/D IIR Filter 1 IIR Filter N Filtering Storage in 16-bit binary format (off-chip SDRAM) MPSoC for RT Analysis MPSoC system with off-chip SDRAM memory Commercial off-the-shelf sensors Ambu Inc. silver/silver chloride “Blue sensor R” (www.ambuusa.com)www.ambuusa.com A/D Conversion: up to 10 kHz IIR Filters to eliminate sensor noise and effect of patient movements 64 Mbyte SDRAM off-chip memory ECG MPSoC based on STMicroelectronics components Computation performed on chunks of 4 sec. of recorded data (~4-beat cycles) V6V6 V2V2 V1V1 V3V3 V5V5 V4V4 aVR aVL aVF RA LA LL RL Ground (G) + - Lead I Lead II + - Lead III + - + - + - + -
9
ECG analysis algorithm ECG analysis starts from a reference point in the heart cycle – –The R-peak is commonly used Accurate detection of the R-peak of the QRS complex is prerequisite for the reliable functionality of ECG analyzers [Bobbie2004] ECG signal variability is high –R-peak detection might be inaccurate (e.g., R-T peak detection instead of R-R peak detection) –As a consequence, other QRS parameters will be inaccurate Traditional techniques may fail in detecting some serious heart disorders -R-on-T complex (premature ventricular complexes) -Risk of ventricular fibrillation
10
Novel approach to ECG analysis By autocorrelation, derive the period without looking for peaks Accurately find peaks in a time window equal to the period Instead of looking for the R-peaks and then detecting the period, detect the period first (via autocorrelation) and then look for the peaks y: filtered input signal Autocorrelation Function ► But this comes at a higher computational complexity: 3.5 MMUL (1.75 Mln Multiplications in our implementation)
11
Autocorrelation analysis For the heartbeat period, we need at least 4 secs of ECG data in order for the ACF to give accurate results: 100% on MIT-BIH database P Q R S T R U If a function is periodic, its derivative is periodic R’ The auto-correlation function of the derivative gives the period
12
MPSoC architecture We exploit industrial IP cores (200 MHz System) ST220 4-issue VLIW DSPs with 32 kB instruction and data caches STBus interconnect from STMicroelectronics In-house optimized memory controller with DMA capability Whole system modeled with the MPSIM virtual platform Cycle accurate and bus-signal accurate Up to 200 kcycles/sec (Pentium 4, 3.5GHz clock) 0.13 um technology-homogeneous industrial power models System Global Interconnect INTERRUPT CONTROLLER MemoryController HARDWARE SEMAPHORES PE1PE2 PRI MEM N 512 kB PRI MEM 1 ……… PEn Off-chip SDRAM Memory 8kB SHARED MEMORY
13
The memory bottleneck “Push” memory channel – –Control Block keeps a table of objects to be moved Table entries can be programmed by different cores – –Transfer engine moves data Triggers bus & SDRAM transactions – –Memory Controller handles SDRAM accesses Off-chip Memory Interface Unit ControllerTransfer Engine Memory Controller SDRAM RAM Programming Data transfer INTERCONNECT CORE
14
STBus interconnect Advanced features with respect to widely used AMBA AHB Forward channel Backward channel AMBA AHB STBus Straigthforward shared bus topology 2 data links, but only 1 active at a time In order completion Transaction pipelining Split request and response channels Wait states can be masked, depending on the depth of slave FIFO buffers Multiple outstanding transactions Out-of-order completion Low latency arbitration
15
Flexible bus topology STBus can be instantiated either as a shared bus or as partial or full crossbar Partial Crossbar Full Crossbar
16
Crossbar-based interconnect MEM CTRL PRI MEM1 PRI MEM N IRUPT SHM MEM SEM DSP 1 DSP N MEM CTRL MEM CTRL IRUPT MEM CTRL PRI MEM1 PRI MEM N SHM MEM SEM DSP 1 DSP N Each private memory on a crossbar branch, accessible by its DSP or by the MemCtrl master port Partial grouping of initiators and targets may result in marginal performance penalties while reducing interconnect area (partial vs. full crossbars) MemCtrl slave port for DMA programming
17
Data management - I Each DSP programs the DMA engine to periodically transfer input data chunks (4 secs of ECG signal) to its private on-chip memory – –With 1kHz sampling frequency and 12 processors, required bandwidth is 6 Mbyte/sec (DMA programming plus actual data transfers) – –Negligible with respect to STBus bandwidth (with 1 wait state memory, it exceeds 400 Mbyte/sec) COMMUNICATION ARCHITECTURE INTERRUPT CONTROLLER MemoryController HARDWARE SEMAPHORES PE1PE2 PRI MEM N 512 kB PRI MEM 1 ……… PEn Off-chip SDRAM Memory 8kB SHARED MEMORY Local storage of input data via DMA
18
Data management - II Independent computation of each DSP in its private memory High communication bandwidth requirement on the interconnect More leads can be processed by the same DSP – –The RTEMS OS supports multiple tasks COMMUNICATION ARCHITECTURE INTERRUPT CONTROLLER MemoryController HARDWARE SEMAPHORES PE1PE2 PRI MEM N 512 kB PRI MEM 1 ……… PEn Off-chip SDRAM Memory 8kB SHARED MEMORY Cache line refills
19
Data management - III COMMUNICATION ARCHITECTURE INTERRUPT CONTROLLER MemoryController HARDWARE SEMAPHORES PE1PE2 PRI MEM N 512 kB PRI MEM 1 ……… PEn Off-chip SDRAM Memory 8kB SHARED MEMORY 64 bytes output data to shared memory Negligible bus bandwidth When the shared memory gets filled beyond a certain level, stored output data can be swapped to the off-chip SDRAM -8 hours of history can be recorded -Data can also be transmitted via a telemedicine link
20
PE efficiency We compared performance of ST220 VLIW DSPs with respect to ARM7TDMI cores Same cache size (32 kB) Processing of 1 ECG lead on 1 core 250 Hz sampling frequency High-quality VLIW code generation –ARM7 (no Thumb) executable is 1.7 times larger static IPC for the 4-issue ST220 VLIW DSP: 2.9 9 times faster 2.5 times more energy-efficient
21
Architectural tuning Let us configure the system to satisfy application requirements at the minimum hardware cost Processing of 4 secs of input data (250 Hz sampling frequency). 12-lead ECG Execution time scales linearly − Communication architecture (shared STBus) is well tuned − Peak memory controller bandwidth satisfies perf. requirement
22
Architectural tuning Load increases quadratically with sampling frequency − −About 3 secs for 1 DSP to process 12 leads Employing more processors is more effective here − −Smoother energy degradation − −Larger margin for heart disorders diagnosis Processing of 4 secs of input data (1 kHz sampling frequency). 12-lead ECG
23
Looking forward What is the maximum achievable sampling frequency while meeting real-time requirements? 2200 Hz 12 processors running. 12-lead ECG. 3.5 sec real-time requirements Typical state-of-the-art frequency range is 250Hz-1kHz − 2.2 kHz achievable with a shared bus − about 4 kHz with an optimized partial crossbar 4000 3.5
24
Interconnect optimization Shared bus System interconnect saturation limits performance scalability − 100% busy at 2.2 KHz Crossbar Parallel topology removes scalability limitations − doubled system performance − avg. and min. bus trans. latencies are close to each other − high bus efficiency: 72% of the bus busy time − partial crossbar almost equals full crossbar performance with almost 3 times less hardware resources: 5x5 instead of 13x13 Now the architecture is computation-limited
25
Comparison with research/commercial ECG SoCs [1] Chang, M. et al., Design of a System-on-Chip for ECG signal processing, The 2004 IEEE Asia-Pacific Conference on Circuits and Systems, December 2004. [2] Freescale TM semiconductor, Personal Electrocardiogram (ECG) Monitor, http://www.freescale.com/ Let us compare two of our MPSoC platform instances with similar designs in research and on the market Only QRS, only decide if healthy or unhealthy Notch1No info2508kB Cache10[1] Only QRS, only decide if healthy or unhealthy IIR8No info800No info12[2] Same as above IIR12<4s2200Same as above 16Shared bus Hear-period; P,Q,R,S,T,U peaks, potential disease detect IIR12<3.5s4000512kB pri.mems. 32kB I- and 32kB D-cache 16Partial crossbar Application results Pre- filter Leads per SoC Real-Time analysis window Freq. (Hz) MemoryData bits Solution
26
Conclusions Real-time nomadic EKG analysis challenges – –12-lead, Multi KHz frequency – –Algorithmic robustness – –Software parallelization – –Hardware bottlenecks (computation and communication arch.) – –Real-time diagnosis Autocorrelation-based algorithm is a promising alternative to traditional techniques – –MPSoC required to handle increased computational requirements HW-SW platform exploration – –VLIW DSP more energy efficient than RISC core – –Bus-based interconnect limits rate to 2KHz – –VLIW core becomes the bottleneck at 4KHz Future: explore DVFS & power management
27
Filtering stage Two confusing R peaks before filtering One clear R peak after filtering Filters out DC offsets and signal interferences Hardware-implemented order-3 IIR filter Output results in 16-bit binary format Facilitates peak resolution and makes heartbeat period computation more precise
28
The bus bottleneck Bus bandwidth saturation limits scalability of state-of-the-art SoCs Bus bandwidth saturation limits scalability of state-of-the-art SoCs Trends Trends –Evolution of communication protocols (AMBA AHB, STBus, CoreConnect, AMBA AXI) (AMBA AHB, STBus, CoreConnect, AMBA AXI) -Evolution of bus topology (shared bus, partial/full crossbar, multi-layer architecture) (shared bus, partial/full crossbar, multi-layer architecture)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.