School of Engineering University of Guelph

Slides:



Advertisements
Similar presentations
Field Programmable Gate Array
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
Robert Barnes Utah State University Department of Electrical and Computer Engineering Thesis Defense, November 13 th 2008.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.
General Purpose FIFO on Virtex-6 FPGA ML605 board Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf 1 Semester: spring 2012.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Block Permutations in Boolean Space to Minimize TCAM for Packet Classification Authors: Rihua Wei, Yang Xu, H. Jonathan Chao Publisher: IEEE INFOCOM,2012.
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.
Highest Performance Programmable DSP Solution September 17, 2015.
Softcore Vector Processor Team ASP Brandon Harris Arpith Jacob.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
J. Christiansen, CERN - EP/MIC
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
Towards the Design of Heterogeneous Real-Time Multicore System Adaptive Systems Laboratory, Master of Computer Science and Engineering in the Graduate.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Paper Review Presentation Paper Title: Hardware Assisted Two Dimensional Ultra Fast Placement Presented by: Mahdi Elghazali Course: Reconfigurable Computing.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Processor Architecture
EKT303/4 Superscalar vs Super-pipelined.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Advanced SW/HW Optimization Techniques for Application Specific MCSoC m Yumiko Kimezawa Supervised by Prof. Ben Abderazek Graduate School of Computer.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.
Programmable Logic Devices
Design and Analysis of Low-Power novel implementation of encryption standard algorithm by hybrid method using SHA3 and parallel AES.
Presenter: Darshika G. Perera Assistant Professor
Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:
Backprojection Project Update January 2002
Flexible FPGA based platform for variable rate signal generation
Hiba Tariq School of Engineering
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Ioannis E. Venetis Department of Computer Engineering and Informatics
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Genomic Data Clustering on FPGAs for Compression
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel Jantsch
Chapter 14 Instruction Level Parallelism and Superscalar Processors
INTRODUCTION TO MICROPROCESSORS
Introduction to Reconfigurable Computing
Instruction Level Parallelism and Superscalar Processors
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Course Agenda DSP Design Flow.
Prepared by: Mahmoud Rafeek Al-Farra
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Computer Architecture
Introduction to Heterogeneous Parallel Computing
2019/4/11 Tree in the List : Accelerated List-based Packet Classification Through Controlled Rule Set Expansion Miultiple controller 的 load balancing 機制,使用一個叫.
COMPUTER ORGANIZATION AND ARCHITECTURE
Presentation transcript:

School of Engineering University of Guelph An Adaptive Implementation of a Dynamically Reconfigurable K-Nearest Neighbor Classifier On FPGA (2012) Huseyin Seker Bio-Health Informatics Research Group De Montfort University, Leicester England, U.K. hseker@dmu.ac.uk Hanaa M. Hussain, Khaled Benkrid School of Engineering Edinburgh University, Edinburgh Scotland, U.K. {h.hussain, k.benkrid}@ed.ac.uk Dunia Jamma PhD Student Prof. Shawki Ariebi Course Instructor

Outline Introduction Background of KNN KNN and FPGA The proposed architectures Dynamic Partial Reconfigurable (DPR) part The achievements Advantages and Disadvantages Conclusion

Introduction K-nearest neighbour (KNN) is a supervised classification technique Applications of KNN (Data Mining, Image processing of satellite and medical images ... etc.) KNN is known to be robust and simple to implement when dealing with data of small size KNN performs slowly when data are large and have high dimensions KNN classifier is sensitive to parameter (K) Number of nearest neighbours The selection of the label for the new query depends on voting on those K points.

1-Nearest Neighbor 3-Nearest Neighbor

KNN Distance Methods To calculate the distance between the new queries and the K’s points the Manhattan distance was used The Manhattan is chosen in this work for its simplicity and lower cost compared to the Euclidean Xi: The new query’s matrix Yi: The trained sample's matrix K: # of samples

KNN and FPGA KNN classifiers can benefit from the parallelism offered by FPGAs Distance computation is time consuming Parallelizing the distance computation part They propose two adaptive FPGA architectures (A1 and A2) of the KNN classifier, and compare the implementations of each one of them with an equivalent implementation running on a general purpose processor (GPP) They propose a novel dynamic partial reconfiguration (DPR) architecture of the KNN classifier for K

Used tools Hardware implementation: Software implementation: The hardware implementation targeted the ML 403 platform board which has a Xilinx XC4VFX12 FPGA chip on it JTAG cable Xilinx PlanAhead 12.2 tool along with Xilinx partial reconfiguration flow (DPR) Software implementation: Matlab (R2009b) bioinformatics toolbox Intel Pentium Dual-Core E5300, running at 2.60 GHz and 3 GB RAM workstation Using of Verilog as HDL configuration language

The used data Factors L M N M: training samples Y = N: Training Vectors L: Label Y: trained data X: New query Y = X =

The proposed architectures The KNN classifier has been divided into three modular blocks (Distance computation, KNN finder, and Query label finder) + FIFO M-Dist PEs K-KNN PEs PE = M + K +1 N-Dist PEs N- KNN PEs A1 Architecture PE = 2N +1 A2 Architecture

The functionality of PEs Previous accumulative distance Dist 2 L2 Yi Dist1 Min L1 Max

Distance computation The distance computations are made in parallel every clock cycle The latency of Dist PE is M cycles A1: the throughput is one distance result every clock cycle A2: the throughput is one distance result every M clock cycle Complete Training

Dist PE inner architecture

K-Nearest Neighbour Finder This block becomes active after M clock cycles The function of this block is completed after an M + N clock cycle

Dynamic Partial Reconfigurable part (DPR) The value of K parameter was dynamically reconfigured, when N, M, B, and C are fixed for a given classification problem. Two cores (A1) Distance computation core - Static KNN core (KNN PE, Label PE) - Dynamic The size of the RP is made large enough to accommodate the logic resources required by largest K Advantages: saving in reconfiguration time, Power Difficulties: Limitations (resources), the cost, the verification of the interfaces between the static region and RP for all RMs

The achievement This DPR implementation offers 5x speed-up in the reconfiguration time of a KNN classifier on FPGA

Advantages Variation which allows the user to select the most appropriate architecture for the targeted application (available resources, performance, cost) Enhancement in Performance Parallelism-speed up DPR-reconfigurable time Efficiency in term of KNN performance - the DPR for K Using the Manhattan’s theorem (simplicity and lower cost)

Disadvantages The amount of used resources The not worthy achieved speed (5X) for DPR part comparing to the amount of used resources and effort Constraints in A2 architecture and the DPR (area) The latency due to pipelining manner of producing the results

Conclusion Efficient design for different KNN classifier applications Two architectures A1 and A2 and the user can choose one of them A1 can be used to target applications whereby N>>M, whereas A2 is used to target applications whereby N<<M DPR part (could be reproduced with ICAP) Achievements comparing to GPP Speedup by 76X for A1 and 68X by A2 Speedup by 5X in DPR

Any question?

Extra Slides

Memory Each FIFO is associated with one distance PE The query vectors gets streamed to the PEs to be stored in registers- they will be required every clock cycle Where: B is the sample wordlength M is the number of samples N is the number of training vectors

Class Label Finder The block consists mainly of C counters each associated with one of the class labels The hardware resources depends on user defined parameters K and C The architecture of this block is identical in both A1 and A2

A2 Architecture N FIFOs are used to store the training set with each of them having a depth of M The class labels get streamed and stored in registers within the distance PEs A2 requires more CLB slices than A1, when N, M, and K are the same the first distance result becomes ready after all samples are processed i.e., after M clock cycles

DPR for K Maximum BW for JTAG is 66Mbps Maximum BW for ICAP is 3.2Gbps ICAP > 48x JTAG

Dynamic Partial Reconfigurable part (DPR) The JTAG was used (BW = 66Mbps) Using of ICAP instead would decrease the configuration time (BW = 3.2Gbps) 26