UNIVERSITY OF MORATUWA PERFORMANCE EVALUATION OF VISION ALGORITHMS ON FPGA By Mahendra Gunathilaka Samarawickrama This thesis is submitted to the Department of Electronic & Telecommunication Engineering of the University of Moratuwa in partial fulfillment of the requirements for the degree of Master of Science in Full Time Research. University of Moratuwa, Sri Lanka July, 2010 DECLARATION I certify that this thesis does not incorporate without acknowledgement any mate- rial previously submitted for a degree or diploma in any university. Furthermore, this does not contain any material previously published or written or orally com- municated by another person except where due reference is made in the text or in the figure captions or in the table captions. Mahendr G. Samarawickrama To the best of our knowledge the above particulars are true and accurate. Dr. A.A. Pasqual Research Supervisor, Senior Lecturer, Electronic and Telecommunication Engineering. Dr. Ranga Rodrigo Research Supervisor, Senior Lecturer, Electronic and Telecommunication Engineering. Abstract The modern FPGAs enable system designers to develop high-performance com- puting (HPC) applications with large amount of parallelism. Real-time image processing is such a requirement that demands much more processing power than a conventional processor can deliver. In this research, we implemented software and hardware based architectures on FPGA to achieve real-time image processing. Furthermore, we benchmark and compare our implemented architectures with ex- isting architectures. The operational structures of those systems consist of on-chip processors or custom vision coprocessors implemented in a parallel manner with efficient memory and bus architectures. The performance properties such as the accuracy, throughput and efficiency are measured and presented. According to results, FPGA implementations are faster than the DSP and GPP implementations for algorithms which can exploit a large amount of parallelism. Our image pre-processing architecture is nearly two times faster than the opti- mized software implementation on an Intel Core 2 Duo GPP. However, because of the higher clock frequency of DSPs/GPPs, the processing speed for sequential computations on on-chip processors in FPGAs is slower than on DSPs/GPPs. These on-chip processors are well suited for multi-processor systems for software level parallelism. Our quad-Microblaze architecture achieved 75-80% performance improvement compared to its single Microblaze counterpart. Moreover, the quad- Microblaze design is faster than the single-powerPC implementation on FPFA. Therefore, multi-processor architecture with customised coprocessors are effective for implementing custom parallel architecture to achieve real time image process- ing. i To my parents, family and teachers for giving me constant support and motivation. ii Acknowledgment I wish to thank my supervisors Dr. Ajith Pasqual and Dr. Ranga Rodrigo for their support and encouragement during this research. Their insight, guidance, feedback and especially the constructive criticisms contributed enormously to the production of this thesis. I am grateful to Dr. E.C. Kulasekere, the coordinator of this research and Dr. Chathura De Silva, the chairman of the progress review committies and Prof. (Mrs.) I.J. Dayawansa, the postgraduate research advisor for their feedback, kind advice and invaluable suggestions given. I am deeply indebted to other academics and administrators who have provided helpful advice and knowledge during this research. I also wish to extend my gratitude to Zone24×7 (Pvt) Ltd. for providing laboratory facilities. I acknowledge the financial support given by the University of Moratuwa Sen- ate Research Committee grant SRC-297, to enable me conduct the masters pro- gram at University of Moratuwa. Finally, I am thankful to my parents, family and friends for their care, com- mitment and support they extended to me during this research program. Mahendra G. Samarawickrama July 2010 iii Contents List of Figures vi List of Tables viii Abbreviations ix Notations x CHAPTER 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 CHAPTER 2 Design and Implementation 9 2.1 Image Pre-Processing Architecture . . . . . . . . . . . . . . . . . . 9 2.2 Image Convolution Coprocessor . . . . . . . . . . . . . . . . . . . . 18 2.3 Standalone PowerPC Architecture . . . . . . . . . . . . . . . . . . . 19 2.4 Single Microblaze Architecture . . . . . . . . . . . . . . . . . . . . . 21 2.5 Multiple-Microblaze Architecture . . . . . . . . . . . . . . . . . . . 22 CHAPTER 3 Results 25 3.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 CHAPTER 4 Conclusion 37 4.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 References 39 iv APPENDIX A Virtex-5 FPGA ML505 Evaluation Platform 42 APPENDIX B Image Pre-processing Architecture 45 APPENDIX C Processor-Based Vision Architectures 48 APPENDIX D Fixed-Point Digital Signal Processors 55 APPENDIX E Sample Codes 57 v List of Figures 1.1 Block diagram of Texas machine vision solution . . . . . . . . . . . 2 1.2 Growth of FPGA memory, logic resources and bandwidth . . . . . . 3 1.3 Technology gap between demand and performance of a real time system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Execution times of Gaussian pyramid function under different mem- ory configurations on a C64x DSP . . . . . . . . . . . . . . . . . . . 5 1.5 Maximum operating frequency curves for one token in a linear pipeline of n stages . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Layered architecture of a vision system . . . . . . . . . . . . . . . . 10 2.2 Design of a typical vision system . . . . . . . . . . . . . . . . . . . 10 2.3 Image pre-processing architecture . . . . . . . . . . . . . . . . . . . 11 2.4 Implemented memory-bank schematics . . . . . . . . . . . . . . . . 12 2.5 16KB memory-block architecture . . . . . . . . . . . . . . . . . . . 13 2.6 Basic memory-bank implementation . . . . . . . . . . . . . . . . . . 14 2.7 Model of mapping pixel array into memory array . . . . . . . . . . 16 2.8 Internal data flow of the vision core for a 3×3 convolution . . . . . 17 2.9 Bus interface of vision architecture developed with PowerPC-405 . . 20 2.10 Bus interface of vision architecture developed with single Microblaze 21 2.11 Bus interface of quad-processor vision architecture developed with four Microblazes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.12 Software operation-sequence of quad-processor vision architecture . 24 3.1 Results of Gaussian smoothing by a 3×3 mask . . . . . . . . . . . . 26 vi 3.2 Results of the edge detection by a 3×3 mask . . . . . . . . . . . . . 27 3.3 Results of the right-angle-corner detector . . . . . . . . . . . . . . . 28 3.4 Timings for a 3×3 Sobel edge detector for a 512×512 image on different platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Results of Gaussian smoothing with different masks . . . . . . . . . 31 3.6 Results of Laplace sharpening with different masks . . . . . . . . . 32 3.7 Speed of the histogram-equalization for (128×128) 8-bit image, im- plemented with XCL and DDR-RAM. . . . . . . . . . . . . . . . . 33 3.8 Performance improvement comparison from single-Microblaze to quad-Microblazes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A.1 Front view of the ML505 FPGA board . . . . . . . . . . . . . . . . 43 A.2 Back view of the ML505 FPGA board . . . . . . . . . . . . . . . . 44 B.1 Image pre-processing architecture developed in Verilog . . . . . . . 46 B.2 Large (Multiple 16KB) memory-bank architecture . . . . . . . . . . 47 C.1 Bus interface of vision architecture developed with single Microblaze 49 C.2 Overview of vision architecture developed with single Microblaze . . 50 C.3 Bus interface of vision architecture developed with PowerPC-405 . . 51 C.4 Overview of vision architecture developed with PowerPC-405 . . . . 52 C.5 Bus interface of quad-processor vision architecture developed with four Microblazes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 C.6 Overview of quad-processor vision architecture developed with four Microblazes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 D.1 Specification of TMS320C6414 digital signal processor . . . . . . . . 56 E.1 Send and receive a image without performing any operation on that image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 E.2 Send and receive a image while performing histogram-equalization algorithm on that image . . . . . . . . . . . . . . . . . . . . . . . . 58 vii List of Tables 1.1 HLL vs. HDL comparisons in core design . . . . . . . . . . . . . . . 7 3.1 Device utilization summary for image pre-processing architecture . 28 3.2 Timings for a 3×3 Sobel edge detector for a 512×512 image on different platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Speed of the 2-D convolution for 8-bit 800×600 image with different mask sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Device utilization summary for 2-D convolution with different mask sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Speed of the histogram equalization for (128×128) 8-bit image . . . 33 3.6 Speed of the histogram equalization for 8-bit image in different res- olutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.7 Histogram equalization time per unit image area (100×100) . . . . 34 3.8 Performance improvement factor from single-processor to quad- processor for different image resolutions . . . . . . . . . . . . . . . . 34 viii Abbreviations Following abbreviations or acronyms have been used in this thesis. Abbreviations/acronyms Meaning ADDR Address: Memory location for read/write data BRAM Block RAM CLK Clock CMP Chip Multiprocessor DDR Double Data Rate DIN Data Input: Data written into memory DOUT Data Output: Synchronous output of the memory DSP Digital Signal Processor EDK Embedded Development Kit EN Enable: Enables access to memory EEPROM Electrically Erasable Programmable ROM FPGA Field-Programmable Gate Array GPP General Purpose Processor HDL Hardware Description Language HLL High-Level Language LMB Local Memory Bus LUT Lookup Table ROM Read-Only Memory PIF Performance Improve Factor PLB Processor Local Bus SoC System on Chip WE Write Enable: Allows data transfer into memory XCL Xilinx Cache Link XPS Xilinx Platform Studio ix Nomenclature Following symbols or notations have been used in this thesis. Notation Meaning Tnbhd Time to read neighborhood pixels around the first pixel of the image Nmask Kernel dimension fclk Clock frequency Timg Total time needs to process all the pixels of the image Mimg Number of pixels per image TSM Time to execute in single-microblaze architecture TQM Time to execute in quad-processor-microblaze architecture x