15 minute read
Hardware acceleration of k-means clustering for satellite image compression
Francis J. Fattori, Alan D. George
Department of Electrical and Computer Engineering
Francis J. Fattori Francis Fattori is a junior computer engineering student from Oxford, PA with interests in parallel processing methodologies, heterogeneous computing architectures and spacebased computing applications. Upon completing his undergraduate career, Francis hopes to continue conducting research in these areas while pursuing a master’s degree.
Dr. George is Department Chair, R&H Mickle Endowed Chair, and Professor of Electrical and Computer Engineering at the University of Pittsburgh (Pitt), and Fellow of the IEEE. His research interests focus upon high-performance architectures, apps, networks, services, systems, Alan D. George, Ph.D. and missions for reconfigurable, parallel, distributed, and dependable computing, from satellites to supercomputers.
Significance Statement
This article explores the applicability of a hybrid CPU-FPGA system-on-chip design in accelerating a color-quantization application for satellite image compression, highlighting the improvements in both performance and energy efficiency against traditional linear computing methods.
Category: Experimental Research
Key Words: hybrid system-on-chip (SoC) platform,
field-programmable gate array (FPGA), parallel computing, space-based computing.
Abstract
Image compression is a vital component of remote camera modules on satellites to facilitate file storage and network transfer. K-means clustering is an effective algorithm for lossy image compression, but its computational complexity can render the algorithm inefficient when implemented with serially functioning processors. Issues of execution latency are magnified for space-based embedded platforms, which contain radiation-hardened processors and memory units of lower overall performance and efficiency. This paper introduces a hybrid systemon-chip (SoC) design involving both a Central Processing Unit (CPU) and a Field-Programmable Gate Array (FPGA) to serve as an accelerator for a k-means clustering image-compression application on board satellites. A PYNQ-Z2 development board housing a Xilinx Zynq-7020 SoC was used for application testing. Multiple program executions with various test images revealed that the hybrid accelerator performed k-means clustering roughly 100 times faster than the software-only baseline while consuming only 1.19 % of the energy. The application functioned at a compression ratio of 4:1 and produced output images with only minor losses in image quality.
1. Introduction
1.1 Onboard Image Compression for EO Satellites
Improvements in space camera units have enabled Earth Observation (EO) satellites to capture high-resolution images. In order for these photos to be retained for future use, image information must be stored in onboard memory units and/or downlinked to a database on Earth. However, radiation-tolerant flash memory systems present in most satellites are limited in storage capacity, and typical extraplanetary telecommunication networks do not have a bandwidth capable of downlinking full-resolution, raw images obtained by satellite cameras. Thus, a data-compression protocol is required to encode digital image information using fewer bits than the original representation.
Data-compression algorithm classification as lossy or lossless and the respective roles of these algorithm types in onboard image-compression modules is elucidated in [1]. To circumvent the transfer and retention of valueless high-resolution images, common procedure for satellite photo transmission begins with downlinking lossy compressed images for preliminary analysis. Only after the image has been classified as meaningful for the particular application will image information produced by lossless compression be conveyed to ground stations. The work in this paper addresses the first component of this procedure by developing and analyzing a hybrid CPU-FPGA architecture to accelerate a lossy-compression algorithm known as k-means clustering.
In alignment with the recommendations of the Consultive Committee for Space Data Systems, most modern satellite image-compression units utilize complex algorithms involving wavelet transforms and bit-plane encoding [2]. Although k-means clustering is more limited in use, it can serve as the foundation of a simple yet flexible
compression application for satellite subsystems with stringent power and performance constraints. Additionally, the procedures and results of this study provide a general guidepost for the FPGA acceleration of other clustering algorithms in color-quantization applications.
1.2 K-Means Clustering
Clustering is a method by which the observations of a dataset are partitioned into a specific number of groups. K-means clustering is a commonly used unsupervised algorithm that performs this task by classifying data points into k disjoint clusters via the use of centroids. The steps that characterize Lloyd’s algorithm, the traditional k-means clustering process, are described in [3]. The output of this algorithm is a set of k clusters, and a labeling of each data point that specifies the centroid to which it is assigned. With slight modifications, k-means clustering can be applied to digital images to achieve color-quantization by grouping similar colored pixels [4]. The information provided by the algorithm when applied in this manner enables a fundamental reformatting of image data that consumes less storage space. 1.3 Heterogenous System-on-Chip Computing
A system-on-chip is an integrated circuit that consolidates all major computing elements on a single silicon substrate. This unification of what would otherwise appear as a multi-chip system permits a more compact form-factor, reduces energy expenditure, and facilitates rapid data transfer. Given that space-based computing demands both low-latency application execution and minimal power consumption, the SoC is a befitting computing tool for on-orbit embedded systems. A hybrid SoC platform containing both a CPU and FPGA introduces a reconfigurable logic region to the conventional processing system. The FPGA fabric can be programmed to accelerate a particular algorithm, as redundant data processing operations will execute faster on programmable logic hardware than the processing system software. Advancements in satellite-imaging technologies and the resultant increase in image resolution has spawned a research field concerned with the hardware acceleration of on-orbit image processing and compression. The inclusion of a reconfigurable FPGA fabric has become standard practice for algorithm acceleration in this computing domain [1].
Traditional software implementations of the k-means clustering algorithm are time-consuming. The serial processing nature of a CPU is deficient for k-means clustering, which requires multiple passes through highly repetitive distance calculations to achieve convergence. A hybrid SoC platform is better suited for a program of this type. The reconfigurable, parallelizable architecture of the FPGA fabric enables a fully-pipelined design capable of clustering data points at a much higher rate [5]. When paired with the rapid data allocation and control-flow capabilities of the CPU, the resulting accelerator can apply the target algorithm with significant speedup and energy savings.
Xilinx, Inc. is a semiconductor manufacturing company that specializes in onboard, programmable SoC production. Xilinx’s Zynq-7000 product line offers cost-optimized SoCs that integrate both software and hardware, making these chips ideal for space-based embedded systems. Xilinx also delivers the hardware development tools necessary to leverage high-level design techniques for their embedded platforms, namely the Vivado Design Suite and SDSoC.
2. Methods
2.1 K-Means Clustering for Image Compression
The application accelerated in this study employs k-means clustering to reduce the number of distinct colors in 3-channel digital images [3]. The k value is selected in such way as to maximize compressibility while maintaining sufficient image quality, which is explored in section 3. k centroids μ1 … μk are initialized to the RGB values of randomly selected pixels in the image, ensuring that all clusters C1 … Ck contain at least one pixel at the conclusion of the algorithm. Considering an image of N total pixels, each pixel P1 … PN is assigned to the centroid μi of closest RGB value. To simplify computations and conserve FPGA resources, Manhattan distance is utilized in place of Euclidean distance [6]. This process is expressed by the formulas:
d(Pn, μi) = |Pn - μi|
(1)
Ci = { Pn | d(Pn , μi) ≤ d(Pn , μj) }
(2)
where 1 ≤ i , j ≤ k and i ≠ j. All arithmetic operations performed on pixels and clusters involve separate calculations for R, G, and B attributes. Next, each centroid μi must be set to the mean RGB value of all pixels Pn in their particular cluster Ci, as defined in the equation
(3)
where the resulting RGB values of μi are truncated to integers to avoid expensive floating-point operations. The processes represented by formulas (2) and (3) are repeated until the average change in centroid RGB value is less than or equal to a threshold distance of 2. The final RGB values of the centroids are the k unique colors that will appear in the color-quantized image. Given that the optimization objective of k-means clustering is to minimize the Manhattan distance between the centroid and member pixels of each cluster
(4)
with respect to Ci and μi , the colors provided by the final centroids are adequate in preserving the visual information of the image. In the compressed image, all pixels will assume the RGB value of the centroid to which they are assigned.
2.2 Evaluating Image Quality
The compression ratio of this application can be varied according to the number of colors k to which the image is quantized. Although greater compression is advantageous in minimizing image file size, it is accompanied by reductions in image quality. Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index are two common image quality metrics that measure the power of corrupting noise and feature dissimilarities in a compressed image as compared to an uncompressed standard. MATLAB was used to determine PSNR and SSIM values for images compressed with this application.
All image quality measurements were conducted on 10 Earth observation photos captured by digital space cameras mounted on the STP-H5-CSP experimental pallet of the NSF Center for Space, High-Performance, and Resilient Computing (SHREC). These photos were deliberately selected from this image set to provide a diverse range of geomorphological and meteorological scenes of various colors and patterns. The resolution of all test images is 489×410 pixels.
2.3 Accelerator Architecture & Design
Figure 1: System Architecture Block Diagram. The high-level operations of each SoC computing region and the dataflow within and between the SoC and DDR memory is illustrated pictorially.
For this project, the Vivado Design Suite and SDSoC were coupled to enable Xilinx SoC programming using C-based design procedures. All hardware compilations targeted the PYNQ-Z2 development board, containing a Xilinx Zynq-7020 SoC. The final system architecture is illustrated in Figure 1. The CPU begins application execution by parsing the input image into 32-bit pixel elements that report individual RGB intensities values. Using this pixel information, the initial centroid values that seed the k-means clustering algorithm are computed. The CPU places both the pixel and centroid arrays in physically contiguous memory spaces in DDR, and invokes the direct memory access (DMA) controller to transfer these arrays to the FPGA. The instantiation of the DMA offloads data transfer tasks that would otherwise encumber the CPU and create a software bottleneck.
The FPGA is responsible for assigning each pixel to the closest centroid and updating cluster properties accordingly. Pixel units are stored in registers and enter a data pipeline that determines the cluster to which the pixel belongs. The use of parallel processing methodologies permits concurrent distance calculations and creates a throughput of one pixel assignment per clock cycle, which serves as a significant improvement in data processing efficiency. Upon exiting the pipeline, a pixel unit is evaluated for cluster membership and the corresponding elements of an RGB accumulation array are incremented appropriately. Once the pipeline is depleted and all pixels have been grouped, program control returns to software. The DMA delivers hardware output to the CPU, which then computes new centroid values and evaluates algorithm convergence.
3. Results
Compressed image quality was assessed for various k values in the clustering algorithm, and thus for various compression ratios in the application. 10 test images were compressed at seven different k values, each of which is a power of 2 within the range of 4 to 256 inclusive. As k increases along this set of selected values, the corresponding compression ratio decreases in a linear fashion. k values are restricted to powers of 2 since these numbers mark an upper boundary after which compressed pixel units must assume an additional bit to specify cluster membership. PSNR and SSIM were averaged across all test images after compression at each k value. The results are presented in Figure 2.
Figure 2: MATLAB Image Quality Plot. The blue curve corresponds to the left axis, depicting PSNR against k value. The orange curve corresponds to the right axis, depicting SSIM against k value. The k values analyzed are 4, 8, 16, 32, 64, 128, and 256.
Upon designing the application for a 4:1 compression ratio, Vivado implementation reports delineated the degree of FPGA resource utilization required to realize the hardware accelerator. Availability for each of the four primary computing resources on the Zync-7020 are compared to the resources demanded by the hardware function. The resource usage metrics highlighted in Table 1 apply exclusively to the compression of 489×410 images, as image resolution dictates the size of the pipeline in the accelerator.
Resource Type LUT FF BRAM DSP # Available 53200 106400 280 220
# Used
% Utilization
11009 17353 8 0
20.96 16.31 2.86 0
Table 1: Vivado resource utilization report for the FPGA accelerator. Hardware usage is measured over four component types: Lookup Table (LUT), Flip Flop (FF), Block Random Access Memory (BRAM), and Digital Signal Processor (DSP).
The k-means clustering compression program was implemented not only as a software-hardware accelerated application, but also as a software only baseline application. Each implementation compressed the same 10 test images, and the average runtime and energy requirements are depicted below in tabular form.
Power Draw Execution Time Energy Consumption
Software Only 1.402 W
Hybrid
1.672 W
Relative Reduction
0.839 0.69276 s 0.9712 J
0.00690 s 0.0115 J
100.404 84.191
Table 2: Application performance results for the software only application and hybridized application. Relative reduction metrics are obtained as the ratio of software only results to hybrid results.
The execution times displayed in Table 2 were collected by physically running the application on the Zynz7020 SoC at a clock frequency of 100 MHz. The power metrics were acquired from Vivado power analysis reports and serve as estimates for actual application execution. Energy consumption was computed as the product of instantaneous power draw and the execution time for each application.
4. Discussion
The plots of Figure 2 assume the expected pattern for image quality assessed at varying cluster quantities k. Beginning at high k values and moving left across the x-axis, initial reductions in k are met with slight degradations in PSNR and SSIM. However, more dramatic regressions in image quality appear upon dropping below a particular k value, which in this case is k=64. Since the marginal increase in image quality is negligible beyond k=64, this cluster quantity was selected for application development and testing. Although the exact k-value that constitutes the elbow of such image quality curves depends largely on the color composition of the captured images, the reverse exponential plot structure applies universally to any multispectral image set.
The resource utilization metrics of Table 2 reveals that the accelerator requires a relatively small fraction of FPGA computing components. The proposed design consumes very few BRAM units, as the majority of data arrays are partitioned completely to permit uniform data access in the clustering pipeline. This, in turn, depletes a greater number of FFs that are required to store the individual elements of the arrays. Hence, FFs along with LUTs comprise the vast majority of storage, logic, and arithmetic operations in the accelerator. The low proportion of resource usage suggests that this application can be scaled to compress images of higher resolution and/or implement additional clusters for greater compressed image fidelity.
The inclusion of the FPGA fabric to the k-means clustering compression application led to significant performance improvements. Although both solutions generate identical image output, the FPGA parallelized approach to pixel clustering operates roughly 100 times faster than the CPU serial implementation. At any given moment during application execution, the amount of power needed by the hybridized design slightly exceeds that of the software only design. These results were anticipated, as the use of the accelerator activates additional resources on the SoC. However, the substantial curtailment of execution time in the hybrid application more than compensated for this increase in power. After execution, the accelerated application consumed 84 times less energy than the software baseline. This savings is particularly significant for lossy compression applications of this type, as they will likely be invoked with every satellite image acquisition.
5. Conclusion
This paper proposed a hybrid CPU-FPGA accelerator for an onboard lossy compression application for satellite images. K-means clustering was employed to quantize image color gamut to 64 and offer a compression ratio of 4:1, producing files that consume less space in flash memory and can be transferred to ground stations at faster speeds with lower energy. The accelerated design proved successful in offering considerable reductions in application runtime and energy consumption. These performance improvements are vital for space-based embedded SoC devices, as they are limited in computing capability and energy access.
6. Acknowledgements
Funding was provided by the Mascaro Center for Sustainable Innovation (MCSI) Undergraduate Research Program coordinated by Dr. David V.P. Sanchez, Gena Kovalcik, and Ellen Cadden. Additional assistance was provided by the Center for Space, High-performance, and Resilient Computing (SHREC) Summer Undergraduate Research Group.
7. References
[1] S. Lopez et. al., The Promise of Reconfigurable Computing for Hyperspectral Imaging Onboard Systems: A Review and Trends, Proceedings of the IEEE, vol. 101, no. 3 (2013), pp. 698-722. [2] C. Thiebaut and R. Camarero, CNES Studies for On-Board Compression of High-Resolution Satellite Images, in: B. Huang (Eds.), Satellite Data Compression, Springer, New York, 2011, pp. 29-46. [3] S. Na, L. Xumin and G. Yong, Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm, Third International Symposium on Intelligent Information Technology and Security Informatics (2010), pp. 63-67. [4] T. Saegusa and T. Maruyama, Real-Time Segmentation of Color Images based on the K-means Clustering on FPGA, International Conference on Field-Programmable Technology (2007), pp. 329-332. [5] M. Gokhale et. al., Experience with a Hybrid Processor: K-Means Clustering, The Journal of Supercomputing, vol. 26, no. 2 (2003), pp. 131-148. [6] M. Estlick, M. Leeser, J. Theiler, and J. Szymanski, Algorithmic transformation in the implementation of k-means clustering on reconfigurable hardware, ACM/SIGDA 9th International Symposium on Field Programmable Gate Arrays (2001), pp. 103-110.