ETPL VLSI - 001
MACS: A Highly Customizable Low-Latency Communication Architecture
Networks-on-chips (NoCs) are an increasingly popular communication infrastructure in single chip VLSI design for enhancing parallelism and system scalability. Processing elements (PEs) connect to a communication topology via NoC switches, which are responsible for runtime establishment and management of inter-PE communication channels. Since NoC switch design directly affects overall system performance and exploited communication parallelism, much previous work focused on efficient NoC switch design. In this paper, we present MACS-a highly parametric NoC switch architecture that provides reduced data transfer latency, increased designer flexibility, and scalability as compared to previous architectures by combining and enhancing several NoC design strategies. MACS enhances inter-PE communication using a circuit switching technique with minimal adaptive routing and a simple and fair path resolution algorithm to maximize bandwidth utilization. We evaluate area and performance of an FPGA implementation of MACS, and, show that compared to previous work, MACS offers a 2× to 7× decrease in average channel setup latency, a 1.7× to 2× reduction in area requirements, similar average packet latency, up to a 6× increase in the network saturation point, and up to a 1.4× increase in bandwidth utilization. Additionally, we illustrate MACS's low average channel setup latency using six network traffic patterns and eight parallel JPEG decompression core trace simulations.
ETPL VLSI - 002
Low-Cost High-Performance VLSI Architecture for Montgomery Modular Multiplication
This paper proposes a simple and efficient Montgomery multiplication algorithm such that the lowcost and high-performance Montgomery modular multiplier can be implemented accordingly. The proposed multiplier receives and outputs the data with binary representation and uses only one-level carry-save adder (CSA) to avoid the carry propagation at each addition operation. This CSA is also used to perform operand precomputation and format conversion from the carry-save format to the binary representation, leading to a low hardware cost and short critical path delay at the expense of extra clock cycles for completing one modular multiplication. To overcome the weakness, a configurable CSA (CCSA), which could be one full-adder or two serial half-adders, is proposed to reduce the extra clock cycles for operand precomputation and format conversion by half. In addition, a mechanism that can detect and skip the unnecessary carry-save addition operations in the one-level CCSA architecture while maintaining the short critical path delay is developed. As a result, the extra clock cycles for operand precomputation and format conversion can be hidden and high throughput can be obtained. Experimental results show that the proposed Montgomery modular multiplier can achieve higher performance and significant area-time product improvement when compared with previous designs.
ETPL VLSI - 003
DFSB-Based Thermal Management Scheme for 3-D NoC-Bus Architectures
Three-dimensional network-on-chip (NoC)-bus hybrid architectures are motivated to achieve lower propagation latency and higher bandwidth in vertical direction, by taking the advantage of the short interwafer distances in 3-D integrated circuits. However, 3-D integration technology increases the power density of the chip, and thus, results in thermal-related problems. Therefore, to ensure that the chip operates within the safe temperature range, while keeping the traffic performance undegraded, this paper proposes a proactive thermal management scheme based on dynamic frequency scaling bus (DFSB) for developing thermal-aware 3-D NoC-bus architectures. The novel solution includes thermal-aware frequency scaling policy (TFSP) and frequency-aware adaptive routing (FAAR), for the temporal and spatial management separately. TFSP dynamically and proactively adjusts the frequency of DFSB, according to the predicted thermal variation, to throttle the data flow for heat dissipation. Meanwhile, FAAR cooperated with TFSP by migrating the data flow to balance the distribution of traffic and thermal, and thus, unacceptable local data congestion and latency are avoided. In order to show the effectiveness of the proposed solution, we compare it against global throttling and downward routing thermal management solutions in a 4 Ă— 4 Ă— 4 3-D NoC-bus architecture. Experimental results show that, under the thermal limitation of 378.15 K, our proposed solution outperforms the other two solutions by 24% and 56.2% improvement in throughput, and 33.1% and 45.7% reduction in latency.
ETPL VLSI - 004
LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter
In this paper, we analyze the contents of lookup tables (LUTs) of distributed arithmetic (DA)-based block least mean square (BLMS) adaptive filter (ADF) and based on that we propose intra-iteration LUT sharing to reduce its hardware resources, energy consumption, and iteration period. The proposed LUT optimization scheme offers a saving of 60% LUT content for block size 8 and still higher saving for larger block sizes over the conventional design approach. We also present here the design of a register-based LUT matrix for maximal sharing of LUT contents and full-parallel LUT-update operation. Based on the proposed design approach, we have derived a DA-based architecture for the BLMS ADF, which is scalable for larger block sizes as well as higher filter lengths. We find that the hardware complexity of the proposed structure increases less than proportionately with input block size and filter length. It offers a saving of 60% LUT-update per output and 59% LUT access per output over the recently proposed DA-based BLMS ADF structure for block size 8 and filter length 64. Besides, the proposed structure involves nearly 30% saving in the iteration period over the other for 16-bit coefficient word length. Application specific integrated circuit (ASIC) synthesis result shows that the proposed structure for block size 8 offers a saving of 48% area-delay product (ADP) and 53% energy per sample (EPS) over the existing DA-based BLMS ADF structure on average for different filter lengths, and offers 30% higher sampling rate due to its shorter iteration period. Compared with the existing DA-based LMS ADF structure, the proposed structure involves 68% less ADP and $1.6 times $ less EPS.
ETPL VLSI - 005
High-Speed and Energy-Efficient Carry Skip Adder Operating Under a Wide Range of Supply Voltage Levels
In this paper, we present a carry skip adder (CSKA) structure that has a higher speed yet lower energy consumption compared with the conventional one. The speed enhancement is achieved by applying concatenation and incrementation schemes to improve the efficiency of the conventional CSKA (ConvCSKA) structure. In addition, instead of utilizing multiplexer logic, the proposed structure makes use of AND-OR-Invert (AOI) and OR-AND-Invert (OAI) compound gates for the skip logic. The structure may be realized with both fixed stage size and variable stage size styles, wherein the latter further improves the speed and energy parameters of the adder. Finally, a hybrid variable latency extension of the proposed structure, which lowers the power consumption without considerably impacting the speed, is presented. This extension utilizes a modified parallel structure for increasing the slack time, and hence, enabling further voltage reduction. The proposed structures are assessed by comparing their speed, power, and energy parameters with those of other adders using a 45-nm static CMOS technology for a wide range of supply voltages. The results that are obtained using HSPICE simulations reveal, on average, 44% and 38% improvements in the delay and energy, respectively, compared with those of the Conv-CSKA. In addition, the power-delay product was the lowest among the structures considered in this paper, while its energy-delay product was almost the same as that of the Kogge-Stone parallel prefix adder with considerably smaller area and power consumption. Simulations on the proposed hybrid variable latency CSKA reveal reduction in the power consumption compared with the latest works in this field while having a reasonably high speed.
ETPL VLSI - 006
A New XOR-Free Approach for Implementation of Convolutional Encoder
This letter presents a new algorithm to construct an XOR-Free architecture of a power efficient Convolutional Encoder. Optimization of XOR operators is the main concern while implementing polynomials over GF(2), which consumes a significant amount of dynamic power. The proposed approach completely removes the XOR-processing operation of a chosen nonsystematic, feed-forward generator polynomial and reduces the logical operators, thereby the encoding cost. Hardware (HW) implementation of the proposed design uses Read-only memory (ROM) with a preprocessed addressing operations to reduce ROM size by nearly 50%. The results of the new architecture reduce the dynamic power up to 21.4% and HW cost up to 15% with lesser design complexity as compared to conventional method. The Hardware cosimulation of the architecture is first validated and then implemented with Xilinx Virtex-V FPGA.
ETPL VLSI - 007
A Low-Latency List Successive-Cancellation Decoding Implementation for Polar Codes
Motivated by recently derived fundamental limits on total (transmit + decoding) power for coded communication with VLSI decoders, this paper investigates the scaling behavior of the minimum total power needed to communicate over AWGN channels as the target bit-error-probability tends to zero. We focus on regular-LDPC codes and iterative message-passing decoders. We analyze scaling behavior under two VLSI complexity models of decoding. One model abstracts power consumed in processing elements (node model), and another abstracts power consumed in wires which connect the processing elements (wire model). We prove that a coding strategy using regular-LDPC codes with Gallager-B decoding achieves order-optimal scaling of total power under the node model. However, we also prove that regular-LDPC codes and iterative message-passing decoders cannot meet existing fundamental limits on total power under the wire model. Furthermore, if the transmit energy-per-bit is bounded, total power grows at a rate that is worse than uncoded transmission. Complementing our theoretical results, we develop detailed physical models of decoding implementations using post-layout circuit simulations. Our theoretical and numerical results show that approaching fundamental limits on total power requires increasing the complexity of both the code design and the corresponding decoding algorithm as communication distance is increased or error-probability is lowered.
ETPL VLSI - 008
A Low-Power Broad-Bandwidth Noise Cancellation VLSI Circuit Design for In-Ear Headphones
Conventional active noise cancelling (ANC) headphones often perform well in reducing the lowfrequency noise and isolating the high-frequency noise by earmuffs passively. The existing ANC systems often use high-speed digital signal processors to cancel out disturbing noise, which results in high power consumption for a commercial ANC headphone. The contribution of this paper can be classified into: 1) proper filter length selection; 2) low-power storage mechanism for convolution operation; and 3) high-throughput pipelining architecture. With these novel techniques, we develop an area-/power-efficient ANC circuit by using the TSMC 90-nm CMOS technology for in-ear headphone applications. The proposed feedforward filtered-x least mean square ANC circuit design provides the features of using lower operating frequency and consuming much less power that facilitate better performance than the conventional ANC headphones. To verify the effectiveness of the proposed design, a series of physical measurements is executed in an anechoic chamber. Measurement results show that the proposed high-performance/low-power circuit design can reduce disturbing noise of various frequency bands very well, and outperforms the existing works. The proposed design can attenuate 15 dB for broadband pink noise between 50 and 1500 Hz when operated at 20-MHz clock frequency at the costs of 84.2 k gates and power consumption of 6.59 mW only. Compared with the existing designs, the proposed work achieves higher noise cancellation performance in terms of 3 dB further and saves 97% power consumption.
ETPL VLSI - 009
Low-Power High-Density STT MRAMs on a 3-D Vertical Silicon Nanowire Platform
In recent years, researchers have focused toward reduction in power dissipation and cell size to employ spin-transfer torque (STT) magnetic random-access memories (MRAMs) for embedded applications. Hence, the magnetic tunnel junctions (MTJs) with an optimized structure and magnetic properties are being explored to reduce the switching current. However, the switching current reduction in the MTJs generally lowers the data-retention capability. Hence, a different approach to reduce power dissipation using a novel select device should be considered. This paper, therefore, explores the STT MRAM with vertical silicon nanowire gate all around (GAA) high-k select device for superior performance. The MTJ is stacked above the vertical GAA device, so that both occupy the same footprint area to achieve high array density. Furthermore, enhancement of current drive using high-k gate dielectric and its impact on the STT MRAMs are analyzed at different feature sizes. The proposed STT MRAM cell with high-k dielectric (HfO2) lowers the power dissipation by 8%-25% and increases the write margins (WMs) up to 38%, with negligible increment in delay in comparison with the GAA device using lowk dielectric (SiO2). Moreover, asymmetricity is introduced in device configuration to achieve power savings of 25%-30% at high VDD. The proposed asymmetric high-k cell offers a substantially larger tradeoff window between high WMs and low power dissipation.
ETPL VLSI - 010
On the Total Power Capacity of Regular-LDPC Codes With Iterative Message-Passing Decoders
Motivated by recently derived fundamental limits on total (transmit + decoding) power for coded communication with VLSI decoders, this paper investigates the scaling behaviour of the minimum total power needed to communicate over AWGN channels as the target bit-error-probability tends to zero. We focus on regular-LDPC codes and iterative message-passing decoders. We analyse scaling behaviour under two VLSI complexity models of decoding. One model abstracts power consumed in processing elements (node model), and another abstracts power consumed in wires which connect the processing elements (wire model). We prove that a coding strategy using regular-LDPC codes with Gallager-B decoding achieves order-optimal scaling of total power under the node model. However, we also prove that regular-LDPC codes and iterative message-passing decoders cannot meet existing fundamental limits on total power under the wire model. Furthermore, if the transmit energy-per-bit is bounded, total power grows at a rate that is worse than uncoded transmission. Complementing our theoretical results, we develop detailed physical models of decoding implementations using post-layout circuit simulations. Our theoretical and numerical results show that approaching fundamental limits on total power requires increasing the complexity of both the code design and the corresponding decoding algorithm as communication distance is increased or error-probability is lowered.
ETPL VLSI - 011
Assessing the Suitability of King Topologies for Interconnection Networks
In the late years many different interconnection networks have been used with two main tendencies. One is characterized by the use of high-degree routers with long wires while the other uses routers of much smaller degree. The latter rely on two-dimensional mesh and torus topologies with shorter local links. This paper focuses on doubling the degree of common 2D meshes and tori while still preserving an attractive layout for VLSI design. By adding a set of diagonal links in one direction, diagonal networks are obtained. By adding a second set of links, networks of degree eight are built, named king networks. This research presents a comprehensive study of these networks which includes a topological analysis, the proposal of appropriate routing procedures and an empirical evaluation. King networks exhibit a number of attractive characteristics which translate to reduced execution times of parallel applications. For example, the execution times NPB suite are reduced up to a 30 percent. In addition, this work reveals other properties of king networks such as perfect partitioning that deserves further attention for its convenient exploitation in forthcoming high-performance parallel systems.
ETPL VLSI - 012
A 6 b 5 GS/s 4 Interleaved 3 b/Cycle SAR ADC
This paper presents a 4Ă— time-interleaved 6-bit 5 GS/s 3 b/cycle SAR analog-to-digital converter (ADC). Hardware overhead induced by a 3 b/cycle architecture is eased by an interpolation technique where around 1/3 of the hardware is saved. In addition, complicated switching controls are simplified with a proposed fractional DAC array switching scheme, thus reducing the design complexity and the hardware burden. A boundary detection code overriding (BDCO) is introduced to reduce error probability at the large error magnitude, by utilizing the extended time when the comparator is at reset and the DAC at settling. The floorplan of the front-end is optimized for important interleaving clock distributions, and a master-clock-control bootstrapped-switch technique is adopted to suppress the timing-skew effect among the channels. The unit capacitor has been designed to suit for the DAC structure which allows top-plate sharing in both directions, plus, the offset is calibrated on-chip with a clocking variable biasing transistor pair at the latch. Measurement results show that the prototype can achieve 5 GS/s with a total power consumption of 5.5 mW at 1 V supply in 65 nm CMOS technology. Besides, it exhibits a 30.76 dB SNDR and 43.12 dB SFDR at Nyquist, which yields a Walden FoM of 39 fJ/conversion-step.
ETPL VLSI - 013
Algorithm and Architecture of Configurable Joint Detection and Decoding for MIMO Wireless Communications with Convolutional Codes
This paper presents an algorithm and a VLSI architecture of a configurable joint detection and decoding (CJDD) scheme for multi-input multioutput (MIMO) wireless communication systems with convolutional codes. A novel tree-enumeration strategy is proposed such that the MIMO detection and decoding of convolutional codes can be conducted in single stage using a tree-searching engine. Moreover, this design can be configured to support different combinations of quadrature amplitude modulation (QAM) schemes as well as encoder code rates, and thus can be more practically deployed to real-world MIMO wireless systems. A formal outline of the proposed algorithm will be given and simulation results for 16-QAM and 64-QAM with rate-1/2 and rate-1/3 codes will be presented showing that, compared with the conventional separate scheme, the CJDD algorithm can greatly improve bit error rate (BER) performance with different system settings. In addition, the VLSI architecture and implementation of the CJDD approach will be illustrated. The architectures and circuits are designed to support configurability and flexibility while maintaining high efficiency and low complexity. The postlayout experimental results for 16-QAM and 64-QAM with rate-1/2 and rate1/3 codes show that, compared with the previous configurable design, this architecture can achieve reduced or comparable complexity with improved BER performance.
ETPL VLSI - 014
Power/Energy Minimization Techniques for Variability-Aware HighPerformance 16-nm 6T-SRAM
Power and energy minimization is a critical concern for the battery life, reliability, and yield of many minimum-sized SRAMs. In this paper, we extend our previously proposed hybrid analytical-empirical model for minimizing and predicting the delay and delay variability of SRAMs, VAR-TX, to a new enhanced version, exVAR-TX, to minimize and predict the power/energy and power/energy variability of a 16-nm 6T-SRAM under the influence of the three major types of variations: Fabrication, Operation, and Implementation. Using exVAR-TX for architectural optimization [exhaustively computing and comparing the range of feasible architectures subject to interdie (die-to-die/D2D) and intradie (within-die/WID) process and operation variations (PVT), electromigration (EM), negative bias temperature instability (NBTI), and soft-errors, among others] on top of deploying the most recent state of the art effective mitigation techniques we show that energy and energy-delay-product (EDP) of 64KB 16-nm 6T-SRAM could be reduced by ~12.5X and ~33%, respectively, as compared to the existing conventional designs.
ETPL VLSI - 015
A Single-Ended With Dynamic Feedback Control 8T Subthreshold SRAM Cell
A novel 8-transistor (8T) static random access memory cell with improved data stability in subthreshold operation is designed. The proposed single-ended with dynamic feedback control 8T static RAM (SRAM) cell enhances the static noise margin (SNM) for ultralow power supply. It achieves write SNM of 1.4× and 1.28× as that of isoarea 6T and read-decoupled 8T (RD-8T), respectively, at 300 mV. The standard deviation of write SNM for 8T cell is reduced to 0.4× and 0.56× as that for 6T and RD-8T, respectively. It also possesses another striking feature of high read SNM 2.33×, 1.23×, and 0.89× as that of 5T, 6T, and RD-8T, respectively. The cell has hold SNM of 1.43×, 1.23×, and 1.05× as that of 5T, 6T, and RD-8T, respectively. The write time is 71% lesser than that of single-ended asymmetrical 8T cell. The proposed 8T consumes less write power 0.72×, 0.6×, and 0.85× as that of 5T, 6T, and isoarea RD-8T, respectively. The read power is 0.49× of 5T, 0.48× of 6T, and 0.64× of RD-8T The power/energy consumption of 1-kb 8T SRAM array during read and write operations is 0.43× and 0.34×, respectively, of 1-kb 6T array. These features enable ultralow power applications of 8T.
ETPL VLSI - 016
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem, known as the utilization wall or dark silicon, is becoming increasingly serious. With the introduction of 3-D integrated circuits (ICs), it is likely to become more severe. Thus, how to take advantage of the extra transistors, made available by Moore's law and the onset of 3-D ICs, within the power budget poses a significant challenge to system designers. To address this challenge, we propose a 3-D hybrid architecture consisting of a CPU layer with multiple cores, a fieldprogrammable gate array (FPGA) layer, and a DRAM layer. The architecture is designed for low power without sacrificing performance. The FPGA layer is capable of supporting a large number of accelerators. It is placed adjacent to the CPU layer, with a communication mechanism that allows it to access CPU data caches directly. This enables fast switches between these two layers. This architecture reduces the power and energy significantly, at better or similar performance. This then alleviates the dark silicon problem by letting us power ON more components to achieve higher performance. We evaluate the proposed architecture through a new framework we have developed. Relative to the outof-order CPU, the accelerators on the FPGA layer can reduce function-level power by 6.9× and energydelay product (EDP) by 7.2×, and application-level power by 1.9× and EDP by 2.2×, while delivering similar performance. For the entire system, this translates to a 47.5% power reduction relative to a baseline system that consists of a CPU layer and a DRAM layer. This also translates to a 72.9% power reduction relative to an alternative system that consists of a CPU layer, an L3 cache layer, and a DRAM layer.
ETPL VLSI - 017
Streaming Elements for FPGA Signal and Image Processing Accelerators
Field-programmable gate array (FPGA) devices boast abundant resources with which custom accelerator components for signal, image, and data processing may be realized; however, realizing high-performance, low-cost accelerators currently demands manual register transfer level design. Software-programmable soft processors have been proposed as a way to reduce this design burden, but they are unable to support performance and cost comparable to custom circuits. This paper proposes a new soft processing approach for FPGA that promises to overcome this barrier. A high-performance, fine-grained streaming processor, known as a streaming accelerator element, is proposed, which realizes accelerators as large-scale custom multicore networks. By adopting a streaming execution approach with advanced program control and memory addressing capabilities, typical program inefficiencies can be almost completely eliminated to enable performance and cost, which are unprecedented among software-programmable solutions. When used to realize accelerators for fast Fourier transform, motion estimation, matrix multiplication, and sobel edge detection, it is shown how the proposed architecture enables real-time performance and with performance and cost comparable with hand-crafted custom circuit accelerators and up to two orders of magnitude beyond existing soft processors.
ETPL VLSI - 018
High-Performance Pipelined Architecture of Elliptic Curve Scalar Multiplication Over GF( {2}^{m} )
This paper proposes an efficient pipelined architecture of elliptic curve scalar multiplication (ECSM) over GF(2m). The architecture uses a bit-parallel finite field (FF) multiplier accumulator (MAC) based on the Karatsuba-Ofman algorithm. The Montgomery ladder algorithm is modified for better sharing of execution paths. The data path in the architecture is well designed, so that the critical path contains few extra logic primitives apart from the FF MAC. In order to find the optimal number of pipeline stages, scheduling schemes with different pipeline stages are proposed and the ideal placement of pipeline registers is thoroughly analyzed. We implement ECSM over the five binary fields recommended by the National Institute of Standard and Technology on Xilinx Virtex-4 and Virtex-5 field-programmable gate arrays. The three-stage pipelined architecture is shown to have the best performance, which achieves a scalar multiplication over GF(2163) in 6.1 Îźs using 7354 Slices on Virtex-4. Using Virtex-5, the scalar multiplication for m = 163, 233, 283, 409, and 571 can be achieved in 4.6, 7.9, 10.9, 19.4, and 36.5 Îźs, respectively, which are faster than previous results.
ETPL VLSI - 019
Read Bitline Sensing and Fast Local Write-Back Techniques in Hierarchical Bitline Architecture for Ultralow-Voltage SRAMs
Voltage scalable decoupled SRAMs operating at a subthreshold region have various challenges, such as deteriorated read bitline (RBL) swing resulting in read sensing failure and degraded cell stability due to the half-select write. This paper proposes an equalized bitline scheme to eliminate the leakage dependence on data pattern and thus improves RBL sensing and its resilience against process, voltage, and temperature variations. In addition, we propose a fast local write-back (WB) technique to implement a half-select-free write operation. With hierarchical bitline architecture, it facilitates a local read and a subsequent fast WB action to secure the original data without performance degradation. A 16-kb SRAM test chip has been fabricated in a 65-nm CMOS technology and achieved the minimum operating voltage of 0.24 V with a read access time of 4.88
ETPL VLSI - 020
A High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications
Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant multiplications (MCM) technique that results in significant saving of computation. However, transpose form configuration does not directly support the block processing unlike direct-form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for transpose form block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves significantly less area-delay product (ADP) and less energy per sample (EPS) than the existing block implementation of direct-form structure for medium or large filter lengths, while for the short-length filters, the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure. Application-specific integrated circuit synthesis result shows that the proposed structure for block size 4 and filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure.
ETPL VLSI - 021
A Novel Quantum-Dot Cellular Automata {X} -bit \times 32 -bit SRAM
Application of quantum-dot cellular automata (QCA) technology as an alternative to CMOS technology on the nanoscale has a promising future; QCA is an interesting technology for building memory. The proposed design and simulation of a new memory cell structure based on QCA with a minimum delay, area, and complexity is presented to implement a static random access memory (SRAM). This paper presents the design and simulation of a 16-bit x 32-bit SRAM with a new structure in QCA. Since QCA is a pipeline, this SRAM has a high operating speed. The 16-bit x 32-bit SRAM has a new structure with a 32-bit width designed and implemented in QCA. It has the ability of a conventional logic SRAM that can provide read/write operations frequently with minimum delay. The 16-bit x 32-bit SRAM is generalized and an n x 16-bit x 32-bit SRAM is implemented in QCA. Novel 16-bit decoders and multiplexers (MUXs) in QCA are presented that have been designed with a minimum number of majority gates and cells. The new SRAM, decoders, and MUXs are designed, implemented, and simulated in QCA using a signal distribution network to avoid the coplanar problem of crossing wires. The QCA-based SRAM cell was compared with the SRAM cell based on CMOS. Results show that the proposed SRAM is more efficient in terms of area, complexity, clock frequency, latency, throughput
t, and power consumption. ETPL VLSI - 022
Process Variation Delay and Congestion Aware Routing Algorithm for Asynchronous NoC Design
Application of quantum-dot cellular automata (QCA) technology as an alternative to CMOS technology on the nanoscale has a promising future; QCA is an interesting technology for building memory. The proposed design and simulation of a new memory cell structure based on QCA with a minimum delay, area, and complexity is presented to implement a static random access memory (SRAM). This paper presents the design and simulation of a 16-bit x 32-bit SRAM with a new structure in QCA. Since QCA is a pipeline, this SRAM has a high operating speed. The 16-bit x 32-bit SRAM has a new structure with a 32-bit width designed and implemented in QCA. It has the ability of a conventional logic SRAM that can provide read/write operations frequently with minimum delay. The 16-bit x 32-bit SRAM is generalized and an n x 16-bit x 32-bit SRAM is implemented in QCA. Novel 16-bit decoders and multiplexers (MUXs) in QCA are presented that have been designed with a minimum number of majority gates and cells. The new SRAM, decoders, and MUXs are designed, implemented, and simulated in QCA using a signal distribution network to avoid the coplanar problem of crossing wires. The QCA-based SRAM cell was compared with the SRAM cell based on CMOS. Results show that the proposed SRAM is more efficient in terms of area, complexity, clock frequency, latency, throughput, and power consumption.
ETPL VLSI - 023
Designing Tunable Subthreshold Logic Circuits Using Adaptive Feedback Equalization
Ultralow-power subthreshold logic circuits are becoming prominent in embedded applications with limited energy budgets. Minimum energy consumption of digital logic circuits can be obtained by operating in the subthreshold regime. However, in this regime process variations can result in up to an order of magnitude variations in ION/IOFF ratios leading to timing errors, which can have a destructive effect on the functionality of the subthreshold circuits. These timing errors become more frequent in scaled technology nodes where process variations are highly prevalent. Therefore, mechanisms to mitigate these timing errors while minimizing the energy consumption are required. In this paper, we propose a tunable adaptive feedback equalizer circuit that can be used with a sequential digital logic to mitigate the process variation effects and reduce the dominant leakage energy component in the subthreshold digital logic circuits. We also present detailed energy-performance models of the adaptive feedback equalizer circuit. As part of the modeling approach, we also develop an analytical methodology to estimate the equivalent resistance of MOSFET devices in subthreshold regime. For a 64-bit adder designed in 130 nm, our proposed approach can reduce the normalized variation of the critical path delay from 16.1% to 11.4% while reducing the energy-delay product by 25.83% at minimum energy supply voltage.
ETPL VLSI - 024
Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic
Hardware acceleration has been proved an extremely promising implementation strategy for the digital signal processing (DSP) domain. Rather than adopting a monolithic application-specific integrated circuit design approach, in this brief, we present a novel accelerator architecture comprising flexible computational units that support the execution of a large set of operation templates found in DSP kernels. We differentiate from previous works on flexible accelerators by enabling computations to be aggressively performed with carry-save (CS) formatted data. Advanced arithmetic design concepts, i.e., recoding techniques, are utilized enabling CS optimizations to be performed in a larger scope than in previous approaches. Extensive experimental evaluations show that the proposed accelerator architecture delivers average gains of up to 61.91% in area-delay product and 54.43% in energy consumption compared with the state-of-art flexible datapaths.
ETPL VLSI - 025
A Low-Power Robust Easily Cascaded PentaMTJ-Based Combinational and Sequential Circuits
Advanced computing systems embed spintronic devices to improve the leakage performance of conventional CMOS systems. High speed, low power, and infinite endurance are important properties of magnetic tunnel junction (MTJ), a spintronic device, which assures its use in memories and logic circuits. This paper presents a PentaMTJ-based logic gate, which provides easy cascading, selfreferencing, less voltage headroom problem in precharge sense amplifier and low area overhead contrary to existing MTJ-based gates. PentaMTJ is used here because it provides guaranteed disturbance free reading and increased tolerance to process variations along with compatibility with CMOS process. The logic gate is validated by simulation at the 45-nm technology node using a VerilogA model of the PentaMTJ.
ETPL VLSI - 026
Design of Modified Second-Order Frequency Transformations Based Variable Digital Filters with Large Cutoff Frequency Range and Improved Transition Band Characteristics
The frequency transformation based filters (FT filters) provide an absolute control over the cutoff frequency. However, the cutoff frequency range (Ωc_range) of the FT filters is limited. The second-order frequency transformations combined with coefficient decimation technique based filter (FTCDM filter) has wider Ωc_range compared with the FT filter; however, the ratio of transition bandwidth of the transformed filter to that of the prototype filter, tbwFT/tbwmod, is large over a significant portion of Ωc_range. In this paper, we propose a novel idea of relaxing the one-to-one mapping condition between the frequency variables, to overcome the issue of limited Ωc_range for tbwFT ≤ tbwmod. In the proposed modified second-order frequency transformation based filter (MSFT filter), we relax the one-to-one mapping condition between the frequency variables and use low-pass to high-pass transformation on the prototype filter to achieve wider Ωc_range with tbwFT ≤ tbwmod. Design example shows that the MSFT filter provides 3 and 1.22 times wider Ωc_range compared to FT and FTCDM filters, respectively.
ETPL VLSI - 027
High-Speed, Low-Power, and Highly Reliable Frequency Multiplier for DLL-Based Clock Generator
A high-speed, low-power, and highly reliable frequency multiplier is proposed for a delay-locked loopbased clock generator to generate a multiplied clock with a high frequency and wide frequency range. The proposed edge combiner achieves a high-speed and highly reliable operation using a hierarchical structure and an overlap canceller. In addition, by applying the logical effort to the pulse generator and multiplication-ratio control logic design, the proposed frequency multiplier minimizes the delay difference between positive- and negative-edge generation paths, which causes a deterministic jitter. Finally, a numerical analysis is performed to analyze and compare the performance of the proposed frequency multiplier with that of previous frequency multipliers. The proposed frequency multiplier is fabricated using a 0.13-Îźm CMOS process technology, and has the multiplication ratios of 1, 2, 4, 8, and 16, and an output range of 100 MHz-3.3 GHz. The frequency multiplier achieves a power consumption to a frequency ratio of 2.9 ÎźW/MHz.
ETPL VLSI - 028
Knowledge-Based Neural Network Model for FPGA Logical Architecture Development
This paper proposes a knowledge-based neural network (KBNN) modeling approach for fieldprogrammable gate array (FPGA) logical architecture design. The KBNN embeds the existing FPGA analytical models (AMs) into an NN. The NN can complement the AMs according to their needs to provide further increased model accuracy, while maintaining the meaningful trends successfully captured in the AMs. The obtained KBNN predicts the routing channel width required by circuit implementations on various FPGA architectures, which can be used by architects to quickly and accurately evaluate various FPGA architectures in early development stages. Experimental results show that the KBNN-based approach achieves an average error of 2%, which shows 75% accuracy enhancement over the existing AMs for routing channel width estimation of a set of benchmark circuits and FPGA architectures. The KBNN model has been applied to three FPGA architecture development scenarios to demonstrate its practical application and effectiveness.
ETPL VLSI - 029
Built-In Self-Test and Digital Calibration of Zero-IF RF Transceivers
We propose a self-test method for zero-IF radio frequency transceivers using primarily loopback, aided by a small built-in self-test (BIST) circuitry, to determine critical performance parameters, such as I/Q imbalance and nonlinearity coefficients. The transceiver is placed in the loopback mode by couplers, specifically designed to be asymmetric with respect to the primary path and the BIST path. The loopback path is also designed to include two traces with slightly different delays to enable parameter deembedding. Transceiver parameters are analytically computed using baseband and signals over two frames, each of which is 200 in duration. Overall, measurement time is <10 ms, including computation time. In addition to loopback hardware support and the associated parameter deembedding methodology, we propose a complimentary BIST circuit to measure the transmitter (TX) gain. The measured parameters can be used for predistortion or postdistortion to calibrate the transceiver, both at production time and in the field. Both simulation and hardware measurement results show that the proposed method can determine the target performance parameters with adequate accuracy for digital calibration. Measurement and the subsequent calibration are shown to reduce TX error vector magnitude more than fivefold, even for significantly impaired systems.
ETPL VLSI - 030
A Systematic Design Methodology of Asynchronous SAR ADCs
Successive approximation register (SAR) analog-to-digital converters (ADCs) are widely used in biomedical and portable/wearable electronic systems due to their excellent power efficiency. However, both the design and the optimization of high-performance SAR ADCs are time consuming, even for well-experienced circuit designers. For system designers, it is also hard to quickly evaluate the feasibility of a given specification in a process node. This paper presents a systematic sizing procedure for asynchronous SAR ADCs based on design considerations. A sizing tool based on the proposed design procedure is also implemented, the sizing results of which are highly competitive in comparison with other state-of-the-art manual works. Moreover, the sizing time is relatively short due to the efficient and effective search algorithms employed. In addition to the simulation results, two silicon proofs with different specifications and process nodes are provided to demonstrate the feasibility of this design methodology.
ETPL VLSI - 031
Test Pattern Modification for Average IR-Drop Reduction
This paper presents a novel technique that modifies automatic test pattern generation test patterns to reduce time-averaged IR drop of a test pattern. We propose a fast average IR drop estimation, which is very close to the time-averaged IR drop of time-consuming transient simulation (R2 = 0.99). We calculate the contribution of every node to these nodes inside IR-drop hotspot so that we can effectively modify only a few don't care bits in the test patterns to reduce IR drop. The experimental results show that our technique successively reduces time-averaged IR drop by 10% with almost no fault coverage loss and no test pattern inflation.
ETPL VLSI - 032
Input-Based Dynamic Reconfiguration of Approximate Arithmetic Units for Video Encoding
The field of approximate computing has received significant attention from the research community in the past few years, especially in the context of various signal processing applications. Image and video compression algorithms, such as JPEG, MPEG, and so on, are particularly attractive candidates for approximate computing, since they are tolerant of computing imprecision due to human imperceptibility, which can be exploited to realize highly power-efficient implementations of these algorithms. However, existing approximate architectures typically fix the level of hardware approximation statically and are not adaptive to input data. For example, if a fixed approximate hardware configuration is used for an MPEG encoder (i.e., a fixed level of approximation), the output quality varies greatly for different input videos. This paper addresses this issue by proposing a reconfigurable approximate architecture for MPEG encoders that optimizes power consumption with the goal of maintaining a particular Peak Signal-to-Noise Ratio (PSNR) threshold for any video. Toward this end, we design reconfigurable adder/subtractor blocks (RABs), which have the ability to modulate their degree of approximation, and subsequently integrate these blocks in the motion estimation and discrete cosine transform modules of the MPEG encoder. We propose two heuristics for automatically tuning the approximation degree of the RABs in these two modules during runtime based on the characteristics of each individual video. Experimental results show that our approach of dynamically adjusting the degree of hardware approximation based on the input video respects the given quality bound (PSNR degradation of 1%-10%) across different videos while achieving a power saving up to 38% over a conventional nonapproximated MPEG encoder architecture. Note that although the proposed reconfigurable approximate architecture is presented for the specific case of an MPEG encoder, it can be easily extended to other DSP- applications.
ETPL VLSI - 033
A Mismatch-Insensitive Skew Compensation Architecture for Clock Synchronization in 3-D ICs
Traditional die-to-die (DTD) clock skew compensation topologies prerequisite matched delay lines or equal through-silicon via (TSV) delays. Unlike previous techniques, the proposed mismatchinsensitive skew compensation architecture can maintain a synchronous clock signal between two dies, while completely eliminating any skew arising from code-dependent mismatch in delay lines or unequal TSV delays. The performance of our design is verified in theory and simulation in light of mismatch/finite resolution of delay lines, clock jitter, phase detector dead zone, TSV delay, and buffer mismatch. Postsynthesis timing verification of this cell-based design was done in a 65-nm CMOS process. Under similar worse case mismatch conditions, the residual skew in the proposed architecture was delimited to 32 ps at 1 GHz, compared with 116 ps for a recent DTD topology, while consuming only 2.1 mW.
ETPL VLSI - 034
High-Density and High-Reliability Nonvolatile Field-Programmable Gate Array With Stacked 1D2R RRAM Array
The huge area overhead of the interconnect is one of the critical issues in static random access memory (SRAM)-based field-programmable gate arrays (FPGAs), resulting in high power consumption and slow operation speed. Another critical issue is the volatile feature of the SRAM, which leads to high standby leakage current and long power-ON time. Resistive random access memory (RRAM) with a high resistance ratio and zero standby power possesses great potential in the FPGA applications. The conventional RRAM-based nonvolatile FPGAs (NVFPGAs) may use one-transistor 2-RRAM (1T2R) storage element to replace the SRAM or the one RRAM (1R) cell to replace both nMOS switch and SRAM. However, those NVFPGA schemes may suffer from the issues of low reliability, high configuration power, and high active leakage power. In this paper, we propose a novel element [onediode two-RRAM (1D2R) cells] to replace the nMOS switch and 6 Transistors (6T) SRAM. Meanwhile, the novel block structures of the logic block, connection block, switch block, and the FPGA architecture based on the 1D2R element are proposed. Compared with the conventional 1T2Rbased NVFPGA, our novel structure could improve the operation speed by 53% with a 40.5% lower operation power. Compared with the conventional 1R-based NVFPGA, the proposed scheme could greatly reduce the write error rate by eight orders with more than 20 times lower write power.
ETPL VLSI - 035
In-Field Test for Permanent Faults in FIFO Buffers of NoC Routers
We propose an integrated, energy-efficient, resource allocation framework for overcommitted clouds. The framework makes great energy savings by 1) minimizing Physical Machine (PM) overload occurrences via VM resource usage monitoring and prediction, and 2) reducing the number of active PMs via efficient VM migration and placement. Using real Google data consisting of a 29-day traces collected from a cluster containing more than 12K PMs, we show that our proposed framework outperforms existing overload avoidance techniques and prior VM migration strategies by reducing the number of unpredicted overloads, minimizing migration overhead, increasing resource utilization, and reducing cloud energy consumption.
ETPL VLSI - 036
A Comparator-Based Rail Clamp
A comparator-based rail clamp for handling electrostatic discharge (ESD) events is presented. The new circuit technique allows the use of a time constant that can be much smaller than a traditional RC and inverter-based clamp. The new clamp is more area-efficient and dissipates ESD events with little residual energy. The design is able to support applications with power-ON time slower than 4 Îźs, is immune to latch-ON, and recovers very quickly if falsely triggered. Experimental results and performance comparisons with the traditional circuit are presented.
ETPL VLSI - 037
A SUC-Based Full-Binary 6-bit 3.1-GS/s 17.7-mW Current-Steering DAC in 0.038 mm ^{2}
A 6-bit full-binary compact and low-power current-steering digital-to-analog converter (DAC) designed for 60-GHz Wireless Personal Area Network applications is presented. The closely located circuit components based on the stacked unit cell minimize the parasitic capacitance and enhance the high-frequency dynamic linearity. The proposed binary structure realizes a compact DAC by eliminating the need for additional circuits, such as thermometer decoders, and thus reduces power consumption. A prototype 6-bit 3.1-GS/s full-binary DAC was fabricated in a 90-nm CMOS process. The DAC exhibits a spurious-free dynamic range of >37.2 dB up to 3.1 GS/s over the Nyquist input. The chip consumes 17.7 mW of power and occupies 0.038 mm2 of core size.
ETPL VLSI â&#x20AC;&#x201C;038
Glitch Energy Reduction and SFDR Enhancement Techniques for LowPower Binary-Weighted Current-Steering DAC
This brief proposes a glitch reduction approach by dynamic capacitance compensation of binaryweighted current switches in a current-steering digital-to-analog converter (DAC). The method was proved successfully by a 10-bit 400-MHz pure binary-weighted current-steering DAC with a minimum number of retiming latches. The experiment results yield very low-glitch energy during major carry transitions at output, which is <1 pVs. This brief utilizes a layout structure to improve the spuriousfree dynamic range at high signal frequencies. This chip was implemented in a standard 0.18- CMOS technology and consumes 20.7 mW at 400 MS/s.
ETPL VLSI - 039
Computing Seeds for LFSR-Based Test Generation From Nontest Cubes
In test data compression methods that are based on the use of a linear-feedback shift register (LFSR), a seed that produces a test for a target fault is computed based on a test cube for the fault. With a given LFSR, a seed may not exist for a given test cube, even though a seed may exist for a different test cube that detects the same fault. This issue is addressed in this brief by computing seeds for LFSR-based test generation without using test cubes. Instead, the procedure described in this brief is based on the use of nontest cubes. A nontest cube for a fault must be avoided in any test or test cube for the fault in order to allow the fault to be detected. Therefore, nontest cubes do not limit the ability of the procedure to compute seeds with a given LFSR. Experimental results demonstrate the advantages that the use of nontest cubes provides, and the associated computational cost.
ETPL VLSI - 040
Design for Testability of Sleep Convention Logic
Testability is a major concern in industry for today's complex system-on-chip design. Design-fortestability (DFT) techniques are essential for any logic style, including asynchronous logic styles in order to reduce the test cost. Sleep convention logic (SCL) is a new promising asynchronous logic style that is based on the more well-known asynchronous logic style NULL convention logic (NCL). In contrast to the NCL, there are currently no design for testability methodologies existing for the SCL. The aim of this paper is to analyze the various faults within SCL pipelines and propose a scan-based DFT methodology to make the SCL testable. The proposed DFT methodology is then validated through a number of experiments, showing that the methodology provides a high test coverage (>99%). The complete DFT methodology as well as the scan chain and scan cell design are presented.
ETPL VLSI - 041
An Efficient Single and Double-Adjacent Error Correcting Parallel Decoder for the (24,12) Extended Golay Code
Memories that operate in harsh environments, like for example space, suffer a significant number of errors. The error correction codes (ECCs) are routinely used to ensure that those errors do not cause data corruption. However, ECCs introduce overheads both in terms of memory bits and decoding time that limit speed. In particular, this is an issue for applications that require strong error correction capabilities. A number of recent works have proposed advanced ECCs, such as orthogonal Latin squares or difference set codes that can be decoded with relatively low delay. The price paid for the low decoding time is that in most cases, the codes are not optimal in terms of memory overhead and require more parity check bits. On the other hand, codes like the (24,12) Golay code that minimize the number of parity check bits have a more complex decoding. A compromise solution has been recently explored for Bose-Chaudhuri-Hocquenghem codes. The idea is to implement a fast parallel decoder to correct the most common error patterns (single and double adjacent) and use a slower serial decoder for the rest of the patterns. In this brief, it is shown that the same scheme can be efficiently implemented for the (24,12) Golay code. In this case, the properties of the Golay code can be exploited to implement a parallel decoder that corrects single- and double-adjacent errors that is faster and simpler than a single-error correction decoder. The evaluation results using a 65-nm library show significant reductions in area, power, and delay compared with the traditional decoder that can correct single and double-adjacent errors. In addition, the proposed decoder is also able to correct some triple-adjacent errors, thus covering the most common error patterns.
ETPL VLSI - 042
Low-Power ECG-Based Processor for Predicting Ventricular Arrhythmia
This paper presents the design of a fully integrated electrocardiogram (ECG) signal processor (ESP) for the prediction of ventricular arrhythmia using a unique set of ECG features and a naive Bayes classifier. Real-time and adaptive techniques for the detection and the delineation of the P-QRS-T waves were investigated to extract the fiducial points. Those techniques are robust to any variations in the ECG signal with high sensitivity and precision. Two databases of the heart signal recordings from the MIT PhysioNet and the American Heart Association were used as a validation set to evaluate the performance of the processor. Based on application-specified integrated circuit (ASIC) simulation results, the overall classification accuracy was found to be 86% on the out-of-sample validation data with 3-s window size. The architecture of the proposed ESP was implemented using 65-nm CMOS process. It occupied 0.112-mm2 area and consumed 2.78-ÎźW power at an operating frequency of 10 kHz and from an operating voltage of 1 V. It is worth mentioning that the proposed ESP is the first ASIC implementation of an ECG-based processor that is used for the prediction of ventricular arrhythmia up to 3 h before the onset.
ETPL VLSI - 043
Sequence-Aware Watermark Design for Soft IP Embedded Processors
This paper describes a design approach for incorporating sequence-aware watermarks in soft intellectual property (IP) embedded processors. The influence of watermark sequence parameters on detection, area, and power overheads is examined, and consequently a method for incorporating sequence-aware watermarks in soft IP embedded processors is proposed. The intrinsic parameters of sequences, such as the activity factor and the overlapping factor, are introduced, and their impact on correlation results is demonstrated. Measurement and application-specified integrated circuits validate the design approach and demonstrate the resulting IP protection and subsequent costs for constrained embedded processors. Results presented in this paper show that the tradeoff occurs between the watermark robustness against third-party IP attacks and hardware implementation costs. The analysis of this tradeoff is provided, and an application specific watermark implementation is proposed.
ETPL VLSI - 044
A Configurable Parallel Hardware Architecture for Efficient Integral Histogram Image Computing
Integral histogram image can accelerate the computing process of feature algorithm in computer vision, but exhibits high computation complexity and inefficient memory access. In this paper, we propose a configurable parallel architecture to improve the computing efficiency of integral histogram. Based on the configurable design in the architecture, multiple integral objects for integral histogram image, such as image intensity, image gradient, and local binary pattern, are well supported. Meanwhile, by means of the proposed strip-based memory partitioning mechanism, this architecture processes the integral histogram quickly with maximal parallelism in a pipeline manner. Besides, in this architecture, the proposed data correlation memory compression mechanism effectively solves the expansion problem of integral histogram memory caused by storing the histogram data. It fully reduces the data redundancy in the integral histograms, and saves a lot of memory resources. Experiments using Cyclone IV-based field-programmable gate array platform and 65-nm technology-based postsynthesis show that our architecture improves the average computing speed by 8.6 times with high power efficiency compared with the state-of-the-art works.
ETPL VLSI - 045
A Universal Hardware-Driven PVT and Layout-Aware Predictive Failure Analytics for SRAM
The impact of device variability, temperature, and technology CAD-based layout parasitics on lowvoltage static random access memory (SRAM) yield is explored using a novel variability-aware statistical methodology. Threshold voltage, Vt, mismatches for planar 22- and 14-nm FinFET SRAM transistors are characterized based on unique array-like structures for capturing process voltage and temperature (PVT) impact on variability. In general, the mismatches are shown to be a consistent and unique function of Vdd, doping, and temperature across the two technologies. Stronger Vt mismatch impact is observed as a function of Vdd and doping in the 22-nm technology, with higher mismatch recorded at lower temperatures. In the 14-nm technology, doping is found to have the strongest impact on Vt mismatch, and the mismatch increases with Vdd despite the reduced drain induced barrier lowering effects. Similar to the 22-nm technology, the mismatch increases at lower temperatures. Front-end-of-the line capacitance effects are found to be more significant than back-end-of-the-line effects in 14-nm technologies, as opposed to planar technologies. Accurate parasitic capacitance modeling along with PVT-aware variability process variations for different 22-/14-nm cell arrangements are incorporated into a physics based statistical analysis methodology for accurate Vmin analysis. The yield analysis results are corroborated with hardware yield using 4-16-Mb inline SRAM macro monitors. The methodology is unique in the industry, gives insight into the technology-circuit interactions, and is able to effectively predict the SRAM yield bounds.
ETPL VLSI - 046
Error Resilient and Energy Efficient MRF Message-Passing-Based Stereo Matching
Message-passing-based inference algorithms have immense importance in real-world applications. In this paper, error resiliency of a message passing based Markov random field (MRF) stereo matching hardware is explored and enhanced through the application of statistical error compensation. Error resiliency is of particular interest for subnanometer and postsilicon devices. The inherent robustness of iteration-based MRF inference algorithms is explored and shows that small errors are tolerable, while large errors degrade the performance significantly. Based on these error characteristics, algorithmic noise tolerance (ANT) has been applied at the arithmetic, iteration, and system levels. Introducing timing errors via voltage overscaling, at the arithmetic level, results show that the ANT-based hardware can tolerate an error rate of 21.3%, with performance degradation of only 3.5% at an overhead of 97.4%, compared with an error-free hardware with an energy savings of 39.7%. To reduce compensation complexity, iteration and system-level compensation was explored. Results show that, compared with arithmetic level, system-level compensation reduces overhead to 59%, while maintaining stereo matching performance with only 2.5% degradation with 16% additional power savings. These results are verified via FPGA emulation with timing errors induced within the message passing unit via relaxed synthesis.
ETPL VLSI - 047
Unequal-Error-Protection Error Correction Codes for the Embedded Memories in Digital Signal Processors
In many digital signal processing applications, some parts of a word stored in the embedded static random access memories (SRAMs) are more important than other parts of the word. Due to the differences in importance, memory failures that occur in more important bit locations generally give rise to relatively larger system performance degradation than those in less important locations. This brief presents a low-complexity unequal-error-protection error correcting code (UEEP-ECC) approach for the embedded memories in digital signal processor. In the proposed UEEP-ECC, repetition code is combined with the Boseâ&#x20AC;&#x201C;Chaudhuriâ&#x20AC;&#x201C;Hocquenghem code to selectively provide stronger error correction capabilities on more important data portions without a large hardware overhead. An efficient UEEP-ECC generation algorithm that can find the UEEP-ECC code with a minimum power of memory core and ECC logics is also presented. The experimental results show that the UEEP-ECC scheme achieves considerable power savings and data quality improvements in both of the H.264 and fast Fourier transform applications.
ETPL VLSI - 048
Design of a High-Performance System for Secure Image Communication in the Internet of Things
Image or video exchange over the Internet of Things (IoT) is a requirement in diverse applications, including smart health care, smart structures, and smart transportations. This paper presents a modular and extensible quadrotor architecture and its specific prototyping for automatic tracking applications. The architecture is extensible and based on off-the-shelf components for easy system prototyping. A target tracking and acquisition application is presented in detail to demonstrate the power and flexibility of the proposed design. Complete design details of the platform are also presented. The designed module implements the basic proportional-integral-derivative control and a custom target acquisition algorithm. Details of the sliding-window-based algorithm are also presented. This algorithm performs $20times $ faster than comparable approaches in OpenCV with equal accuracy. Additional modules can be integrated for more complex applications, such as search-and-rescue, automatic object tracking, and traffic congestion analysis. A hardware architecture for the newly introduced Better Portable Graphics (BPG) compression algorithm is also introduced in the framework of the extensible quadrotor architecture. Since its introduction in 1987, the Joint Photographic Experts Group (JPEG) graphics format has been the de facto choice for image compression. However, the new compression technique BPG outperforms the JPEG in terms of compression quality and size of the compressed file. The objective is to present a hardware architecture for enhanced real-time compression of the image. Finally, a prototyping platform of a hardware architecture for a secure digital camera (SDC) integrated with the secure BPG (SBPG) compression algorithm is presented. The proposed architecture is suitable for high-performance imaging in the IoT and is prototyped in Simulink. To the best of our knowledge, this is the first ever proposed hardware architecture for SBPG compression integrated with a- SDC.
ETPL VLSI - 049
Histogram-Based Ratio Mismatch Calibration for Bridge-DAC in 12-bit 120 MS/s SAR ADC
This brief reports a 120 MS/s 12-bit successive approximation register analog-to-digital converter (ADC). The conversion nonlinearity in a bridge digital-to-analog converter is analyzed, and its corresponding histogram-based ratio mismatch (HBRM) calibration is presented in detail. Verified by behavioral simulations as well as measured results, the solution improves both the dynamic performance and the static performance of the ADC. The measurement results demonstrate that the HBRM calibration effectively improves the signal-to-noise distortion ratio from 56.9 to 63.7 dB at dc input, with a sampling frequency of 120 MS/s.
ETPL VLSI - 050
Energy and Area Efficient Three-Input XOR/XNORs With Systematic Cell Design Methodology
In this brief, we propose three efficient three-input XOR/XNOR circuits as the most significant blocks of digital systems with a new systematic cell design methodology (SCDM) in hybrid-CMOS logic style. SCDM, which is an extension of CDM, plays the essential role in designing efficient circuits. At first, it is deliberately given priority to general design goals in a base structure of circuits. This structure is generated systematically by employing binary decision diagram. After that, concerning high flexibility in design targets, SCDM aims to specific ones in the remaining three steps, which are wise selections of basic cells and amend mechanisms, as well as transistor sizing. In the end, the resultant three-input XOR/XNORs enjoy full-swing and fairly balanced outputs. They perform well with supply voltage scaling, and their critical path contains only two transistors. They also outperform their counterparts exhibiting 27%-77% reduction in average energy-delay product in HSPICE simulation based on TSMC 0.13-Îźm technology. The symmetric schematic topologies significantly simplify and minimize the layout, as 26%-32% improvement in area is demonstrated.