INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 4 ISSUE 1 – APRIL 2015 - ISSN: 2349 - 9303
Performance Improved Network on Chip Router for Low Power Applications Ms. A.Sambavi1 1
2
Mr. I.Koil Pandi2 Ratna Vel Subramaniam College of Engineering & Technology, ECE Department, gracemejar@email.com
Ratna Vel Subramaniam College of Engineering & Technology, ECE Department, sambavirama7@email.com
Abstract— On chip routers typically have buffers dedicated to their input or output ports for temporarily storing packets in case contention occurs. Buffers consume significant portions of router area. While running a traffic trace, however not all input ports of routers have incoming packets needed to be transferred simultaneously. So large numbers of buffer queues in the network are empty and other queues are mostly busy. This observation motivates us to design Router architecture with Shared Queues (RoShaQ), router architecture that maximizes buffer utilization by allowing the sharing multiple buffer queues among input ports. In the network design of the NoC the most essential things are a network topology and a routing algorithm. Routers route the packets based on the algorithm that they use. Every system has its own requirements for the routing algorithm. A new adaptive weighted XY routing algorithm for eight port router Architecture is proposed in order to decrease the latency of the network on chip router. Index Terms— adaptive weighted routing, Networks on Chip, router architecture, shared buffer. —————————— ——————————
1 INTRODUCTION
N
etwork on chip (NoC) is a communication subsystem on an integrated circuit. In the design of NoCs, high throughput and low latency are both important design parameters and the router micro architecture plays a vital role in achieving these performance goals. In a typical router, each input port has an input buffer for temporarily storing the packets in case that output channel is busy. This buffer can be a single queue as in a wormhole (WH) router or multiple queues in parallel as in Virtual Channel (VC) routers. High throughput routers allow an NoC to satisfy the communication needs of multicore and many core applications, or the higher achievable throughput can be traded off for power savings with fewer resources being used to attain a target bandwidth. Further, achieving high throughput is also critical from a delay perspective for applications with heavy communication workloads because queueing delays grow rapidly as the network approaches saturation. Another approach is by sharing buffer queues that allows utilizing idle buffers or emulating an output buffer router to obtain higher throughput. Our work differs from those router designs by allowing input packets at input ports to bypass shared queues hence, it achieves lower zero load latency. In addition, the proposed router architecture has simple control circuitry making it dissipate less packet energy than VC routers and achieving higher throughput by
letting queues share workloads when the network load becomes heavy. The proposed routing algorithm, which is transparent with respect to the router implementation, are presented and discussed, and assessed by means of simulation on synthetic and real traffic scenarios. The analysis takes into account several aspects and metrics of the design.
2 RELATED WORKS AND CONTRIBUTIONS Peh et al. [19] and Mullins et al. [16] proposed speculative techniques for VC routers allowing a packet to simultaneously arbitrate for both VCA and SA giving a higher priority for non speculative packets to win SA; therefore reducing zero load latency in which the probability of failed speculation is small. This low latency, however, comes with the high complexity of SA circuitry and also wastes more power each time the speculation fails. Sophisticated extensions to IBR micro architectures have been proposed for improving throughput, latency, and power. For throughput, techniques like flit-reservation flow control, variable allocation of VCs, and express VCs [13] have been proposed. As these designs are input-buffered, they are only able to multiplex arriving packets from their input ports across the crossbar switch, unlike our proposed router architecture which can shuffle incoming packet flows from all input ports and then onto the crossbar switch. Recently, Passas et al. [18] designed a 128 × 128 crossbar allowing connecting 128 tiles while occupying only 6% of their total area. This fact encourages us to build RoShaQ that has two crossbars while sharing cost expensive buffer queues. Nicopoulos et al. [17] proposed ViChar, a router architecture which allows packets to share flit slots inside buffer queue so that can achieve higher throughput. Our paper manages buffers at coarser grain that is at queue-level rather than at flit-level, hence allows reusing existing generic queue design which makes buffer and router design much simpler and straightforward. Ramanujam et al. [21]
————————————————
Ms.A.Sambavi is currently pursuing masters degree program in applied electronics in Ratna Vel Subramaniam College of Engineering & Technology,Dindigul, PH-624005. E-mail: sambavirama7@mail.com Mr.I.Koil Pandi Assistant Professor working in Ratna Vel Subramaniam College of Engineering & Technology,Dindigul, PH-624005. E-mail: gracemajar@mail.com
85
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 4 ISSUE 1 – APRIL 2015 - ISSN: 2349 - 9303 recently proposed a router architecture with shared-queues named DSB which emulates an output-buffered router. The majority of state of the art on-chip router designs utilize input queuing buffers; we, however, can find in the literature a few output queuing router architectures [9]. If looking into the whole network picture, buffers at an output router port should act the same as input buffers of its downstream router. Depending on network load, RoShaQ can dynamically adapt to use the bypass paths or the shared queues. Only the initial, introductory paragraph has a drop cap.
3 DESCRIPTION OF THE METHOD Routing on NoC is quite similar to routing on any network. A routing algorithm determines how the data is routed from sender to receiver. Oblivious algorithms route packets without any information about traffic amounts and conditions of the network, deterministic algorithms route packets always along a same route and stochastic routing is based on randomness.
3.1 WH Router architecture Router is the most important component for the design of NoC system. Fig. 1 shows the traveling process of a flit through a WH router. At first, when a flit arrives at an input port, it is written to the corresponding buffer queue. This step is called buffer write or Queue Write (QW).Assuming without other packets in the front of the queue, the packet starts deciding the output port for its next router (based on the destination information contained in its head flit) instead of for the current router (known as Lookahead Routing Computation (LRC). Simultaneously, it arbitrates for its output port at the current router because there may be multiple packets from different input queues having the same output port. This step is called Switch Allocation (SA).If it wins the output SA, it will traverse across the crossbar. This step is called crossbar traversal or Switch Traversal (ST). After that, it then traverses on the output link toward next router. This step is called Link Traversal (LT).
3.2 VC Router architecture Virtual Channel Router (VCR) is a router which uses source routing algorithm and wormhole network flow control with virtual channels. It is suitable for on chip networks with two dimensional topologies. The datapath of the router consists of buffers and a switch. In this VC router design, an input buffer has multiple queues in parallel, each queue is called a VC, that allows packets from different queues to bypass each other to advance to the crossbar stage instead of being blocked by a packet at the head of the queue (however, all queues at one input port can be still blocked if all of them do not win SA or if all corresponding output VC queues are full).Because now an input port has multiple VC queues, each packet has to choose a VC of its next router’s input port before arbitrating for output switch. The routing operation comprises four steps such as Routing Computation (RC), virtual channel allocation (VCA), switch allocation (SA), and switch traversal (ST) these are often implemented as four pipeline stages in modern virtual-channel routers. When a head flit (the first flit of a packet) arrives at an input channel, the router stores the flit in the buffer for the allocated virtual channel and determines the next hop node for the packet. This stage is called as RC stage. Given the next hop, the router then allocates a virtual channel in the next hop. This stage is called as VCA stage. The next hop node and virtual channel decision is then used for the remaining flits of the given packet. The relevant virtual channel is exclusively allocated to that packet until the packet transmission completes. Finally, if the next hop can accept the flit, the flit competes for a switch. This stage is called as SA stage, and moves to the output port. This stage is called as ST stage. Fig. 2 shows the VC router architecture.
Figure axis labels are often a source of confusion. Use words rather than symbols. As an example Fig.2. VC Router architecture.
Granting an output VC for a packet is given by a Virtual Channel Allocator (VCA) and this VC allocation is performed in parallel with the LRC. VC router achieves higher saturation throughput than a WH router while having the same number of buffer entries per input port, it also has higher zero-load latency due to deeper pipeline.
Figure axis labels are often a source of confusion. Fig.1. WH Router architecture. Use words rather than symbols. As an example Both LRC and SA are done by the head flit of each packet; body and tail flits will follow the same route that is already reserved by the head flit, except the tail flit should release the reserved resources once it leaves the queue .In a WH router, if a packet at the head of a queue is blocked (because it is not granted by the SA or the corresponding input queue of the downstream router is full), all packets behind it also stall. This section describes the power model that contains different components of power dissipation of a link.
3.3 Shared Buffer Router Architecture When an input port receives a packet, it calculates its output port for the next router (look ahead routing), at the same time it arbitrates for both its decided output port and shared queues. If it receives a grant from the output port allocators (OPAs), it will advance to its output port in the next cycle. Otherwise, if it receives a grant to a shared
86
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 4 ISSUE 1 – APRIL 2015 - ISSN: 2349 - 9303 queue, it will be written to that shared queue at the next cycle. In case that it receives both grants, it will prioritize to advance to the output port. Shared queues allocator (SQA) receives requests from all input queues and grants the permission to their packets for accessing non full shared queues. Fig.3 shows the Shared buffer router architecture. Packets from input queues are allowed to write to a share queue only if the shared queue is empty or the shared queue is containing packets having the same output port as the requesting packet. The OPA receives requests from both input queues and shared queues. Both SQA and OPA grant these requests in round-robin manner to guarantee fairness and also to avoid starvation and livelock.
2.
3.
data propagation once a message is in the network. The unit of measure for bandwidth is bit per second (bps) and it usually considers the whole packet, including the bits of the header, payload and tail. Throughput: Throughput is defined as the maximum traffic accepted by the network, that is, the maximum amount of information delivered per time unit. The throughput measure is messages per second or messages per clock cycle. One can have a normalized throughput (independently from the size of the messages and of the network) by dividing it by the size of the messages and by the size of the network. Latency: Latency is the time elapsed between the beginning of the transmission of a message (or packet) and its complete reception at the target node. Latency is measured in time units and mostly used as comparison basis among different design choices. In this case, latency can also be expressed in terms of simulator clock cycles. Normally, the latency of a single packet is not meaningful and one uses the average latency to evaluate the network performance. On the other hand, when some messages present a much higher latency than the average, this may be important.
4 WEIGHTED ROUTING ALGORITHM In a physically static NoC, the routing decision can be distributed or a source based deterministic routing scheme may be employed. The weighted XY routing algorithm assigns each output port a weight based on available bandwidth and dx and x coordinate (columns) distance or dy and y coordinate (rows) distance between the current and the destination node. This ideally gives the packet a maximum number of sensible routing choices along its route as it allows the packet to be routed toward its destination in both the x and y directions. The weight is also proportional to the available bandwidth. If the output port is chosen with the highest associated available bandwidth, the used bandwidth is distributed as evenly as possible among the output ports. Thus, the other output ports are more likely to be able to accommodate future transactions. The weights of each port is given as
Fig.3. Shared buffer Router architecture.
Input queue, output port, and shared queue states maintain the status (idle, wait, or busy) of all queues and output ports, and incorporate with SQA and OPA to control the overall operation of the router Only input queues of RoShaQ have routing computation logic because packets in the shared queues were written from input queues hence they already have their output port information. RoShaQ has the same I/O interface as a typical router that means they have the same number of I/O channels with flit level flow control and credit based backpressure management.
3.4 RoShaQ Datapath Pipeline After a packet is written into an input queue in the first cycle, in the second cycle it simultaneously performs three operations such as LRC, OPA, and SQA. LRC stands for Lookahed Routing Computation, OPA stands for Output Port Allocator and SQA stands for Shared Queue Allocator. At low network load, there is a high chance the packet to win the OPA due to low congestion at its desired output port, hence it is granted to traverse through the output crossbar and output link toward next router. 3.5 NoC Performance Parameters Routing algorithms are evaluated primarily by measures of average message latency and average system throughput. The hardware requirements in terms of the buffer size required per node and the number of virtual channels per physical channels are also used in comparing routing algorithms. The performance of a network on chip can be evaluated by three parameters known as bandwidth, throughput, and latency. 1. Bandwidth: The bandwidth refers to the maximum rate of
BN × | yd – y | + Bmax, WN = 0, BN,
yd – y < 0 BN < Bp else
(1)
BS × ( yd – y ) + Bmax, WS= 0, BS,
yd – y < 0 BS < Bp else
(2)
BW × | xd – x | + Bmax, WW= 0, BW,
xd – x < 0 BW < Bp else
(3)
BE × ( xd – x ) + Bmax, WE= 0, BE,
xd – x < 0 BE < Bp else
(4)
( BN + BE) × (yd – xd) + Bmax, yd – xd < 0 WNE= 0, ( BN+ BS) < Bp
87
(5)
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 4 ISSUE 1 – APRIL 2015 - ISSN: 2349 - 9303 BN + BE
else Fig.5.shows the RTL schematic diagram in which it indicates clearly
( BN + BW) × (yd + xd) + Bmax, yd + xd > 0 WNW=0, ( BE+BW ) < Bp BN +BW, else
(6)
( BS + BE) × (yd – xd) + Bmax, yd – xd <0 WSE= 0, (BS + BE) < Bp BS + BE, else
(7)
( BN + BW) × (yd – xd) + Bmax, yd + xd > 0 WSW= 0, (BS + BW) < Bp BS + BW, else
(8)
In the above equations, bandwidth of north, south, west and east ports are denoted as BN,BS,BW,BE. Bmax is the maximum bandwidth and Bp is available port bandwidth. Here the weights of north, south, west, east, northeast, northwest, southeast, southwest ports are denoted as WS, WN, WW, WE, WNE, WNW, WSE, WSW. They are calculated to be proportional to the distance from source to destination and to the available bandwidth if the output direction is facing the destination and proportional to the available bandwidth if it is not. If there is not enough bandwidth available, the weights are zero. The route chosen is then to the direction with the highest weight. If the packet is enter the west port of the router, it will delivered through one of the router port corresponding to the router port constrains. Latency of a packet is measured from the time its head flit is generated by the source to the time its tail flit is consumed by the destination. Average network latency is the mean of latency of all packets in the network.
5 RESULTS AND DISCUSSIONS
Fig.5. RTL schematic .
Fig.6. shows the Technology synthesis diagram in which it indicates clearly about cross sectional front view of the IC. node longitude.
Fig.6. Technology Synthesis.
Our proposed router architecture, achieves both low latency as input buffering routers and high throughput as output buffering ones without the needs of internal router speedup. Input packets also can bypass the shared queues to achieve low latency in the case that the network load was low. Here RoShaQ forming a router with fined grain shared buffers which could improve more network performance. The simulation is performed for eight port networks on chip router design which is based on the weight on each port of the router to demonstrate the effectiveness of our proposed design. Fig.4 shows the output of eight port router in which the packet is delivered to northwest port. In this design the destination latitude is greater than the node latitude at the same time the destination longitude is less than the node longitude. about the inputs and the outputs. longitude.
Fig.7.shows the synthesis report in which it indicates clearly about average latency of the eight port router i.e., 12.171 ns.
Fig.7. Synthesis Report.
6 CONCLUSION In this paper, an eight port networks on chip router using shared buffer was proposed. A new routing algorithm that supports real-time traffic direction arbitration while avoiding deadlock and starvation has been Fig.4. Northwest port Output.
88
INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY VOLUME 4 ISSUE 1 – APRIL 2015 - ISSN: 2349 - 9303 presented. Experimental results verified that the proposed eight port router architecture can significantly reduce the latency. Compared to conventional router, the bandwidth utilization of our eight port router also exhibited higher efficiency. In this project, eight port networks on chip router is designed based on the weight on each port of the router. The proposed router design achieves 12.171ns average latency.
REFERENCES [1] A. Banerjee, P. T. Wolkotte, R. D. Mullins, S. W. Moore, and G. J. M. Smit, ―An energy and performance exploration of network-on- chip architectures,‖ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 3, pp. 319–329, Mar. 2009. [2] E. Beigne, ―An asynchronous power aware and adaptive NoC based circuit,‖ IEEE J. Solid-State Circuits, vol. 44, no. 4, pp. 1167–1177, Apr. 2009. [3] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, ―NoC synthesis flow for customized domain specific multiprocessor systems-on-chip,‖ IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 2, pp. 113–129, Feb. 2005. [4] A.Bianco, P. Giaccone, G. Masera, and M. Ricca, ―Power control for crossbar-base queued Switches,‖ IEEE Trans. Comput., vol . 62,no.1, pp. 74-84, Jan. 2013. [5] S.T. Chuang, A. Goel, N. McKeown, and B. Prabhakar, ―Matching output queueing with a combined input/output-queued switch,‖ IEEE J. Selected Areas Commun., vol. 17, no. 6, pp. 1030–1039, Jun. 1999. [6] M. Galles, ―Spider: A high-speed network interconnect,‖ IEEE Micro, vol. 17, no. 1, pp. 34–39, Jan. 1997. [7] D. Gebhardt, J. You, and K. S. Stevens, ―Design of an energy-efficient asynchronous NoC and its optimization tools for heterogeneous SoCs,‖ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 9, pp. 1387–1399, Sep. 2011. [8] O. He, S. Dong, W. Jang, J. Bian, and D. Z. Pan, ―UNISM: Unified scheduling and mapping for general networks on chip,‖ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 20, no. 8, pp. 1496–1509, Aug. 2012. [9] M. G. Hluchyj and M. J. Karol, ―Queueing in high-performance packet switching,‖ IEEE J. Sel. Areas Commun., vol. 6, no. 9, pp. 1587–1597, Dec. 1988. [10] J.Hu and R.Marculescu , ―Energy-and Performance-aware mapping For NOC architecture,’ IEEE Trans. Comput-Aided Design Integr Circuits Syst., vol. 24, no. 4, pp. 551–562, Apr. 2005. [11] H. Ito, M. Kimura, K. Miyashita, T. Ishii, K. Okada, and K. Masu, ―A bi- directional and multi-drop-transmission-line interconnect for multipoint- to-multipoint on-chip communications,‖ IEEE J. SolidState Circuits, vol. 43, no. 4 pp. 1020–1029, Apr. 2008. [12] J. Kim, C. Nicopoulos, P. Dongkook, V. Narayanan, M. S. Yousif, and C. R. Das, ―A gracefully degrading and energy-efficient modular router architecture for on-chip networks,‖ in Proc. 33rd IEEE/ACM ISCA, Jun. 2006, pp. 4–15. [13] A. Kumar, L.S. Peh, P. Kundu, and N. K. Jha, ―Towards ideal onchip communication using express virtual channels,‖ IEEE Micro, vol. 28, no. 1, pp. 80–90, Jan. 2008. [14] Y.C. Lan, H.A. Lin, S.H. Lo, Y. H. Hu, and S.J. Chen, ―A bidirectional Noc(BiNoC) architecture with dynamic selfreconfigurable channel,‖IEEE Trans.Comput. Aided Design Integr. Circuits Syst., vol. 30, no. 3, pp. 427–440, Mar. 2011. [15] B. Lin and I. Keslassy, ―The concurrent matching switch
89
architecture,‖ in Proc. IEEE Comput. Commun. Soc. (INFOCOM), Apr. 2006. [16] R. Mullins, A. West, and S. Moore, ―Low-latency virtual-channel routers for on-chip networks,‖ in Proc. 31st ISCA, Mar. 2004, p. 188. [17] C. A. Nicopoulos, P. Dongkook, K. Jongman, N. Vijaykrishnan, M. S. Yousif, and C. R. Das, ―ViChaR: A dynamic virtual channel regulator for network-on-chip routers,‖ in Proc. 39th IEEE/ACM Int. Symp. Microarchitect. (MICRO), Dec. 2006, pp. 333–346. [18] G. Passas, M. Katevenis, and D. Pnevmatikatos, ―A 128 × 128 × 24 Gb/s crossbar interconnecting 128 tiles in a single hop and occupying 6% of their area,‖ in Proc. NOCS, 2010, pp. 87–95. [19] L. Peh and W. J. Dally, ―A delay model and speculative architecture for pipelined routers,‖ in Proc. Int. Symp. HPCA, Jan. 2001, pp. 255– 266. [20] A. Prakash, ―Randomized parallel schedulers for switch-memoryswitch routers: Analysis and numerical studies,‖ in Proc. IEEE INFOCOM, vol. 3. Mar. 2004, pp. 2026–2037. [21] R. S. Ramanujam, V. Soteriou, B. Lin, and L.-S. Peh, ―Extending the effective throughput of NoCs with distributed shared-buffer routers,‖ IEEE Trans. Comput.Aided Design Integr. Circuits Syst., vol. 30, no. 4, pp. 548–561, Apr. 2011. [22] D. Seo, A. Ali, W.-T. Lim, and N. Rafique, ―Near-optimal worst-case throughput routing for 2-D mesh networks,‖ in Proc. 32nd IEEE/ACM ISCA, Jun. 2005, pp. 432–443.