E-Book, Englisch, 382 Seiten
Ma / Huang / Lai Networks-on-Chip
1. Auflage 2014
ISBN: 978-0-12-801178-2
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: 6 - ePub Watermark
From Implementations to Programming Paradigms
E-Book, Englisch, 382 Seiten
            ISBN: 978-0-12-801178-2 
            Verlag: Elsevier Science & Techn.
            
 Format: EPUB
    Kopierschutz: 6 - ePub Watermark
Sheng Ma received the B.S. and Ph.D. degrees in computer science and technology from the National University of Defense Technology (NUDT) in 2007 and 2012, respectively. He visited the University of Toronto from Sept. 2010 to Sept. 2012. He is currently an Assistant Professor of the College of Computer, NUDT. His research interests include on-chip networks, SIMD architectures and arithmetic unit designs.
Autoren/Hrsg.
Weitere Infos & Material
1;Front Cover;1
2;Networks-on-Chip: From Implementations to Programming Paradigms;4
3;Copyright;5
4;Contents in Brief;6
5;Contents;8
6;Preface;16
7;About the Editor-in-Chief and Authors;20
7.1; Editor-in-Chief;20
7.2; Authors;20
8;Part I: Prologue;22
8.1;Chapter 1: Introduction;24
8.1.1;1.1 The dawn of the many-core era;24
8.1.2;1.2 Communication-centric cross-layer optimizations;26
8.1.3;1.3 A baseline design space exploration of NoCs;28
8.1.3.1;1.3.1 Topology;29
8.1.3.2;1.3.2 Routing algorithm;30
8.1.3.3;1.3.3 Flow control;32
8.1.3.4;1.3.4 Router microarchitecture;34
8.1.3.5;1.3.5 Performance metric;37
8.1.4;1.4 Review of NoC research;38
8.1.4.1;1.4.1 Research on topologies;38
8.1.4.2;1.4.2 Research on unicast routing;39
8.1.4.3;1.4.3 Research on supporting collective communications;40
8.1.4.4;1.4.4 Research on flow control;41
8.1.4.5;1.4.5 Research on router microarchitecture;43
8.1.5;1.5 Trends of real processors;44
8.1.5.1;1.5.1 The MIT Raw processor;44
8.1.5.2;1.5.2 The Tilera TILE64 processor;45
8.1.5.3;1.5.3 The Sony/Toshiba/IBM Cell processor;47
8.1.5.4;1.5.4 The U.T. Austin TRIPS processor;49
8.1.5.5;1.5.5 The Intel Teraflops processor;50
8.1.5.6;1.5.6 The Intel SCC processor;51
8.1.5.7;1.5.7 The Intel Larrabee processor;53
8.1.5.8;1.5.8 The Intel Knights Corner processor;55
8.1.5.9;1.5.9 Summary of real processors;57
8.1.6;1.6 Overview of the book;59
8.1.7; References;60
9;Part II: Logic implementations;72
9.1;Chapter 2: A single-cycle router with wing channels;74
9.1.1;2.1 Introduction;74
9.1.2;2.2 The router architecture;76
9.1.2.1;2.2.1 The overall architecture;77
9.1.2.2;2.2.2 Wing channels;81
9.1.3;2.3 Microarchitecture designs;83
9.1.3.1;2.3.1 Channel dispensers;83
9.1.3.2;2.3.2 Fast arbiter components;85
9.1.3.3;2.3.3 SIG managers and SIG controllers;86
9.1.4;2.4 Experimental results;88
9.1.4.1;2.4.1 Simulation infrastructures;88
9.1.4.2;2.4.2 Pipeline delay analysis;88
9.1.4.3;2.4.3 Latency and throughput;89
9.1.4.4;2.4.4 Area and power consumption;94
9.1.5;2.5 Chapter summary;95
9.1.6; References;95
9.2;Chapter 3: Dynamic virtual channel routers with congestion awareness;98
9.2.1;3.1 Introduction;98
9.2.2;3.2 DVC with congestion awareness;100
9.2.2.1;3.2.1 DVC scheme;100
9.2.2.2;3.2.2 Congestion avoidance scheme;102
9.2.3;3.3 Multiple-port shared buffer with congestion awareness;103
9.2.3.1;3.3.1 DVC scheme among multiple ports;103
9.2.3.2;3.3.2 Congestion avoidance scheme;105
9.2.4;3.4 DVC router microarchitecture;106
9.2.4.1;3.4.1 VC control module;107
9.2.4.2;3.4.2 Metric aggregation and congestion avoidance;109
9.2.4.3;3.4.3 VC allocation module;111
9.2.5;3.5 HiBB router microarchitecture;112
9.2.5.1;3.5.1 VC control module;113
9.2.5.2;3.5.2 VC allocation and output port allocation;113
9.2.5.3;3.5.3 VC regulation;116
9.2.6;3.6 Evaluation;117
9.2.6.1;3.6.1 DVC router evaluation;117
9.2.6.2;3.6.2 HiBB router evaluation;119
9.2.7;3.7 Chapter summary;123
9.2.8; References;123
9.3;Chapter 4: Virtual bus structure-based network-on-chip topologies;128
9.3.1;4.1 Introduction;129
9.3.2;4.2 Background;130
9.3.3;4.3 Motivation;131
9.3.3.1;4.3.1 Baseline on-chip communication networks;131
9.3.3.1.1;4.3.1.1 Transaction-based bus;131
9.3.3.1.2;4.3.1.2 Packet-based NoC;132
9.3.3.2;4.3.2 Analysis of NoC problems;132
9.3.3.2.1;4.3.2.1 Multihop problem;133
9.3.3.2.2;4.3.2.2 Multicast problem;134
9.3.3.3;4.3.3 Advantages of a transaction-based bus;134
9.3.4;4.4 The VBON;135
9.3.4.1;4.4.1 Interconnect structures;135
9.3.4.1.1;4.4.1.1 Wire delay consideration;136
9.3.4.2;4.4.2 The VB mechanism;137
9.3.4.2.1;4.4.2.1 The VB construction;137
9.3.4.2.2;4.4.2.2 VB arbitration;138
9.3.4.2.3;4.4.2.3 Packet format;140
9.3.4.2.4;4.4.2.4 VB operation;142
9.3.4.2.5;4.4.2.5 A simple example for VB communication;144
9.3.4.3;4.4.3 Starvation and deadlock avoidance;144
9.3.4.4;4.4.4 The VBON router microarchitecture;145
9.3.5;4.5 Evaluation;146
9.3.5.1;4.5.1 Simulation infrastructures;147
9.3.5.1.1;4.5.1.1 Router choices for comparison;147
9.3.5.1.2;4.5.1.2 Network configuration;148
9.3.5.1.3;4.5.1.3 Traffic generation;149
9.3.5.2;4.5.2 Synthetic traffic evaluations;150
9.3.5.2.1;4.5.2.1 Single-level 4 4 VBON;150
9.3.5.2.2;4.5.2.2 Hierarchical 8 8 VBON;151
9.3.5.3;4.5.3 Real application evaluations;153
9.3.5.4;4.5.4 Power consumption analysis;153
9.3.5.5;4.5.5 Overhead analysis;153
9.3.6;4.6 Chapter summary;156
9.3.7; References;157
10;Part III: Routing and flow Control;160
10.1;Chapter 5: Routing algorithms for workload consolidation;162
10.1.1;5.1 Introduction;163
10.1.2;5.2 Background;164
10.1.3;5.3 Motivation;166
10.1.3.1;5.3.1 Insufficient information;166
10.1.3.2;5.3.2 Intraregion interference;166
10.1.3.3;5.3.3 Inter-region interference;168
10.1.4;5.4 Destination-based adaptive routing;169
10.1.4.1;5.4.1 Destination-based selection strategy;169
10.1.4.1.1;5.4.1.1 Congestion information propagation network;169
10.1.4.1.2;5.4.1.2 DBSS router microarchitecture;171
10.1.4.2;5.4.2 Routing function design;173
10.1.4.2.1;5.4.2.1 Offered path diversity;173
10.1.4.2.2;5.4.2.2 VC reallocation scheme;175
10.1.5;5.5 Evaluation;176
10.1.5.1;5.5.1 Evaluation of routing functions;177
10.1.5.2;5.5.2 Single-region performance;179
10.1.5.2.1;5.5.2.1 Synthetic traffic results;179
10.1.5.2.2;5.5.2.2 Application results;180
10.1.5.3;5.5.3 Multiple-region performance;182
10.1.5.3.1;5.5.3.1 Results for a small regular region;182
10.1.5.3.2;5.5.3.2 Irregular-region results;183
10.1.5.3.3;5.5.3.3 Summary;184
10.1.5.4;5.5.4 CMesh evaluation;184
10.1.5.4.1;5.5.4.1 Configuration;184
10.1.5.4.2;5.5.4.2 Performance;184
10.1.5.5;5.5.5 Hardware overhead;187
10.1.5.5.1;5.5.5.1 Wiring overhead;187
10.1.5.5.2;5.5.5.2 Router overhead;187
10.1.5.5.3;5.5.5.3 Power consumption;187
10.1.6;5.6 Analysis and discussion;188
10.1.6.1;5.6.1 In-depth analysis of interference;188
10.1.6.2;5.6.2 Design space exploration;190
10.1.6.2.1;5.6.2.1 Number of propagation wires;190
10.1.6.2.2;5.6.2.2 DBSS scalability;190
10.1.6.2.3;5.6.2.3 Congestion propagation delay;190
10.1.7;5.7 Chapter summary;190
10.1.8; References;191
10.2;Chapter 6: Flow control for fully adaptive routing;196
10.2.1;6.1 Introduction;197
10.2.2;6.2 Background;200
10.2.2.1;6.2.1 Deadlock avoidance theories;200
10.2.2.2;6.2.2 Fully adaptive routing algorithms;200
10.2.3;6.3 Motivation;201
10.2.3.1;6.3.1 VC reallocation;201
10.2.3.2;6.3.2 Routing flexibility;201
10.2.4;6.4 Flow control and routing designs;202
10.2.4.1;6.4.1 Whole packet forwarding;203
10.2.4.2;6.4.2 Aggressive VC reallocation for EVCs;206
10.2.4.3;6.4.3 Maintain routing flexibility;209
10.2.4.4;6.4.4 Router microarchitecture;209
10.2.5;6.5 Evaluation on synthetic traffic;211
10.2.5.1;6.5.1 Performance of synthetic workloads;212
10.2.5.2;6.5.2 Buffer utilization of routing algorithms;213
10.2.5.3;6.5.3 Sensitivity to network design;215
10.2.5.3.1;6.5.3.1 SFP ratio;215
10.2.5.3.2;6.5.3.2 VC depth;217
10.2.5.3.3;6.5.3.3 VC count;218
10.2.5.3.4;6.5.3.4 Network size;219
10.2.6;6.6 Evaluation of PARSEC workloads;220
10.2.6.1;6.6.1 Methodology and configuration;220
10.2.6.2;6.6.2 Performance;221
10.2.7;6.7 Detailed analysis of flow control;222
10.2.7.1;6.7.1 The detailed buffer utilization;222
10.2.7.1.1;6.7.1.1 Allowable EVCs;222
10.2.7.1.2;6.7.1.2 Performance analysis;224
10.2.7.2;6.7.2 The effect of flow control on fairness;225
10.2.8;6.8 Further discussion;228
10.2.8.1;6.8.1 Packet length;228
10.2.8.2;6.8.2 Dynamically allocated multiqueue and hybrid flow controls;229
10.2.9;6.9 Chapter summary;230
10.2.10; Appendix: Logical Equivalence of Alg and Alg + WPF;230
10.2.11; References;232
10.3;Chapter 7: Deadlock-free flow control for torus networks-on-chip;236
10.3.1;7.1 Introduction;237
10.3.2;7.2 Limitations of existing designs;239
10.3.2.1;7.2.1 Dateline;239
10.3.2.2;7.2.2 Localized bubble scheme;240
10.3.2.3;7.2.3 Critical bubble scheme;240
10.3.2.4;7.2.4 Inefficiency with variable-size packets;241
10.3.3;7.3 Flit bubble flow control;242
10.3.3.1;7.3.1 Theoretical description;242
10.3.3.2;7.3.2 FBFC-localized;243
10.3.3.3;7.3.3 FBFC-critical;244
10.3.3.4;7.3.4 Starvation;245
10.3.4;7.4 Router microarchitecture;246
10.3.4.1;7.4.1 FBFC routers;246
10.3.4.2;7.4.2 VCT routers;247
10.3.5;7.5 Methodology;248
10.3.6;7.6 Evaluation on 1D tori (rings);249
10.3.6.1;7.6.1 Performance;249
10.3.6.2;7.6.2 Buffer utilization;251
10.3.6.3;7.6.3 Latency of short and long packets;252
10.3.7;7.7 Evaluation on 2D tori;252
10.3.7.1;7.7.1 Performance for a 44 torus;252
10.3.7.2;7.7.2 Sensitivity to SFP ratios;254
10.3.7.3;7.7.3 Sensitivity to buffer size;255
10.3.7.4;7.7.4 Scalability for an 88 torus;257
10.3.7.5;7.7.5 Effect of starvation;257
10.3.7.6;7.7.6 Real application performance;259
10.3.7.7;7.7.7 Large-scale systems and message passing;260
10.3.8;7.8 Overheads: Power and area;261
10.3.8.1;7.8.1 Methodology;261
10.3.8.2;7.8.2 Power efficiency;262
10.3.8.3;7.8.3 Area;265
10.3.8.4;7.8.4 Comparison with meshes;266
10.3.9;7.9 Discussion and related work;269
10.3.9.1;7.9.1 Discussion;269
10.3.9.2;7.9.2 Related work;269
10.3.10;7.10 Chapter summary;270
10.3.11; References;270
11;Part IV: Programming paradigms;274
11.1;Chapter 8: Supporting cache-coherent collective communications;276
11.1.1;8.1 Introduction;277
11.1.2;8.2 Message combination framework;279
11.1.2.1;8.2.1 MCT format;281
11.1.2.2;8.2.2 Message combination example;281
11.1.2.3;8.2.3 Insufficient MCT entries;284
11.1.3;8.3 BAM routing;284
11.1.4;8.4 Router pipeline and microarchitecture;286
11.1.5;8.5 Evaluation;288
11.1.5.1;8.5.1 Performance;290
11.1.5.1.1;8.5.1.1 Overall network performance;290
11.1.5.1.2;8.5.1.2 Multicast transaction performance;291
11.1.5.1.3;8.5.1.3 Real application performance;292
11.1.5.2;8.5.2 Comparing multicast VN configurations;293
11.1.5.2.1;8.5.2.1 Unicast performance;293
11.1.5.2.2;8.5.2.2 Multicast performance;294
11.1.5.3;8.5.3 MCT size;295
11.1.5.4;8.5.4 Sensitivity to network design;297
11.1.5.4.1;8.5.4.1 VC count;297
11.1.5.4.2;8.5.4.2 Multicast ratio;298
11.1.5.4.3;8.5.4.3 Destinations per multicast;298
11.1.5.4.4;8.5.4.4 Network size;299
11.1.6;8.6 Power analysis;299
11.1.7;8.7 Related work;301
11.1.7.1;8.7.1 Message combination;301
11.1.7.2;8.7.2 NoC multicast routing;301
11.1.8;8.8 Chapter summary;302
11.1.9; References;302
11.2;Chapter 9: Network-on-chip customizations for message passing interface primitives;306
11.2.1;9.1 Introduction;307
11.2.2;9.2 Background;308
11.2.3;9.3 Motivation;310
11.2.3.1;9.3.1 MPI adaption in NoC designs;310
11.2.3.2;9.3.2 Optimizations of MPI functions;311
11.2.4;9.4 Communication customization architectures;311
11.2.4.1;9.4.1 Architecture overview;311
11.2.4.2;9.4.2 The customized NoC design: VBON;313
11.2.4.3;9.4.3 The MPI primitive implementation: MU;313
11.2.4.3.1;9.4.3.1 The architecture of the MU;313
11.2.4.3.2;9.4.3.2 MPI processing unit;316
11.2.4.3.3;9.4.3.3 The collective operation implementation;318
11.2.4.3.4;9.4.3.4 Communication protocols;320
11.2.5;9.5 Evaluation;323
11.2.5.1;9.5.1 Methodology;323
11.2.5.2;9.5.2 Experimental results;324
11.2.5.2.1;9.5.2.1 The effect of point-to-point communication: Bandwidth;324
11.2.5.2.2;9.5.2.2 The effect of collective communication: Broadcast operations;325
11.2.5.2.3;9.5.2.3 The effect of collective communication: Barrier operations;327
11.2.5.2.4;9.5.2.4 The effect of collective communication: Reduce operation;328
11.2.5.2.5;9.5.2.5 The effect of application communication: Performance;329
11.2.5.2.6;9.5.2.6 The effect of application communication: Power and scalability;331
11.2.5.2.7;9.5.2.7 Implementation overheads;332
11.2.6;9.6 Chapter summary;333
11.2.7; References;333
11.3;Chapter 10: Message passing interface communication protocol optimizations;338
11.3.1;10.1 Introduction;339
11.3.2;10.2 Background;340
11.3.2.1;10.2.1 Communication protocols in MPI;340
11.3.2.2;10.2.2 Existing problems;341
11.3.2.2.1;10.2.2.1 Correctness problems;341
11.3.2.2.2;10.2.2.2 Retry problems;342
11.3.2.2.3;10.2.2.3 Performance problems;345
11.3.2.3;10.2.3 Related work;346
11.3.3;10.3 Motivation;347
11.3.4;10.4 Adaptive communication mechanisms;349
11.3.4.1;10.4.1 Goals and approaches;349
11.3.4.2;10.4.2 Baseline MPI-accelerated NoC designs;350
11.3.4.3;10.4.3 ADCM architectural support;352
11.3.4.3.1;10.4.3.1 ADCM hardware;352
11.3.4.3.2;10.4.3.2 Adaptive algorithm implementation;354
11.3.4.3.3;10.4.3.3 The packet format;357
11.3.4.4;10.4.4 Comparison with the ideal protocol;358
11.3.5;10.5 Evaluation;359
11.3.5.1;10.5.1 Methodology;359
11.3.5.2;10.5.2 Synthetic traffic results;361
11.3.5.2.1;10.5.2.1 Round-trip traffic pattern;361
11.3.5.2.2;10.5.2.2 Hotspot traffic pattern;362
11.3.5.3;10.5.3 Real application results;364
11.3.5.4;10.5.4 Sensitivity analysis;367
11.3.5.5;10.5.5 The hardware overhead;367
11.3.6;10.6 Chapter summary;368
11.3.7; References;369
12;Part V: Epilogue;372
12.1;Chapter 11: Conclusions and future work;374
12.1.1;11.1 Conclusions;374
12.1.2;11.2 Future work;376
13;Index;378
A single-cycle router with wing channels†
Abstract
With increasing numbers of cores, the communication latency of networks-on-chip becomes a dominant problem owing to complex operations per node. In this chapter, we try to reduce the communication latency by proposing a single-cycle router architecture with wing channels, which forward the incoming packets to free ports immediately with the inspection of switch allocation results. In addition, the incoming packets assigned to wing channels can fill in the time slots of the crossbar switch and reduce the contentions with subsequent ones, thereby increasing the throughput effectively. We design the proposed router using a 65 nm CMOS process, and the results show that it supports different routing schemes and outperforms the express virtual channel, prediction, and Kumar’s single-cycle routers in terms of latency and throughput. When compared with the speculative router, it provides a latency reduction of 45.7% and throughput improvement of 14.0%. Moreover, we show that the proposed design incurs a modest area overhead of 8.1%, but the power consumption is reduced by 7.8% owing to fewer arbitration activities.
Keywords
Single-cycle router
Wing channel
Switch allocation inspection
Low communication latency
Chapter outline
2.1 Introduction 53
2.2 The Router Architecture 55
2.2.1 The Overall Architecture 56
2.2.2 Wing Channels 60
2.3 Microarchitecture Designs 62
2.3.1 Channel Dispensers 62
2.3.2 Fast Arbiter Components 64
2.3.3 SIG Managers and SIG Controllers 65
2.4 Experimental Results 67
2.4.1 Simulation Infrastructures 67
2.4.2 Pipeline Delay Analysis 67
2.4.3 Latency and Throughput 68
2.4.4 Area and Power Consumption 73
2.5 Chapter Summary 74
References 74
2.1 Introduction
As semiconductor technology is continually advancing into the nanometer region, a single chip will soon be able to integrate thousands of cores. There is a wide consensus, from both industry and academia, that the many-core chip is the only efficient way to utilize the billions of transistors, and it represents the trend of future processor architectures. Recently, industry and academia have delivered several commercial or prototype many-core chips, such as the Teraflops [5], TILE64 [18], and Kilocore [10] processors. The traditional bus or crossbar interconnection structures encounter several challenges in the many-core era, including the sharply increasing wire delay and the poor scalability. The network-on-chip (NoC), as an effective way for on-chip communication, has introduced a packet-switched fabric to address the challenges of the increasing interconnection complexity [1].
Although the NoC provides a preferable solution to mitigate the long wire delay problem compared with the traditional structures, the communication latency is still a dominant challenge with increasing core counts. For example, the average communication latencies of the 80-core Teraflops and 64-core TILE64 processors are close to 41 and 31 cycles, since their packets being forwarded between cores must undergo complex operations at each hop through five-stage or four-stage routers. The mean minimal path of an n × n mesh is given by the formula 2n/3 - 1/3n [20]; the communication latency increases linearly with the expansion of the network size. In this way, the communication latency easily becomes the bottleneck of application performance for the many-core chips.
There has been significant research to reduce the NoC communication latency via several approaches, such as designing novel topologies and developing fast routers. Bourduas and Zilic [2] proposed a hybrid topology which combines the mesh and hierarchical ring to provide fewer transfer cycles. In theory, architects prefer to adopt high-radix topologies to further reduce average hop counts; however, for complex structures such as a flattened butterfly [6], finding the efficient wiring layout during the back-end design flows is a challenge in its own right.
Recently, many aggressive router architectures with single-cycle hop latencies have been developed. Kumar et al. [8] proposed the express virtual channel (EVC) to reduce the communication latency by bypassing intermediate routers in a completely nonspeculative fashion. This method efficiently closes the gap between speculative routers and ideal routers; however, it does not work well at some nonintermediate nodes and is suitable only for deterministic routing. Moreover, it sends a starvation token upstream every fixed n cycles to stop the EVC flits to prevent the normal flits of high-load nodes from being starved. This starvation prevention scheme results in many packets at the EVC source node having to be forwarded via a normal virtual channel (VC), which increases average latencies.
Another predictive switching scheme is proposed in Refs. [14, 16], where the incoming packets are transferred without waiting for the routing computation (RC) and switch allocation (SA) if the prediction hits. Matsutani et al. [11] analyzed the prediction rates of six algorithms, and found that the average hit rate of the best one was only 70% under different traffic patterns. This means that many packets still require at least three cycles to go through a router when the prediction misses or several packets conflict. Kumar et al. [7] presented a single-cycle router pipeline which uses advanced bundles to remove the control setup overhead. However, their proposed design works well only at a low traffic rate since it emphasizes that no flit exists in the input buffer when the advanced bundle arrives. Finally, the preferred path design [12] is also prespecified to offer the ideal latency, but it cannot adapt to the different network environments.
In addition to the single-cycle transfer property exhibited by some of the techniques mentioned above, we emphasize three other important properties for the design of an efficient low-latency router:
(1) A preferred technique that accelerates a specific traffic pattern should also work well for other patterns and it would be best to be suitable for different routing schemes, including both deterministic and adaptive ones.
(2) In addition to low latencies under light network loads, high throughput and low latencies under different loads are also important since the traffic rate is easily changed on an NoC.
(3) Some complex hardware mechanisms should be avoided to realize the cost-efficiency of our design, and these mechanisms include the prediction, speculation, retransmission, and abort detection logics.
To achieve these three desired properties, we propose a novel low-latency router architecture with wing channels in Section 2.2. Regardless of what the traffic rate is, the proposed router inspects the SA results, and then selects some new packets without port conflicts to enter into the wing channel and fill the time slots of crossbar ports, thereby bypassing the complex two-stage allocations and directly forwarding the incoming packets downstream in the next cycle. Here, no matter what the traffic pattern or routing scheme is, once there is no port conflict, the new packet at the current router can be delivered within one cycle, which is the optimal case in our opinion. Moreover, as the packets of the wing channel make full use of the crossbar time slots and reduce contentions with subsequent packets, the network throughput is also increased effectively.
We then modify a traditional router with few additional costs, and present the detailed microarchitecture and circuit schematics of our proposed router in Section 2.3. In Section 2.4 we estimate the timing and power consumption using commercial tools, and evaluate the network performance via a cycle-accurate simulator considering different routing schemes under various traffic rates or patterns. Our experimental results show that the proposed router outperforms the EVC router, the prediction, and Kumar's single-cycle router in terms of latency and throughput metrics. Compared with the state-of-the-art speculative router, our proposed router provides latency reduction of 45.7% and throughput improvement of 14.0% on average. The evaluation results for the proposed router also show that although the router area is increased by 8.1%, its average power consumption is reduced by 7.8% owing to fewer arbitration activities at low rates. Finally, Section...





