E-Book, Englisch, Band Volume 96, 266 Seiten
Reihe: Advances in Computers
Namasudra / Milutinovic Dataflow Processing
1. Auflage 2015
ISBN: 978-0-12-802342-6
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
E-Book, Englisch, Band Volume 96, 266 Seiten
Reihe: Advances in Computers
ISBN: 978-0-12-802342-6
Verlag: Elsevier Science & Techn.
Format: EPUB
Kopierschutz: Adobe DRM (»Systemvoraussetzungen)
Autoren/Hrsg.
Weitere Infos & Material
1;Front Cover;1
2;Dataflow Processing;4
3;Copyright;5
4;Contents;6
5;Preface;8
6;Chapter 1: An Overview of Selected Heterogeneous and Reconfigurable Architectures;10
6.1;1. Introduction;12
6.2;2. Problem Statement;13
6.3;3. Existing Solutions and Their Criticism;14
6.3.1;3.1. Presentation of Existing Solutions;15
6.3.1.1;3.1.1. NVIDIA Fermi GPU;15
6.3.1.2;3.1.2. AMD ATI;18
6.3.1.3;3.1.3. Cell Broadband Engine Architecture;21
6.3.1.4;3.1.4. ClearSpeed;24
6.3.1.5;3.1.5. Maxeler Dataflow Engines;26
6.3.1.6;3.1.6. SGI's Reconfigurable Application-Specific Computing;30
6.3.1.7;3.1.7. Convey´s Reconfigurable COP;32
6.3.2;3.2. Classification of Presented Solutions;34
6.4;4. Summary of Presented Solutions;37
6.4.1;4.1. Computation and Memory Capacity;38
6.4.2;4.2. Programming Considerations;42
6.5;5. Performance Comparison of Presented Solutions;44
6.5.1;5.1. Analytical Comparison;44
6.5.2;5.2. Experimental Comparison;47
6.6;6. Conclusion;49
6.7;Acknowledgments;50
6.8;References;51
6.9;About the Author ;53
7;Chapter 2: Concurrency, Synchronization, and Speculation—The Dataflow Way;56
7.1;1. Introduction;58
7.2;2. Dataflow Concepts;59
7.2.1;2.1. Dataflow Formalism;59
7.2.2;2.2. Generalized Firing Semantic Sets (FSS);61
7.2.3;2.3. Graphical Notations;62
7.2.4;2.4. Concurrency;63
7.2.5;2.5. Synchronization;64
7.2.6;2.6. Speculation or Greedy Execution;64
7.3;3. Dataflow Languages;65
7.3.1;3.1. Dataflow Structures;66
7.3.1.1;3.1.1. Direct Access Method;66
7.3.1.2;3.1.2. Indirect Access Method;66
7.3.1.3;3.1.3. Dennis´s Method;66
7.3.1.4;3.1.4. I-Structures;67
7.3.1.5;3.1.5. Hybrid Scheme;68
7.3.2;3.2. Id: Irvine Dataflow Language;68
7.3.3;3.3. VAL;70
7.3.4;3.4. Streams and Iteration in a Single Assignment Language;72
7.4;4. Historical Dataflow Architectures;74
7.4.1;4.1. Dataflow Instructions;75
7.4.1.1;4.1.1. Classical Architectures;76
7.4.2;4.2. Static Dataflow Architectures;77
7.4.3;4.3. Dynamic Dataflow Architectures;78
7.4.4;4.4. Explicit Token Store;80
7.4.5;4.5. Dataflow Limitations;82
7.4.5.1;4.5.1. Localities and Memory Hierarchy;82
7.4.5.2;4.5.2. Granularity of Computations;83
7.4.5.3;4.5.3. Synchronous Execution;84
7.4.5.4;4.5.4. Memory Ordering;84
7.4.6;4.6. Hybrid Dataflow/Controlflow Architectures;84
7.4.6.1;4.6.1. Coarse-Grain Dataflow;85
7.4.6.2;4.6.2. Complex Dataflow;85
7.4.6.3;4.6.3. RISC Dataflow;85
7.4.6.4;4.6.4. Threaded Dataflow;86
7.5;5. Recent Dataflow Architectures;86
7.5.1;5.1. Tera-op Reliable Intelligently Adaptive Processing System;86
7.5.2;5.2. Data-Driven Workstation Network;88
7.5.3;5.3. WaveScalar;89
7.5.4;5.4. Teraflux;91
7.5.5;5.5. Maxeler;93
7.5.6;5.6. Codelet;95
7.5.7;5.7. Recent Architectures Summary;97
7.6;6. Case Study: Scheduled Dataflow;98
7.6.1;6.1. Organization;98
7.6.2;6.2. Instruction Format;99
7.6.3;6.3. Execution Paradigm;99
7.6.4;6.4. Architecture;102
7.6.4.1;6.4.1. Synchronization Pipeline;102
7.6.4.2;6.4.2. Execution Pipeline;104
7.6.4.3;6.4.3. Scheduling Unit;104
7.6.5;6.5. Support for Thread-Level Speculation;106
7.6.5.1;6.5.1. TLS Schema for SDF;106
7.6.5.2;6.5.2. Speculation Extensions to SDF;107
7.7;7. Conclusions;108
7.8;References;110
7.9;About the Author ;112
8;Chapter 3: Dataflow Computing in Extreme Performance Conditions;114
8.1;1. Introduction;115
8.2;2. Dataflow Computing;117
8.3;3. Maxeler Multiscale DFEs;119
8.4;4. Development Process;121
8.4.1;4.1. Analysis;122
8.4.2;4.2. Transformation;123
8.4.3;4.3. Partitioning;123
8.4.4;4.4. Implementation;123
8.5;5. Programming with MaxCompiler;123
8.6;6. Dataflow Clusters;128
8.6.1;6.1. Power Efficient Computing;128
8.6.2;6.2. Data Storage;129
8.6.3;6.3. Interconnections;130
8.6.4;6.4. Cluster-Level Management;130
8.6.5;6.5. Reliability;131
8.7;7. Case Study: Meteorological Limited-Area Model;132
8.7.1;7.1. The Application;132
8.7.2;7.2. The Model;133
8.7.3;7.3. The Algorithm;134
8.7.4;7.4. Analysis;134
8.7.5;7.5. Partitioning;136
8.7.6;7.6. Transformation;136
8.7.7;7.7. Parallelization;138
8.7.8;7.8. Experimental Setup;139
8.7.9;7.9. Results;139
8.7.10;7.10. Speedup;140
8.8;8. Conclusion;144
8.9;References;145
8.10;About the Author ;145
9;Chapter 4: Sorting Networks on Maxeler Dataflow Supercomputing Systems;148
9.1;1. Introduction;149
9.2;2. Motivation;150
9.3;3. Sorting Algorithms;152
9.3.1;3.1. Sequential Sorting;152
9.3.1.1;3.1.1. Optimal Algorithm for Small Input Data Sizes;153
9.3.2;3.2. Parallel Sorting;154
9.3.3;3.3. Network Sorting;156
9.4;4. Sorting Networks;159
9.4.1;4.1. Basic Properties;159
9.4.2;4.2. Constructing Sorting Networks;161
9.4.3;4.3. Testing Sorting Networks;163
9.4.4;4.4. Network Sorting Algorithms;163
9.4.4.1;4.4.1. Bubble Sorting;164
9.4.4.2;4.4.2. Odd-Even Sorting;164
9.4.4.3;4.4.3. Bitonic Merge Sorting;165
9.4.4.4;4.4.4. Odd-Even Merge Sorting;166
9.4.4.5;4.4.5. Pairwise Sorting;167
9.4.5;4.5. Comparison of Network Sorting Algorithms;167
9.4.6;4.6. Network Sorting Algorithm Operation Example;168
9.4.7;4.7. Network Sorting Versus Sequential and Parallel Sorting;169
9.5;5. Implementation;172
9.5.1;5.1. Dataflow Computer Code (MAX2 Card);172
9.5.2;5.2. Control Flow Computer Code (Host PC);176
9.6;6. Setting Up the Experiment;178
9.7;7. Experimental Results;180
9.7.1;7.1. FPGA Usage;180
9.7.2;7.2. Performance of Network Sorting Algorithms on the MAX2 Card;182
9.7.3;7.3. Comparison of Sorting Algorithms on the MAX2 Card and the Host CPU;187
9.8;8. Conclusion;192
9.9;References;193
9.10;About the Author ;194
10;Chapter 5: Dual Data Cache Systems: Architecture and Analysis;196
10.1;1. Introduction;197
10.2;2. A DDC Systems Classification Proposal;199
10.3;3. Existing DDC Systems;200
10.3.1;3.1. General Uniprocessor Compiler-Not-Assisted;200
10.3.1.1;3.1.1. The DDC;200
10.3.1.2;3.1.2. The STS Data Cache;204
10.3.2;3.2. General Uniprocessor Compiler-Assisted;208
10.3.2.1;3.2.1. The Northwestern Solution;208
10.3.3;3.3. General Multiprocessor Compiler-Not-Assisted;209
10.3.3.1;3.3.1. The Split Data Cache in Multiprocessor System;210
10.3.4;3.4. General Multiprocessor Compiler-Assisted;213
10.3.5;3.5. Special Uniprocessor Compiler-Not-Assisted;213
10.3.5.1;3.5.1. The Reconfigurable Split Data Cache;213
10.3.6;3.6. Special Uniprocessor Compiler-Assisted;217
10.3.6.1;3.6.1. The Data Type-Dependent Cache for MPEG Application;217
10.3.7;3.7. Special Multiprocessor Compiler-Not-Assisted;220
10.3.7.1;3.7.1. The Texas Solution;220
10.3.8;3.8. Special Multiprocessor Compiler-Assisted;221
10.3.8.1;3.8.1. The Time-Predictable Data Cache;221
10.3.9;3.9. A Comparison of the Existing Solutions;223
10.4;4. Conclusion of the Survey Part;224
10.5;5. Problem Statement for the Analysis;226
10.6;6. Critical Analysis of Existing Solutions;226
10.7;7. Generalized Solution;227
10.7.1;7.1. Basic Design of the Solution;228
10.7.2;7.2. Expected Improvement Analysis;229
10.8;8. Determining Locality;230
10.8.1;8.1. Compile-Time Resolution;230
10.8.2;8.2. Profile-Time Resolution;231
10.8.3;8.3. Run-Time Resolution;232
10.9;9. Modified STS in a Multicore System;233
10.10;10. Conditions and Assumptions of the Analysis Below;233
10.11;11. Simulation Strategy;234
10.11.1;11.1. Overall Performance;234
10.11.2;11.2. Effects of the Temporal Latency;235
10.11.3;11.3. Effects of the Temporal Cache Associativity;236
10.11.4;11.4. The X Limit Parameter;237
10.11.5;11.5. Power Consumption;237
10.12;12. Conclusions of the Analysis Part;239
10.13;13. The Table of Abbreviations;239
10.14;Acknowledgments;241
10.15;References;241
10.16;About the Author ;242
11;Author Index;244
12;Subject Index;250
13;Contents of Volumes in This Series;256
Chapter Two Concurrency, Synchronization, and Speculation—The Dataflow Way
Krishna Kavi*; Charles Shelor*; Domenico Pace† * Department of Computer Science and Engineering, University of North Texas Denton, Texas, USA
† Information Technology and Services Accenture, Turin, Italy Abstract
This chapter provides a brief overview of dataflow, including concepts, languages, historical architectures, and recent architectures. It is to serve as an introduction to and summary of the development of the dataflow paradigm during the past 45 years. Dataflow has inherent advantages in concurrency, synchronization, and speculation over control flow or imperative implementations. However, dataflow has its own set of challenges to efficient implementations. This chapter addresses the advantages and challenges of dataflow to set a context for the remainder of this issue. Keywords Dataflow Architecture Survey Abbreviations ABI address buffer identifier AQ acknowledgment queue of D2NOW CU computing unit in Codelet D2NOW data-driven network of workstations DDM data-driven multithreading model DFE dataflow engine in Maxeler D-TSU distributed thread scheduling unit in Teraflux DU data cache unit in TRIPS EDGE explicit data graph execution in TRIPS EP execution pipeline of SDF EPN epoch number of SDF ETS explicit token store EU execution unit in TRIPS EXC execution continuation of SDF FP frame pointer GM graph memory of D2NOW GU global control unit in TRIPS IP instruction pointer IU instruction cache unit in TRIPS L-TSU local thread scheduling unit in Teraflux NIU network interface unit of D2NOW PE processing element of WaveScalar PLC preload continuation of SDF PSC poststore continuation of SDF RIP retry instruction pointer of SDF RQ ready queue of D2NOW RS register set identifier RU register unit in TRIPS SC synchronization count in SDF SDF scheduled dataflow SISAL streams and Iteration in a Single Assignment Language SM synchronization memory of D2NOW SP synchronization pipeline of SDF SU scheduling unit in Codelet SU scheduling unit of SDF TLS thread-level speculation TP threaded procedure in Codelet TRIPS Tera-op Reliable Intelligently adaptive Processing System TSU thread synchronization unit of D2NOW VAL value-based programming language WTC waiting continuation of SDF 1 Introduction
Achieving high performance is possible when multiple activities (or multiple instructions) can be executed concurrently. The concurrency must not incur large overheads if it is to be effective. A second issue that must be addressed while executing concurrent activities is synchronization and/or coordination of the activities. These actions often lead to sequentialization of parallel activities, thus defeating the potential performance gains of concurrent execution. Thus, effective use of synchronization and coordination are essential to achieving high performance. One way to achieve this goal is through speculative execution, whereby it is speculated that concurrent activities do not need synchronization or coordination or predict the nature of the coordination. Successful speculation will reduce sequential portions of parallel programs; but misprediction may add to execution times and power consumption since the speculatively executed activities must be undone and the activity must be restarted with correct synchronization. The dataflow model of computation presents a natural choice for achieving concurrency, synchronization, and speculations. In the basic form, activities in a dataflow model are enabled when they receive all the necessary inputs; no other triggers are needed. Thus, all enabled activities can be executed concurrently if functional units are available. And the only synchronization among computations is the flow of data. In a broader sense, coordination or predicate results can be viewed as data that enable or coordinate dataflow activities. The functional nature of dataflow (eliminating side effects that plague imperative programming models) makes it easier to use speculation or greedy execution; unnecessary or unwanted computations can be simply discarded without any need for complex recovery on mis-speculations. In this chapter, we introduce well-understood dataflow models of computation in Section 2. We review programming languages that adhere to dataflow principles in Section 3. We present historical attempts at developing architectures implementing the dataflow models along with a discussion of the limitations of dataflow encountered by these architectures in Section 4. We present some recent variations of the dataflow paradigm that could potentially lead to efficient architectures in Section 5. As a case study in Section 6, we provide how one such architecture, scheduled dataflow (SDF), implements concurrency, synchronization, and speculation. Finally, we include our conclusions and prognosis on the future of the model as a viable alternative to control-flow systems in Section 7. 2 Dataflow Concepts
2.1 Dataflow Formalism
The dataflow model of computation is based on the use of graphs to represent computations and flow of information among the computations. The signal processing community uses a visual dataflow model to specify computations. Some examples of environments used by this community include Signal [1, 2], Lustre [3], and Ptolemy [4]. In this chapter, our focus is on general-purpose programming and general-purpose computer architectures. Hence, we will not review programming models, languages, or environments for specific or restricted domains; however, Lee [5] provides a good survey on such languages and models. One of the earliest dataflow models is called Kahn [6] process networks. A program in this model consists of channels, processes, and links connecting these processes. Figure 1 shows an example program and the corresponding graphical representation of the process network. In principle, channels can be viewed as arrays or lists (sequence of values). It is also possible to associate firing semantics with a process to define necessary input sequences and nondeterminism. We again refer the reader to the survey by Lee [5]. A process takes the required sequences of inputs from its input channels and creates sequences on output channels. To facilitate continuous execution and concurrency, one can view the sequences of data on channels as partially ordered. A set of inputs may contain multiple, partially ordered inputs, where the lack of ordering among subsequences allows for concurrency. Figure 1 Kahn process networks example. There have been several formalisms that have extended Kahn process networks. We use specific notations to describe a process network, which resemble the most common view of dataflow graphs to graphically represent computations. We use the notations from Ref. [7] in the rest of this section. A dataflow graph is a directed graph in which the elements are called links and actors [8]. In this model, actors (nodes) describe operations, while links receive data from a single actor and transmit values to one or more actors by way of arcs; arcs can be considered as channels of communication, while links can be viewed as placeholders or buffers =, where A = {a1, a2, …, an} is the set of actors; L = {l1, l2, …, lm} is the set of links and ?(AXL)?(LxA) is the set of edges connecting links and actors. In its basic form, activities (or nodes) are enabled for execution when all input arcs contain tokens and no output arcs contain tokens. However, if we consider arcs as (potentially...