Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
August 12, 2005

Asynchronous Heterogeneity: The Next Big Thing In HEC

Commentary from the High-End Crusader

 A number of computer vendors have quietly moved into heterogeneous processing, whether they use 1) heterogeneous cores within the processor architecture, or 2) heterogeneous processors within the system architecture (or both), as a natural response to application diversity.  After all, given that different applications in a workload, or different portions of the same application, can have significant differences in algorithmic characteristics, it is just common sense to offload portions of an application with special algorithmic characteristics to a coprocessor that is ideally suited to computing portions of that type.  Algorithmic variability within a single application arises naturally from that application’s dynamic mix of different forms of parallelism, different forms of locality, different forms of communication, and/or different forms of synchronization.  This offloading may be loosely or tightly coupled.  The standard textbook view of coprocessors is that they extend the instruction-set architecture (ISA) of the main processor to enrich its functionality.  In this view, offloading essentially consists of a remote procedure call to the coprocessor, where the main processor waits for the coprocessor to return.  But such tight coupling results in sequential processing, which is not what we want.  To obtain the benefits of parallel processing, we must configure these processors (or cores) of different types as self-contained, autonomous computers, with distinct instruction sets and distinct _uncoupled_ program counters.  We must also be extremely wary of synchronization, which easily serializes the computation.  Not every system uses asynchronous heterogeneity to accomplish the same goals. Indeed, some design teams have not fully addressed one fundamental problem in system-level asynchronous heterogeneity, viz., what styles of interaction, i.e., what styles of communication and synchronization, minimize the performance degradation of application portions of a given type that run on processors of the corresponding type?  If large amounts of the computation run on processors of a given type, you don’t want to slow that type of processor down, because of Amdahl’s law.  Pragmatically, application portions are threads of a given type marshaled by the parallelizing compiler from diverse sets of operations in the computation according to well-specified, system-defined criteria.  In this article, the term “asynchronous heterogeneity” refers indiferently to loosely coupled heterogeneous processing at either the processor or the system level.  There is already a pre-thesis here: If portions of an application are (somewhat uniformly) distributed across processors of different types, and if each type is expected to perform a significant fraction of the computation, then—given our concern with response times—there are clearly better and worse styles of interaction between processors (or cores) of different types. Quite simply, an interaction style is better if it does not reduce parallelism that could otherwise have been exploited.  Heterogeneity is all around us.  Japan is spending close to a billion dollars to build a “usable” 10-PFs/s system, which is due out in 2011, to simulate climate change and galaxy formation, and to predict the behavior of new drugs. This will be a specialized design, one reads, perhaps incorporating several different breeds of processor in a single system.  “The architecture … is a hybrid [mixing vector processors, scalar processors, and special-purpose compute engines]“, says Jack Dongarra.  “To get to 10 PFs/s, I think you’d need that”.  Dongarra is quoted as saying that the planned system may even include some chips that are designed to perform just one type of calculation, albeit at lightning speed.  Whether conventional commodity clusters can scale to 10 PFs/s may be a controversial question to some, but not to others.  We are already seeing limits to “commodity scaling”.  Recall that the phrase “commodity processor” typically means a single-core RISC superscalar processor optimized for single-thread performance and dependent for its performance both on its cache and on complex, power-hungry control circuitry for out-of-order execution.  Recently, processor vendors with complex microarchitectures trying to scale to higher single-processor performance have been forced by the laws of physics to move to multicore microarchitectures (2X, 4X, 8X, …), where each core is just a smaller instance of the _same_ conventional, complex, power-hungry design.  This doesn’t particularly suit cluster-vendor integrators.  Why not? Vast agglomerations of multi-complex-core chips are proving to be extremely difficult to cool.  Who knows?  Homogeneity may simply annihilate conventional killer micros in large clusters.  (Your correspondent suspects that most multi-simple-core processors will be heterogeneous—unless they are just one processor type in a larger heterogeneous parallel system).  Standing somewhat apart, Blue Gene/Light is not the best example of scaling difficulties because its processors have been explicitly designed to be less power hungry.  Closer to home, Cray Inc., SGI, and SRC (among others) have introduced largely autonomous FPGAs into some of their processor architectures.  For example, in SGI Altix 3000 Series Superclusters (think of the large Columbia Project supercluster at NASA Ames), the NUMAlink interconnect is used in several ways: 1) to scale memory, 2) to scale I/O (distinct I/O processors), 3) to scale application-specific units (FPGAs), and 4) to scale visualization (GPUs). These are just multichip heterogeneous processors.  For example, two Itanium2 CPUs are linked to an interface chip, two FPGAs are linked to a separate interface chip, the two interface chips have a NUMALink interconnect between them, and finally the first interface chip is linked to memory.  This heterogeneous processor is then used as the building block of a homogeneous system architecture.  Without question, the most visible example of asynchronous heterogeneity is the Sony-Toshiba-IBM (STI) CELL processor architecture, which was initially designed for the PlayStation 3, but which is being adapted by IBM for more general-purpose uses.  While the CELL is a heterogeneous processor microarchitecture, rather than a heterogeneous system architecture, it vividly illustrates the power of heterogeneous processing.  — The IBM CELL Processor Architecture  In some sense, the CELL processor architecture is a scaled-down parallel vector processor on a single chip.  Each CELL is a single-chip, nine-core microarchitecture with a control processor (the “scalar core”), eight workhorse vector processors (the “vector cores”), an off-chip local memory, and high-speed I/O channels to send/receive messages (in part to/from other CELLs).  One of the most striking features of the CELL is the way in which both off-chip local memory and on-chip “local stores” are used to guarantee a sustained high-bandwidth stream of data operands to each of the vector cores.  A CELL’s local memory is one (possibly more) GB of RAM connected to the CELL by two Rambus XDR memory controllers.  This dual-bus memory channel can feed the CELL with data at the rate of 25.6 GBs/s.  Now, can, say, a vector core directly load and store from the CELL’s local memory?  Not exactly.  Every vector core has a local store consisting of 256 KBs of SRAM.  It also has 128 128-bit vector registers.  The vector loads and stores of the vector core read and write the local store of that vector core.  In addition, a vector core can replenish its local store by requesting a block transfer from the CELL’s local memory, with permissible block sizes ranging anywhere from 1 to 16 Kbs.  (Of course, block transfer is bidirectional).  The local store of a vector core is thus a virtual second-level vector register file that is “paged” against the CELL’s local memory.  Obviously, if CELLs are used as building blocks in medium-scale parallel systems, the compiler will be responsible for some sophisticated data distribution if this “paging” scheme is to be made to work.  Indeed, the CELL is a genuine RISC processor because it leaves all the Really Interesting Stuff to the Compiler.  :-)   The CELL chip will be built using 65-nm VLSI process technology.  The scalar core is a dual-issue, dual-threaded, in-order-execution RISC processor with a slightly modified conventional cache but without elaborate branch-prediction support.  This much simpler design uses considerably less power.  The primary function of the scalar core is to queue up vectorized tasks (i.e., vector threads) for execution by the vector cores.  The single scalar core thus offloads all (vectorized) compute-intensive work to the vector cores.  It also keeps the “outermost” scalar code for itself.  Is the scalar performance of the scalar core an issue?  (As is the custom, each vector processor has an embedded scalar control unit that is tightly coupled to the execution of vector instructions, but this is not our focus). To repeat, the scalar core is responsible for executing the nonvectorizable portion of any application that runs on the CELL, although vector cores could —in a pinch—execute scalar threads.  Without much ILP, the scalar core must pray that very little application work falls on its shoulders, and fairly regular work at that.  Is the scalar core reponsible for keeping the CELL’s local memory fresh? The CELL does have 76.8 GBs/s of I/O bandwidth.  Is the scalar core constantly requesting block transfers from outside the CELL to the CELL’s local memory? In any case, it appears—at least in the uniprocessor case—that having a scalar thread running on the scalar core _wait_ for the completion of some compute-intensive vector thread it has offloaded onto a vector core is not an immediately obvious performance problem.  (Any scalar threads that might run on vector cores are another matter entirely).  Each vector core is a self-contained vector processor.  Like the scalar core, vector cores implement in-order execution.  Each vector core has a peak arithmetic performance of 32 GFs/s, which gives all eight of them a combined peak of 256 GFs/s.  Let us examine the memory bandwidth that is required to support that much raw vector functional-unit performance.  Recall that the Cray-1 used vector registers primarily to reduce the demand on limited memory bandwidth, and that the Cray-2 extended this to a two-level scheme.  Specifically, the Cray-2 had 16K words of (processor) local memory that had to be explicitly managed by the compiler/programmer.  Compare this to the CELL.  Each CELL vector core has 128 128-bit vector registers, 256 KBs of computational memory, and a backing store for this computational memory, which is just the local memory of the CELL.  The local store of a vector core allows us to do without a cache of any kind.  Imagine a vector thread that, over its lifetime, needs hundreds or thousands of bytes of data to run to completion, and has perfect foreknowledge of what those data are.  This much data will easily fit into a vector core’s local store.  (The local store must contain both the program and the data).  So, if all of the data for the vector thread can be prestaged in the local memory, a vector core can obtain a “prestaged” vector thread, use the DMAC to transfer the thread’s program and all the thread’s data from the CELL’s local memory to the vector core’s local store, and then run the vector thread to completion. Since the local store is a computational memory and not a cache, it can sustain high-bandwidth operand delivery over the entire lifetime of the vector thread.  Specifically, it can sustain 64 GBs/s on perfectly vectorized codes, which is one register per cycle.  Once started, vector threads do not stop, i.e., do not pause or wait.  They do not synchronize in midstream.  The effective thread state of a vector thread running on a vector core consists of the contents of the vector register file _plus_ the contents of the vector core’s local store.  A context switch is unthinkable.  Vector threads do not start until all their data has been localized (first in the local memory and then in their local store) and, once they have started, they do not stop for anything.  — System-Level Asynchronous Heterogeneity  The Cray Cascade supercomputer design effort is one of three projects left in DARPA’s HPCS program.  (The other vendors are IBM and SUN).  The competition is heating up as each vendor must decide soon whether to commit to actually building a general-purpose petascale machine.  HPCS program managers would prefer a potential product line consistent with the vendor’s market strategy rather than a showpiece machine that leads nowhere.  The Japanese have committed large amounts of money to build a heterogeneous petascale architecture in roughly the same time frame as the HPCS program.  The purpose of building such a machine is to give real users a comprehensible and programmable parallel system that would allow them to routinely achieve sustained petascale computing on their important science, engineering, and national-security applications.  The roadblock is that these (future) petascale applications are themselves extremely heterogeneous in their algorithmic characteristics, and—given the very real constraints on system cost, power, size, and programmability—there is no single processor architecture, no single clever hybrid of the dataflow and von Neumann computing paradigms, no single heterogeneous processor architecture even, that could be used as the _only_ processor building block in an efficient, general-purpose system-level homogeneous petascale architecture.  Similar concerns echo in the reconfigurability community.  A recent conference announcement states, “Reconfigurable high-performance computing systems have the potential to exploit coarse-grained functional parallelism through conventional parallel processing as well as fine-grained parallelism through direct hardware execution”.  Thus, many of us see value in decomposing (and molding) applications into component computations of a given algorithmic type, and then running these components (i.e., threads of a given type) on distinct hardware subsystems, most naturally on processors of a given type.  But if decomposition is good, what is the good decomposition?  What kind of heterogeneity might a single application have?  Well, is it vectorizable or not?  More precisely, what is the respective potential for vector-level and thread-level parallelism?  Is it communication intensive or not?  More precisely, is it temporally local or not? spatially local or not? engaged in frequent long-range communication or not? engaged in frequent short-range communication or not? with what granularity? with what regularity of its memory-access pattern?  Finally, is it loosely synchronized or not? (The opposite of loose synchronization is fine-grained synchronization).  More precisely, is the synchronization overhead, i.e., the context-switch cost, ruinously expensive or not?  Synchronization latency is more a matter of an application’s being communication intensive.  Cascade is heterogeneous in that it combines efficient processing components for heavyweight threads that exhibit high temporal locality and other efficient processing components for lightweight threads that exhibit little or no temporal locality.  Heavyweight processors are responsible for executing the compute-intensive portions of an application, and combine the best aspects of multithreading, vector computing, and stream processing for very high-throughput execution of the compute-intensive portions.  Within each locale, a number of lightweight (multithreaded scalar) processors will be near or within the locale’s memory for high-bandwidth and low-latency in-memory computation.  Locales, which are the fundamental Cascade building blocks, contain one heavyweight-processor chip and eight multicore lightweight-processor chips, which are either PIM chips properly so called or “PIM equivalent” chips.  Each core is a multithreaded scalar processor. Within the locale, PIM chips are backed by non-PIM DRAM.  Locales are interconnected by next-generation hierarchical symmetric Cayley-graph networks for high specific bandwidth.  In one sense, Cascade’s asynchronous heterogeneity is contained within individual locales; the Cascade system as a whole is then a homogeneous interconnection of locales.  However, since processors of both weights have long-range interactions, this gives us full system-level asynchronous heterogeneity.  The main chip in a locale implements a heavyweight processor including its caches, an interface to multiple DRAM memory chips with intermixed lightweight processors, and a network router to interconnect to other locales.  Several design decisions were taken up front.  First, Cascade is a shared-memory machine.  Second, Cascade is dependent on exceptional global system bandwidth, although this is by far the most-expensive part of the system.  Third, both heavyweight and lightweight processors are expected to tolerate the latency of long-range communication.  In short, both heavyweight and lightweight threads are presumed to be potentially communication intensive.  Prestaging, i.e., localizing data in a locale’s memory for a locale’s heavyweight processor is not totally absurd, but there is little point if the localization consumes as much global bandwidth as simply doing the long-range communication in the first place.  That is to say, localization in Cascade is not absurd but it is subtle.  For example, how perfect is our foreknowledge of what data we need?  Also, what is the net effect of localization on “return on bandwidth” (see below)?  Some early conceptualizations of Cascade focused on locality, i.e., on different communication (memory-access) patterns.  Today, we more accurately describe the differences between heavyweight and lightweight threads, and their respective capabilities, by focusing on the enormous difference in the amount of thread state.  Communication still matters, but this new conceptualization allows us to cleanly handle differences in synchronization granularity.  Heavyweight processors adopt the conventional approach to exploiting temporal locality, albeit with a different cache architecture and an abundance of processor parallelism to generate memory-reference concurrency.  Temporal locality comes in two forms.  In data reuse, a data value that has been loaded into a processor store is used—somewhat promptly—many times before it is discarded.  In intermediate-value use (aka as “producer-consumer locality”), intermediate values written to a processor store are read from that store— somewhat promptly—and used as operands, possibly many times, before they are discarded.  The conventional approach to exploiting temporal locality is to use a large data cache and/or a large register set to provide significant processor storage for live data values.  The bigger these stores are, the more temporal locality one can exploit.  Exploiting temporal locality in compute-intensive work is of great performance value because it can dramatically increase the “return on bandwidth” (ROB).  Another term for this is “arithmetic intensity”, i.e., how many flops you can execute for each word you load.  The system’s arithmetic return on bandwidth is perhaps its most important objective function since global system bandwidth is, by far, the most critical system resource.  In truth, there are many, many bandwidths in a computer architecture, some more critical than others.  But there is a fatal flaw.  Allowing the thread state proper to grow without limit, and remembering that rebuilding a thread’s cache footprint after a context switch has major performance implications, we find ourselves faced with colossal effective-state computers, which should never be allowed to synchronize, i.e., block—well, hardly ever.  The context-switch cost is absurdly expensive.  If a compute-intensive, heavyweight thread absolutely must synchronize, so be it.  However, we try to avoid this as much as possible.  Efficient execution of compute-intensive code goes with large thread state goes with real synchronization difficulty because of the context-switch overhead.  Indeed, a homogeneous, pure multithreaded, cacheless architecture is unlikely to result in a usable petascale computer (of course, there are wonderful terascale architectures that adopt this approach).  Together with their compiler-directed data caches, heavyweight Cascade processors have traded away their ability to synchronize well in exchange for exceptional performance on compute-intensive codes.  — The True Role Of Lightweight Processors  PIM processors (see qualification above) are often characterized by saying that every part of memory has a processor that is _very_ nearby, resulting in extremely high bandwidth and low latency between that processor and that part of memory.  In Cascade, lightweight processors have two additional architectural features.  First, they have been explicitly designed to have very low thread state, i.e., there are very few words in the register set owned by an individual thread.  Second, there is an extremely high degree of processor multithreading.  (In MTA terms, we would say that the number of processor-resident active threads is extremely high).  Lightweight and heavyweight processors are two intermediate design points on an architectural continuum whose extrema are dataflow computing and von Neumann computing (this will be explained in the conclusion).  In lightweight processors, an enormous number of program counters each initiate a moderate number of concurrent operations per program counter.  In heavyweight processors, a moderate number of program counters each initiate an enormous number of concurrent operations per program counter.  In both cases, the processor initiates an enormous number of concurrent operations per cycle.  We can see some minor differences in compiler strategy: a compiler can freely unroll loops for a heavyweight processor without fear of register spill, but should probably stick to software pipelining for a lightweight processor.  Lightweight threads were originally explained as a correction to the conventional approach to exploiting spatial locality.  Today, microprocessors use a long cache line to concurrently prefetch data.  (This is a latency-tolerance scheme that does nothing to reduce the demand on bandwidth). But an application, depending on its memory-access pattern, need not be cache friendly: bandwidth is actually wasted if the fetched data aren’t used, and cache-coherence schemes quickly lead to the problem of false sharing.  Still, suppose there is a block of data localized in a part of memory somewhere that is relevant to some computation.  Say this block is ‘n’ words long.  A heavyweight thread that depends on the result of this computation might well be tempted, not to load the ‘n’ words into its locale, but rather to spawn a lightweight thread that migrates to the lightweight processor closest to this memory part, and somehow obtains the result without causing ‘n’ words to be transferred to the original locale.  This process iterates, so that if the lightweight thread falls off the end of a block, it simply migrates to pick up at the start of the next block.  A simple example is computing the dot product of two vectors A and B that are both distributed across memory in multiple independent contiguous blocks.  A single lightweight thread can repeatedly migrate to track the blocks of vector A, whose components it reads locally, while issuing remote loads to obtain the corresponding elements of B.  This locality strategy reduces remote loads by a factor of two, which more than offsets the cost of repeated migration. Moreover, the remote bandwidth is not charged against the originating locale’s “bandwidth budget”.  The use of lightweight PIM processors to exploit spatial locality seems very natural, whether this is easily migratable threads pursuing spatial locality across the memory, or somehow using lightweight threads for localization, i.e., copying data nearby.  But the standard Cascade mantra is: The compiler packages the temporally local loops into heavyweight threads and all other loops into lightweight threads.  (“Loops” is a tad restrictive here).  So, while spatial locality makes spawning lightweight threads worthwhile, there are many, many other things that make it equally worthwhile.  Let’s take a more abstract view.  Some (portions of) computations _require_ high thread state; hence, they must be scheduled on heavyweight processors. Other (portions of) computation _require_ low thread state; hence, they must be scheduled on lightweight processors.  Any (portion) of a computation that requires neither should be scheduled on lightweight processors for efficient execution.  For example, fine-grained dataflow problems are naturally programmed using fine-grained synchronization, and only run effectively on lightweight processors.  Here is a simple example.  The parallel prefix-sum problem is to compute the sum of the first j numbers for j equals 1 to P.  Memory locations are equipped with full/empty bits to enable fine-grained producer-consumer synchronization in memory.  The problem is solved by migrating P lightweight threads, each initialized with one of the original P numbers, to one or more PIM nodes with the collective capacity to store a logP by P array.  After logP time steps, the P prefix sums have been computed.  Here is the SPMD code for cyclic reduction (SPMD style is a rotten way for humans to design code).  Cyclic Reduction ________________   – code for thread j, which is assigned to column j of a logPxP matrix  /* upon termination, the jth prefix sum is in ‘sum’  ‘shared’  a   ‘future’ ‘int’ ‘array’[1..logP,1..P] := undefined; ‘private’ sum ‘int’ := j;           hop ‘int’ := 1;  ‘do’ level = 1,logP —>     ‘if’ j <= P-hop ---> a[level,j] := sum      ‘fi’  // set location full     ‘if’ j >    hop —> sum +:= a[level,j-hop] ‘fi’  // accumulate sum      hop := 2*hop ‘od’  Each component of the ‘a’ matrix is a _future_ variable.  Each thread iterates logP times.  Each time through the loop it synchronizes, i.e., potentially blocks, waiting for the relevant future variable to become defined.  It does this _every other instruction_!  This constant fine-grained synchronization mandates execution of all thread code on one or more lightweight processors.  Spatial locality is irrelevant.  Indeed, focusing on thread state opens new vistas: a group of lightweight threads can exploit, not just pre-existing spatial locality, but spatial localizability!  We may have to adjust the Cascade mantra.  We come now to one of the fundamental problems of system-level asynchronous heterogeneity: What styles of interaction, i.e., what styles of communication and synchronization, minimize the performance degradation of either lightweight or heavyweight threads?  Imagine a set of heavyweight threads, at least one of which is dependent on the result—which can be almost anything—of a subcomputation that most properly runs as one or more lightweight threads executing on lightweight processors.  How should heavyweight and lightweight threads coordinate?  The solution is straightforward: any executing heavyweight thread may spawn any number of lightweight threads but cannot _itself_ wait for results to be returned from any of them.  Imagine the performance hit if, say, an executing compute-intensive, heavyweight thread were to RPC a parallel prefix summation, and then patiently wait for the result to be returned.  In point of fact, lightweight threads have many, many ways to satisfy the (initial) task or data dependences of heavyweight threads.  It is just that this is done upfront, before the heavyweight thread has ever been scheduled to run on a heavyweight processor.  Think of a dataflow model in which dataflow constraints schedule the first instruction of a thread, but then the thread runs normally.  If one takes an extremely abstract view of “prestaging”, then the compiler can identify many kinds of work—whether this be rapid updates of contiguous data blocks, tree/graph walking, efficient and concurrent array transpose, fine-grained manipulation of sparse and irregular data structures, parallel prefix operations, in-memory data movement—in fact, any computation at all that executes efficiently in a low thread-state environment—that should be completed before a _particular_ heavyweight thread is first scheduled for execution on a heavyweight processor.  Remember, the compiler chooses what to put into each thread.  As a group, heavyweight processors offload work of various kinds—work that suits them poorly—onto lightweight processors.  Moreover, as a group, heavyweight threads offload work of various kinds onto lightweight threads. The trick, which isn’t exactly rocket science, is that one heavyweight thread spawns a set of lightweight threads whose completion satisfies some initial dependence of _another_ heavyweight thread.  Conversely, a group of lightweight threads can cause a compiler-specified set of dependence constraints for the initial scheduling of a heavyweight thread to be satisfied.  Mechanically, the lightweight threads place the newly enabled heavyweight thread onto the work (or ready) queue, from which it is later retrieved when the heavyweight processor becomes free.  This is dynamic scheduling of compute-intensive tasks from the locale’s central work queue.  — Conclusion  The deeper meaning of system-level asynchronous heterogeneity is that it can be viewed as the natural implementation of a new “ontological” model of parallel computation that identifies the single most important dimension of intra-application variability in petascale applications.  In a word, parallel computing is an irreducible mixture of high thread-state and low thread-state computing.  Historically, dataflow and von Neumann computing have been viewed as radically different, and perhaps irreconcilable, models of computation.  However, it is now understood that the models are simply the extrema of an architectural continuum.  The von Neumann architecture can be extended into a multithreaded model by replicating the program counter and register set, and by providing efficient primitives to synchronize among several threads of control.  What was the main issue in trying to make dataflow hybrids work?  Just one, really: how to provide new mechanisms to reduce the cost of fine-grained synchronization while retaining the performance efficiencies of sequential scheduling, i.e., control-flow threads and registers.  Dataflow computing could synchronize well, but extended von Neumann computing could actually compute efficiently.  Speaking _very_ loosely, fine-grained dataflow gives us efficient synchronization, and coarse-grained dataflow, in which threads actually compute something appreciable before synchronizing, gives us high-performance computing.  Pure dataflow is zero thread-state computing.  Pure MTA-like multithreading is moderate thread-state computing.  Vector processing and conventional RISC superscalar processing are high thread-state computing (although, of the two, only vectors support scalable latency tolerance).  Alas, no single choice of thread state can possibly satisfy all of computing’s needs.  No _single_ intermediate design point on this architectural continuum efficiently supports general-purpose computing.  In order of increasing thread state, the architectural continuum contains: 1) dataflow computing, 2) lightweight processing, 3) heavyweight processing, and 4) von Neumann computing.  This leaves many questions open.  How light should lightweight be?  How heavy should heavyweight be?  Can we adjust the effective thread state of processors dynamically?  Cascade’s model of parallel computation is that petascale applications contain a (somewhat balanced and possibly hidden) irreducible mixture of portions of the computation suitable for coarse-grained dataflow computing, i.e, high thread-state computing, and other portions suitable for fine-grained dataflow computing, i.e., low thread-state computing.  Portions suitable for coarse-grained dataflow computing are executed as compute-intensive, heavyweight threads that essentially “synchronize” precisely once, before they are scheduled on the processor for the very first time.  Portions suitable for fine-grained dataflow computing are executed as agility-intensive, lightweight threads that have no problem stopping and restarting any number of times _during_ thread execution because the thread state is so low.  (Low thread state creates thread agility, which is essential for either migration or synchronization).  By the way, low thread-state computing is a fascinating engineering discipline, whose many possibilities have only been hinted at here.  Cascade’s system architecture follows from its model of parallel computation in a natural way.  If parallel computation is an irreducible mixture of high thread-state and low thread-state computing, a parallel system architecture should be split into separate subsystems that support the global parallel computation precisely by efficiently executing both the high thread-state code and the low thread-state code separately in parallel.  There is complete interaction symmetry here.  Just as lightweight threads push heavyweight threads onto the central work queue in a locale (from which they are dynamically scheduled onto the heavyweight processor), so heavyweight threads (via spawning) push lightweight threads onto the work queue of some lightweight processor (from which they are dynamically scheduled onto that lightweight processor).  The compiler splits the entire application into compute-intensive, heavyweight portions and agility-intensive, lightweight portions, possibly with dependences between portions of the same type, but with the rigid style of interaction between portions of opposite type outlined above.  Your correspondent’s sense is that such decomposition will, in general, always be possible.  We are only beginning to understand this compilation process and its effect on independent subsystem utilization.  There is a clear design choice here.  Should we build explicitly reconfigurable components that morph into different execution modes when the application changes, or should we design two separate processing subsystems that operate in parallel and have been optimized separately to support the efficient execution of the two basic kinds of computation we expect in petascale applications?  In fairness, only time will tell.  On the architectural continuum from dataflow to von Neumann computing, there is no single good intermediate design point for parallel computing; all parallel computing is an irreducible mixture of dataflow and von Neumann computing; hence, we must build separate subsystems to handle each portion separately.  This is a liberating view that lets us see design options we never saw before.  —  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Tim Curns at