Asynchronous Heterogeneity: The Next Big Thing In HEC

By Commentary from the High-End Crusader

August 12, 2005

 A number of computer vendors have quietly moved into heterogeneous processing, whether they use 1) heterogeneous cores within the processor architecture, or 2) heterogeneous processors within the system architecture (or both), as a natural response to application diversity.  After all, given that different applications in a workload, or different portions of the same application, can have significant differences in algorithmic characteristics, it is just common sense to offload portions of an application with special algorithmic characteristics to a coprocessor that is ideally suited to computing portions of that type.  Algorithmic variability within a single application arises naturally from that application’s dynamic mix of different forms of parallelism, different forms of locality, different forms of communication, and/or different forms of synchronization.  This offloading may be loosely or tightly coupled.  The standard textbook view of coprocessors is that they extend the instruction-set architecture (ISA) of the main processor to enrich its functionality.  In this view, offloading essentially consists of a remote procedure call to the coprocessor, where the main processor waits for the coprocessor to return.  But such tight coupling results in sequential processing, which is not what we want.  To obtain the benefits of parallel processing, we must configure these processors (or cores) of different types as self-contained, autonomous computers, with distinct instruction sets and distinct _uncoupled_ program counters.  We must also be extremely wary of synchronization, which easily serializes the computation.  Not every system uses asynchronous heterogeneity to accomplish the same goals. Indeed, some design teams have not fully addressed one fundamental problem in system-level asynchronous heterogeneity, viz., what styles of interaction, i.e., what styles of communication and synchronization, minimize the performance degradation of application portions of a given type that run on processors of the corresponding type?  If large amounts of the computation run on processors of a given type, you don’t want to slow that type of processor down, because of Amdahl’s law.  Pragmatically, application portions are threads of a given type marshaled by the parallelizing compiler from diverse sets of operations in the computation according to well-specified, system-defined criteria.  In this article, the term “asynchronous heterogeneity” refers indiferently to loosely coupled heterogeneous processing at either the processor or the system level.  There is already a pre-thesis here: If portions of an application are (somewhat uniformly) distributed across processors of different types, and if each type is expected to perform a significant fraction of the computation, then—given our concern with response times—there are clearly better and worse styles of interaction between processors (or cores) of different types. Quite simply, an interaction style is better if it does not reduce parallelism that could otherwise have been exploited.  Heterogeneity is all around us.  Japan is spending close to a billion dollars to build a “usable” 10-PFs/s system, which is due out in 2011, to simulate climate change and galaxy formation, and to predict the behavior of new drugs. This will be a specialized design, one reads, perhaps incorporating several different breeds of processor in a single system.  “The architecture … is a hybrid [mixing vector processors, scalar processors, and special-purpose compute engines]”, says Jack Dongarra.  “To get to 10 PFs/s, I think you’d need that”.  Dongarra is quoted as saying that the planned system may even include some chips that are designed to perform just one type of calculation, albeit at lightning speed.  Whether conventional commodity clusters can scale to 10 PFs/s may be a controversial question to some, but not to others.  We are already seeing limits to “commodity scaling”.  Recall that the phrase “commodity processor” typically means a single-core RISC superscalar processor optimized for single-thread performance and dependent for its performance both on its cache and on complex, power-hungry control circuitry for out-of-order execution.  Recently, processor vendors with complex microarchitectures trying to scale to higher single-processor performance have been forced by the laws of physics to move to multicore microarchitectures (2X, 4X, 8X, …), where each core is just a smaller instance of the _same_ conventional, complex, power-hungry design.  This doesn’t particularly suit cluster-vendor integrators.  Why not? Vast agglomerations of multi-complex-core chips are proving to be extremely difficult to cool.  Who knows?  Homogeneity may simply annihilate conventional killer micros in large clusters.  (Your correspondent suspects that most multi-simple-core processors will be heterogeneous—unless they are just one processor type in a larger heterogeneous parallel system).  Standing somewhat apart, Blue Gene/Light is not the best example of scaling difficulties because its processors have been explicitly designed to be less power hungry.  Closer to home, Cray Inc., SGI, and SRC (among others) have introduced largely autonomous FPGAs into some of their processor architectures.  For example, in SGI Altix 3000 Series Superclusters (think of the large Columbia Project supercluster at NASA Ames), the NUMAlink interconnect is used in several ways: 1) to scale memory, 2) to scale I/O (distinct I/O processors), 3) to scale application-specific units (FPGAs), and 4) to scale visualization (GPUs). These are just multichip heterogeneous processors.  For example, two Itanium2 CPUs are linked to an interface chip, two FPGAs are linked to a separate interface chip, the two interface chips have a NUMALink interconnect between them, and finally the first interface chip is linked to memory.  This heterogeneous processor is then used as the building block of a homogeneous system architecture.  Without question, the most visible example of asynchronous heterogeneity is the Sony-Toshiba-IBM (STI) CELL processor architecture, which was initially designed for the PlayStation 3, but which is being adapted by IBM for more general-purpose uses.  While the CELL is a heterogeneous processor microarchitecture, rather than a heterogeneous system architecture, it vividly illustrates the power of heterogeneous processing.  — The IBM CELL Processor Architecture  In some sense, the CELL processor architecture is a scaled-down parallel vector processor on a single chip.  Each CELL is a single-chip, nine-core microarchitecture with a control processor (the “scalar core”), eight workhorse vector processors (the “vector cores”), an off-chip local memory, and high-speed I/O channels to send/receive messages (in part to/from other CELLs).  One of the most striking features of the CELL is the way in which both off-chip local memory and on-chip “local stores” are used to guarantee a sustained high-bandwidth stream of data operands to each of the vector cores.  A CELL’s local memory is one (possibly more) GB of RAM connected to the CELL by two Rambus XDR memory controllers.  This dual-bus memory channel can feed the CELL with data at the rate of 25.6 GBs/s.  Now, can, say, a vector core directly load and store from the CELL’s local memory?  Not exactly.  Every vector core has a local store consisting of 256 KBs of SRAM.  It also has 128 128-bit vector registers.  The vector loads and stores of the vector core read and write the local store of that vector core.  In addition, a vector core can replenish its local store by requesting a block transfer from the CELL’s local memory, with permissible block sizes ranging anywhere from 1 to 16 Kbs.  (Of course, block transfer is bidirectional).  The local store of a vector core is thus a virtual second-level vector register file that is “paged” against the CELL’s local memory.  Obviously, if CELLs are used as building blocks in medium-scale parallel systems, the compiler will be responsible for some sophisticated data distribution if this “paging” scheme is to be made to work.  Indeed, the CELL is a genuine RISC processor because it leaves all the Really Interesting Stuff to the Compiler.  🙂  The CELL chip will be built using 65-nm VLSI process technology.  The scalar core is a dual-issue, dual-threaded, in-order-execution RISC processor with a slightly modified conventional cache but without elaborate branch-prediction support.  This much simpler design uses considerably less power.  The primary function of the scalar core is to queue up vectorized tasks (i.e., vector threads) for execution by the vector cores.  The single scalar core thus offloads all (vectorized) compute-intensive work to the vector cores.  It also keeps the “outermost” scalar code for itself.  Is the scalar performance of the scalar core an issue?  (As is the custom, each vector processor has an embedded scalar control unit that is tightly coupled to the execution of vector instructions, but this is not our focus). To repeat, the scalar core is responsible for executing the nonvectorizable portion of any application that runs on the CELL, although vector cores could —in a pinch—execute scalar threads.  Without much ILP, the scalar core must pray that very little application work falls on its shoulders, and fairly regular work at that.  Is the scalar core reponsible for keeping the CELL’s local memory fresh? The CELL does have 76.8 GBs/s of I/O bandwidth.  Is the scalar core constantly requesting block transfers from outside the CELL to the CELL’s local memory? In any case, it appears—at least in the uniprocessor case—that having a scalar thread running on the scalar core _wait_ for the completion of some compute-intensive vector thread it has offloaded onto a vector core is not an immediately obvious performance problem.  (Any scalar threads that might run on vector cores are another matter entirely).  Each vector core is a self-contained vector processor.  Like the scalar core, vector cores implement in-order execution.  Each vector core has a peak arithmetic performance of 32 GFs/s, which gives all eight of them a combined peak of 256 GFs/s.  Let us examine the memory bandwidth that is required to support that much raw vector functional-unit performance.  Recall that the Cray-1 used vector registers primarily to reduce the demand on limited memory bandwidth, and that the Cray-2 extended this to a two-level scheme.  Specifically, the Cray-2 had 16K words of (processor) local memory that had to be explicitly managed by the compiler/programmer.  Compare this to the CELL.  Each CELL vector core has 128 128-bit vector registers, 256 KBs of computational memory, and a backing store for this computational memory, which is just the local memory of the CELL.  The local store of a vector core allows us to do without a cache of any kind.  Imagine a vector thread that, over its lifetime, needs hundreds or thousands of bytes of data to run to completion, and has perfect foreknowledge of what those data are.  This much data will easily fit into a vector core’s local store.  (The local store must contain both the program and the data).  So, if all of the data for the vector thread can be prestaged in the local memory, a vector core can obtain a “prestaged” vector thread, use the DMAC to transfer the thread’s program and all the thread’s data from the CELL’s local memory to the vector core’s local store, and then run the vector thread to completion. Since the local store is a computational memory and not a cache, it can sustain high-bandwidth operand delivery over the entire lifetime of the vector thread.  Specifically, it can sustain 64 GBs/s on perfectly vectorized codes, which is one register per cycle.  Once started, vector threads do not stop, i.e., do not pause or wait.  They do not synchronize in midstream.  The effective thread state of a vector thread running on a vector core consists of the contents of the vector register file _plus_ the contents of the vector core’s local store.  A context switch is unthinkable.  Vector threads do not start until all their data has been localized (first in the local memory and then in their local store) and, once they have started, they do not stop for anything.  — System-Level Asynchronous Heterogeneity  The Cray Cascade supercomputer design effort is one of three projects left in DARPA’s HPCS program.  (The other vendors are IBM and SUN).  The competition is heating up as each vendor must decide soon whether to commit to actually building a general-purpose petascale machine.  HPCS program managers would prefer a potential product line consistent with the vendor’s market strategy rather than a showpiece machine that leads nowhere.  The Japanese have committed large amounts of money to build a heterogeneous petascale architecture in roughly the same time frame as the HPCS program.  The purpose of building such a machine is to give real users a comprehensible and programmable parallel system that would allow them to routinely achieve sustained petascale computing on their important science, engineering, and national-security applications.  The roadblock is that these (future) petascale applications are themselves extremely heterogeneous in their algorithmic characteristics, and—given the very real constraints on system cost, power, size, and programmability—there is no single processor architecture, no single clever hybrid of the dataflow and von Neumann computing paradigms, no single heterogeneous processor architecture even, that could be used as the _only_ processor building block in an efficient, general-purpose system-level homogeneous petascale architecture.  Similar concerns echo in the reconfigurability community.  A recent conference announcement states, “Reconfigurable high-performance computing systems have the potential to exploit coarse-grained functional parallelism through conventional parallel processing as well as fine-grained parallelism through direct hardware execution”.  Thus, many of us see value in decomposing (and molding) applications into component computations of a given algorithmic type, and then running these components (i.e., threads of a given type) on distinct hardware subsystems, most naturally on processors of a given type.  But if decomposition is good, what is the good decomposition?  What kind of heterogeneity might a single application have?  Well, is it vectorizable or not?  More precisely, what is the respective potential for vector-level and thread-level parallelism?  Is it communication intensive or not?  More precisely, is it temporally local or not? spatially local or not? engaged in frequent long-range communication or not? engaged in frequent short-range communication or not? with what granularity? with what regularity of its memory-access pattern?  Finally, is it loosely synchronized or not? (The opposite of loose synchronization is fine-grained synchronization).  More precisely, is the synchronization overhead, i.e., the context-switch cost, ruinously expensive or not?  Synchronization latency is more a matter of an application’s being communication intensive.  Cascade is heterogeneous in that it combines efficient processing components for heavyweight threads that exhibit high temporal locality and other efficient processing components for lightweight threads that exhibit little or no temporal locality.  Heavyweight processors are responsible for executing the compute-intensive portions of an application, and combine the best aspects of multithreading, vector computing, and stream processing for very high-throughput execution of the compute-intensive portions.  Within each locale, a number of lightweight (multithreaded scalar) processors will be near or within the locale’s memory for high-bandwidth and low-latency in-memory computation.  Locales, which are the fundamental Cascade building blocks, contain one heavyweight-processor chip and eight multicore lightweight-processor chips, which are either PIM chips properly so called or “PIM equivalent” chips.  Each core is a multithreaded scalar processor. Within the locale, PIM chips are backed by non-PIM DRAM.  Locales are interconnected by next-generation hierarchical symmetric Cayley-graph networks for high specific bandwidth.  In one sense, Cascade’s asynchronous heterogeneity is contained within individual locales; the Cascade system as a whole is then a homogeneous interconnection of locales.  However, since processors of both weights have long-range interactions, this gives us full system-level asynchronous heterogeneity.  The main chip in a locale implements a heavyweight processor including its caches, an interface to multiple DRAM memory chips with intermixed lightweight processors, and a network router to interconnect to other locales.  Several design decisions were taken up front.  First, Cascade is a shared-memory machine.  Second, Cascade is dependent on exceptional global system bandwidth, although this is by far the most-expensive part of the system.  Third, both heavyweight and lightweight processors are expected to tolerate the latency of long-range communication.  In short, both heavyweight and lightweight threads are presumed to be potentially communication intensive.  Prestaging, i.e., localizing data in a locale’s memory for a locale’s heavyweight processor is not totally absurd, but there is little point if the localization consumes as much global bandwidth as simply doing the long-range communication in the first place.  That is to say, localization in Cascade is not absurd but it is subtle.  For example, how perfect is our foreknowledge of what data we need?  Also, what is the net effect of localization on “return on bandwidth” (see below)?  Some early conceptualizations of Cascade focused on locality, i.e., on different communication (memory-access) patterns.  Today, we more accurately describe the differences between heavyweight and lightweight threads, and their respective capabilities, by focusing on the enormous difference in the amount of thread state.  Communication still matters, but this new conceptualization allows us to cleanly handle differences in synchronization granularity.  Heavyweight processors adopt the conventional approach to exploiting temporal locality, albeit with a different cache architecture and an abundance of processor parallelism to generate memory-reference concurrency.  Temporal locality comes in two forms.  In data reuse, a data value that has been loaded into a processor store is used—somewhat promptly—many times before it is discarded.  In intermediate-value use (aka as “producer-consumer locality”), intermediate values written to a processor store are read from that store— somewhat promptly—and used as operands, possibly many times, before they are discarded.  The conventional approach to exploiting temporal locality is to use a large data cache and/or a large register set to provide significant processor storage for live data values.  The bigger these stores are, the more temporal locality one can exploit.  Exploiting temporal locality in compute-intensive work is of great performance value because it can dramatically increase the “return on bandwidth” (ROB).  Another term for this is “arithmetic intensity”, i.e., how many flops you can execute for each word you load.  The system’s arithmetic return on bandwidth is perhaps its most important objective function since global system bandwidth is, by far, the most critical system resource.  In truth, there are many, many bandwidths in a computer architecture, some more critical than others.  But there is a fatal flaw.  Allowing the thread state proper to grow without limit, and remembering that rebuilding a thread’s cache footprint after a context switch has major performance implications, we find ourselves faced with colossal effective-state computers, which should never be allowed to synchronize, i.e., block—well, hardly ever.  The context-switch cost is absurdly expensive.  If a compute-intensive, heavyweight thread absolutely must synchronize, so be it.  However, we try to avoid this as much as possible.  Efficient execution of compute-intensive code goes with large thread state goes with real synchronization difficulty because of the context-switch overhead.  Indeed, a homogeneous, pure multithreaded, cacheless architecture is unlikely to result in a usable petascale computer (of course, there are wonderful terascale architectures that adopt this approach).  Together with their compiler-directed data caches, heavyweight Cascade processors have traded away their ability to synchronize well in exchange for exceptional performance on compute-intensive codes.  — The True Role Of Lightweight Processors  PIM processors (see qualification above) are often characterized by saying that every part of memory has a processor that is _very_ nearby, resulting in extremely high bandwidth and low latency between that processor and that part of memory.  In Cascade, lightweight processors have two additional architectural features.  First, they have been explicitly designed to have very low thread state, i.e., there are very few words in the register set owned by an individual thread.  Second, there is an extremely high degree of processor multithreading.  (In MTA terms, we would say that the number of processor-resident active threads is extremely high).  Lightweight and heavyweight processors are two intermediate design points on an architectural continuum whose extrema are dataflow computing and von Neumann computing (this will be explained in the conclusion).  In lightweight processors, an enormous number of program counters each initiate a moderate number of concurrent operations per program counter.  In heavyweight processors, a moderate number of program counters each initiate an enormous number of concurrent operations per program counter.  In both cases, the processor initiates an enormous number of concurrent operations per cycle.  We can see some minor differences in compiler strategy: a compiler can freely unroll loops for a heavyweight processor without fear of register spill, but should probably stick to software pipelining for a lightweight processor.  Lightweight threads were originally explained as a correction to the conventional approach to exploiting spatial locality.  Today, microprocessors use a long cache line to concurrently prefetch data.  (This is a latency-tolerance scheme that does nothing to reduce the demand on bandwidth). But an application, depending on its memory-access pattern, need not be cache friendly: bandwidth is actually wasted if the fetched data aren’t used, and cache-coherence schemes quickly lead to the problem of false sharing.  Still, suppose there is a block of data localized in a part of memory somewhere that is relevant to some computation.  Say this block is ‘n’ words long.  A heavyweight thread that depends on the result of this computation might well be tempted, not to load the ‘n’ words into its locale, but rather to spawn a lightweight thread that migrates to the lightweight processor closest to this memory part, and somehow obtains the result without causing ‘n’ words to be transferred to the original locale.  This process iterates, so that if the lightweight thread falls off the end of a block, it simply migrates to pick up at the start of the next block.  A simple example is computing the dot product of two vectors A and B that are both distributed across memory in multiple independent contiguous blocks.  A single lightweight thread can repeatedly migrate to track the blocks of vector A, whose components it reads locally, while issuing remote loads to obtain the corresponding elements of B.  This locality strategy reduces remote loads by a factor of two, which more than offsets the cost of repeated migration. Moreover, the remote bandwidth is not charged against the originating locale’s “bandwidth budget”.  The use of lightweight PIM processors to exploit spatial locality seems very natural, whether this is easily migratable threads pursuing spatial locality across the memory, or somehow using lightweight threads for localization, i.e., copying data nearby.  But the standard Cascade mantra is: The compiler packages the temporally local loops into heavyweight threads and all other loops into lightweight threads.  (“Loops” is a tad restrictive here).  So, while spatial locality makes spawning lightweight threads worthwhile, there are many, many other things that make it equally worthwhile.  Let’s take a more abstract view.  Some (portions of) computations _require_ high thread state; hence, they must be scheduled on heavyweight processors. Other (portions of) computation _require_ low thread state; hence, they must be scheduled on lightweight processors.  Any (portion) of a computation that requires neither should be scheduled on lightweight processors for efficient execution.  For example, fine-grained dataflow problems are naturally programmed using fine-grained synchronization, and only run effectively on lightweight processors.  Here is a simple example.  The parallel prefix-sum problem is to compute the sum of the first j numbers for j equals 1 to P.  Memory locations are equipped with full/empty bits to enable fine-grained producer-consumer synchronization in memory.  The problem is solved by migrating P lightweight threads, each initialized with one of the original P numbers, to one or more PIM nodes with the collective capacity to store a logP by P array.  After logP time steps, the P prefix sums have been computed.  Here is the SPMD code for cyclic reduction (SPMD style is a rotten way for humans to design code).  Cyclic Reduction ________________   – code for thread j, which is assigned to column j of a logPxP matrix  /* upon termination, the jth prefix sum is in ‘sum’  ‘shared’  a   ‘future’ ‘int’ ‘array'[1..logP,1..P] := undefined; ‘private’ sum ‘int’ := j;           hop ‘int’ := 1;  ‘do’ level = 1,logP —>     ‘if’ j <= P-hop ---> a[level,j] := sum      ‘fi’  // set location full     ‘if’ j >    hop —> sum +:= a[level,j-hop] ‘fi’  // accumulate sum      hop := 2*hop ‘od’  Each component of the ‘a’ matrix is a _future_ variable.  Each thread iterates logP times.  Each time through the loop it synchronizes, i.e., potentially blocks, waiting for the relevant future variable to become defined.  It does this _every other instruction_!  This constant fine-grained synchronization mandates execution of all thread code on one or more lightweight processors.  Spatial locality is irrelevant.  Indeed, focusing on thread state opens new vistas: a group of lightweight threads can exploit, not just pre-existing spatial locality, but spatial localizability!  We may have to adjust the Cascade mantra.  We come now to one of the fundamental problems of system-level asynchronous heterogeneity: What styles of interaction, i.e., what styles of communication and synchronization, minimize the performance degradation of either lightweight or heavyweight threads?  Imagine a set of heavyweight threads, at least one of which is dependent on the result—which can be almost anything—of a subcomputation that most properly runs as one or more lightweight threads executing on lightweight processors.  How should heavyweight and lightweight threads coordinate?  The solution is straightforward: any executing heavyweight thread may spawn any number of lightweight threads but cannot _itself_ wait for results to be returned from any of them.  Imagine the performance hit if, say, an executing compute-intensive, heavyweight thread were to RPC a parallel prefix summation, and then patiently wait for the result to be returned.  In point of fact, lightweight threads have many, many ways to satisfy the (initial) task or data dependences of heavyweight threads.  It is just that this is done upfront, before the heavyweight thread has ever been scheduled to run on a heavyweight processor.  Think of a dataflow model in which dataflow constraints schedule the first instruction of a thread, but then the thread runs normally.  If one takes an extremely abstract view of “prestaging”, then the compiler can identify many kinds of work—whether this be rapid updates of contiguous data blocks, tree/graph walking, efficient and concurrent array transpose, fine-grained manipulation of sparse and irregular data structures, parallel prefix operations, in-memory data movement—in fact, any computation at all that executes efficiently in a low thread-state environment—that should be completed before a _particular_ heavyweight thread is first scheduled for execution on a heavyweight processor.  Remember, the compiler chooses what to put into each thread.  As a group, heavyweight processors offload work of various kinds—work that suits them poorly—onto lightweight processors.  Moreover, as a group, heavyweight threads offload work of various kinds onto lightweight threads. The trick, which isn’t exactly rocket science, is that one heavyweight thread spawns a set of lightweight threads whose completion satisfies some initial dependence of _another_ heavyweight thread.  Conversely, a group of lightweight threads can cause a compiler-specified set of dependence constraints for the initial scheduling of a heavyweight thread to be satisfied.  Mechanically, the lightweight threads place the newly enabled heavyweight thread onto the work (or ready) queue, from which it is later retrieved when the heavyweight processor becomes free.  This is dynamic scheduling of compute-intensive tasks from the locale’s central work queue.  — Conclusion  The deeper meaning of system-level asynchronous heterogeneity is that it can be viewed as the natural implementation of a new “ontological” model of parallel computation that identifies the single most important dimension of intra-application variability in petascale applications.  In a word, parallel computing is an irreducible mixture of high thread-state and low thread-state computing.  Historically, dataflow and von Neumann computing have been viewed as radically different, and perhaps irreconcilable, models of computation.  However, it is now understood that the models are simply the extrema of an architectural continuum.  The von Neumann architecture can be extended into a multithreaded model by replicating the program counter and register set, and by providing efficient primitives to synchronize among several threads of control.  What was the main issue in trying to make dataflow hybrids work?  Just one, really: how to provide new mechanisms to reduce the cost of fine-grained synchronization while retaining the performance efficiencies of sequential scheduling, i.e., control-flow threads and registers.  Dataflow computing could synchronize well, but extended von Neumann computing could actually compute efficiently.  Speaking _very_ loosely, fine-grained dataflow gives us efficient synchronization, and coarse-grained dataflow, in which threads actually compute something appreciable before synchronizing, gives us high-performance computing.  Pure dataflow is zero thread-state computing.  Pure MTA-like multithreading is moderate thread-state computing.  Vector processing and conventional RISC superscalar processing are high thread-state computing (although, of the two, only vectors support scalable latency tolerance).  Alas, no single choice of thread state can possibly satisfy all of computing’s needs.  No _single_ intermediate design point on this architectural continuum efficiently supports general-purpose computing.  In order of increasing thread state, the architectural continuum contains: 1) dataflow computing, 2) lightweight processing, 3) heavyweight processing, and 4) von Neumann computing.  This leaves many questions open.  How light should lightweight be?  How heavy should heavyweight be?  Can we adjust the effective thread state of processors dynamically?  Cascade’s model of parallel computation is that petascale applications contain a (somewhat balanced and possibly hidden) irreducible mixture of portions of the computation suitable for coarse-grained dataflow computing, i.e, high thread-state computing, and other portions suitable for fine-grained dataflow computing, i.e., low thread-state computing.  Portions suitable for coarse-grained dataflow computing are executed as compute-intensive, heavyweight threads that essentially “synchronize” precisely once, before they are scheduled on the processor for the very first time.  Portions suitable for fine-grained dataflow computing are executed as agility-intensive, lightweight threads that have no problem stopping and restarting any number of times _during_ thread execution because the thread state is so low.  (Low thread state creates thread agility, which is essential for either migration or synchronization).  By the way, low thread-state computing is a fascinating engineering discipline, whose many possibilities have only been hinted at here.  Cascade’s system architecture follows from its model of parallel computation in a natural way.  If parallel computation is an irreducible mixture of high thread-state and low thread-state computing, a parallel system architecture should be split into separate subsystems that support the global parallel computation precisely by efficiently executing both the high thread-state code and the low thread-state code separately in parallel.  There is complete interaction symmetry here.  Just as lightweight threads push heavyweight threads onto the central work queue in a locale (from which they are dynamically scheduled onto the heavyweight processor), so heavyweight threads (via spawning) push lightweight threads onto the work queue of some lightweight processor (from which they are dynamically scheduled onto that lightweight processor).  The compiler splits the entire application into compute-intensive, heavyweight portions and agility-intensive, lightweight portions, possibly with dependences between portions of the same type, but with the rigid style of interaction between portions of opposite type outlined above.  Your correspondent’s sense is that such decomposition will, in general, always be possible.  We are only beginning to understand this compilation process and its effect on independent subsystem utilization.  There is a clear design choice here.  Should we build explicitly reconfigurable components that morph into different execution modes when the application changes, or should we design two separate processing subsystems that operate in parallel and have been optimized separately to support the efficient execution of the two basic kinds of computation we expect in petascale applications?  In fairness, only time will tell.  On the architectural continuum from dataflow to von Neumann computing, there is no single good intermediate design point for parallel computing; all parallel computing is an irreducible mixture of dataflow and von Neumann computing; hence, we must build separate subsystems to handle each portion separately.  This is a liberating view that lets us see design options we never saw before.  —  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com. 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

ExxonMobil, NCSA, Cray Scale Reservoir Simulation to 700,000+ Processors

February 17, 2017

In a scaling breakthrough for oil and gas discovery, ExxonMobil geoscientists report they have harnessed the power of 717,000 processors – the equivalent of 22,000 32-processor computers – to run complex oil and gas reservoir simulation models. Read more…

By Doug Black

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

Drug Developers Use Google Cloud HPC in the Fight Against ALS

February 16, 2017

Within the haystack of a lethal disease such as ALS (amyotrophic lateral sclerosis / Lou Gehrig’s Disease) there exists, somewhere, the needle that will pierce this therapy-resistant affliction. Read more…

By Doug Black

HPE Extreme Performance Solutions

Object Storage is the Ideal Storage Method for CME Companies

The communications, media, and entertainment (CME) sector is experiencing a massive paradigm shift driven by rising data volumes and the demand for high-performance data analytics. Read more…

Weekly Twitter Roundup (Feb. 16, 2017)

February 16, 2017

Here at HPCwire, we aim to keep the HPC community apprised of the most relevant and interesting news items that get tweeted throughout the week. Read more…

By Thomas Ayres

Alexander Named Dep. Dir. of Brookhaven Computational Initiative

February 15, 2017

Francis Alexander, a physicist with extensive management and leadership experience in computational science research, has been named Deputy Director of the Computational Science Initiative at the U.S. Read more…

Here’s What a Neural Net Looks Like On the Inside

February 15, 2017

Ever wonder what the inside of a machine learning model looks like? Today Graphcore released fascinating images that show how the computational graph concept maps to a new graph processor and graph programming framework it’s creating. Read more…

By Alex Woodie

Azure Edges AWS in Linpack Benchmark Study

February 15, 2017

The “when will clouds be ready for HPC” question has ebbed and flowed for years. Read more…

By John Russell

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

Drug Developers Use Google Cloud HPC in the Fight Against ALS

February 16, 2017

Within the haystack of a lethal disease such as ALS (amyotrophic lateral sclerosis / Lou Gehrig’s Disease) there exists, somewhere, the needle that will pierce this therapy-resistant affliction. Read more…

By Doug Black

Azure Edges AWS in Linpack Benchmark Study

February 15, 2017

The “when will clouds be ready for HPC” question has ebbed and flowed for years. Read more…

By John Russell

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

Cray Posts Best-Ever Quarter, Visibility Still Limited

February 10, 2017

On its Wednesday earnings call, Cray announced the largest revenue quarter in the company’s history and the second-highest revenue year. Read more…

By Tiffany Trader

HPC Cloud Startup Launches ‘App Store’ for HPC Workflows

February 9, 2017

“Civilization advances by extending the number of important operations which we can perform without thinking about them,” Read more…

By Tiffany Trader

Intel and Trump Announce $7B for Fab 42 Targeting 7nm

February 8, 2017

In what may be an attempt by President Trump to reset his turbulent relationship with the high tech industry, he and Intel CEO Brian Krzanich today announced plans to invest more than $7 billion to complete Fab 42. Read more…

By John Russell

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

US, China Vie for Supercomputing Supremacy

November 14, 2016

The 48th edition of the TOP500 list is fresh off the presses and while there is no new number one system, as previously teased by China, there are a number of notable entrants from the US and around the world and significant trends to report on. Read more…

By Tiffany Trader

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

D-Wave SC16 Update: What’s Bo Ewald Saying These Days

November 18, 2016

Tucked in a back section of the SC16 exhibit hall, quantum computing pioneer D-Wave has been talking up its new 2000-qubit processor announced in September. Forget for a moment the criticism sometimes aimed at D-Wave. This small Canadian company has sold several machines including, for example, ones to Lockheed and NASA, and has worked with Google on mapping machine learning problems to quantum computing. In July Los Alamos National Laboratory took possession of a 1000-quibit D-Wave 2X system that LANL ordered a year ago around the time of SC15. Read more…

By John Russell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

Leading Solution Providers

Nvidia Sees Bright Future for AI Supercomputing

November 23, 2016

Graphics chipmaker Nvidia made a strong showing at SC16 in Salt Lake City last week. Read more…

By Tiffany Trader

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

Dell Knights Landing Machine Sets New STAC Records

November 2, 2016

The Securities Technology Analysis Center, commonly known as STAC, has released a new report characterizing the performance of the Knight Landing-based Dell PowerEdge C6320p server on the STAC-A2 benchmarking suite, widely used by the financial services industry to test and evaluate computing platforms. The Dell machine has set new records for both the baseline Greeks benchmark and the large Greeks benchmark. Read more…

By Tiffany Trader

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

What Knights Landing Is Not

June 18, 2016

As we get ready to launch the newest member of the Intel Xeon Phi family, code named Knights Landing, it is natural that there be some questions and potentially some confusion. Read more…

By James Reinders, Intel

KNUPATH Hermosa-based Commercial Boards Expected in Q1 2017

December 15, 2016

Last June tech start-up KnuEdge emerged from stealth mode to begin spreading the word about its new processor and fabric technology that’s been roughly a decade in the making. Read more…

By John Russell

Intel and Trump Announce $7B for Fab 42 Targeting 7nm

February 8, 2017

In what may be an attempt by President Trump to reset his turbulent relationship with the high tech industry, he and Intel CEO Brian Krzanich today announced plans to invest more than $7 billion to complete Fab 42. Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This