Heterogeneous Processing Needs Software Revolutions

By the High-End Crusader

June 23, 2006

Many seasoned observers of high-end computing feel it in their bones that heterogeneous processing is rising like the Fundy tide (to the east of Maine and then north), a massive natural phenomenon of gravity and hydrodynamics.In the United States notably, under the auspices of DARPA's HPCS program, there is Cray's Cascade project and IBM's infuriating "don't ask; don't tell" Percs project.  Admittedly, IBM _has_ made strong representations to DARPA concerning Percs' heterogeneity content.  In Japan, the heterogeneous 10-PFs/s Keisoku Keisanki---the poetry is lost in translation---will be built.  In Japan, the money and political will are there.  So, can we just sit back and wait for good things to happen?  Will this inevitable transition to heterogeneous processing in high-end computing usher in a refreshingly beneficial paradigm shift in parallel computing?  Ha!  What have you been smoking?  Unlike the Fundy tide, which needs only natural law to work its awesome magic, heterogeneous processing---which encompasses all computing on any member of a widely diverse set of heterogeneous system architectures---needs, in each and every nontrivial instance, a revolution in system software.  In the absence of appropriate system software, a nontrivial heterogeneous-system project will just abort.  In Japan, where the proposed hardware architecture for the Keisoku Keisanki is very clean, the Japanese inability to handle the software revolution will---in all likelihood---simply stop the Keisoku Keisanki dead in its tracks.  But the U.S. has recognized expertise in sophisticated system software, programming languages, language processing, runtime systems, program-development tools, models of parallel computation, and all that, does it not?  Perhaps the U.S. has such expertise in principle, but it clearly has not yet worked the issues of developing system software to manage, operate, and exploit next-generation heterogeneous systems.  More generally, in the area of heterogeneous processing for high-end computing, there is a total absence of leadership---from vendors, from academia, from government.  Whether we consider heterogeneous system architectures, hardware technology, language processing, etc., etc., there are no private-sector computer architects or government agencies who either can, or are willing to, assume a leadership role; there are no compelling visions of heterogeneous processing from which to choose.  We are lacking even a simple roadmap for heterogeneous processing.  And DARPA has---or so it seems---bought into the Army's oxymoronic concept of low-footprint counter-insurgency, i.e., it is not pressuring vendors to bite the bullet of revolutionary heterogeneous processing---notably, in the areas of developing sophisticated system software and providing sufficient global system bandwidth.  What might a concerned citizen (say, a high-end crusader) do?  Start at the beginning, start with what we know, and slowly build up the foundations until we see the clear choices in heterogeneous system architectures and the deep challenges in system software that accompany each one.  To anticipate, the key task for heterogeneous-system system software lies in scheduling strategies and other system functions that maximize the performance extracted from scarce system resources, notably the heterogeneous system's limited global system bandwidth.  This is the punchline of this article.  --- Latency Avoidance and Latency Tolerance  In the von Neumann model, processors are separated from memories, from which they fetch operand data to feed their arithmetic functional units.  We say that a high-value processor suffers from _latency disease_ if it idles much of the time waiting for data operands to arrive.  For simplicity, we focus on memory latency (or network/memory latency).  Since arithmetic functional units are of exceedingly low value, they cannot suffer from latency disease; no one gives a hoot about their degree of utilization.  Only _critical_ system resources can suffer from either low _or_ foolishly extravagant utilization, where "critical" means "costs a lot" or "is a primary performance bottleneck" or both.  In the early 90s, when control-flow processors were reasonably critical resources, the memory-latency story was simple.  Latency is avoided by copying data nearby.  Latency is tolerated by doing something else while waiting. Avoidance scales up if locality increases with size.  Tolerance scales up if parallelism increases with size.  Multiprocessing (a/k/a multiprogramming) is based on the latter premise.  In multiprocessing, a job requests I/O and then blocks, performing a context switch to another job.  There is a dependence because the first job cannot continue until its I/O has completed.  The compute processor offloads work onto the I/O processor.  No deep thinking is required because only the compute processor can handle computation and only the I/O processor can handle I/O, assuming that I/O is DMA.  The compute processor _tolerates_ disk latency in order to maximize total system throughput.  There is no interest in individual job latency, which often increases.  Multiprocessing doesn't scale down because of context-switch cost.  The heterogeneous multicore Cell processor avoids this problem by using both nonpreemptive vector threads and software control of the memory hierarchy. Think of a vector core's SRAM local store as if it were a very nearby local memory (whose latency is not an issue), think of the Cell processor's DRAM external memory as if it were the vector core's disk, and think of the data movement between the Cell's DRAM and its SRAM as if it were I/O.  Using the "I/O-request (DMA)" instruction in its ISA, the vector compute processor offloads work onto the I/O processor, i.e., the scalar core.  Moreover, this "I/O" is orchestrated so that data from the "disk" is always present in the local store well before it is required by the vector processor, so this processor never needs to wait for data (and never _should_ be made to wait).  Here, a _little_ thinking is required.  After all, the scalar core and the vector cores are _compute_ processors.  A decision must be taken that some work is best performed by offloading it from one processor subsystem onto the other processor subsystem.  In a heterogeneous multicore system, it may be quite important that work be scheduled on a core of appropriate type.  Memory-latency avoidance techniques include processor registers, caches used temporally, and nearby memory.  Memory-latency tolerance techniques include vector pipelining, caches used spatially (i.e., long cache lines), prefetching (a/k/a precommunication), and multithreading.  Every parallel machine contains some mix of these, or similar, techniques.  For example, the MTA uses the following memory-latency techniques: processor registers, nearby memory, prefetching (i.e., explicit-dependence lookahead), and of course multithreading.  Note that the MTA predated any understanding of heterogeneous processing as a key enabler of scalable high-end computing, or the practical desire to scale to sustained petaflops and beyond, for that matter.  In processor-based latency tolerance, the processor supplies a steady stream of memory references that eventually fill the memory pipeline (i.e., outgoing network, memory subsystem, incoming network).  The operand values returned by the memory satisfy dependences on these values in requesting threads.  Little's law prescribes how many memory references must be outstanding in order to sustain, in the face of a given network/memory latency, the desired bandwidth of returned operand values.  But what, abstractly, _is_ processor-based latency tolerance?  (The answer points the way to system-level latency tolerance).  The processor issues a steady stream of _dependence requests_.  The processor receives a steady stream of _dependence satisfiers_.  This stream of dependence satisfiers guarantees that a processor's work queue is constantly stocked with ready threads, and hence that the processor is always constructively occupied.  This abstraction _crashes and burns_ if there is not enough global system bandwidth to transport these streams of dependence requests and dependence satisfiers.  No system software for a heterogeneous system can begin to compensate for a too-significant underprovisioning of the system's most critical resource---its global system bandwidth.  The reader should remember that, in this writer's view, high-end computing properly targets the "difficult" applications, which are both not easily localizable and have other interesting attributes (see "Hard Questions While Waiting For The HPCS Downselect", HPCwire, May 5, 2006).  For quite a few reasons, neither the MTA nor the MTA-2 had a D-cache.  Most of the heavy lifting was done by the processor parallelism, and hence the memory-reference concurrency, generated by the MTA's multithreaded processors. Although people use the term "latency-tolerant processor", latency tolerance is actually the joint result of processor parallelism and network bandwidth. In any case, as scaling parallel systems to tens and hundreds of sustained petaflops on nonlocalizable applications was contemplated, it became more and more clear that the combination of processor parallelism and network bandwidth ---when used in isolation---simply does not scale to handle the large system diameters in petascale systems.  Obviously, a hybrid approach incorporating an _extended_ mix of latency-tolerance and latency-avoidance techniques is required.  The affordable, industrial-strength solution to the problem of scaling parallel machines to tens and hundreds of sustained petaflops on difficult applications, which are profoundly cluster unsuitable, lies in increasing the system's parallelism, for superior latency tolerance, and increasing the system's locality, for superior latency avoidance.  There are feasible (heterogeneous) and infeasible (homogeneous) ways of attempting to do this.  As heterogeneous processing began to be understood, it became clear that we have to generate parallelism both inside and _outside_ of the processors (at the system level, as it were) and to exploit heterogeneity to create entirely new approaches to generating locality.  The amazing thing is that (both hardware-controlled and software-controlled) intelligent _bidirectional_ offloading of work onto the other of two complementary processor subsystems turns out to be the key to both endeavors, and leads to generalized notions of latency tolerance, latency avoidance, dynamic (thread) scheduling, and load balancing.  --- Scheduling Multithreaded Computations  Scheduling of multithreaded computations on parallel machines is complicated by the dynamic nature of such computations, which is only fair since both hardware and software multithreading have been proposed as general solutions to the problem of exploiting dynamic, unstructured parallelism.  Here, dynamically created threads cooperate in solving the problem at hand. However, for efficient execution, we must have efficient runtime thread placement and scheduling.  Although general thread placement for optimal processor utilization is NP-hard, schedulers based on simple heuristics have been proposed that work well for a broad class of applications.  Historically, implementations of the thread scheduling and placement task at the very core of the runtime system have had two possibly conflicting goals:  1) maintain related threads on the same processor to minimize communication cost, and 2) migrate threads to other processors as required for dynamic load balancing.  Previous work on runtime scheduling assumed that threads and processors were homogeneous.  There are two main approaches.  In _work sharing_, whenever a processor generates new threads, the scheduler attempts to migrate some of them to (potentially) underutilized processors.  In _work stealing_, only processors that are actually underutilized attempt to "steal" threads from other processors.  Work stealing minimizes the communication cost of thread migration---if processors have enough work to do, no thread migration takes place.  In the multithreaded world, a work-stealing scheduler must also handle "dataflow" computations, in which threads may stall due to a data dependence. This usually involves dynamic scheduling of the threads present in the processor's work (or ready) queue---when an execution unit finishes a task, it automatically reaches into the work queue to fetch another task.  What makes sense here critically depends on the precise communication cost of thread migration, and on what performance advantages might accrue from running a thread on a processor with special characteristics.  But this can mean many things.  In "Hard Questions" (op. cit.), much was made of threads extracted from either serial code, vector code, or multithreaded code.  Obviously, a given code type mandates execution on a core of matching type (i.e., either serial, vector, or multithreaded).  In this section, we largely abstract from "serial/vector/multithreaded" code heterogeneity---we take this form of heterogeneity for granted, and seek to uncover and exploit deeper forms.  A multithreaded computation may be represented as a directed acyclic graph (dag) where the vertices are tasks (possibly elementary operations) and the edges are either flow dependences or spawn operations.  During execution, a thread may create, or _spawn_, other threads.  In this representation, the only edges within a given thread are flow dependences (computations arising from individual threads are themselves partial orders).  An edge that crosses from one thread to another is either a spawn operation or a flow dependence between a task in one thread and a task in another thread.  An important special case is where an inter-thread dependence edge (or edges) functions as an _abstract_ spawn operation.  Suppose that only initial tasks of a given thread are flow dependent on tasks in other threads.  Until these dependences are satisfied, the thread is blocked and is not uploaded to the processor's work queue.  Only when all initial dependences of the thread have been satisfied (when the closure becomes full, as it were) does this uploading take place.  Essentially, a "fresh" thread has been created that now competes for the processor's execution units.  We have seen such threads before, in our discussion of the Cell processor.  Threads with only initial dependences are instances of nonpreemptive threads, which---once scheduled on an execution unit---run to completion without stalling or blocking.  Nonpreemptive threads are only justified in general-purpose computing when the thread's state becomes so colossal that context switching is not an option.  These threads lack the expressiveness and flexibility required to exploit dynamic, unstructured parallelism.  More familiar preemptive threads, which may stall or block many times during execution, are often used exclusively; in any case, preemptive threads are mandatory for the "dataflow" portions of a multithreaded computation.  Heterogeneous processing's potential lies in providing mechanisms that allow us to combine approaches that were once thought to be mutually exclusive. So, task the system software with mixing and matching (agile) preemptive threads _and_ (cumbersome) nonpreemptive threads.  Let the compiler and the runtime cooperate to mix and match dependence-aware thread scheduling _and_ dependence-oblivious thread scheduling.  In fact, for performance beyond the limits of processor parallelism and for dynamic load balancing, let us attempt to combine work stealing _and_ work sharing.  But we are getting ahead of ourselves.  Multithreaded computations with arbitrary data dependences are often impossible to schedule efficiently.  Dynamic scheduling within and across processors is normally dependence oblivious.  The standard scheduling heuristic organizes each processor work queue as a _ready deque_ (double-ended queue) with a _top_ and a _bottom_.  Threads are inserted on the bottom, and can be removed from either end.  Briefly, an idle processor normally fetches its next thread from the bottom of its ready deque.  Locally spawned or enabled threads are placed on the bottom of the ready deque.  An idle processor confronted with an empty ready deque attempts to steal the topmost thread from the ready deque of a randomly chosen remote processor.  This "local depth-first, remote breadth-first" scheduling heuristic captures only some of the dependence order of threads but is computationally feasible as a scheduling algorithm.  --- System Software for Heterogeneous Parallelism  The grand challenge in this area is designing system software to manage a heterogeneous system architecture with full-scale generation and exploitation of locality.  The software tasks are so revolutionary that very few people even begin to understand them.  On a more mundane level, your correspondent finds it challenging, in a clear and coherent fashion, to describe these software tasks, and the idealized heterogeneous system architecture to which they correspond.  One fact that complicates matters is that there is no uniquely preferred implementation of this idealized system.  So much of the heated debate about competing implementations has been misdirected!  Design teams have suffered major attrition.  Well, there is no point crying over spilt milk.  The idealized architecture does appear to provide the maximal exploitable parallelism _and_ the maximal exploitable locality in the presence of "difficult" applications (remember, this is a term of art).  There can be no doubt that different implementations have a _major_ effect on the granularity with which we can exploit either parallelism or locality, but a coarse-grained implementation of the right architecture is better than any implementation of the wrong architecture.  After the excitements of 2005, your correspondent is resigned to incremental implementation; that is why it is so important to be clear about the idealized heterogeneous system architecture that we would _all_ like to implement.  Given the difficulty of exposition, we pause to discuss system software to manage heterogeneous parallelism before moving on to discuss the more subtle problem of system software to manage heterogeneous locality.  Homogeneous multicore processors, which are now becoming quite standard in industry, create conflicting requirements in core complexity.  Ironically, for this very reason, even "latency-intolerant" commodity processors will soon be forced to bite the bullet of heterogeneous multicore.  Some applications do not parallelize (or at least have not been parallelized yet).  Other applications may have "serial" portions.  In such cases, single-thread performance becomes the relevant figure of merit.  When this is so, the appropriate core is a larger, higher-power core that implements _out-of-order_ execution of individual threads.  But out-of-order cores, with associated major increases in area, power, and design complexity, lead to only modest increases in application performance (leaving aside the effects of technology scaling).  Other applications can be easily decomposed into parallel threads.  When we replace single-thread performance as the figure of merit by system throughput of the set of threads that have been extracted from an application, the appropriate core is a large number (say, ten) of smaller, lower-power cores, each of which implements _in-order_ execution of individual threads.  Given a decomposition of our application into ten threads, the ten in-order cores simply blow the single out-of-order core out of the water.  (The Achilles heel of out-of-order cores is the use of register renaming, which increases the length of the instruction pipeline).  In reality, applications dynamically change their stripes during execution, sometimes behaving as serial code, at other times behaving as threaded code. Since no single choice for core complexity makes sense, we _should_ design heterogeneous multicore processors with both one large out-of-order core and ten small in-order cores---in this case, almost certainly on the same die. (Rescuing latency-intolerant commodity processors with heterogeneous multicore is an example, not your correspondent's dream microarchitecture.  But the idea is good).  One possibility is to use the same ISA for both types of core, with only performance differences (the power-hungry cores do run threads faster).  The compiler will extract threads from both the portions of the code it deems serial and the portions of the code it deems threadable.  The compiler will decide on an appropriate core to execute the thread at the moment of its extraction.  More interesting is to have different ISAs in different cores.  This may lead to having multiple versions of codes, where the compiler and runtime working together make the final decision.  The Cell processor uses different ISAs: the scalar core runs scalar code and the vector cores run vector code.  In the applications that have been ported to the Cell so far, scalar versus vector appears to be a fairly trivial decision.  The Cell processor is so cut-and-dried that it may not actually need a software revolution.  Even this simple examples provides a preliminary answer to the general question: what software revolutions are required to bring out the potential of heterogeneous processing?  At first glance, it seems that we only need to build tools for program development, design a decent compiler that takes advantage of the possibilities of using threads creatively, build a runtime system that solves the problem of thread scheduling, and, oh yes, agree on a computational model of heterogeneous processing that will allow us to integrate our hardware and software efforts.  The example also points the way to a central theme of the next section. Pretend that the single out-of-order core forms one processor subsystem and that the ten in-order cores form a distinct processor subsystem.  In this way, we may view the transition from a serial portion of the computation to a threaded portion as one thread running on a heavyweight processor subsystem offloading work onto a lightweight processor subsystem by spawning ten new threads.  More interesting, the decision to offload work is based not only on the fact that the work can be done faster on the other subsystem, but also on the fact that a critical system resource will be more moderately consumed. Here, the system resource is the cooling capacity, which must offset power dissipation (we are---reasonably---assuming that ten in-order cores can be made to dissipate less power than one monster out-of-order core).  Hardware and software control of general-purpose heterogeneous-processing systems is all about scheduling heterogeneous tasks wisely onto distributed, heterogeneous system resources.  Heterogeneous multicore can also mean a microarchitecture in which (some of) the cores are full-custom designs of vector and/or multithreaded latency-tolerant processors.  With serial/vector/multithreaded heterogeneity, the compiler must generate code for all three types of core (execution unit). Obviously, these cores have very different ISAs.  But some of the decisions (initially) taken by the compiler are decidedly nontrivial.  For example, how do you decide correctly, almost all of the time, whether a given loop should be vectorized or whether it should be multithreaded?  Perhaps there is no single right answer at compile time.  Again, this looks like an area where it would be advantageous to generate multiple versions of the code.  The programmer has some knowledge of his application and should not be ignored, but even the simple question of code generation and core assignment will require closer cooperation between the compiler and the hardware, with the runtime system often making the final call.  Major advances in compilers, runtime systems, development tools, debugging tools, etc., etc., are required just to manage heterogeneous parallelism. Things get really interesting when we task the system software with integrated joint management of heterogeneous parallelism _and_ heterogeneous locality. This is the theme of the next section.  --- Heterogeneous Parallelism and Locality  Consider a _virtual_ heterogeneous system architecture (i.e., don't ask if distinct processor types imply the existence of distinct full-custom chip designs or polymorphic processors or whatever).  Make the stipulation that global system bandwidth is the heterogeneous system's most precious, and most critical, resource.  Further stipulate that global system bandwidth is _somewhat_ underprovisioned, by neccessity, not by choice.  The goal of heterogeneity is to make the best possible use of this limited global system bandwidth by extracting the maximum possible performance from each unit of consumed bandwidth.  In a heterogeneous system, the hardware provides distinct mechanisms, and the system software makes sophisticated decisions about which mechanisms to use.  Processors, i.e., execution units (indeed, individual cores), are divided into two types: _heavyweight processors_ (HWPs) and _lightweight processors_ (LWPs).  (Do not ask if a given individual core can morph at runtime from being an HWP into being an LWP---these are virtual processors and such questions are inappropriate and/or proprietary).  HWPs have the property that they allow executing threads to accumulate a colossal amount of thread state, which is normally stored in a D-cache and/or a large register set.  In contrast, LWPs have the property that they can accommodate executing threads that do not, or should not, accumulate large amounts of thread state.  Logically, we imagine both an HWP processor subsystem and an LWP processor subsystem.  Physically, both the set of HWPs and the set of LWPs are distributed throughout system memory, at the same or different granularities. We assume nonuniform-memory-access (NUMA) shared memory.  More importantly, we suppose a memory and bandwidth hierarchy in which processors enjoy more bandwidth to closer ("more local") memory and less bandwidth to farther ("more global") memory.  This memory and bandwidth hierarchy may be quantized, e.g., by providing special high-bandwidth interconnect within well-defined locality regions (locales or places).  Furthermore, we assume that the bulk of the HWPs, and _all_ the LWPs, support processor-based latency tolerance.  This means that, given the need for global communication, both HWPs and LWPs can be _starved_ by insufficient availability of global bandwidth.  (LWPs starve more easily).  Lack of global bandwidth can also starve system-level latency tolerance in a heterogeneous system, simply because it inhibits both thread migration and the return of dependence satisfiers.  The desire for distinct processor types is in reality the desire for distinct thread types: high-state threads (a/k/a heavyweight or immobile threads) and low-state threads (a/k/a lightweight or mobile threads).  Your correspondent also uses "cumbersome" and "agile" to make this distinction.  Heterogeneous systems exploit these two distinct thread types to make more productive use of limited global system bandwidth.  System software decides which type of thread to run and where to run it.  So, what is this $64,000 question that system software needs to answer?  Recall that a processor that sustains an average network bandwidth of 'b' words/cycle over an average network distance of 'd' links consumes system bandwidth at the rate of (b * d) network words/cycle.  This must be matched against some sustained performance, say, 's' flops/cycle.  For efficient execution, we need to maximize the ratio of bang to buck, i.e., the value of 's' divided by (b * d).  Consider a hypothetical daemon who combines the powers of an incremental compiler, a sophisticated runtime system, and perhaps even has access to hardware instrumentation that gives him tips about network bandwidth and network distance.  We will place one daemon at each HWP, leaving unsaid whether there is also a daemon at each LWP.  This daemon is constantly asking himself:  At this point in my (local) computation, do I have access to a ready thread with high internal locality (a/k/a temporal locality; see "Hard Questions", op. cit.)?  If so, does the arithmetic intensity (the arithmetic per delivered operand) justify the still significant consumption of global bandwidth?  What is the ratio of bang to buck?  In other words, should I schedule this high-state thread to execute here, at this (heavyweight) processor?  In contrast, do I have access to a ready (spawnable) thread with low internal locality?  Is there a (programmed) concentration of thread-relevant data somewhere in the system to where I could migrate this thread?  At the remote (lightweight) processor, it might not enjoy much arithmetic intensity, but the average network distance would be considerably reduced, thus maintaining an acceptable ratio of bang to buck.  The decision tree is exceedingly complex.  Very crudely, considering 1) a potential thread's internal locality, 2) the bandwidth to the thread's data structures, 3) the network distance, and 4) whether there is a favorable nonuniformity of data distribution, should work 'xyz' be accomplished by running a marshaled high-state thread here or migrating a marshaled low-state to execute there?  Both threads and processors morph.  For example, a lightweight thread arriving at a rich clump of data puts on weight, i.e., has its further migration inhibited, until it has consumed the clump---at which point it promptly regains its former weight.  Or, a multithreaded processor might temporarily shrink the size of its execution contexts, guaranteeing threads with very low thread state, which synchronize _most_ efficiently.  In our heterogeneous system, we imagine separate per-processor virtual work queues of high-state and low-state threads; some of this work will be done locally and some will be offloaded to run elsewhere.  The tough decision is:  do I perform a given task by running this thread here or migrating that thread there?  Some of the time, both choices will be equally good.  Now, dream deeply.  Imagine a daemon with a rich supply of nonpreemptive high-state threads with only initial dependences, which the daemon keeps in a blocked queue.  Imagine that the daemon also has a rich supply of preemptive low-state threads that satisfy these dependences, and that he can easily migrate these low-state threads to system regions where _each_ thread is physically colocated near the center of mass of a reasonably compact set of thread-relevant data.  My goodness!  We have turned thread migration into system-level latency tolerance.  By offloading work onto the LWP processor subsystem, the daemon directs the HWP processor to issue a steady stream of _dependence requests_.  These requests may circulate transitively within the LWP processor subsystem for some length of time.  However, the LWP processor subsystem eventually returns a steady stream of _dependence satisfiers_ to this HWP processor. As a result, the HWP's work queue is constantly stocked with ready high-state threads, and hence the HWP is always constructively occupied.  In a word, we have moved from tolerating memory latency to tolerating work latency, and we have done so in a fashion that minimizes the consumption of global system bandwidth.  We have also augmented memory pipelining with work pipelining.  --- Conclusion  In and of itself, processor-based latency tolerance consumes precious, limited global system bandwidth.  A locality-aware heterogeneous system mitigates foolishly extravagant utilization of global bandwidth by optimizing bandwidth usage to extract the maximum possible performance from each unit of consumed bandwidth.  This requires quite sophisticated system software to schedule heterogeneous threads onto heterogeneous execution resources.  Heterogeneous systems are required both to support different styles of computation and to make better use of critical resources.  The significance of bidirectional offloading of work between two distinct types of processor subsystems is that we thereby allow performance scaling beyond the limits of processor-based latency tolerance.  The system-level (work) streams of dependence requests and dependence satisfiers bring the latency that must be tolerated down to the point where it can by handled by the processor's parallelism (and possibly its D-cache), because some of the work has already been accomplished or for other reasons.  But moving from memory pipelining to work pipelining changes everything.  There are performance considerations.  To migrate a thread, we send a full continuation no larger than a network packet.  To receive a dependence satisfier, we should receive something of roughly the same size.  If either 1) you do not have bite-sized (work) traffic in both directions, or 2) your network does not have the bisection bandwidth required to handle both the "dependence" traffic _and_ the "operand" traffic that is required to feed your latency-tolerant (i.e., vector and multithreaded) processors, then the whole system falls apart, or rather limps along when it should be charging.  Caveat: Making receiving a dependence satisfier into the same thing as transporting a large amount of data up close and personal to an HWP has not enjoyed unalloyed success in previous designs, from HTMT forward.  If only for system balance, processor-based latency tolerance and system-level latency tolerance should meet each other half way.  If the cycle time is exceedingly small, this is an open problem.  HPCS must accommodate different computation styles because of the need to compute informatics graphs.  It must accommodate performance scaling for reasons too numerous to mention.  What is HPCS' proper focus?  If we agree that thread migrations are bite-sized dependence requests, and that dependence satisfiers are bite-sized responses to thread migrations, then HPCS should _demand_ 1) global system bandwidth sufficient to carry both dependence and operand traffic, and 2) sophisticated system software that extracts appropriate heterogeneous threads from difficult applications and dynamically schedules them onto heterogeneous execution resources in order to use limited global system bandwidth well.  This is the _key problem_, for Pete's sake!  Stop talking only about flops and megawatts per dollar.  -----  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Michael Feldman at [email protected]. 
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing power it brings to artificial intelligence.  Nvidia's DGX Read more…

Call for Participation in Workshop on Potential NSF CISE Quantum Initiative

March 26, 2024

Editor’s Note: Next month there will be a workshop to discuss what a quantum initiative led by NSF’s Computer, Information Science and Engineering (CISE) directorate could entail. The details are posted below in a Ca Read more…

Waseda U. Researchers Reports New Quantum Algorithm for Speeding Optimization

March 25, 2024

Optimization problems cover a wide range of applications and are often cited as good candidates for quantum computing. However, the execution time for constrained combinatorial optimization applications on quantum device Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at the network layer threatens to make bigger and brawnier pro Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HBM3E memory as well as the the ability to train 1 trillion pa Read more…

Nvidia Appoints Andy Grant as EMEA Director of Supercomputing, Higher Education, and AI

March 22, 2024

Nvidia recently appointed Andy Grant as Director, Supercomputing, Higher Education, and AI for Europe, the Middle East, and Africa (EMEA). With over 25 years of high-performance computing (HPC) experience, Grant brings a Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

NVLink: Faster Interconnects and Switches to Help Relieve Data Bottlenecks

March 25, 2024

Nvidia’s new Blackwell architecture may have stolen the show this week at the GPU Technology Conference in San Jose, California. But an emerging bottleneck at Read more…

Who is David Blackwell?

March 22, 2024

During GTC24, co-founder and president of NVIDIA Jensen Huang unveiled the Blackwell GPU. This GPU itself is heavily optimized for AI work, boasting 192GB of HB Read more…

Nvidia Looks to Accelerate GenAI Adoption with NIM

March 19, 2024

Today at the GPU Technology Conference, Nvidia launched a new offering aimed at helping customers quickly deploy their generative AI applications in a secure, s Read more…

The Generative AI Future Is Now, Nvidia’s Huang Says

March 19, 2024

We are in the early days of a transformative shift in how business gets done thanks to the advent of generative AI, according to Nvidia CEO and cofounder Jensen Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Nvidia Showcases Quantum Cloud, Expanding Quantum Portfolio at GTC24

March 18, 2024

Nvidia’s barrage of quantum news at GTC24 this week includes new products, signature collaborations, and a new Nvidia Quantum Cloud for quantum developers. Wh Read more…

Houston We Have a Solution: Addressing the HPC and Tech Talent Gap

March 15, 2024

Generations of Houstonian teachers, counselors, and parents have either worked in the aerospace industry or know people who do - the prospect of entering the fi Read more…

Alibaba Shuts Down its Quantum Computing Effort

November 30, 2023

In case you missed it, China’s e-commerce giant Alibaba has shut down its quantum computing research effort. It’s not entirely clear what drove the change. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

DoD Takes a Long View of Quantum Computing

December 19, 2023

Given the large sums tied to expensive weapon systems – think $100-million-plus per F-35 fighter – it’s easy to forget the U.S. Department of Defense is a Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Baidu Exits Quantum, Closely Following Alibaba’s Earlier Move

January 5, 2024

Reuters reported this week that Baidu, China’s giant e-commerce and services provider, is exiting the quantum computing development arena. Reuters reported � Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Shutterstock 1179408610

Google Addresses the Mysteries of Its Hypercomputer 

December 28, 2023

When Google launched its Hypercomputer earlier this month (December 2023), the first reaction was, "Say what?" It turns out that the Hypercomputer is Google's t Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

Google Introduces ‘Hypercomputer’ to Its AI Infrastructure

December 11, 2023

Google ran out of monikers to describe its new AI system released on December 7. Supercomputer perhaps wasn't an apt description, so it settled on Hypercomputer Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Intel Won’t Have a Xeon Max Chip with New Emerald Rapids CPU

December 14, 2023

As expected, Intel officially announced its 5th generation Xeon server chips codenamed Emerald Rapids at an event in New York City, where the focus was really o Read more…

IBM Quantum Summit: Two New QPUs, Upgraded Qiskit, 10-year Roadmap and More

December 4, 2023

IBM kicks off its annual Quantum Summit today and will announce a broad range of advances including its much-anticipated 1121-qubit Condor QPU, a smaller 133-qu Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire