Heterogeneous Processing Needs Software Revolutions

By the High-End Crusader

June 23, 2006

Many seasoned observers of high-end computing feel it in their bones that heterogeneous processing is rising like the Fundy tide (to the east of Maine and then north), a massive natural phenomenon of gravity and hydrodynamics.In the United States notably, under the auspices of DARPA's HPCS program, there is Cray's Cascade project and IBM's infuriating "don't ask; don't tell" Percs project.  Admittedly, IBM _has_ made strong representations to DARPA concerning Percs' heterogeneity content.  In Japan, the heterogeneous 10-PFs/s Keisoku Keisanki---the poetry is lost in translation---will be built.  In Japan, the money and political will are there.  So, can we just sit back and wait for good things to happen?  Will this inevitable transition to heterogeneous processing in high-end computing usher in a refreshingly beneficial paradigm shift in parallel computing?  Ha!  What have you been smoking?  Unlike the Fundy tide, which needs only natural law to work its awesome magic, heterogeneous processing---which encompasses all computing on any member of a widely diverse set of heterogeneous system architectures---needs, in each and every nontrivial instance, a revolution in system software.  In the absence of appropriate system software, a nontrivial heterogeneous-system project will just abort.  In Japan, where the proposed hardware architecture for the Keisoku Keisanki is very clean, the Japanese inability to handle the software revolution will---in all likelihood---simply stop the Keisoku Keisanki dead in its tracks.  But the U.S. has recognized expertise in sophisticated system software, programming languages, language processing, runtime systems, program-development tools, models of parallel computation, and all that, does it not?  Perhaps the U.S. has such expertise in principle, but it clearly has not yet worked the issues of developing system software to manage, operate, and exploit next-generation heterogeneous systems.  More generally, in the area of heterogeneous processing for high-end computing, there is a total absence of leadership---from vendors, from academia, from government.  Whether we consider heterogeneous system architectures, hardware technology, language processing, etc., etc., there are no private-sector computer architects or government agencies who either can, or are willing to, assume a leadership role; there are no compelling visions of heterogeneous processing from which to choose.  We are lacking even a simple roadmap for heterogeneous processing.  And DARPA has---or so it seems---bought into the Army's oxymoronic concept of low-footprint counter-insurgency, i.e., it is not pressuring vendors to bite the bullet of revolutionary heterogeneous processing---notably, in the areas of developing sophisticated system software and providing sufficient global system bandwidth.  What might a concerned citizen (say, a high-end crusader) do?  Start at the beginning, start with what we know, and slowly build up the foundations until we see the clear choices in heterogeneous system architectures and the deep challenges in system software that accompany each one.  To anticipate, the key task for heterogeneous-system system software lies in scheduling strategies and other system functions that maximize the performance extracted from scarce system resources, notably the heterogeneous system's limited global system bandwidth.  This is the punchline of this article.  --- Latency Avoidance and Latency Tolerance  In the von Neumann model, processors are separated from memories, from which they fetch operand data to feed their arithmetic functional units.  We say that a high-value processor suffers from _latency disease_ if it idles much of the time waiting for data operands to arrive.  For simplicity, we focus on memory latency (or network/memory latency).  Since arithmetic functional units are of exceedingly low value, they cannot suffer from latency disease; no one gives a hoot about their degree of utilization.  Only _critical_ system resources can suffer from either low _or_ foolishly extravagant utilization, where "critical" means "costs a lot" or "is a primary performance bottleneck" or both.  In the early 90s, when control-flow processors were reasonably critical resources, the memory-latency story was simple.  Latency is avoided by copying data nearby.  Latency is tolerated by doing something else while waiting. Avoidance scales up if locality increases with size.  Tolerance scales up if parallelism increases with size.  Multiprocessing (a/k/a multiprogramming) is based on the latter premise.  In multiprocessing, a job requests I/O and then blocks, performing a context switch to another job.  There is a dependence because the first job cannot continue until its I/O has completed.  The compute processor offloads work onto the I/O processor.  No deep thinking is required because only the compute processor can handle computation and only the I/O processor can handle I/O, assuming that I/O is DMA.  The compute processor _tolerates_ disk latency in order to maximize total system throughput.  There is no interest in individual job latency, which often increases.  Multiprocessing doesn't scale down because of context-switch cost.  The heterogeneous multicore Cell processor avoids this problem by using both nonpreemptive vector threads and software control of the memory hierarchy. Think of a vector core's SRAM local store as if it were a very nearby local memory (whose latency is not an issue), think of the Cell processor's DRAM external memory as if it were the vector core's disk, and think of the data movement between the Cell's DRAM and its SRAM as if it were I/O.  Using the "I/O-request (DMA)" instruction in its ISA, the vector compute processor offloads work onto the I/O processor, i.e., the scalar core.  Moreover, this "I/O" is orchestrated so that data from the "disk" is always present in the local store well before it is required by the vector processor, so this processor never needs to wait for data (and never _should_ be made to wait).  Here, a _little_ thinking is required.  After all, the scalar core and the vector cores are _compute_ processors.  A decision must be taken that some work is best performed by offloading it from one processor subsystem onto the other processor subsystem.  In a heterogeneous multicore system, it may be quite important that work be scheduled on a core of appropriate type.  Memory-latency avoidance techniques include processor registers, caches used temporally, and nearby memory.  Memory-latency tolerance techniques include vector pipelining, caches used spatially (i.e., long cache lines), prefetching (a/k/a precommunication), and multithreading.  Every parallel machine contains some mix of these, or similar, techniques.  For example, the MTA uses the following memory-latency techniques: processor registers, nearby memory, prefetching (i.e., explicit-dependence lookahead), and of course multithreading.  Note that the MTA predated any understanding of heterogeneous processing as a key enabler of scalable high-end computing, or the practical desire to scale to sustained petaflops and beyond, for that matter.  In processor-based latency tolerance, the processor supplies a steady stream of memory references that eventually fill the memory pipeline (i.e., outgoing network, memory subsystem, incoming network).  The operand values returned by the memory satisfy dependences on these values in requesting threads.  Little's law prescribes how many memory references must be outstanding in order to sustain, in the face of a given network/memory latency, the desired bandwidth of returned operand values.  But what, abstractly, _is_ processor-based latency tolerance?  (The answer points the way to system-level latency tolerance).  The processor issues a steady stream of _dependence requests_.  The processor receives a steady stream of _dependence satisfiers_.  This stream of dependence satisfiers guarantees that a processor's work queue is constantly stocked with ready threads, and hence that the processor is always constructively occupied.  This abstraction _crashes and burns_ if there is not enough global system bandwidth to transport these streams of dependence requests and dependence satisfiers.  No system software for a heterogeneous system can begin to compensate for a too-significant underprovisioning of the system's most critical resource---its global system bandwidth.  The reader should remember that, in this writer's view, high-end computing properly targets the "difficult" applications, which are both not easily localizable and have other interesting attributes (see "Hard Questions While Waiting For The HPCS Downselect", HPCwire, May 5, 2006).  For quite a few reasons, neither the MTA nor the MTA-2 had a D-cache.  Most of the heavy lifting was done by the processor parallelism, and hence the memory-reference concurrency, generated by the MTA's multithreaded processors. Although people use the term "latency-tolerant processor", latency tolerance is actually the joint result of processor parallelism and network bandwidth. In any case, as scaling parallel systems to tens and hundreds of sustained petaflops on nonlocalizable applications was contemplated, it became more and more clear that the combination of processor parallelism and network bandwidth ---when used in isolation---simply does not scale to handle the large system diameters in petascale systems.  Obviously, a hybrid approach incorporating an _extended_ mix of latency-tolerance and latency-avoidance techniques is required.  The affordable, industrial-strength solution to the problem of scaling parallel machines to tens and hundreds of sustained petaflops on difficult applications, which are profoundly cluster unsuitable, lies in increasing the system's parallelism, for superior latency tolerance, and increasing the system's locality, for superior latency avoidance.  There are feasible (heterogeneous) and infeasible (homogeneous) ways of attempting to do this.  As heterogeneous processing began to be understood, it became clear that we have to generate parallelism both inside and _outside_ of the processors (at the system level, as it were) and to exploit heterogeneity to create entirely new approaches to generating locality.  The amazing thing is that (both hardware-controlled and software-controlled) intelligent _bidirectional_ offloading of work onto the other of two complementary processor subsystems turns out to be the key to both endeavors, and leads to generalized notions of latency tolerance, latency avoidance, dynamic (thread) scheduling, and load balancing.  --- Scheduling Multithreaded Computations  Scheduling of multithreaded computations on parallel machines is complicated by the dynamic nature of such computations, which is only fair since both hardware and software multithreading have been proposed as general solutions to the problem of exploiting dynamic, unstructured parallelism.  Here, dynamically created threads cooperate in solving the problem at hand. However, for efficient execution, we must have efficient runtime thread placement and scheduling.  Although general thread placement for optimal processor utilization is NP-hard, schedulers based on simple heuristics have been proposed that work well for a broad class of applications.  Historically, implementations of the thread scheduling and placement task at the very core of the runtime system have had two possibly conflicting goals:  1) maintain related threads on the same processor to minimize communication cost, and 2) migrate threads to other processors as required for dynamic load balancing.  Previous work on runtime scheduling assumed that threads and processors were homogeneous.  There are two main approaches.  In _work sharing_, whenever a processor generates new threads, the scheduler attempts to migrate some of them to (potentially) underutilized processors.  In _work stealing_, only processors that are actually underutilized attempt to "steal" threads from other processors.  Work stealing minimizes the communication cost of thread migration---if processors have enough work to do, no thread migration takes place.  In the multithreaded world, a work-stealing scheduler must also handle "dataflow" computations, in which threads may stall due to a data dependence. This usually involves dynamic scheduling of the threads present in the processor's work (or ready) queue---when an execution unit finishes a task, it automatically reaches into the work queue to fetch another task.  What makes sense here critically depends on the precise communication cost of thread migration, and on what performance advantages might accrue from running a thread on a processor with special characteristics.  But this can mean many things.  In "Hard Questions" (op. cit.), much was made of threads extracted from either serial code, vector code, or multithreaded code.  Obviously, a given code type mandates execution on a core of matching type (i.e., either serial, vector, or multithreaded).  In this section, we largely abstract from "serial/vector/multithreaded" code heterogeneity---we take this form of heterogeneity for granted, and seek to uncover and exploit deeper forms.  A multithreaded computation may be represented as a directed acyclic graph (dag) where the vertices are tasks (possibly elementary operations) and the edges are either flow dependences or spawn operations.  During execution, a thread may create, or _spawn_, other threads.  In this representation, the only edges within a given thread are flow dependences (computations arising from individual threads are themselves partial orders).  An edge that crosses from one thread to another is either a spawn operation or a flow dependence between a task in one thread and a task in another thread.  An important special case is where an inter-thread dependence edge (or edges) functions as an _abstract_ spawn operation.  Suppose that only initial tasks of a given thread are flow dependent on tasks in other threads.  Until these dependences are satisfied, the thread is blocked and is not uploaded to the processor's work queue.  Only when all initial dependences of the thread have been satisfied (when the closure becomes full, as it were) does this uploading take place.  Essentially, a "fresh" thread has been created that now competes for the processor's execution units.  We have seen such threads before, in our discussion of the Cell processor.  Threads with only initial dependences are instances of nonpreemptive threads, which---once scheduled on an execution unit---run to completion without stalling or blocking.  Nonpreemptive threads are only justified in general-purpose computing when the thread's state becomes so colossal that context switching is not an option.  These threads lack the expressiveness and flexibility required to exploit dynamic, unstructured parallelism.  More familiar preemptive threads, which may stall or block many times during execution, are often used exclusively; in any case, preemptive threads are mandatory for the "dataflow" portions of a multithreaded computation.  Heterogeneous processing's potential lies in providing mechanisms that allow us to combine approaches that were once thought to be mutually exclusive. So, task the system software with mixing and matching (agile) preemptive threads _and_ (cumbersome) nonpreemptive threads.  Let the compiler and the runtime cooperate to mix and match dependence-aware thread scheduling _and_ dependence-oblivious thread scheduling.  In fact, for performance beyond the limits of processor parallelism and for dynamic load balancing, let us attempt to combine work stealing _and_ work sharing.  But we are getting ahead of ourselves.  Multithreaded computations with arbitrary data dependences are often impossible to schedule efficiently.  Dynamic scheduling within and across processors is normally dependence oblivious.  The standard scheduling heuristic organizes each processor work queue as a _ready deque_ (double-ended queue) with a _top_ and a _bottom_.  Threads are inserted on the bottom, and can be removed from either end.  Briefly, an idle processor normally fetches its next thread from the bottom of its ready deque.  Locally spawned or enabled threads are placed on the bottom of the ready deque.  An idle processor confronted with an empty ready deque attempts to steal the topmost thread from the ready deque of a randomly chosen remote processor.  This "local depth-first, remote breadth-first" scheduling heuristic captures only some of the dependence order of threads but is computationally feasible as a scheduling algorithm.  --- System Software for Heterogeneous Parallelism  The grand challenge in this area is designing system software to manage a heterogeneous system architecture with full-scale generation and exploitation of locality.  The software tasks are so revolutionary that very few people even begin to understand them.  On a more mundane level, your correspondent finds it challenging, in a clear and coherent fashion, to describe these software tasks, and the idealized heterogeneous system architecture to which they correspond.  One fact that complicates matters is that there is no uniquely preferred implementation of this idealized system.  So much of the heated debate about competing implementations has been misdirected!  Design teams have suffered major attrition.  Well, there is no point crying over spilt milk.  The idealized architecture does appear to provide the maximal exploitable parallelism _and_ the maximal exploitable locality in the presence of "difficult" applications (remember, this is a term of art).  There can be no doubt that different implementations have a _major_ effect on the granularity with which we can exploit either parallelism or locality, but a coarse-grained implementation of the right architecture is better than any implementation of the wrong architecture.  After the excitements of 2005, your correspondent is resigned to incremental implementation; that is why it is so important to be clear about the idealized heterogeneous system architecture that we would _all_ like to implement.  Given the difficulty of exposition, we pause to discuss system software to manage heterogeneous parallelism before moving on to discuss the more subtle problem of system software to manage heterogeneous locality.  Homogeneous multicore processors, which are now becoming quite standard in industry, create conflicting requirements in core complexity.  Ironically, for this very reason, even "latency-intolerant" commodity processors will soon be forced to bite the bullet of heterogeneous multicore.  Some applications do not parallelize (or at least have not been parallelized yet).  Other applications may have "serial" portions.  In such cases, single-thread performance becomes the relevant figure of merit.  When this is so, the appropriate core is a larger, higher-power core that implements _out-of-order_ execution of individual threads.  But out-of-order cores, with associated major increases in area, power, and design complexity, lead to only modest increases in application performance (leaving aside the effects of technology scaling).  Other applications can be easily decomposed into parallel threads.  When we replace single-thread performance as the figure of merit by system throughput of the set of threads that have been extracted from an application, the appropriate core is a large number (say, ten) of smaller, lower-power cores, each of which implements _in-order_ execution of individual threads.  Given a decomposition of our application into ten threads, the ten in-order cores simply blow the single out-of-order core out of the water.  (The Achilles heel of out-of-order cores is the use of register renaming, which increases the length of the instruction pipeline).  In reality, applications dynamically change their stripes during execution, sometimes behaving as serial code, at other times behaving as threaded code. Since no single choice for core complexity makes sense, we _should_ design heterogeneous multicore processors with both one large out-of-order core and ten small in-order cores---in this case, almost certainly on the same die. (Rescuing latency-intolerant commodity processors with heterogeneous multicore is an example, not your correspondent's dream microarchitecture.  But the idea is good).  One possibility is to use the same ISA for both types of core, with only performance differences (the power-hungry cores do run threads faster).  The compiler will extract threads from both the portions of the code it deems serial and the portions of the code it deems threadable.  The compiler will decide on an appropriate core to execute the thread at the moment of its extraction.  More interesting is to have different ISAs in different cores.  This may lead to having multiple versions of codes, where the compiler and runtime working together make the final decision.  The Cell processor uses different ISAs: the scalar core runs scalar code and the vector cores run vector code.  In the applications that have been ported to the Cell so far, scalar versus vector appears to be a fairly trivial decision.  The Cell processor is so cut-and-dried that it may not actually need a software revolution.  Even this simple examples provides a preliminary answer to the general question: what software revolutions are required to bring out the potential of heterogeneous processing?  At first glance, it seems that we only need to build tools for program development, design a decent compiler that takes advantage of the possibilities of using threads creatively, build a runtime system that solves the problem of thread scheduling, and, oh yes, agree on a computational model of heterogeneous processing that will allow us to integrate our hardware and software efforts.  The example also points the way to a central theme of the next section. Pretend that the single out-of-order core forms one processor subsystem and that the ten in-order cores form a distinct processor subsystem.  In this way, we may view the transition from a serial portion of the computation to a threaded portion as one thread running on a heavyweight processor subsystem offloading work onto a lightweight processor subsystem by spawning ten new threads.  More interesting, the decision to offload work is based not only on the fact that the work can be done faster on the other subsystem, but also on the fact that a critical system resource will be more moderately consumed. Here, the system resource is the cooling capacity, which must offset power dissipation (we are---reasonably---assuming that ten in-order cores can be made to dissipate less power than one monster out-of-order core).  Hardware and software control of general-purpose heterogeneous-processing systems is all about scheduling heterogeneous tasks wisely onto distributed, heterogeneous system resources.  Heterogeneous multicore can also mean a microarchitecture in which (some of) the cores are full-custom designs of vector and/or multithreaded latency-tolerant processors.  With serial/vector/multithreaded heterogeneity, the compiler must generate code for all three types of core (execution unit). Obviously, these cores have very different ISAs.  But some of the decisions (initially) taken by the compiler are decidedly nontrivial.  For example, how do you decide correctly, almost all of the time, whether a given loop should be vectorized or whether it should be multithreaded?  Perhaps there is no single right answer at compile time.  Again, this looks like an area where it would be advantageous to generate multiple versions of the code.  The programmer has some knowledge of his application and should not be ignored, but even the simple question of code generation and core assignment will require closer cooperation between the compiler and the hardware, with the runtime system often making the final call.  Major advances in compilers, runtime systems, development tools, debugging tools, etc., etc., are required just to manage heterogeneous parallelism. Things get really interesting when we task the system software with integrated joint management of heterogeneous parallelism _and_ heterogeneous locality. This is the theme of the next section.  --- Heterogeneous Parallelism and Locality  Consider a _virtual_ heterogeneous system architecture (i.e., don't ask if distinct processor types imply the existence of distinct full-custom chip designs or polymorphic processors or whatever).  Make the stipulation that global system bandwidth is the heterogeneous system's most precious, and most critical, resource.  Further stipulate that global system bandwidth is _somewhat_ underprovisioned, by neccessity, not by choice.  The goal of heterogeneity is to make the best possible use of this limited global system bandwidth by extracting the maximum possible performance from each unit of consumed bandwidth.  In a heterogeneous system, the hardware provides distinct mechanisms, and the system software makes sophisticated decisions about which mechanisms to use.  Processors, i.e., execution units (indeed, individual cores), are divided into two types: _heavyweight processors_ (HWPs) and _lightweight processors_ (LWPs).  (Do not ask if a given individual core can morph at runtime from being an HWP into being an LWP---these are virtual processors and such questions are inappropriate and/or proprietary).  HWPs have the property that they allow executing threads to accumulate a colossal amount of thread state, which is normally stored in a D-cache and/or a large register set.  In contrast, LWPs have the property that they can accommodate executing threads that do not, or should not, accumulate large amounts of thread state.  Logically, we imagine both an HWP processor subsystem and an LWP processor subsystem.  Physically, both the set of HWPs and the set of LWPs are distributed throughout system memory, at the same or different granularities. We assume nonuniform-memory-access (NUMA) shared memory.  More importantly, we suppose a memory and bandwidth hierarchy in which processors enjoy more bandwidth to closer ("more local") memory and less bandwidth to farther ("more global") memory.  This memory and bandwidth hierarchy may be quantized, e.g., by providing special high-bandwidth interconnect within well-defined locality regions (locales or places).  Furthermore, we assume that the bulk of the HWPs, and _all_ the LWPs, support processor-based latency tolerance.  This means that, given the need for global communication, both HWPs and LWPs can be _starved_ by insufficient availability of global bandwidth.  (LWPs starve more easily).  Lack of global bandwidth can also starve system-level latency tolerance in a heterogeneous system, simply because it inhibits both thread migration and the return of dependence satisfiers.  The desire for distinct processor types is in reality the desire for distinct thread types: high-state threads (a/k/a heavyweight or immobile threads) and low-state threads (a/k/a lightweight or mobile threads).  Your correspondent also uses "cumbersome" and "agile" to make this distinction.  Heterogeneous systems exploit these two distinct thread types to make more productive use of limited global system bandwidth.  System software decides which type of thread to run and where to run it.  So, what is this $64,000 question that system software needs to answer?  Recall that a processor that sustains an average network bandwidth of 'b' words/cycle over an average network distance of 'd' links consumes system bandwidth at the rate of (b * d) network words/cycle.  This must be matched against some sustained performance, say, 's' flops/cycle.  For efficient execution, we need to maximize the ratio of bang to buck, i.e., the value of 's' divided by (b * d).  Consider a hypothetical daemon who combines the powers of an incremental compiler, a sophisticated runtime system, and perhaps even has access to hardware instrumentation that gives him tips about network bandwidth and network distance.  We will place one daemon at each HWP, leaving unsaid whether there is also a daemon at each LWP.  This daemon is constantly asking himself:  At this point in my (local) computation, do I have access to a ready thread with high internal locality (a/k/a temporal locality; see "Hard Questions", op. cit.)?  If so, does the arithmetic intensity (the arithmetic per delivered operand) justify the still significant consumption of global bandwidth?  What is the ratio of bang to buck?  In other words, should I schedule this high-state thread to execute here, at this (heavyweight) processor?  In contrast, do I have access to a ready (spawnable) thread with low internal locality?  Is there a (programmed) concentration of thread-relevant data somewhere in the system to where I could migrate this thread?  At the remote (lightweight) processor, it might not enjoy much arithmetic intensity, but the average network distance would be considerably reduced, thus maintaining an acceptable ratio of bang to buck.  The decision tree is exceedingly complex.  Very crudely, considering 1) a potential thread's internal locality, 2) the bandwidth to the thread's data structures, 3) the network distance, and 4) whether there is a favorable nonuniformity of data distribution, should work 'xyz' be accomplished by running a marshaled high-state thread here or migrating a marshaled low-state to execute there?  Both threads and processors morph.  For example, a lightweight thread arriving at a rich clump of data puts on weight, i.e., has its further migration inhibited, until it has consumed the clump---at which point it promptly regains its former weight.  Or, a multithreaded processor might temporarily shrink the size of its execution contexts, guaranteeing threads with very low thread state, which synchronize _most_ efficiently.  In our heterogeneous system, we imagine separate per-processor virtual work queues of high-state and low-state threads; some of this work will be done locally and some will be offloaded to run elsewhere.  The tough decision is:  do I perform a given task by running this thread here or migrating that thread there?  Some of the time, both choices will be equally good.  Now, dream deeply.  Imagine a daemon with a rich supply of nonpreemptive high-state threads with only initial dependences, which the daemon keeps in a blocked queue.  Imagine that the daemon also has a rich supply of preemptive low-state threads that satisfy these dependences, and that he can easily migrate these low-state threads to system regions where _each_ thread is physically colocated near the center of mass of a reasonably compact set of thread-relevant data.  My goodness!  We have turned thread migration into system-level latency tolerance.  By offloading work onto the LWP processor subsystem, the daemon directs the HWP processor to issue a steady stream of _dependence requests_.  These requests may circulate transitively within the LWP processor subsystem for some length of time.  However, the LWP processor subsystem eventually returns a steady stream of _dependence satisfiers_ to this HWP processor. As a result, the HWP's work queue is constantly stocked with ready high-state threads, and hence the HWP is always constructively occupied.  In a word, we have moved from tolerating memory latency to tolerating work latency, and we have done so in a fashion that minimizes the consumption of global system bandwidth.  We have also augmented memory pipelining with work pipelining.  --- Conclusion  In and of itself, processor-based latency tolerance consumes precious, limited global system bandwidth.  A locality-aware heterogeneous system mitigates foolishly extravagant utilization of global bandwidth by optimizing bandwidth usage to extract the maximum possible performance from each unit of consumed bandwidth.  This requires quite sophisticated system software to schedule heterogeneous threads onto heterogeneous execution resources.  Heterogeneous systems are required both to support different styles of computation and to make better use of critical resources.  The significance of bidirectional offloading of work between two distinct types of processor subsystems is that we thereby allow performance scaling beyond the limits of processor-based latency tolerance.  The system-level (work) streams of dependence requests and dependence satisfiers bring the latency that must be tolerated down to the point where it can by handled by the processor's parallelism (and possibly its D-cache), because some of the work has already been accomplished or for other reasons.  But moving from memory pipelining to work pipelining changes everything.  There are performance considerations.  To migrate a thread, we send a full continuation no larger than a network packet.  To receive a dependence satisfier, we should receive something of roughly the same size.  If either 1) you do not have bite-sized (work) traffic in both directions, or 2) your network does not have the bisection bandwidth required to handle both the "dependence" traffic _and_ the "operand" traffic that is required to feed your latency-tolerant (i.e., vector and multithreaded) processors, then the whole system falls apart, or rather limps along when it should be charging.  Caveat: Making receiving a dependence satisfier into the same thing as transporting a large amount of data up close and personal to an HWP has not enjoyed unalloyed success in previous designs, from HTMT forward.  If only for system balance, processor-based latency tolerance and system-level latency tolerance should meet each other half way.  If the cycle time is exceedingly small, this is an open problem.  HPCS must accommodate different computation styles because of the need to compute informatics graphs.  It must accommodate performance scaling for reasons too numerous to mention.  What is HPCS' proper focus?  If we agree that thread migrations are bite-sized dependence requests, and that dependence satisfiers are bite-sized responses to thread migrations, then HPCS should _demand_ 1) global system bandwidth sufficient to carry both dependence and operand traffic, and 2) sophisticated system software that extracts appropriate heterogeneous threads from difficult applications and dynamically schedules them onto heterogeneous execution resources in order to use limited global system bandwidth well.  This is the _key problem_, for Pete's sake!  Stop talking only about flops and megawatts per dollar.  -----  The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous.  He alone bears responsibility for these commentaries.  Replies are welcome and may be sent to HPCwire editor Michael Feldman at [email protected] 
Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first planned U.S. exascale computer. Intel also provided a glimpse of Read more…

By John Russell

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutting for the Expo Hall opening is Monday at 6:45pm, with the Read more…

By Tiffany Trader

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently deputy director for the U.S. Department of Energy’s (DOE) Read more…

By Doug Black

Microsoft Azure Adds Graphcore’s IPU

November 15, 2019

Graphcore, the U.K. AI chip developer, is expanding collaboration with Microsoft to offer its intelligent processing units on the Azure cloud, making Microsoft the first large public cloud vendor to offer the IPU designe Read more…

By George Leopold

At SC19: What Is UrgentHPC and Why Is It Needed?

November 14, 2019

The UrgentHPC workshop, taking place Sunday (Nov. 17) at SC19, is focused on using HPC and real-time data for urgent decision making in response to disasters such as wildfires, flooding, health emergencies, and accidents. We chat with organizer Nick Brown, research fellow at EPCC, University of Edinburgh, to learn more. Read more…

By Tiffany Trader

AWS Solution Channel

Making High Performance Computing Affordable and Accessible for Small and Medium Businesses with HPC on AWS

High performance computing (HPC) brings a powerful set of tools to a broad range of industries, helping to drive innovation and boost revenue in finance, genomics, oil and gas extraction, and other fields. Read more…

IBM Accelerated Insights

Data Management – The Key to a Successful AI Project

 

Five characteristics of an awesome AI data infrastructure

[Attend the IBM LSF & HPC User Group Meeting at SC19 in Denver on November 19!]

AI is powered by data

While neural networks seem to get all the glory, data is the unsung hero of AI projects – data lies at the heart of everything from model training to tuning to selection to validation. Read more…

China’s Tencent Server Design Will Use AMD Rome

November 13, 2019

Tencent, the Chinese cloud giant, said it would use AMD’s newest Epyc processor in its internally-designed server. The design win adds further momentum to AMD’s bid to erode rival Intel Corp.’s dominance of the glo Read more…

By George Leopold

Intel Debuts New GPU – Ponte Vecchio – and Outlines Aspirations for oneAPI

November 17, 2019

Intel today revealed a few more details about its forthcoming Xe line of GPUs – the top SKU is named Ponte Vecchio and will be used in Aurora, the first plann Read more…

By John Russell

SC19: Welcome to Denver

November 17, 2019

A significant swath of the HPC community has come to Denver for SC19, which began today (Sunday) with a rich technical program. As is customary, the ribbon cutt Read more…

By Tiffany Trader

SC19’s HPC Impact Showcase Chair: AI + HPC a ‘Speed Train’

November 16, 2019

This year’s chair of the HPC Impact Showcase at the SC19 conference in Denver is Lori Diachin, who has spent her career at the spearhead of HPC. Currently Read more…

By Doug Black

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

Intel AI Summit: New ‘Keem Bay’ Edge VPU, AI Product Roadmap

November 12, 2019

At its AI Summit today in San Francisco, Intel touted a raft of AI training and inference hardware for deployments ranging from cloud to edge and designed to support organizations at various points of their AI journeys. The company revealed its Movidius Myriad Vision Processing Unit (VPU)... Read more…

By Doug Black

IBM Adds Support for Ion Trap Quantum Technology to Qiskit

November 11, 2019

After years of percolating in the shadow of quantum computing research based on superconducting semiconductors – think IBM, Rigetti, Google, and D-Wave (quant Read more…

By John Russell

Tackling HPC’s Memory and I/O Bottlenecks with On-Node, Non-Volatile RAM

November 8, 2019

On-node, non-volatile memory (NVRAM) is a game-changing technology that can remove many I/O and memory bottlenecks and provide a key enabler for exascale. That’s the conclusion drawn by the scientists and researchers of Europe’s NEXTGenIO project, an initiative funded by the European Commission’s Horizon 2020 program to explore this new... Read more…

By Jan Rowell

MLPerf Releases First Inference Benchmark Results; Nvidia Touts its Showing

November 6, 2019

MLPerf.org, the young AI-benchmarking consortium, today issued the first round of results for its inference test suite. Among organizations with submissions wer Read more…

By John Russell

Supercomputer-Powered AI Tackles a Key Fusion Energy Challenge

August 7, 2019

Fusion energy is the Holy Grail of the energy world: low-radioactivity, low-waste, zero-carbon, high-output nuclear power that can run on hydrogen or lithium. T Read more…

By Oliver Peckham

Using AI to Solve One of the Most Prevailing Problems in CFD

October 17, 2019

How can artificial intelligence (AI) and high-performance computing (HPC) solve mesh generation, one of the most commonly referenced problems in computational engineering? A new study has set out to answer this question and create an industry-first AI-mesh application... Read more…

By James Sharpe

Cray Wins NNSA-Livermore ‘El Capitan’ Exascale Contract

August 13, 2019

Cray has won the bid to build the first exascale supercomputer for the National Nuclear Security Administration (NNSA) and Lawrence Livermore National Laborator Read more…

By Tiffany Trader

DARPA Looks to Propel Parallelism

September 4, 2019

As Moore’s law runs out of steam, new programming approaches are being pursued with the goal of greater hardware performance with less coding. The Defense Advanced Projects Research Agency is launching a new programming effort aimed at leveraging the benefits of massive distributed parallelism with less sweat. Read more…

By George Leopold

AMD Launches Epyc Rome, First 7nm CPU

August 8, 2019

From a gala event at the Palace of Fine Arts in San Francisco yesterday (Aug. 7), AMD launched its second-generation Epyc Rome x86 chips, based on its 7nm proce Read more…

By Tiffany Trader

D-Wave’s Path to 5000 Qubits; Google’s Quantum Supremacy Claim

September 24, 2019

On the heels of IBM’s quantum news last week come two more quantum items. D-Wave Systems today announced the name of its forthcoming 5000-qubit system, Advantage (yes the name choice isn’t serendipity), at its user conference being held this week in Newport, RI. Read more…

By John Russell

Ayar Labs to Demo Photonics Chiplet in FPGA Package at Hot Chips

August 19, 2019

Silicon startup Ayar Labs continues to gain momentum with its DARPA-backed optical chiplet technology that puts advanced electronics and optics on the same chip Read more…

By Tiffany Trader

Crystal Ball Gazing: IBM’s Vision for the Future of Computing

October 14, 2019

Dario Gil, IBM’s relatively new director of research, painted a intriguing portrait of the future of computing along with a rough idea of how IBM thinks we’ Read more…

By John Russell

Leading Solution Providers

ISC 2019 Virtual Booth Video Tour

CRAY
CRAY
DDN
DDN
DELL EMC
DELL EMC
GOOGLE
GOOGLE
ONE STOP SYSTEMS
ONE STOP SYSTEMS
PANASAS
PANASAS
VERNE GLOBAL
VERNE GLOBAL

Intel Confirms Retreat on Omni-Path

August 1, 2019

Intel Corp.’s plans to make a big splash in the network fabric market for linking HPC and other workloads has apparently belly-flopped. The chipmaker confirmed to us the outlines of an earlier report by the website CRN that it has jettisoned plans for a second-generation version of its Omni-Path interconnect... Read more…

By Staff report

Kubernetes, Containers and HPC

September 19, 2019

Software containers and Kubernetes are important tools for building, deploying, running and managing modern enterprise applications at scale and delivering enterprise software faster and more reliably to the end user — while using resources more efficiently and reducing costs. Read more…

By Daniel Gruber, Burak Yenier and Wolfgang Gentzsch, UberCloud

Dell Ramps Up HPC Testing of AMD Rome Processors

October 21, 2019

Dell Technologies is wading deeper into the AMD-based systems market with a growing evaluation program for the latest Epyc (Rome) microprocessors from AMD. In a Read more…

By John Russell

Rise of NIH’s Biowulf Mirrors the Rise of Computational Biology

July 29, 2019

The story of NIH’s supercomputer Biowulf is fascinating, important, and in many ways representative of the transformation of life sciences and biomedical res Read more…

By John Russell

Xilinx vs. Intel: FPGA Market Leaders Launch Server Accelerator Cards

August 6, 2019

The two FPGA market leaders, Intel and Xilinx, both announced new accelerator cards this week designed to handle specialized, compute-intensive workloads and un Read more…

By Doug Black

Cray, Fujitsu Both Bringing Fujitsu A64FX-based Supercomputers to Market in 2020

November 12, 2019

The number of top-tier HPC systems makers has shrunk due to a steady march of M&A activity, but there is increased diversity and choice of processing compon Read more…

By Tiffany Trader

When Dense Matrix Representations Beat Sparse

September 9, 2019

In our world filled with unintended consequences, it turns out that saving memory space to help deal with GPU limitations, knowing it introduces performance pen Read more…

By James Reinders

With the Help of HPC, Astronomers Prepare to Deflect a Real Asteroid

September 26, 2019

For years, NASA has been running simulations of asteroid impacts to understand the risks (and likelihoods) of asteroids colliding with Earth. Now, NASA and the European Space Agency (ESA) are preparing for the next, crucial step in planetary defense against asteroid impacts: physically deflecting a real asteroid. Read more…

By Oliver Peckham

  • arrow
  • Click Here for More Headlines
  • arrow
Do NOT follow this link or you will be banned from the site!
Share This