People to Watch 2016

Thomas Sohmers
Founder and CEO
REX Computing

Thomas Sohmers is the founder and CEO of REX Computing, a High Performance Computing startup that is focusing on developing a new processor architecture targeted at performance, energy efficiency, and scalability for modern HPC and supercomputing workloads. Starting tinkering with simple electronics kits when he was six, Thomas taught himself basic electrical engineering and computer science skills until he developed his first creation, the EyePC monocular head mounted display in 2009, when he was 13. Shortly afterwards, he started an internship at the Institute for Soldier Nanotechnologies, an army research lab at the Massachusetts Institute of Technology, and worked there for three years. While he started off working on rapid prototyping of soldier utilities, he found more interest in embedded and high performance computing systems.

In 2013, Thomas became part of the 2013 class of the Thiel “20 under 20” Fellowship, and later that year started REX Computing.

In his free time, he enjoys augmented and virtual reality systems, general hardware hacking and tinkering, science fiction movies/shows/books/games, and playing guitar.

HPCwire: Hi Thomas. Congratulations on being selected as an HPCwire 2016 Person to Watch. Under your leadership, REX Computing is developing a new chip architecture that is slated to be more powerful and 25 times more energy efficient than anything industry giants are offering today. What do you expect to accomplish in 2016 toward that end?

Thomas Sohmers: The real big thing that’s happening this year is that we’re going to be getting our first silicon back. Most people think that getting silicon is a huge, expensive process that requires hundreds of people and tens to hundreds of millions of dollars, but right now we’re just a team of six people with just $2 million to get our first silicon back later this summer. So we’re taping out our first 16-core prototype test chip in May, and should be getting those chips back in the July-August timeframe. This is a test chip, so it’s not a full commercial product, but we will be evaluating with early customers this year.

HPCwire: I’m sure our readers would love to hear more about what differentiates your design from any existing architectures, and in particular what elements have opened the door for these performance milestones. Could you share a bit more about the technical aspects of the Neo processor?

Sure, I’ll start with the overview. The key thing from a high level is we have a two-dimensional tiled mesh architecture. This test chip we’re making is 16 cores, so four by four, and our full production chip, which is slated for early-to-mid 2017, is for 256 of these cores.

Each core is a VLIW 64-bit, fully capable core that has a double-precision, IEEE-compatible floating point unit, ALU and two of our load / store units. So the idea is that every single one of our cores is quad-issue and we’re aiming to get three out of the four functional units operating every single core cycle, and in some cases all four. The VLIW concept has scared a lot of people we have talked to, but we have very good reason to believe that we have solved a large part of the VLIW scheduling problem that has plagued these sort of architectures making it into most systems. We will be talking more about how we have done this later this year.

Putting this all together and we have our 256-core chip that will be our first real product, we’re aiming at full IEEE 754-2008 double-precision floating-point operations, 256 gigaFLOPS all on a power budget of about 5 watts.

The fundamental element that’s saving most of our energy is that we’re utilizing what we call “scratchpad memory” for each one of our cores. So instead of having a traditional cache hierarchy as you have on virtually every other architecture nowadays, which has a lot of extra hardware in order to automatically manage memory movement on the processor, CPU or GPU, we have the vast majority of that complexity offloaded to our software stack, mostly in our toolchain. So compared to an Intel Haswell processor, which uses roughly 100 picojoules doing a double-precision floating-point operation, it takes about 4200 picojoules to be able to move the 64 bits from DRAM to registers. While most would think that the majority of that 4200 picojoules is used having to move from the DRAM chip to the processor (through the PCB), the reality is that over 60% of that energy is being spent by the hardware caching system. Hardware caching like this was created in order to make programmers lives easier, but done long before compiler techniques were anywhere close to being able to handle this. With our advancements in compile time analysis and automated memory management techniques, we are able to move this hardware complexity into software. While this adds a small amount of extra time to compilation, we feel that it is better to spend even 2x more time compiling a program and get a order of magnitude more efficient runtime. We have a lot of other smaller improvements, but this is the big aspect where we’re saving power while also increasing throughput and performance.

But this is not a new process or memory design – its that we’re removing a ton of logic from the TLBs, MMUs et cetera. In doing so, we’re able to have much larger memories ­– so for each core we have up to 128 kilobytes of SRAM. In our full 256-core chip that would be 32 megabytes of SRAM, which has 1-cycle access latency from its individual core. So if you compare that to a high-end Intel Xeon chip, it’s a little over 50 cycles to access your L3 cache, which starts to get up to the same amount of SRAM we have. The result is that we’re able to have an order-of-magnitude higher throughput while also greatly reducing the power costs.

The other big part of how we make that a reality, and the part where people think we’re exceptionally crazy (or a doomed startup like all the rest) is the software aspect of this. We’ve had far, far more people than I can count tell us that the software aspect of this is just flat-out impossible.

Our initial funding was through DARPA, for compiler and software efforts, which is principally done by my co-founder Paul Sebexen. (He’s 23, so he’s a bit older than I am.) He’s the real brains behind most of this – Paul previously worked at LBNL and we’ve been working together for just over 2 years now. I would love to say more, but out of all of this, this is the real secret sauce.

So unfortunately, while we will be releasing more information in the future, all we can say for now is to “Stay tuned; We’ll have chips out this summer and hope to be showing all of this off at SC.”

HPCwire: You’ve mentioned that machine learning calculations and GPU-oriented jobs are the next target for Neo. Could you elaborate a bit more on your goals there?

One of the things that has been our main focus is our 10-25x energy efficiency (256 gigaFLOPS at about 5 watts or about 51 gigaFLOPS/watt for double-precision floating point operations, or 102 GFLOPS/watt for single-precision). A big part of our overall proposition Is that those numbers aren’t pie-in-the-sky theoretical numbers that you will never see in reality. Unlike the GPUs and coprocessors making waves out here that show fantastical teraflops of performance, we actually have the memory bandwidth both on chip and off chip to sustain our theoretical peak performance. If you compare what the top GPUs get on a benchmark like LINPACK (Which is not a good comparison for real world applications, and will get you pretty close to the theoretical peak), to what the same GPU gets on a closer to reality benchmark like HPCG (High Performance Conjugant Gradient, which stresses memory bandwidth like most modern applications), gets a very small fraction. Tianhe-2, for example, is ranked on the TOP500 list using LINPACK at 33PFLOPs, but when measured using HPCG, it gets a mere 0.580PFLOPs. Our design has focused on being able to actually achieve our quoted speeds in real applications, unlike most systems since the original Seymour Cray machines.

But this 10-25x is specifically looking at what we’ve been optimizing for (FFTs) and it’s more about streaming-style applications. That’s been our focus from the beginning. We see this as being a great signal-processing processor to begin with, but we’re not limiting ourselves to that.

So the thing that we’re going to start focusing on closer to the end of this year is actually supporting existing tools and libraries people are using with GPUs and existing architectures. But there’s going to be a performance and efficiency hit for not programming specifically for this. We’re trying to make the portability process as straightforward as possible.

HPCwire: Beyond the incredible work you’re doing with Neo, your own personal story is remarkable. At 13 you began working at a research lab at MIT, and by the 11th grade you had dropped out of high school to join Peter Thiel’s 20 Under 20 Thiel Fellowship. But what probably makes you stand out even more to our readers is that you represent young talent in an industry that doesn’t always get traction with emerging computer scientists and engineers. What is your take on this issue and what would be your advice to students who may be interested in HPC?

The first time I went to SC was when I was 16, so I was fairly certain I was the youngest person there. I’ve really loved the entire HPC community because it’s really focused on technical achievements and trying to get that edge up. Being at the very top of the entire computing industry is just a really exciting thing, and being able to focus on trying to build the fastest, most energy efficient machines are the best challenges. Personally I wouldn’t want to be in any other part of the computing industry.

As far as getting other young people interested, I think that it will be difficult just because most of the appeal for a lot of young people today is the modern Silicon Valley story – make an app and sell it to Facebook for a billion dollars. And that’s been shown quite a few times but still in terms of the number people who are successful in doing that are a small fraction of a fraction of a percent. But that’s the prevailing mentality at this point.

But my personal goal, because of my real passion is at the bleeding edge of performance and the lowest levels of computer architecture, design and implementation, is my vision with REX Computing. Even if we’re not successful in a market since, we want to show that in 2016 it’s possible to start a semiconductor company and, with a very small team and not that much money, you have a shot at taking on the big guys.

HPCwire: Final question: What can you share about yourself that you think your colleagues or peers would be surprised to learn?

Being in the HPC community with so many heroes and pioneers – people who frankly are two to three times my age – I’ve surprised a lot of them with my knowledge of historical computer architectures. I have my own personal vintage computer collection, but beyond that you can ask me a lot about pretty much any what would be considered a computer: analytical or difference engine up until today’s systems. And everything I’m doing today is based on my collective knowledge and quest for learning as much as I can about any and all systems that have been built. So I shock a lot of people (especially considering my age) for knowing stuff that came 50 years before me.

 

Toni Collis
Women in HPC
Jim Ganthier
Dell
Sumit Gupta
IBM
Dr. Yutong Lu
NUDT
Bill Mannel
HPE
Hartmut Neven
Google
William “Tim” Polk
OSTP
Irene Qualters
NSF
Thomas Sohmers
REX Computing
John West
TACC
SC16 Chair
Kathy Yelick
LBL

 

Leading Solution Providers

Contributors

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

HPCwire