Exascale: Are we ready for the next generation of supercomputers?

Cristin Merritt (00:02):
Why does everybody keep building these big, huge microscopes to look out into space? Because you're getting more and more and more detail. And to go ahead and move into that kind of level of scale, you need more specialization, you need better hardware, you need better software, you need better people working on these types of skills. So the bigger the machine gets, the more complicated, but the output is so much more detailed, you can get more benefits longterm. So that's kind of the best way to think of exascale in compared to like a standard HPC machine.

Aubrey Lovell (00:33):
Within the world of supercomputers, numbers matter. Megaflops, gigaflops, petaflops, we're all in a seemingly endless push for more power and more capability as demand grows for more accurate simulations, better predictive power for things like weather forecasting, drug discoveries, and even more accurate manufacturing techniques. But within all of this, there is a class of supercomputer, which up until recently was only theoretical. A supercomputer of supercomputers, an exascale computer. With a huge amount of collaboration between HPE and various other organizations, that theory is now a reality.

Michael Bird (01:11):
And that's what we're looking at this week. The rise of the exascale computer, why it matters and is it worth the cost? You are listening to Technology Untangled, a show which looks at the rapid evolution of technology and unravels the way it's changing our world. We're your hosts, Michael Bird.

Aubrey Lovell (01:41):
And Aubrey Lovell.

Michael Bird (01:47):
The world of supercomputers. It's a funny old place. For a start, there's no real definition of what a supercomputer actually is, and what defines a supercomputer is a constantly changing goalpost. The first computer fast enough to be called a supercomputer was the Cray CDC 66600, built in 1964. It weighed six tons, took 30 kilowatts of power and had a processing power of three megaflops or floating point operations. You'll hear a lot about flops in this episode. That's the measure of computing performance used when talking about supercomputers.

(02:30):
It stands for floating point operations per second. Now, to put that in context, a modern smartphone has something like five to 15 teraflops of performance, and they're designed to be as energy efficient as possible. We're talking 10 or 15 watts of power.

Aubrey Lovell (02:46):
But it was a lot more capable than anything else in the world at the time. Hence, the phrase supercomputer, and that's still the definition that sticks. A number that's occasionally thrown around is that the best consumer computers have the processing power of a 30-year-old supercomputer.

Michael Bird (03:02):
In essence, an exascale computer can do a quintillion, I promise that is a real number, mathematical calculations a second.

Aubrey Lovell (03:10):
But what exactly is exascale?

Mike Woodacre (03:13):
So my name's Mike Woodacre. I'm the Chief Technology Officer for the HPC and AI business unit at Hewlett-Packard Enterprise. So exascale in the purest sense is 10 to the 18 operations per second. I think people are probably familiar with kilobytes of information and megabytes. And then as we move up the scale, you get into gigabytes and terabytes and petabytes and ultimately exabytes. And we did achieve the first publicly recorded exaflop result last May with the Frontier supercomputer at Oak Ridge National Lab in the US. It took technology from many companies as well as the Department of Energy in the US to invest in bringing that capability together to let us break through that barrier.

(04:08):
So that was a tremendous breakthrough for the industry as a whole. It used to be supercomputers would kind of sit in the corner of a lab somewhere. People would fire off a program on it, might take months to come back with an answer. And then with an exascale system, we're now able to ask many questions at once. We can explore design spaces, whether it's for how are we going to build fusion reactors to help us with energy and drive to net-zero. That sort of stuff. You can just get results much quicker in many different dimensions. Suddenly I can make much faster progress.

Michael Bird (04:46):
So Aubrey, Mike mentioned Frontier, which came online at Oak Ridge National Laboratory in 2022, is 2.5 times faster than the second-fastest ranked computer in the world. That is quite incredible, isn't it? I mean that it's taken so long to get to that, but I guess exascale is a ridiculous number. I mean, it's what? 10 to the 18, I think you said. Until recently, Doug Kothe was the director of the Exascale Computing Project at Oak Ridge.

Aubrey Lovell (05:19):
He's hugely excited for the opportunities of exascale too, particularly in its ability to use its vast powers of data processing and analysis to create evermore accurate models of the universe, from galactic down to microscopic scales.

Doug Kothe (05:34):
Also, what comes on these computers is tremendous amounts of memory to be able to simulate very complex phenomena in our world. And so the more powerful computer you have, the more complicated and complex phenomena you can simulate. And a good example is, let's simulate the entire earth system. Let's simulate climate for the entire earth. What happens on the land? What happens in the atmosphere? What happens in the oceans? What happens with ice? And so to simulate the earth system, you're talking about a tremendously complicated system that certainly your laptop can't compute.

(06:17):
So an exascale computer now opens up a wide range of opportunities to be able to devise models for how the earth behaves. And now you're able to actually get reasonably accurate or high quality, high confidence solutions because the computer is so powerful. You don't have to make assumptions in the accuracy of the model. So certainly in the science area, there are all kinds of compelling cases to make for an exascale system being a game changer and being able to elucidate breakthroughs in science and engineering that we can only imagine. Frankly, with Frontier, I'm going to be very surprised if we don't see Nobel Prize work come off of that system.

Aubrey Lovell (07:09):
It's so interesting to see the progression of technology and kind of the solutions that we have around it, right? Storage, compute, all of that. I mean, now look how fast we're moving and how much greater capacity these supercomputers have within X amount of years of each other, right? So it's pretty cool to see that.

Michael Bird (07:27):
So do you remember when the sort of big exascale computer announcement happened at HP? There was a bit of a fanfare.

Aubrey Lovell (07:33):
Absolutely. I think it's really become kind of a flagship for our company. There was a lot of buzz, a lot of people were talking about it and just the value of it and what its purpose was pretty cool to hear.

Michael Bird (07:44):
But Frontier took over three years to build. Why?

Aubrey Lovell (07:54):
Here's Mike Woodacre.

Mike Woodacre (07:56):
When you're doing something for the first time, inevitably you're pushing the boundaries. And when you've literally got 60 million components in Frontier, but it's got about 38,000 GPUs, 8-9,000 CPUs, 90 miles of high performance slingshot, ethernet interconnecting that system. One of the big bottlenecks is feeding the processing elements with data. And so you want very high bandwidth access to memory, but the challenge is they're built out of DRAM. DRAM doesn't like heat, right? It's a reliability issue.

(08:33):
So then cooling these chips to make them reliable, it's very challenging when you're running a problem across a whole machine for potentially days or weeks at a time. If something goes wrong, you could end up restarting that. So there's a lot of redundancy we put increasingly into the hardware to make sure data's not just delivered, but it's actually the data you intended to be delivered. So one of the biggest challenges is how do you maintain the uptime of that resource to let you run large scale problems?

(09:07):
Because one of the biggest challenges we face today is the power consumption of systems and how do we continue to improve processing performance without a linear increase in power consumption, which just wouldn't be sustainable by any of the deployments of these systems. They're already in the tens of megawatts, but that's the engineering challenges. Many of us, including myself, love to work through those challenges to keep pushing these boundaries.

Michael Bird (09:39):
90 miles of cabling, 38,000 GPUs and 9,000 CPUs. Oh my goodness. That is quite something.

Aubrey Lovell (09:52):
Absolutely. I mean, it's not like something that you can just buy off the shelf or manufacture right off a factory line. There's a lot of different components and framework that goes into this. It's not something that's just instantaneous. It really put into perspective the whole picture of how these things come together and how much work goes into it.

Michael Bird (10:10):
I now sort of understand why it's so complicated.

Aubrey Lovell (10:14):
Absolutely. But exascale isn't just about compute. It's no longer good enough that a supercomputer has just processing power. It needs to have power across a whole load of different disciplines. HPE is also working through partnership with Intel on Aurora, an exascale computer at the Argonne National Laboratory in Illinois.

Michael Bird (10:32):
Professor Rick Stevens is Argonne's associate laboratory director for computing environment and life sciences. He's been working on exascale computing since the early days of the concept and says that the demands and requirements on supercomputers have evolved significantly over time.

Prof. Rick Stevens (10:51):
Some things are happening that were not happening when we started the journey. So if we go back in 2007, if you search on internet exascale. First time it started to appear, it was 2007. In these workshops where we were working on it, we weren't thinking about AI at that time. AI was deep into an AI winter. There was no one able to get funding to work on AI at that point. And we weren't really even thinking about data. We were still almost entirely focused on simulation.

(11:21):
But by the time we got to where we're actually specifically designing the exascale machines kind of in 2015 or so, trying to tie everything together. It became pretty clear that these machines have to be able to not only do simulations well, but they had to be able to process large amounts of data, petabytes to exabytes of data. And they also needed to be good at doing AI, even in 2015, which is only about three years into the modern third wave of AI, was still not clear what that meant.

(11:53):
And I don't want to say it was a complete accident, but it was a partial accident that we ended up in that place. Because the fundamental technology in these machines are GPUs and the GPU market right now is being largely driven by gaming where they came from, but more importantly, leaning forward by using these as engines for AI training. And so I don't want to say it's serendipity, but it's kind of like the fact that we have these building blocks that are good at all these things is super helpful.

(12:25):
So the resulting systems that we have at Argonne and the sister systems we have at Oak Ridge and the systems that we have at Livermore and so on, these are quite well-balanced systems for doing simulation, for doing large scale data streaming analysis and so on, but also for training large and inferring on large AI models.

Michael Bird (12:47):
So what's the big deal? Why does this all matter? And frankly, what's the wider return on value here? Cristin Merritt is the chief marketing officer at Alces Flight. They are an HPC solutions provider who work across businesses of all scales to give them access to supercomputers on site through resource sharing on the cloud. With years in the industry, she's well-placed to talk about the commercial demand and value of exascale.

Cristin Merritt (13:13):
I think the thing is that the commercial need for exascale, if you want to argue it, it's actually happening right now. Because it's actually being built by commercial companies and then the researchers can use them. So what you're going to have is kind of this sort of back and forth. So you're going to build the machine, research happens. Out of the research comes the results, and these businesses start forming around all of that research that's happening. So if you want to think of the university as the governments, think of them as the F1 teams, the really big, fancy, beautiful cars. They're zooming around the track and everybody watches them in awe..

(13:48):
You have to think about the fact that the F1 teams are some of the people that have pioneered things such as crash and safety regulations. So the seat belts in our cars, the stuff that filters all the way back down to the people who manufacture the bits and components that improve our day-to-day existence started in those big institutions. And people are picking off bits and pieces and turning it into their own. And so I think we're going to see this sort of wonderful back and forth sort of happening, this pendulum swing of commercial people building. Then the researchers improving, improving, and everybody goes back until you've got this wonderful balance.

Michael Bird (14:29):
So exascale offers the opportunity for an entire ecosystem to revolve around these computers and for the compute power they unlock. But what exactly are they going to be used for? After all, you don't just spend hundreds of millions, if not billions of dollars just, well, because. So for Rick Stevens, it opens up an entire world of possibilities of particular interest to him on medical research opportunities, which he's exploring through a project called CANDLE.

Prof. Rick Stevens (15:03):
CANDLE is really building a software environment that stands for Cancer Distributed Learning Environment. And the idea is to put together in one software package all the methods that we need to make exascale systems useful to the cancer research community that's using AI methods. Many problems in cancer can be attacked with artificial intelligence methods, the machine learning methods. The challenges that these exascale machines are not so easy to use for training large scale models, improving the models for managing the simulations needed to drive the AI and so forth.

(15:45):
And we focus on three problems to make it have traction in the community. One of them is focused on trying to accelerate and improve the way we do molecular level simulations for problems in cancer. And that's focusing on a problem called the RAS problem, which is a particular protein that gets stuck in the on position and it keeps telling the cell to make copies of itself, and that's not what you want. But it's a very difficult protein to drug, it's a very difficult protein to develop a therapy for it. And so the idea with CANDLE on that problem was to use very large scale simulations and AI to understand the behavior of it so that we could ultimately figure out how to manage cancers that are RASS based.

(16:30):
Another one is using AI to predict the response of tumors to drugs. How to come up with the best combination of drug, either one drug and other therapies or combinations of drugs to treat that tumor. I have basically, I mean not infinite, but it's like 10 to the hundredth power or whatever of possible combinations or more. And so I can't do all of them, so I have to somehow be clever in searching through that space. And because there's so many possibilities, this is a case where we want to use AI to try to build models by learning across data from many, many cancer patients and many, many laboratory experiments to try to build the best possible predictors.

(17:14):
And that's a large scale computational problem to do that and to build models that are really good at that. And the third one is actually using AI to analyze pathology reports from millions of cancer patients. And the idea is to build AI that can read all of those, so not just the ones that an individual physician might know about, but read all of them. And learn from all of them the patterns that we see across the entire populations. And to use that for a couple of reasons. One is to understand the different kinds of trajectories that happen for different answer types and different patient demographics. And the second one is to be able to use that data to ideally match patients to clinical trials, which is typically very difficult.

Michael Bird (18:04):
The idea of detail really matters, and it's one of the standout opportunities of exascale. It's also about the ability to get those detailed answers in a fraction of the time taken in previous generations of supercomputers. And that alone is hugely important.

Aubrey Lovell (18:21):
But exascale isn't just about running thousands of simulations and permutation simultaneously to solve one problem or get exceptional levels of detail. A kind of deep compute, if you like. It also has the ability to open up in bulk to thousands of users each with their own problems to solve. You could call it supercomputing and breadth. Here's Mike.

Mike Woodacre (18:42):
It's not just that peak exaflop capability. Now, effectively an exascale machine is I can run 1000 petascale jobs so I can explore spaces that I just couldn't do before. And actually, Frontier, I was recently talking to Doug Kothe, he was saying they're literally just opening it up to their big user base. Giving quite a lot of users access to an unbelievable amount of computing resource so they can explore different spaces. In a timeframe, that means that they can kind of usefully ask the next question.

Michael Bird (19:21):
Exascale presents the opportunity to keep research or design moving at an incredible pace by not just giving more detailed answers, simulations or suggestions, but to do it in timescales, which have been previously unheard of. For the first time, we could see parameter changes in complex simulations being modeled in real time, rather than having to wait for a simulation to finish before changing a variable and restarting it.

Aubrey Lovell (19:51):
That's an incredible tool for making research more efficient and while keeping everyone who is involved engaged, innovative and on their toes. Frankly, this is a computer which could change the way we think about scientific research. So we've got one exascale computer online, another will soon be reliably up and running too. So is that problem solved? Well, no, not quite.

Michael Bird (20:13):
The fact that no one has ever really created an exascale platform or UI before, makes exascale decidedly non-standard. And that's a big, bold, shiny red flag for the technology as a commercially viable offer right now. As Cristin explained, when I asked her if she'd consider pitching exascale for sale to her clients.

Cristin Merritt (20:36):
Oh goodness, not right now. It's the fact that this is technology that people know about that's being used in a different way. So why is it non-standard? I want to quote my husband who works in vaccine research and he says that a lot of people think that you do this and you punch out these vaccines and they're always the same, but you have this horrible thing called biology that gets in the way. So you can have bad batches happen and you can have things go wrong. And what it is we've taken another step up. So we've made this machine bigger, which means that stuff that works at small scale does not work when you get larger numbers or does not push through.

(21:16):
The machine doesn't function the same way. A prop plane and a jumbo jet are two completely different things. They run off the same principles, but the complexity increases as you go up. So that's what I mean by non-standard. And the workloads and the applications, they've all been written to a certain point, they process to a certain number of zeros. You go to exascale, you're adding a whole bunch more, and what worked at this level doesn't work at the next level up. Stuff that works for two nodes won't work for 10,000.

Michael Bird (21:48):
I think what's really interesting here is that because such a pioneering breakthrough technology, as Cristin is saying, there's just so many elements that just don't scale in the way that you would think.

Aubrey Lovell (22:00):
But that's kind of to be expected and solving the software challenges is just one of the ongoing challenges for Doug.

Doug Kothe (22:07):
Here you're talking about millions of lines of computer code that have to be either redesigned and refactored relative to pre-exascales, we call it, or born from scratch. And the software has to be able to essentially will this large computer to do what it wants it to do. In this case, think of maybe an orchestra conductor that is conducting each one of these nodes to work together. The computer itself can't think without the software to think forward or to tell it what to do. And so how could one actually design and write high quality software that can orchestrate such a large system?

(22:49):
And so that's been a tremendous challenge as well, and I think we have as a community, really met and overcome those challenges.

Michael Bird (23:01):
So what happens next? Well, that's a big question with a whole lot of different answers depending on your perspective. For Cristin, it's all about giving both the technology and the people involved in it time to mature and deepen their collaboration to get the best use out of the technology.

Cristin Merritt (23:19):
When I think about exascale as opposed to a standard install, I think about the fact that it would require more partnerships and people working together. So it's a change in thought in how businesses, it's the same thing that Cloud native did for us, isn't it? It's the idea that we've come into this new capability and now we're going to spend the next few years figuring out exactly what it is we want to do and how we want to sell it. It's just like when cloud came in and everybody kind of panicked a bit, that they realized that it's another tool and how do we ingrain that tool into what we do?

(23:53):
So you build a large system such as that, and you have someone like Alces come in and manage it. It might actually be that we're in a partnership for managing a system that size because of the user types, because of the business that needs to happen on it. It could be a shared system. So there's a lot of ways that you could approach this from a business side that would possibly be new or novel just as much as the technology is. But the fact of the matter is it's on the floor and now people can get onto it, and it's the people who are going to make this go.

Michael Bird (24:22):
For Rick, it's about continuing the journey by using exascale as both an exciting new tool for research, study and innovation, and as a machine capable of planning its own replacement. What a thought.

Prof. Rick Stevens (24:36):
So when we think about next generation machines, we think about, well, what is the bottleneck in the current way I'm doing it? Maybe the bottleneck is the AI part. And the next generation machine, I want to have a lot more capability at AI. Or maybe I need to shift entirely to thinking about this problem in a way I can solve with a quantum computer. Or maybe I need to lean with both of these directions. But of course, all these are kind of bets, and so I would have to think about, well, how do I hedge my bet?

(25:05):
So what exascale will teach us, it kind of allows us to do a lot of stuff right now, but it'll also allow us to think forward about what does the next 10 years of advances in hardware and software need to look like to keep making progress? It's not a destination as much as it's just a way point, right? It's just a place where we can pause, make everything work, get agreement community back up. I mean, they've been running for 10 years trying to catch up with all this.

(25:35):
They get to take a breather, have some Gatorade or whatever, and then we're going to run again, right? We're going to keep going because the race that you're in is really a race against time and ideas, but you're really racing against the future. These are tools for making things better. And in order to actually use the tool to make things better, you better make the tool better.

Aubrey Lovell (25:58):
For Doug, it's about opening up the world of exascale to the rest of us, making this seemingly monolithic technology accessible to the highest number of users possible.

Doug Kothe (26:08):
With regard to the exascale work we've been doing here at Oak Ridge National Lab and across the [inaudible 00:26:15] complex. We're developing a lot of fundamental cross-cut technologies that many, many businesses are going to be able to directly use and exploit. I'd like to think that we're developing an app store for the nation, where a given app will be able to simulate earth systems. We'll be able to simulate how do wind farms behave? How does a nuclear reactor behave? How can we use machine learning to better understand precision medicine for oncology or cancer?

(26:45):
So in other words, we are developing a number of fundamental technologies or apps, just like the apps on your phone. That I think businesses and US industry will be able to directly pick up and use for their products and services. I believe we're developing the scientific software stack, if you want to call it that for the nation. And on top of that, stack killer apps and science, technology, engineering that cover a broad range of very fundamental from chemistry to materials to energy production, energy transmission to more fundamental science like the origin of the universe.

(27:27):
So an exascale computer and the software technologies we're developing really are targeting this wide range of things that make our world go around. And it's going to be exciting to see these technologies continue to evolve and be used for the betterment of our society and quality of our life.

Michael Bird (27:45):
And that's something we can all look forward to and benefit from. You've been listening to Technology Untangled. We've been your hosts, Aubrey Lovell and Michael Bird. And a huge thanks to Mike Woodacre, Cristin Merritt, Rick Stevens, and Doug Kothe. We've been your hosts, Aubrey Lovell and Michael Bird. You can find more information on today's episode in the show notes. And this is the fifth episode in the fourth series of Technology Untangled.

(28:16):
Next time we are exploring the increasing use of AI in healthcare. Do subscribe on your podcast app of choice so you don't miss out and to check out the last three series.

Aubrey Lovell (28:27):
Today's episode was written and produced by Sam [inaudible 00:28:31], Michael Bird, and myself, Aubrey Lovell. Sound design and editing was by Alex Bennett with production support from Harry Morton, Alison Paisley, Alicia Kempson, Camilla Patel, Alyssa Mitri, and Alex Podmore. Technology Untangled is a Lower Street production for Hewlett Packard Enterprise.

Hewlett Packard Enterprise