Semiconductor Engineering sat down to discuss power optimization with Oliver King, CTO at Moortec; João Geada, chief technologist at Ansys; Dino Toffolon, senior vice president of engineering at Synopsys; Bryan Bowyer, director of engineering at Mentor, a Siemens Business; Kiran Burli, senior director of marketing for Arm‘s Physical Design Group; Kam Kittrell, senior product management group director for Cadence‘s Digital And Signoff Group; Saman Sadr, vice president of product marketing for IP cores at Rambus; and Amin Shokrollahi, CEO of Kandou. What follows are excerpts of that discussion. To view part one of this discussion, click here. Part two is here.
SE: Where does security fit into the power and performance equation? There is definitely overhead in terms of resources, right?
Sadr: Yes, it’s a tradeoff between the risk and cost that, and you can think about that as a financial or a power cost. Are you willing to risk the security, which is going to cost much more than incremental power? So it is definitely a tradeoff, but anytime that tradeoff is being discussed, the security risk overpowers the concerns about power. This doesn’t change the concerns about power and area, but if you look at any complex SoC today you will see there is a designated area for encoding related to security. There are a lot of architectures, even with FPGAs, where you have a security processor that has to be separated from the main processor, and there is definitely an area impact. There’s always pressure to reduce that impact, but the risk factors usually override those concerns.
Kittrell: There’s no doubt this has to be built into the modern systems, from the cloud to the edge to your cell phone, because so much valuable data is on everything, and it’s being transmitted digitally now. It’s part of the cost, and people are coming to it in fits and starts. This also is the impetus for some of architectural changes, because they may have had a previous generation solution that didn’t handle security so well. Maybe the processor didn’t have a security protocol, and now they’ve got to upgrade to something else. They’re looking at how to change their architectures, even though there is a longstanding line of processors — especially for microcontroller type designs — that they would stay attached to. Security is definitely a driving force in all of these designs.
Geada: You do pay a penalty for security, but usually the security on most reasonable size designs is a small part of the system overhead. There is a penalty to make it absolutely secure from both logical timing and side channel attacks, but I don’t see it as a tradeoff. It’s something you have to have on a modern system. You need to design for it. And the actual penalty when you look at the full system perspective is small enough that it’s a fool’s errand to ignore it. There are a lot more side channels today. Information can leak through inductive means, it can leak through thermal, power signatures, monitoring voltage, and timing attacks.
SE: Can you build systems that will be secure over time?
Shokrollahi: One of the reasons people like the chiplet approach is that they want to create a chip today that deals with today’s problems, and maybe later replace that with another chip that solves tomorrow’s problems. In terms of power, we are in need of a type of software that goes well beyond EDA. So EDA helps us design the the chips. But if you’re looking at a larger system, we need software that also models the system and the way these chips are interconnected and work together to solve specific problems. We are a long way from the rollout of chiplets, and one of the reasons is that software like this is lacking. If someone wants to produce an MCM, they don’t know how these things will work together. If everything is on the same chip, there is an EDA tool to do that. But in a package this doesn’t exist. There’s also a question of who is going to supply those chips.
Burli: This crosses into reliability, which has become a very big deal. If the temperature goes up, resistance goes up. And if resistance goes up, then EM (electromigration) starts becoming a problem. Then you start putting margin everywhere and widening metals. So that’s why, when you start thinking about reliability, aging, EM, and all of those kinds of things, you need to start thinking about how can you keep the thermal budget low. Temperature is your biggest enemy. You need a lot of sensors to make sure you can monitor the system really well and that you can do something dynamically, where either you’re changing the clock frequency as you go — or you can put in circuits that give you some boost in terms of voltage, so that you can minimize the IR drop and stuff like that. Sensing is going to be quite critical moving forward.
SE: As we start getting down into the most advanced nodes and advanced packages, we start running into all sorts of effects that we didn’t have to deal with in the past. So things like electromagnetic interference and power-related noise were on everybody’s back burner for a long time, but nobody ever really worried about them. How do we deal with these issues?
Geada: One approach is to use digital twins of physical entities. We are able to model large systems with interacting components, where you only model the necessary detail of all of the little chiplets, while for others you may model them all the way down. Some you just leave us abstractions. We have the capability of simulating various interactions, such as what happens when my cores activate and change from one pattern to another. What happens when my SerDes does this while the memory is doing that? All of these things are actually fairly straightforward to simulate.
Toffolon: EM, electromagnetic coupling and aging require a delicate balance. The key is having a solid design methodology, because optimizing for any one of those individually is essentially a dead end. If you try to size your design to deal with the worst EM limits across the board, there’s no way you’re going to hit your power budget. Or if you try and space out your design to reduce the electromagnetic coupling effects, you’re going to end up having larger clock routing between blocks, and that’s going to blow your power budget. This is why it’s important to understand how to balance mission profiles so that you’re not over-designing, and how to add redundancy in circuits to potentially swap in different paths if a device ages to the point where it could break down. In general, if you design circuits without aging in mind — and I mean designing for aging, not just simulating for aging, where you ensure all your clocks stop so they apply common stress on circuits — those are fundamental things that you need to really design into your methodology to have any shot of trying to optimize all those parameters.
SE: Some of this stuff is in motion, too, right? So it is being used for a particular use case in a particular way or for a particular application.
Toffolon: Right, and that’s where firmware and software-based control over the PHYs really comes into play. For example, a lot of these wide, parallel die-to-die links are actively monitoring the aging and doing constant loopback testing on the links to check for functionality. In a lot of cases they also add redundancy, so they’ll add in a redundant lane, for example. That provides some provision for recovering from a fatal failure, and in some cases even for mechanical failures in the packaging. You can do in situ loopback testing, detect the failure, and swap in a redundant lane.
Bowyer: The key is making sure that the person who can fix the problem knows about it. There’s a lot of data to sort through at the end of this process. In an extreme case, imagine you’ve got an architect working in MATLAB who makes a decision that’s going to mess up your power budget or cause communication problems because they’re completely unaware of these problems. Either the tools need better integration and better ability to handle these things automatically, or there has to be some improvement in feeding this data back into the system so either people or the tools can fix them. It feels like with every new generation, or every process node, you’ve got 10 more things to worry about that nobody knew about or has ever dealt with before.
Geada: This is one of the challenges. When you design a chip there are terabytes of data that get generated, and very little of it actually gets analyzed in detail. People look at the top thousands of paths, or they look at the hotspots here and there, but very few systems are capable of looking at the entire design holistically and do large-scale analytics over that to try to figure out patterns. Most EDA tools don’t make it easy for you to run inferencing and machine learning. ‘Maybe this is a variant of a design that I’ve done three times. What are the problems I’ve run into before that are starting to show symptoms here?’ You really need a platform that makes it easy to do large-scale design analytics for things that haven’t already been pre-cooked into the software and which can be done the customer side. You need customer-side large-scale analytics that can look at these terabytes of data and give you meaningful information that you can feed to an architect. There’s no way on Earth that EDA guys, sitting at their desks somewhere, are going to be able to foresee all of these problems. This is stuff that has to happen on the customer side. And you need to have tools and capabilities to do large-scale analytics and give your designers, your architects, some actionable items they can actually deal with.
King: We’re talking a lot here about chip design, and how you push all the margins to get the best out of a piece of silicon, which is very expensive. You want to get all the margins down to nothing, or close to it. The chip then goes out into the field, and by and large, with a few exceptions, it’s no longer really part of the same system. The people who design it don’t know in 1 year, 2 years, 5 years, or 10 years down the road, how that chip really performs. That’s very rarely folded back into the system. And then on the flip side, when you’re manufacturing chips, all you really know when you get a die is that it passed the wafer acceptance test. So it’s a good wafer. But how good? Is it a really good wafer, or not so good? Some of that is analyzed and some people during production test will bin parts. But how much of the margin is there that you you’ve left on the table, even at the point where the chip goes out the door from from the fab? And then, when it goes into the into the field here, what do you do from there? Some of the developments that we’re going to start to see are about being able to make use of that. You can’t necessarily do it today through designing differently, because you don’t know that data yet. You have to have chips that have been in the field for a decade to see how well you did with your original objectives. But what you can do is build chips that adapt, either through voltage scaling or various dynamic effects like that. We’re going to see a lot more analysis of chips in the field, especially in data centers, but also in consumer spaces. There is just too much cost involved to leave all that margin lying around in various places.
Kittrell: A lot of customers are talking about data analytics and aging. How do you predict aging? They get some guidance from the foundry, but it depends on the use model, so they still have to make a lot of judgment calls on their own for this. There is a lot of redundancy being built in, in order to to prevent old age infirmities before their time on these chips. There is a lot of data, and it’s like the Internet without Google. It’s dumped in various places, and so having an analytics-centric environment is important to customers. The feedback we’re getting is if you can collect the data and store it in a certain way, then get these guys applications they can build on top of it. That would be the ideal solution. And if they have multiples of these systems, all of that needs to fit together in a framework, tied together by industry-standard interfaces and integrated into product lifecycle management. A lot of times, product lifecycle management starts whenever you are designing the end system, such as when you’re putting the PCB together. But whenever you’re first designing the RTL or picking out IP, those beginning steps are important parameters, as well. So we’re seeing increasingly complex power problems, but we’re also seeing some amazing deliveries, on schedule, for really, really difficult designs. 7nm hasn’t been out that long and Nvidia just announced its A100, which is an enormous chip. To do this on schedule requires this type of insight to tie everything together.
Part 1: Power And Performance Optimization At 7/5/3nm
What happens when AI chips max out at reticle size?
Part 2 of this Roundtable: Custom Designs, Custom Problems
Power and performance issues at the most advanced nodes.