June 8, 2015, edited transcript
I would like to talk to you about variation impact on custom IP. We design circuits and make sure the circuits work in a server class microprocessor. Before I start, I would like to acknowledge my co-author, (Abriham Chandrashekar) helping on this work.
Let me introduce to you some of the issues with designs, what parts of the designs are sensitive to device variation, and where variation would have a meaningful impact in the way that we do designs.
Variation-affecting functionality include design hold paths, ratioed logic, memory writes, and level shifting. And variation affecting performance include dynamic logic, differential signaling, and memory reads.
Let me talk a little bit about some of the historical issues with IC design before we step into a case study. To identify the worst case stimulus and worst-case PVT for each circuit is simulation and setup intensive. Doing a corner-based analysis is just insufficient and can’t expose the true worst case device variation applying to each design. Then the ability to correctly measure a standard deviation with a low number of sample simulation is another issue. And the runtime is linear with the number of samples and becomes very high for a high-sigma analysis.
The bottom line is that to just run a basic corner-based analysis and a brute force Monte Carlo, usually you would get some runs, let’s say 1000 or 10,000, whatever your farm can support. You get a mean and a sigma, which will tell you something about the design.
But it won’t tell you what you really need to know. For example, in this typical SRAM critical path, you will see that the distribution is asymmetrical.
If you look at the high-sigma values, it is not strictly a normal distribution. If you’re interested in 6-sigma, 7-sigma, you have a high number of SRAM cells or designs being implemented on your chip, then it’s an issue running just a basic, 10K simulation.
You really need to get to the high-sigma simulations to see where your pitfalls are, where your issues are, and how to fix them.
So what we were looking for in terms of a solution was to be able to support a high-sigma analysis, running up to 10 billion samples on a Monte Carlo analysis and for large memories and a low run time; the runtime that we’re considering low is on the order of hours for a high number of samples. Also to determine whether the circuit has an issue, whether it was pass/fail or failing a timing criteria that we set out in the beginning, look for outliers in the distribution whether it’s based on pFET, nFETs device parameters or specific devices in the design, and identify areas for optimization and reduction to the impact of process variation.
When we engaged with Solido last year, what we were able to find is that Solido was able to provide a lot of these things. Let me run you through one case study. This case is a ratio-logic design.
If you run a typical, 10K or 1K simulation, you wouldn’t have any of these big outliers. But if you see in this simulation, you see this one is based on 10,000 samples analyzed. You don’t see any failures here. Using Solido, we did get a speed up running because Solido only needed to simulate 7000 samples to get the accuracy of 100,000 simulations in a typical brute Monte Carlo run.
And 100,000 samples can barely be simulated with a conventional simulator.
And there were only around 6 (outliers). If you look at, this is something that, that Solido has been able to provide for us – that you can visualize very well – what your distribution looks like and what your normal quantile looks like.
For a typical circuit designer, you’re more used to the probability density function. This one is supposed to look like a Gaussian distribution, and clearly it does not. If you look at where the timing has set out, it’s really basically at the middle here. And you see the number of timing outliers, out of 100,000 samples, it is not a lot. But it’s still a significant speed up.
What we’re really interested in is what happens when you run 10 billion samples. And you see here, the circuit doesn’t scale well with a lot of samples and high sigma. And there’s a design flaw here. And you see here in this particular test case it ran up to 10 billion samples, but the simulation time is only based on about 13K samples simulated so the run times were still very reasonable for us.
It was something that we were not able to do in the past, and not only did we see failures we saw big outliers. The transient scale and the value scale here is not the same scale as before. But to fit everything on the same graph you will find that large number of outliers, outside the design target.
We found issues observed in the silicon and we have to fix them. We weren’t able to expose this in our typical “before” simulation method, because running 10 billion samples or a billion samples would take forever.
So this is another capability in Solido that was very useful for us. It isolated the impact of particular devices to this design that we have highlighted, saying that this is the design criteria and I’m looking at when things push out – an increasing delay instead of minimizing delay – where are the devices that have the biggest impact to this? (Solido) highlights things in order. If you look at the circuit itself, it matches exactly what you would expect: why things are not scaling correctly and where the oversight was. So it’s very consistent, we can actually see these things, e.g. you really have to optimize this particular device, then look at another one.
There were other circuits we found where design parameters like VT and things like that, that were more impactful than the devices, but this is particular to this design. So this part is very useful and to go back… what happens after we optimize the design by cleaning up these particular devices?
It doesn’t really show the same story, because if you compare against this graph in terms of the probability density functions of 10 billion samples, this is also 10 billion samples. It hasn’t gotten back to a full normal distribution, but if you compare this even against the 10 billion samples they are a much improved design, and met our design criteria.
And if you compare against this particular one where in the previous design we had about 100k samples analyzed, this ended up being a much tighter distribution. Our design goal was right around here. And what happens was our sigma had come down, as well as our spread of our overall distribution so everything just tightened up.
This is really the power that we were observing.
We’re going to put this into all of our designs going forward as our signoff, to say, “You just can’t run brute force Monte Carlo anymore. You want to run high sigma analysis.”
And going into the smaller technologies, we have a very large increase in the number of devices in our design so this is basically what we’re targeting. It shows how important it is to look at device variation this way.
Panelist Questions & Answers
Question: In terms of deployment methodology, when deploying these tools into your organizations how have you done that? Have you used a particular method or what have your experiences been?
In our case, we found an issue in the design – there was a hole in our design. And that means there is a hole in our methodology. We found where the hole in the methodology was. We looked around to see if there was a commercial solution that would fit that hole, and close it.
And Solido was able to address it. We put in Solido. And to tell you the truth – as we were doing the evaluation and going through putting in the tool, we had to change simulators. And that didn’t stop us at all. We did a simulator change and full evaluation, went with that, and Solido had no problems adapting from one simulator to another one. The runtime and everything was just fine. We basically said, “Here’s a hole, and we need someone to fill it.” This addressed it.
Question: When is it important to deploy variation-aware design methodology? Is it at specific node, or is it a specific power? What’s the compelling event?
We found issues at the 28 nanometer node. But the fact that we had a hole was indicative that there was a hole in our methodology throughout our technology nodes.
We did look at Solido a long time ago. We looked around when we were setting up the methodology and it wasn’t mature at that point.
But when we came back this time and we knew that this whole is fairly significant and we had to fill it. The technology was very mature and was something we could just plug in right away.
So I would say it wasn’t a technology-related thing. If you have a hole in your methodology – variation exists all the way back in the day. And whether if you few lucky enough to stumble onto an issue or not.
Question: Jeff and Sifuei commented a little on usability. I was curious from the other panelists’ point of view, what were your thoughts on that?
Really, it was a very easy tool to use. We wanted to get it going so that we could test corner cases, and it very quickly turned into: How can we integrate into our methodology? How can we just get it into a turnkey solution and to the rest of everything that we do?
The bring-up time was very, very low for pretty much everybody. I’ve used it myself, it’s very easy to use. The graphical interface was really the most – it’s something that you would be surprised what you find when you are just reviewing things visually, versus seeing this is the mean, this is the sigma, and that’s what you’re expecting from your simulator.