DAC 2017: EDA Powered by Machine Learning

One of the things I wanted to talk about is, a lot of people are buzzing about machine learning right now. We’re excited about it. We’re trying to apply it. There are lots of in house projects that are looking to apply methods that already exist to solve big problems.

We can smell a machine learning problem. We know because data acquisition is really expensive, and we’ve got a ton of data. Now what methods do we apply to solve these problems? That’s the hard part. One thing that doesn’t work very well is just opening a text book or taking a course, and then applying the methods directly from there to solve problems.

And the reason for that is that engineering problems require a different angle. So, what I describe Solido as doing is really “machine learning for engineering applications”, and I’m going to talk a little bit about what differentiates just standard machine learning from machine learning for engineering applications.

There are 3 areas here:  

  • Massive data. Certainly, we have a lot of data in any machine learning application. The thing that’s kind of unique about our data in engineering, is that a lot of it is streaming data. We collect historic data but we’re also collecting data in real time as we’re running our analyses, and we need to be able to analyze those in real time.
  • The second thing is that this stuff is complicated. We can make very few assumptions about the nature of the data here. There are a lot of dimensions, there are all kinds of tricky things there.
  • And the third thing is that in engineering, our bridges need to stand, our chips need to work, and we can’t really bet our designs on estimates. We must have good evidence that our answers are correct.

I’m going to talk about these three things which I think really characterize the difference between standard machine learning and machine learning for engineering.

Massive data.  We certainly have massive data, as I said, in every machine learning type area. But what we’ve got here is:  Imagine a whole bunch of SPICE simulators, 2000 of them, working in parallel on solving a problem, for a chip we’ve never seen before, on a manufacturing process that we’ve never seen before.

Certainly, we can gather some information for how things have behaved in the past and start to shape our models from that. But there’s a lot of real-time data happening here. What we’re doing essentially is real-time machine learning. We’re streaming in information, and building models in real time.

So, what we need, to be able to do, to do this, are things like:

  • Optimizing streaming parsers — things that can read this data efficiently as it’s coming in.
  • Parallelizable algorithms — a lot of the machine learning technologies that are out there, don’t parallelize that well. They run on single CPUs or a small number of CPUs, and we need this to run on thousands of CPUs to keep up with the rate at which data is coming in.
  • We also need scalable cluster management — being able to distribute and dispatch all these jobs while bringing everything together into a single central model is a very hard problem in itself.
  • We need automated recovery and repair — things go wrong when you’re streaming real-time data and you need to be able to say, it’s okay that that went wrong, and I’m going to figure out what to do about it. Sometimes you get incorrect answers coming in, which can pollute your models that you’re building; being able to filter that and adapt it and adjust it and correct from it is hard. This is something that you must do in this space, and it’s very hard to do.
  • Then also being able to figure out what went wrong and being able to debug real-time streaming data is quite challenging.

So that’s the first thing you must be able to deal with when you’re doing machine learning for engineering — certainly in the chip space.

The second problem here is complexity.

Imagine a chip, where you’ve got lots of transistors. If you’re looking at process variations, a space that we live in, each one of those transistors has models for how it varies. So, for a chip that’s got thousands of transistors, you might have tens of thousands of variables.

These all interact. These aren’t simple standalone type of responses that we’re looking at. They all interact and they interact in tricky ways — non-linear interactions, interactions with discontinuities.

What we need to handle this, is technology that first, can do an effective design of experiments. Given this large space, how do you start to collect some information about it? You can’t just go run everything, because now you’ve defeated the purpose. We must pick some places to start. So good design of experiments technology is really important.

The second thing is we need advanced supervised learning techniques. Basically, this is a number of different ways of modeling data, some of which are very accurate, some of which are very fast and very scalable. And different methods that we can apply to a single solution. We don’t just use one. Sometimes, we’ll need to work with multiple types of models that filter down into the thing that we want.

Intelligent screening and filtering: One thing we don’t want to do is to throw away important variables — variables that will be important under some conditions. They don’t look important in our initial experiments, but as we started zoning in our areas of interest, they might become important.

So how do we make sure we don’t filter the wrong things? We’ve got sometimes tens of thousands or hundreds of thousands of dimensions we need to filter. But how do you make sure you’re not filtering the wrong things?

One of the things that’s important for handling this data complexity is having a benchmarking infrastructure that has a quick way to test what you’re doing against things that happen in the real world. We’ve got 20,000 test cases that run automatically, that we can run in whole or in part overnight on large clusters which is useful.

The last thing is just a big tool box. If you have lots of tools, a whole bunch of different tools that you can use and you know how to put them together, you’ve probably got a good basis for being able to handle this kind of complexity that we have in this space. It’s not an easy space to apply machine learning.

The third thing — and this is probably the most important thing — if you’re going to implement a machine learning technology in a production flow, you must have the right answer. If the answer is not right, people won’t use it.

So, it’s great if you can say, “I’ll make something 100 or 1000 times faster, or save you weeks.” But if at the end of the day there’s a risk that you might have a re-spin as a result, or a risk that you might be massively over-margining still, people won’t adopt that solution.

How do you do that, given that with machine learning techniques, we’re essentially building a big estimator? This isn’t the real answer anymore — we’ve got a big estimator. Well you can’t just give people that, and say well just trust it — the data around that region looks like it supports this answer. That’s not good enough for a lot of engineering decisions.

What we need is accuracy-aware modeling techniques. And this is hard — there aren’t that many accuracy-aware modeling techniques in the world today. They are not in textbooks; you must invent them, you must come up with them.

We need active learning approaches where we can incrementally figure out where areas of interest are. Usually they’re around our worst cases. Show me where my chip is likely to fail, and then go get lots of resolution in that area. We want to actively do direct experiments into areas of interest, and having active learning methods that are good at targeting problem areas is super-important.

The third thing — and this might be the most important thing that I’m going to say today — is self-verifying algorithms. And the reason for this is, if you can’t prove to an engineer that the answer is right, they’re not going to take the answer. Even if you can describe the technology and it has been right before a lot of times, maybe it’s been right before a 1000’s of times, how do they know it’s right in their case? If the algorithm can’t prove its correctness, people are really hesitant to use that technology, and bet their design on it.

So, if you can design algorithms that are verifiable, and can in fact implement the verification as part of the technology, so that you don’t just give answer, but you prove at-run time that it’s the correct answer, that’s really compelling to people, and that’s what make this work in production.

Also, then being able to prove through lots and lots of data helps, having lots of cases where you can show that it has worked, not just on your data, but on customer data or data onsite — real production data that you know is yours and that you know is based on your chips, and your processes, and your design practices — then you can believe this a lot more. So being able to run hundreds or thousands of cases, and show that it gives the right answer again and again compared to brute force, is nice for building confidence as well.

So those are really the three main problem areas. We’ve got our massive data, we’ve got data complexity, and correctness. And we need technologies above and beyond what’s in the machine learning textbooks and research today to build machine learning based solutions for engineering.

Solido has been at this for 12 years — I was at Solido for 11 of those year.  We’ve implemented these solutions into two products lines.

  • The first one is Variation Designer, that’s a production tool that uses a lot of different machine learning techniques for differentiating against the status quo. People don’t come to Solido because we have the same solution as everyone else; they come to us because our solutions are 10 times or 100 times or 1000 times faster, but are still accurate and provable. You can still make engineering decisions based on the results.
  • The second set of tools is the newer one we just announced this year. It’s called the Machine Learning (ML) Characterization Suite. This is basically applying a lot of what we know, a lot of our tools in our tool box, to the problem of library timing characterization.

We can speed up that process by weeks as well, with machine learning technologies. This is for standard cell, memory, and I/O type of problems.

Another thing we’re doing, that we just announced recently, is ML Labs. We know there are a lot of other problems in this space that our customers have that still exist in the world that aren’t solved yet by machine learning technologies. And there are a lot of initiatives to apply machine learning technologies to try and solve problems.

What we want to do is act as the glue, to basically take customer problems, collaborate with them on them, run a proof of concept using real data, make sure that it’s a reasonable direction. Then ultimately bring new products to market in new spaces. So, that’s a new initiative that we’ve got at Solido as well on the machine learning side.

That was just a little bit of information about machine learning for engineering, and sharing some of the things that have made that successful in production, things that you really need to think about if you’re looking to apply machine learning solutions to solve your problems.

Thanks.

Q&A

Q. Related to yield improvement and pessimism reduction: What does Solido do specifically with same BSIM models and variation blocks that others don’t do, and what do you improve there exactly?

Jeff: Yeah, that’s a good question. When designers are faced with uncertainty they’ll over margin. Chips need to work.  I think the way to best address this is that to make designers want to reduce margins, you must give them a lot more certainty in the quality of the answers. With machine learning technologies, you can get a lot more coverage.

Let’s say you’ve got a fixed schedule, you can get a lot more coverage of this space, which can give you a lot more information, and a lot more confidence that you’ve found all the worst cases, you know how everything behaves. Then you can make much more aggressive design decisions. That’s the biggest thing that leads to over margin reduction.

[Solido has] tools that analyze that; we’ve got tools that help people fix it as well. But it’s really about that confidence in reducing uncertainty — that’s where the gold happens because designers make more aggressive decisions when they think they’ve got the right answer.