8B-2 Novel Applications of Deep Learning Hidden Features for Adaptive Testing Bingjun Xiao Jinjun Xiong Yiyu Shi University of California,Los Angeles IBM Research University of Notre Dame Los Angeles,CA Yorktown Heights,NY Notre Dame,IN xiao@cs.ucla.edu jinjun@us.ibm.com yshi4@nd.edu ABSTRACT Test 1> Abstract-Adaptive test of integrated circuits (IC)promises to in- crease the quality and yield of products with reduced manufactur- Manufactured Test2→ Discarded ing test cost compared to traditional static test flows.Two mostly chips widely used techniques are Statistical Process Control(SPC)and chips Part Average Testing(PAT),whose capabilities to capture complex correlation between test measurements and the underlying IC's physical and electrical properties are,however,limited.Based on Test Iml→ recent progress on machine learning,this paper proposes a novel deep learning based method for adaptive test.Compared to most machine learning techniques,deep learning has the distinctive ad- Shipped chips vantage of being able to capture the underlying key features au- tomatically from data without manual intervention.In this pa- Fig.1.Conventional test flow that performs screening at each step. per,we start from a trained deep neuron network (DNN)with a much higher accuracy than the conventional test flow for the pass and fail prediction.We further develop two novel applica- tions by leveraging the features learned from DNN:one to enable partial testing,i.e,make decisions on pass and fail without finish- the underlying properties of process that relate tests together. ing the entire test flow,and two to enable dynamic test ordering, Because of the above mentioned issues,recent work tried i.e.,changing the sequence of tests adaptively.Experiment results to explore the relationship among test items [5,6].Based on show significant improvement on the accuracy and effectiveness expert understanding of the physical mechanisms of fabricated of our proposed method. chips,they group related test items,and calculated a hyper- plane via support vector machine(SVM)for each test group to separate outliers from good chips.This approach is,however, I.INTRODUCTION not scalable as which test items to be grouped together need Chip testing is an important step in manufacturing of inte- human selection and domain knowledge from chip experts. grated circuits.By measuring various parameters of chips and In addition,adaptive testing is usually used to reduce the test setting up the screening criteria,defective chips are separated time and cost [7.8,9,10].In this scheme,the test order and from good chips.A test flow can have different choices on the content per die are modified based on the real-time analysis of testing items and screening criteria.But a good test flow should the measured data.Human selection of test items to be grouped minimize the total cost of testing items while maintaining yield. becomes prohibitive in adaptive testing since decisions need to A test flow of a chip usually contains 20-100 test items.Con- be made on the fly. ventional flows and their derivations apply screening to outliers The underlying relationships among test items can be at each step [1,2,3,4].However,the measurement results of learned from historical test data with the help of deep neu- these test items are actually determined by the physical facts ral networks(DNNs)[11].Inspired by this trend,some re- of a chip,and thus are highly correlated.That is to say,the searchers started to use deep neural networks(DNNs)in chip boundary between good chips and defective chips is not on testing [12,13,14].However,the usage of DNNs is limited the orthogonal lines independently determined by the two test to a binary classifier for chip disposition.It only brings better items.Instead,the boundary is a curve of which the shape can precision but does not help optimize the test flow.We believe be arbitrary depending on the relationship between the two test that more opportunities can be explored from the chip insights items.The overlook of this underlying relationship is a big automatically gained via DNNs.Based on the relations among limitation of conventional test flows,as illustrated in Fig.1.If test items represented by a trained DNN,we can estimate a chip we apply passing criteria at each test step separately,we will status from fewer test items and bypass other items to reduce lose the good chips that happen to fail accidentally at only one the test cost.We can also dynamically change the test order to or two test steps.Instead,we should consider the fact that tests measure the most important test items to converge to the dispo- are highly correlated and that a chip passing one test is highly sition decision of the chip as soon as possible.In this work,we likely to pass others.This indicates the dynamic adjustment of try to use DNNs to explore these opportunities in chip testing. the criteria of a test based on the result of some other test and The contributions of this paper include: 978-1-4673-9569-4/16/S31.00©2016EEE 743
Novel Applications of Deep Learning Hidden Features for Adaptive Testing Bingjun Xiao Jinjun Xiong Yiyu Shi University of California, Los Angeles IBM Research University of Notre Dame Los Angeles, CA Yorktown Heights, NY Notre Dame, IN xiao@cs.ucla.edu jinjun@us.ibm.com yshi4@nd.edu ABSTRACT Abstract— Adaptive test of integrated circuits (IC) promises to increase the quality and yield of products with reduced manufacturing test cost compared to traditional static test flows. Two mostly widely used techniques are Statistical Process Control (SPC) and Part Average Testing (PAT), whose capabilities to capture complex correlation between test measurements and the underlying IC’s physical and electrical properties are, however, limited. Based on recent progress on machine learning, this paper proposes a novel deep learning based method for adaptive test. Compared to most machine learning techniques, deep learning has the distinctive advantage of being able to capture the underlying key features automatically from data without manual intervention. In this paper, we start from a trained deep neuron network (DNN) with a much higher accuracy than the conventional test flow for the pass and fail prediction. We further develop two novel applications by leveraging the features learned from DNN: one to enable partial testing, i.e., make decisions on pass and fail without finishing the entire test flow, and two to enable dynamic test ordering, i.e., changing the sequence of tests adaptively. Experiment results show significant improvement on the accuracy and effectiveness of our proposed method. I. INTRODUCTION Chip testing is an important step in manufacturing of integrated circuits. By measuring various parameters of chips and setting up the screening criteria, defective chips are separated from good chips. A test flow can have different choices on the testing items and screening criteria. But a good test flow should minimize the total cost of testing items while maintaining yield. A test flow of a chip usually contains 20-100 test items. Conventional flows and their derivations apply screening to outliers at each step [1, 2, 3, 4]. However, the measurement results of these test items are actually determined by the physical facts of a chip, and thus are highly correlated. That is to say, the boundary between good chips and defective chips is not on the orthogonal lines independently determined by the two test items. Instead, the boundary is a curve of which the shape can be arbitrary depending on the relationship between the two test items. The overlook of this underlying relationship is a big limitation of conventional test flows, as illustrated in Fig. 1. If we apply passing criteria at each test step separately, we will lose the good chips that happen to fail accidentally at only one or two test steps. Instead, we should consider the fact that tests are highly correlated and that a chip passing one test is highly likely to pass others. This indicates the dynamic adjustment of the criteria of a test based on the result of some other test and m Fig. 1. Conventional test flow that performs screening at each step. the underlying properties of process that relate tests together. Because of the above mentioned issues, recent work tried to explore the relationship among test items [5, 6]. Based on expert understanding of the physical mechanisms of fabricated chips, they group related test items, and calculated a hyperplane via support vector machine (SVM) for each test group to separate outliers from good chips. This approach is, however, not scalable as which test items to be grouped together need human selection and domain knowledge from chip experts. In addition, adaptive testing is usually used to reduce the test time and cost [7, 8, 9, 10]. In this scheme, the test order and content per die are modified based on the real-time analysis of the measured data. Human selection of test items to be grouped becomes prohibitive in adaptive testing since decisions need to be made on the fly. The underlying relationships among test items can be learned from historical test data with the help of deep neural networks (DNNs) [11]. Inspired by this trend, some researchers started to use deep neural networks (DNNs) in chip testing [12, 13, 14]. However, the usage of DNNs is limited to a binary classifier for chip disposition. It only brings better precision but does not help optimize the test flow. We believe that more opportunities can be explored from the chip insights automatically gained via DNNs. Based on the relations among test items represented by a trained DNN, we can estimate a chip status from fewer test items and bypass other items to reduce the test cost. We can also dynamically change the test order to measure the most important test items to converge to the disposition decision of the chip as soon as possible. In this work, we try to use DNNs to explore these opportunities in chip testing. The contributions of this paper include: 978-1-4673-9569-4/16/$31.00 ©2016 IEEE 8B-2 743
8B-2 1.We make use of the underlying features gained by the hid- ear combination followed by the non-linear activation: den layer of DNNs to reduce the number of test items while giving the best estimation of the chip status r(m) =f(工+$) (1) 2.We build an adaptive test flow that automatically com- bines the hidden features with online testing results,se- lects the best test item to be measured next,and converges where h and m is the number of elements in the hidden to the final prediction results faster. variable vector hk and the measurement data vector m.If we write the linear combination into matrix and vector multiplica- tions,the equation can be simplified as II.AUTOMATED LEARNING VIA ARTIFICIAL NEURAL NETWORK r(m)=f(Lk+1f(Lkf(Lk-1…f(Lm)…))(2) DNNs are based on artificial neural networks (ANNs)but Note that when f()is applied to a vector,it means applying with deep layer structures.ANN is a general nonlinear model f()to each element in the vector. without the assumption of any particular model under training data [11].It mimics the brain nervous system behavior and III.TEST FLOW IMPROVEMENT BASED ON ANN organize neurons into networks.Therefore,classification prob- lems,pattern recognition or more generally black-box model- As discussed in Section I,the direct result of introducing ing can be solved by ANNs [12,13,14].Note that the theo- ANNs to chip testing is the improvement of defect prediction ries developed in this work are applicable to general ANNs in- [12,13,14,15].However,it is just a starting point.We can cluding DNNs.Though we use two-layered ANNs to simplify use the automatically learned knowledges to do more.We fur- some examples for the demonstration purpose,all of these ex- amples can be easily extended to DNNs based on the recursive Artificial structure of ANNs. Data Network The ANN architecture is detailed in Fig.2(a).It contains a Underlying Features Improve defect 7=8x+8r12 prediction for yield Predict defects from partial measurements 8.12 hidden variables to learn for cost reduction Adaptive test with real-time defect=01 ) 8 other applications analysis of measurements via the learned model f(x)= 1+e Fig.3.An overview of our test data analytics framework based on ANN. ( ther build our test data analytics framework based on ANN as shown in Fig.3.Based on the big volume of historical testing Fig.2.Artificial neural network. data,we can use artificial neural network to learn the underly- ing hidden features.These hidden features can be used in dif- ferent applications.In addition to improving defect prediction set of neurons and synapses in different layers.The input layer for yield,we will further demonstrate how to use this infor- at the leftmost part of Fig.2(a)copies the measurement data mation to predict defects from partial measurements for cost m in all the test items to its neurons.The hidden layers in the reduction in Section IV.Furthermore,we propose an adap- middle contain the variables h,h2,...,hk that can be learned tive testing with real real-time analysis of measurements via from training and correspond to the hidden features of chips the learned model in Section V. The output layer at the rightmost part of Fig.2(a)is the classi- fication resultsr of chips,i.e..good chips or defective chips. IV.PREDICTION BASED ON PARTIAL MEASUREMENT The neurons in adjacent layers are fully connected by weighted synapses with trainable network parameters.Each Each test item contributes to product cost,and therefore we neuron will receive excitations from all the precedent neu- want to measure as few test items as possible.The opportunity rons.The next step is to pass the accumulated excitation is that all the measurement data of test items are determined by to the next layer via an activation function f().The most some common hidden features of a chip.Based on the learned common type of activation function is the s-shaped sigmoid underlying models,we can bypass some test items,and esti- f(z)=1/(1+e-).If the accumulated excitation z is strong mate the hidden features from partial measurements,as shown enough,f(z)will be close to 1 (active state),and the neuron in Fig.4.If it works,both test cost and test time can be saved. will further excite all the neurons in the succeeding layer.In To realize this methodology,we first need to estimate how summary,the classification function r(m)is a multi-stage lin- hidden variables determine measurements based on historical 744
1. We make use of the underlying features gained by the hidden layer of DNNs to reduce the number of test items while giving the best estimation of the chip status 2. We build an adaptive test flow that automatically combines the hidden features with online testing results, selects the best test item to be measured next, and converges to the final prediction results faster. II. AUTOMATED LEARNING VIA ARTIFICIAL NEURAL NETWORK DNNs are based on artificial neural networks (ANNs) but with deep layer structures. ANN is a general nonlinear model without the assumption of any particular model under training data [11]. It mimics the brain nervous system behavior and organize neurons into networks. Therefore, classification problems, pattern recognition or more generally black-box modeling can be solved by ANNs [12, 13, 14]. Note that the theories developed in this work are applicable to general ANNs including DNNs. Though we use two-layered ANNs to simplify some examples for the demonstration purpose, all of these examples can be easily extended to DNNs based on the recursive structure of ANNs. The ANN architecture is detailed in Fig. 2(a). It contains a x1 8 -12 y=f(8x1+8x2-12) measurement data hidd i bl l x2 y 8 hidden variables to learn defect = 0|1 r=f(Lk+1gk) x e f x 1 1 network parameters L h=L1m h2=L2h hk=Lkhk-1 L2 m parameters L1 (a) (b) 1 g=f(h) 2 2 g2=f(h2) k k k-1 gk=f(hk) Fig. 2. Artificial neural network. set of neurons and synapses in different layers. The input layer at the leftmost part of Fig. 2(a) copies the measurement data m in all the test items to its neurons. The hidden layers in the middle contain the variables h, h2, ..., hk that can be learned from training and correspond to the hidden features of chips. The output layer at the rightmost part of Fig. 2(a) is the classi- fication results r of chips, i.e., good chips or defective chips. The neurons in adjacent layers are fully connected by weighted synapses with trainable network parameters. Each neuron will receive excitations from all the precedent neurons. The next step is to pass the accumulated excitation to the next layer via an activation function f(·). The most common type of activation function is the s-shaped sigmoid f(x)=1/ (1 + e−x). If the accumulated excitation x is strong enough, f(x) will be close to 1 (active state), and the neuron will further excite all the neurons in the succeeding layer. In summary, the classification function r(m) is a multi-stage linear combination followed by the non-linear activation: r(m) = f |hk| j=0 l k+1 j gk j = f |hk| j=0 l k+1 j f |hk−1| i=0 l k jihk−1 i = ··· (1) where |hk| and |m| is the number of elements in the hidden variable vector hk and the measurement data vector m. If we write the linear combination into matrix and vector multiplications, the equation can be simplified as r(m) = f (Lk+1f (Lkf (Lk−1 ··· f(L1m)···))) (2) Note that when f(·) is applied to a vector, it means applying f(·) to each element in the vector. III. TEST FLOW IMPROVEMENT BASED ON ANN As discussed in Section I, the direct result of introducing ANNs to chip testing is the improvement of defect prediction [12, 13, 14, 15]. However, it is just a starting point. We can use the automatically learned knowledges to do more. We furArtificial History Neural Network Underlying Features Data Improve defect prediction for yield Predict defects from partial measurements for cost reduction Adaptive test with real-time analysis of measurements via the learned model Fig. 3. An overview of our test data analytics framework based on ANN. ther build our test data analytics framework based on ANN as shown in Fig. 3. Based on the big volume of historical testing data, we can use artificial neural network to learn the underlying hidden features. These hidden features can be used in different applications. In addition to improving defect prediction for yield, we will further demonstrate how to use this information to predict defects from partial measurements for cost reduction in Section IV. Furthermore, we propose an adaptive testing with real real-time analysis of measurements via the learned model in Section V. IV. PREDICTION BASED ON PARTIAL MEASUREMENT Each test item contributes to product cost, and therefore we want to measure as few test items as possible. The opportunity is that all the measurement data of test items are determined by some common hidden features of a chip. Based on the learned underlying models, we can bypass some test items, and estimate the hidden features from partial measurements, as shown in Fig. 4. If it works, both test cost and test time can be saved. To realize this methodology, we first need to estimate how hidden variables determine measurements based on historical 8B-2 744
8B-2 hidden features and generally follows the Gaussian distribution: Test 2S h~Vhpa,∑(hpa》 (6) steps where hpast is the mean vector,and (hpast)is the covariance atures e$ matrix of the hidden variable.They can be leamed from his- torical data.Then the probability density function(PDF)of h is Fig.4.Chip testing with partial measurements P(h)= (h)exp((h-hpast)T 1 (7) ∑(hpast)-1(h-hpa) data.We group the measurement data vector m of all the his- torical training samples into a matrix M'.Each row of M where h is the number of hidden variables in total. represents a training sample while each column represents a The key idea of MAP is to find the optimal solution h that measurement parameter across all the samples.Without loss of maximizes the posterior distribution,i.e.,the conditional PDF generality,we start from M'and the learned hidden variables P(hm =Ah).Namely,given the partial measurement data in the first neuron layer H'=LIM'where Li is the matrix m,it aims to find the solution h that is most likely to oc- of network parameters in Fig.2.All the methods discussed cur.Based on Bayes's theorem [16].the posterior distribution P(hm=Ah)is proportional to the prior distribution P(h) below can be easily extended to the other neuron layers in a and the likelihood function P(mh): neural network due to the recursive topology of the network. We denote m Ah to express how hidden variables h of a P(h)P(h) chip determine its measurements m.We use historical data M' P(m)= P(m) (8) and H'to optimize the mapping A,such that the total error Theoretically,the likelihood is a Dirac delta function: function is minimized,i.e.. 00 (in Ah) minlM -A;H'2,i=1,2,..,Iml (3) P(mh)= 0 (m≠Ah) (9) where Ai is the ith row of A,and M is the ith row of M Hence,maximizing the posterior probability in Eq.(8)is equiv- which means the ith measurement item.This problem formu- alent to maximizing the prior probability P(h)subject to the lation means that for each test item i,we can perform a separate constraint m Ah: optimization to minimize its estimation error over all the his- minh (h-hpast)T(hpast)-1(h-hpast) (10) torical samples.This problem can be solved via pseudoinverse. subject to Ah=m After we obtain an optimal mapping A,for a chip under test, we want to estimate the hidden variables h in the first neuron In this optimization problem,the objective function is quadratic layer using the partial measurements m which is a subset of and the constraints are linear equality constraints.Hence this the full measurement vector m.With the full measurements, is a simple quadratic programming with a known analytic so- we have the equation group lution.By using Langrange multipliers and seeking the ex- tremum of the Langrangian,the solution to the problem can be m=Ah (4) given by the linear system Now we have only the partial measurements m where some A (11) elements in m are missing.We can ignore the equations at the 0L rows corresponding with these missing elements and have the where A is a set of Langrange multipliers which come out partial equation group alongside the optimal h. In reality,we find out that measurement data usually contain m Ah (5) noise.We denote >(mpast)the noise intensity as learned from historical data M'.Then the likelihood P(mh)is replaced by: Here we denote A as the partial mapping in A where some rows are removed due to the missing elements.In reality,there 元=Ah+N(0,∑(mpa) (12) can be very few elements in the partial measurements m which makes Eq.(5)unsolvable.Motivated by this,we propose to In this case,the conditional probability P(hm)can be ex- gives the best statistical estimation of h by maximum a poste- pressed as: rior (MAP)[16]. P(h)P(h) To solve Eq.(5),we first need to define a so-called prior dis- P(hl)= P(m) tribution for h[16].Intuitively,the prior distribution represents our prior knowledge about h without seeing any measurement 1 data.Such prior information helps us to further constrain the 而ep(-5h-@r∑ha1h-hp underdetermined linear equation m=Ah in Eq.(5)so that a meaningful solution can be uniquely found.We assume that the m-iP∑(m)(m-)) hidden variables in the vector h are the sources of everything, (13) 745
m Fig. 4. Chip testing with partial measurements. data. We group the measurement data vector m of all the historical training samples into a matrix M . Each row of M represents a training sample while each column represents a measurement parameter across all the samples. Without loss of generality, we start from M and the learned hidden variables in the first neuron layer H = L1M where L1 is the matrix of network parameters in Fig. 2. All the methods discussed below can be easily extended to the other neuron layers in a neural network due to the recursive topology of the network. We denote m = Ah to express how hidden variables h of a chip determine its measurements m. We use historical data M and H to optimize the mapping A, such that the total error function is minimized, i.e., min Ai ||M i − AiH ||2, i = 1, 2, .., |m| (3) where Ai is the ith row of A, and M i is the ith row of M which means the ith measurement item. This problem formulation means that for each test item i, we can perform a separate optimization to minimize its estimation error over all the historical samples. This problem can be solved via pseudoinverse. After we obtain an optimal mapping A, for a chip under test, we want to estimate the hidden variables h in the first neuron layer using the partial measurements m˜ which is a subset of the full measurement vector m. With the full measurements, we have the equation group m = Ah (4) Now we have only the partial measurements m˜ where some elements in m are missing. We can ignore the equations at the rows corresponding with these missing elements and have the partial equation group m˜ = Ah˜ (5) Here we denote A˜ as the partial mapping in A where some rows are removed due to the missing elements. In reality, there can be very few elements in the partial measurements m˜ which makes Eq. (5) unsolvable. Motivated by this, we propose to gives the best statistical estimation of h by maximum a posterior (MAP) [16]. To solve Eq. (5), we first need to define a so-called prior distribution for h [16]. Intuitively, the prior distribution represents our prior knowledge about h without seeing any measurement data. Such prior information helps us to further constrain the underdetermined linear equation m˜ = Ah˜ in Eq. (5) so that a meaningful solution can be uniquely found. We assume that the hidden variables in the vector h are the sources of everything, and generally follows the Gaussian distribution: h ∼ N(hpast, (hpast)) (6) where hpast is the mean vector, and (hpast) is the covariance matrix of the hidden variable. They can be learned from historical data. Then the probability density function (PDF) of h is P(h) = √ 1 (2π)|h|| P(hpast)| exp −1 2 (h − hpast)T (hpast)−1(h − hpast) (7) where |h| is the number of hidden variables in total. The key idea of MAP is to find the optimal solution h that maximizes the posterior distribution, i.e., the conditional PDF P(h|m˜ = Ah˜ ). Namely, given the partial measurement data m˜ , it aims to find the solution h that is most likely to occur. Based on Bayes’s theorem [16], the posterior distribution P(h|m˜ = Ah˜ ) is proportional to the prior distribution P(h) and the likelihood function P( ˜m|h): P(h|m˜ ) = P(h)P( ˜m|h) P( ˜m) (8) Theoretically, the likelihood is a Dirac delta function: P( ˜m|h) = ∞ ( ˜m = Ah˜ ) 0 (˜m = Ah˜ ) (9) Hence, maximizing the posterior probability in Eq. (8) is equivalent to maximizing the prior probability P(h) subject to the constraint m˜ = Ah˜ : minh (h − hpast)T (hpast)−1(h − hpast) subject to Ah˜ = ˜m (10) In this optimization problem, the objective function is quadratic and the constraints are linear equality constraints. Hence this is a simple quadratic programming with a known analytic solution. By using Langrange multipliers and seeking the extremum of the Langrangian, the solution to the problem can be given by the linear system (hpast)−1 A˜T A˜ 0 h λ = (hpast)−1hpast m˜ (11) where λ is a set of Langrange multipliers which come out alongside the optimal h. In reality, we find out that measurement data usually contain noise. We denote ( ˜mpast) the noise intensity as learned from historical data M . Then the likelihood P( ˜m|h) is replaced by: m˜ = Ah˜ + N(0, ( ˜mpast)) (12) In this case, the conditional probability P(h|m˜ ) can be expressed as: P(h|m˜ ) =P(h)P( ˜m|h) P( ˜m) = 1 Z( ˜m) exp −1 2 (h − hpast) T (hpast) −1(h − hpast) −1 2 ( ˜m − Ah˜ ) T ( ˜mpast) −1( ˜m − Ah˜ ) (13) 8B-2 745
8B-2 where Z(m)is the normalization function.Note that Z(m) Enumerate candidate over can be ignored during the optimization of Eq.(8)since it is each unmeasured test Measured independent on the optimization variable h.Then the MAP Dafa problem can be rewritten as: Predict result variance Leamed features min(h-hpast)(hpast)-1(h-hpast) and statistics from (14) historical data +(m-Ah)T>(rpast)(m-Ah) Update optimal Better No measurement set solution? This is an unconstrained quadratic programming and can be easily transformed to the linear least-square problem like Eq.(3).It can be solved by pseudoinverse. Fig.6.Methodology of test step optimization V.ADAPTIVE TESTING We first estimate the variance of hidden variables for a can- We further find out that the selection of partial measurements didate m.We already know from Section IV that has high impact on defect prediction error.The challenge is 1 that different chips have different optimal sets of partial mea- P(m)= surements.If one hidden feature of a chip deviates far from 而(-h-a∑p1h- standard,it will dominate the result,and the measurements re- lated to this hidden feature become more important.So we 与a-AT∑m)1m-i (16) Historical data From this,we know that Select test Var(m)=(∑h1+AT∑(mp)-1A)-(a7) which is the variance of hidden variables.Next,we calcu- Measurements late the prediction variance from hidden variable variance.We know that the prediction result r in Fig.2 is Estimate hidden features r=f(Lk+1f(Lkf(Lk-1…f(h)…)) (18) Hidden variable estmaton and the derivative of the sigmoid function f(x)is Predict defects df四=fe1-f》 (19) dr Fig.5.Adaptive test flow. Then we have need to customize the test flow during chip testing.as shown in d"(h) dh =r(1-r)Π(Lk+1·(f(h)(1-fhx)》 (20) Fig.5.At each test step of a chip,given the hidden variables i=k of the chip estimated from the measurements that have been which builds the relation between the prediction results and the performed,we choose the best measurement to perform in the hidden variables.Though the neural network is nonlinear,we next step,so that the defect prediction error is to be minimized. can approximate the variance of the prediction results by lin- The key step in the adaptive test flow is the selection of mea- earization in the small region around the current measurements surement at each step.The optimization goal can be formu- and have lated as:given current measurement set m,find a new measure- ment candidate mi to form an incremental set of measurements r(h) Var(hlm) r(h) m=mU[mi,such that the variance of defect prediction is Var(rlmn) h=he dh dh h=hest minimized,i.e., (21) min Var(rm) (15) where hest is the estimated hidden variables calculated from the current measurements via the approach in Section IV.We use The optimization methodology is shown in Fig.6.We enu- this method to predict the result variance in Fig.6. merate candidates over each unmeasured test,and calculate the prediction variance based on measured data and the learned features.If this candidate improves the current solution,we VI.EXPERIMENTS update the optimal measurement set that have been found so far.Then we continue to enumerate another candidate.The key A.Experiment Settings part of this flow is to calculate the prediction variance given a We set up our experimental flow as shown in Fig.7.The candidate m. flow starts from historical data which contain basic information 746
where Z( ˜m) is the normalization function. Note that Z( ˜m) can be ignored during the optimization of Eq. (8) since it is independent on the optimization variable h. Then the MAP problem can be rewritten as: min h (h − hpast) T (hpast) −1(h − hpast) +(˜m − Ah˜ ) T ( ˜mpast) −1( ˜m − Ah˜ ) (14) This is an unconstrained quadratic programming and can be easily transformed to the linear least-square problem like Eq. (3). It can be solved by pseudoinverse. V. ADAPTIVE TESTING We further find out that the selection of partial measurements has high impact on defect prediction error. The challenge is that different chips have different optimal sets of partial measurements. If one hidden feature of a chip deviates far from standard, it will dominate the result, and the measurements related to this hidden feature become more important. So we Fig. 5. Adaptive test flow. need to customize the test flow during chip testing, as shown in Fig. 5. At each test step of a chip, given the hidden variables of the chip estimated from the measurements that have been performed, we choose the best measurement to perform in the next step, so that the defect prediction error is to be minimized. The key step in the adaptive test flow is the selection of measurement at each step. The optimization goal can be formulated as: given current measurement set m˜ , find a new measurement candidate mi to form an incremental set of measurements mˆ = ˜m ∪ {mi}, such that the variance of defect prediction is minimized, i.e., min mˆ Var(r|mˆ ) (15) The optimization methodology is shown in Fig. 6. We enumerate candidates over each unmeasured test, and calculate the prediction variance based on measured data and the learned features. If this candidate improves the current solution, we update the optimal measurement set that have been found so far. Then we continue to enumerate another candidate. The key part of this flow is to calculate the prediction variance given a candidate mˆ . Enumerate candidate over Measured Data each unmeasured test Predict result variance Learned features and statistics from historical data Better solution? Update optimal Yes No measurement set Fig. 6. Methodology of test step optimization. We first estimate the variance of hidden variables for a candidate mˆ . We already know from Section IV that P(h|mˆ ) = 1 Z( ˆm) exp −1 2 (h − hpast) T (hpast) −1(h − hpast) −1 2 ( ˆm − Ahˆ ) T ( ˆmpast) −1( ˆm − Ahˆ ) (16) From this, we know that Var(h|mˆ ) = (hpast) −1 + AˆT ( ˆmpast) −1Aˆ −1 (17) which is the variance of hidden variables. Next, we calculate the prediction variance from hidden variable variance. We know that the prediction result r in Fig. 2 is r = f (Lk+1f (Lkf (Lk−1 ··· f(h)···))) (18) and the derivative of the sigmoid function f(x) is d f(x) dx = f(x)(1 − f(x)) (19) Then we have d r(h) dh = r(1 − r) 1 i=k (Lk+1 · (f(hk) (1 − f(hk)))) (20) which builds the relation between the prediction results and the hidden variables. Though the neural network is nonlinear, we can approximate the variance of the prediction results by linearization in the small region around the current measurements and have Var(r|mˆ ) = d r(h) dh |h=hestT Var(h|mˆ ) d r(h) dh |h=hest (21) where hest is the estimated hidden variables calculated from the current measurements via the approach in Section IV. We use this method to predict the result variance in Fig. 6. VI. EXPERIMENTS A. Experiment Settings We set up our experimental flow as shown in Fig. 7. The flow starts from historical data which contain basic information 8B-2 746
8B-2 B.Defect Prediction Historical Data Wafer ID of each chip,location on the wafer We first validate the direct benefit of ANNs in chip testing- Measured parameters (e.g.,Iddq,PSRO,Fmax) User feedback(defects in final tests and custom returns) defect prediction improvement.We compare four test flows. The first one is the 'oracle'flow which knows all the defective training data chips in advance.This flow provides the upper-bound of any Deep Learning validation data testing approach.The second one is a'baseline'built on top of To learn hidden featur (edge weights the conventional test flows as discussed in Section I.This flow Via forward/backward propagation algorithm applies passing criteria at each test step.The third one is the 'logistic'flow based on logistic regression,a widely used ma- Validation Apply the trained network with learned features to validation data chine learning method.We expect that though both logistic re- Compare defect predictions by deep learning with historical facts gression and ANNs can automatically learn insights from data, ANNs can learn more and give better prediction.The fourth Fig.7.Flow of ANN validation on chip testing. one is the 'neural'flow based on our ANN training.This flow uses historical data to improve the current and later testing. such as wafer ID of each chip and its location on the wafer. 80 The historical data also contain measured parameters which are commonly used in industrial testing such as IDDQ,PSRO, FMAX,etc.User feedbacks,i.e.,the existence of defects found in final tests and custom returns,are also included.We divide 40 ◆-base the historical data into two parts.One part are used as training oracle data for ANN.We use the forward/backward propagation al- 20 -neural gorithm to iteratively update the synapses weights in the ANN -logistic from the training data.The hidden features are also calculated at the same time.The other part of historical data are used for 0 10 20 30 40 50 wafer ID validation.We apply the trained ANN with learned hidden fea- tures to this part of data.We compare the defect predictions by Fig.9.Wafer yield improvement in our analytics flow based on ANN training. the ANN with the ground truths specified by the historical data. We use synthetic data in our experiments.Our data set con- tains 100 wafers with 176 chips on each wafer,and each chip The experimental results are shown in Fig.9.We summa- comes with 20 measured parameters.Fig.8 shows the distri- rize the average chip yield on the 50 wafers using the four dif- ferent test flows.We can see that while the baseline'flow Wafer Map overkills many good chips and leads to low yield,our 'neu- ral'flow achieves results close to the 'oracle'upper bound. It means that ANN training does capture the hidden features within chips that relate measurement data to exposed defects. We also observe that our 'neural'flow achieves better results than the 'logistic'flow.The reason is that ANN is a more pow- erful modeling than traditional regression models(including lo- gistic regression).Note that ANN training was often limited by bias(i.e.,underfitting)in the past due to high model complex- 5678 910111213141s1t ity and small data volume.But this limitation has gone in the X era of big data. Average yield of 100 wafers Fig.8.Example of wafer data used in our experiments. C.Removing Test Items To validate our approach in Section IV,we randomly select t measurements out of the total 20 in the experiments.We sweep bution of the defect rate averaged by 100 wafers.In this syn- t from 1 to 20 and for each t,we repeat the random selection thesis,we use independent random variables to simulate the for 100 times.For each selection,we use the approach dis- physical states of chips.We also use randomly generated non- cussed above to estimate the hidden variables and then predict linear functions to determine all the chip measurements from defective chips. the physical states and add Gaussian noise further to mimic the The result is shown in Fig.10.Here we plot both the mean real measurement data.One big advantage of using synthetic and variance values of the repeated 100 random selections for data is that we can have a golden oracle to validate the proposed each t.We can see that as the number of measurements in- algorithms.This validation cannot be easily done on real data. creases,the average error rate of defect prediction converges to But we are also actively working on applying our algorithms the lower bound.It means that we do not need all the measure- on real data. ments if we do not pursue the lower bound of the error rate.We 747
Historical Data • Wafer ID of each chip, location on the wafer • Measured parameters (e.g., Iddq, PSRO, Fmax) • User feedback (defects in final tests and custom returns) training data Deep Learning • To learn hidden features from training data • Automated training of neural network (edge weights) validation data g (g g ) • Via forward/backward propagation algorithm Validation • Apply the trained network with learned features to validation data • Compare defect predictions by deep learning with historical facts Fig. 7. Flow of ANN validation on chip testing. such as wafer ID of each chip and its location on the wafer. The historical data also contain measured parameters which are commonly used in industrial testing such as IDDQ, PSRO, FMAX, etc. User feedbacks, i.e., the existence of defects found in final tests and custom returns, are also included. We divide the historical data into two parts. One part are used as training data for ANN. We use the forward/backward propagation algorithm to iteratively update the synapses weights in the ANN from the training data. The hidden features are also calculated at the same time. The other part of historical data are used for validation. We apply the trained ANN with learned hidden features to this part of data. We compare the defect predictions by the ANN with the ground truths specified by the historical data. We use synthetic data in our experiments. Our data set contains 100 wafers with 176 chips on each wafer, and each chip comes with 20 measured parameters. Fig. 8 shows the distriWafer Map 1 1 2 3 4 5 0.7 0.8 0.9 Y 6 7 8 9 10 0.3 0.4 0.5 0.6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 10 11 12 13 0 0.1 0.2 Average yield of 100 wafers X Fig. 8. Example of wafer data used in our experiments. bution of the defect rate averaged by 100 wafers. In this synthesis, we use independent random variables to simulate the physical states of chips. We also use randomly generated nonlinear functions to determine all the chip measurements from the physical states and add Gaussian noise further to mimic the real measurement data. One big advantage of using synthetic data is that we can have a golden oracle to validate the proposed algorithms. This validation cannot be easily done on real data. But we are also actively working on applying our algorithms on real data. B. Defect Prediction We first validate the direct benefit of ANNs in chip testing— defect prediction improvement. We compare four test flows. The first one is the ‘oracle’ flow which knows all the defective chips in advance. This flow provides the upper-bound of any testing approach. The second one is a ‘baseline’ built on top of the conventional test flows as discussed in Section I. This flow applies passing criteria at each test step. The third one is the ‘logistic’ flow based on logistic regression, a widely used machine learning method. We expect that though both logistic regression and ANNs can automatically learn insights from data, ANNs can learn more and give better prediction. The fourth one is the ‘neural’ flow based on our ANN training. This flow uses historical data to improve the current and later testing. 80 40 60 d (%) 20 yiel base oracle neural 0 10 20 30 40 50 0 neural logistic wafer ID Fig. 9. Wafer yield improvement in our analytics flow based on ANN training. The experimental results are shown in Fig. 9. We summarize the average chip yield on the 50 wafers using the four different test flows. We can see that while the ‘baseline’ flow overkills many good chips and leads to low yield, our ‘neural’ flow achieves results close to the ‘oracle’ upper bound. It means that ANN training does capture the hidden features within chips that relate measurement data to exposed defects. We also observe that our ‘neural’ flow achieves better results than the ‘logistic’ flow. The reason is that ANN is a more powerful modeling than traditional regression models (including logistic regression). Note that ANN training was often limited by bias (i.e., underfitting) in the past due to high model complexity and small data volume. But this limitation has gone in the era of big data. C. Removing Test Items To validate our approach in Section IV, we randomly select t measurements out of the total 20 in the experiments. We sweep t from 1 to 20 and for each t, we repeat the random selection for 100 times. For each selection, we use the approach discussed above to estimate the hidden variables and then predict defective chips. The result is shown in Fig. 10. Here we plot both the mean and variance values of the repeated 100 random selections for each t. We can see that as the number of measurements increases, the average error rate of defect prediction converges to the lower bound. It means that we do not need all the measurements if we do not pursue the lower bound of the error rate. We 8B-2 747
8B-2 30 fully extend the insights learned by the neural network to esti- mate chip quality from partial measurements,and to improve the test flow combined with online analysis. As this is the first work on applying artificial neural network to chip testing,our primary goal is to identify the new research opportunities and potential benefits as presented in this work. 10 Many opportunities exist for further improvement.For exam- ple,during the test cost optimization,we have not yet consid- ered the variety of test cost over different measurements.Two 10 20 of measurements cheap measurements might gain more information than one ex- pensive measurements.This kind of interesting research prob- Fig.10.Experimental results of chip testing with partial measurements. lem can be further investigated. REFERENCES can safely reduce the number of measurements from 20 to as [1]H.Ayari,F.Azais,S.Bernard,M.Comte,V.Kerzerho,O.Potin,and low as 14 without large penalty on the prediction error.We also M.Renovell,"Making predictive analog/RF alternate test strategy inde- see that there are large variations on the prediction error.These pendent of training set size,"in International Test Conference,Nov.2012, variations come from the randomness of the measurement se- Pp.1-9. lection.It means that the selection of measurements matters [2]M.Gao,P.Lisherness,and K.-T.(Tim)Cheng."Adaptive test selection which leaves space for further investigation.If we can select for post-silicon timing validation:A data mining approach,"20/2 IEEE International Test Conference,pp.1-7.Nov.2012. the optimal measurement set for each chip during testing.the error rate can be reduced significantly,e.g.,from 18%to 8%at [3]A.Nahar,K.Butler,J.M.Carulli,and C.Weinberger,"Quality improve- ment and cost reduction using statistical outlier methods,"in Interna- t=10.This measurement selection problem is further opti- tional Conference on Computer Design,Oct.2009,pp.64-69. mized in adaptive testing in Section V. [4]B.Seshadri,P.Gupta,Y.T.Lin,and B.Cory,"Systematic defect screen- ing in controlled experiments using volume diagnosis,"in International Test Conference,vol.Di,Nov.2012,pp.1-7. D.Adaptive Testing [5]N.Sumikawa,D.G.Drmanac,L.-C.Wang,L.Winemberg,and M.S. Abadir,"Forward prediction based on wafer sort data-A case study," 30 International Test Conference.pp.1-10,Sep.2011. -predefined [6]D.Drmanac,N.Sumikawa,L.Winemberg.L-C.Wang.and M.S. 子 -adaptive Abadir."Multidimensional parametric test set optimization of wafer 20. probe data for predicting in field failures and setting tighter test limits," in Design,Automation and Test in Europe.leee,Mar.2011,pp.1-6. 5 .7x [7]K.R.Gotkhindikar,"A Die-level Adaptive Test Scheme for Real-time Test Reordering and Elimination,"Master.Portland State University. 10 2012. [8]ITRS,"Adaptive Test,"Tech.Rep.,2013.[Online].Available: http://www.semi.org/en/sites/semi.org/files/docs/ITRS_AdaptiveTest_WhitePaper2013.p 两中文◆中◆◆边 10 15 20 [9]S.Biswas,R.D.S.Blanton,N.Corp,and S.Clara,"Improving the Accu- #of measurements racy of Test Compaction through Adaptive Test Update,"in International Test Conference,2008.p.95050. Fig.11.Experimental results of adaptive testing. [10]E.Yilmaz,S.Ozev,and K.M.Butler,"Adaptive test flow for mixed- signal/RF circuits using learned information from device under test,"in International Test Conference,2010. We implement the adaptive testing flow in Fig.5.The up- [11]H.White,Artificial Neural Networks:Approximation and Learning The- dated experimental results are shown in Fig.11 with compar- ory.Oxford:Blackwell,1992. ison against the prior flow with predefined measurement sets. [12]L.Wu,J.Zhang.and G.Zhang."A Fuzzy Neural Network Approach for We see that our adaptive testing do not have variance on the Die Yield Prediction of Wafer Fabrication Line,"in International Con ference on Fuzzy Systems and Knowledge Discovery.2009.pp.198-202. curve any more.It eliminates the uncertainty of the predic- tion error brought by the measurement selection.In addition, [13]T.S.Kim,Y.G.Jang,J.I.Lee,K.J.Lee,B.Y.Kim,and C.H.Cho. "Yield prediction models for optimization of high-speed micro-processor our adaptive testing performs coherently better than the aver- manufacturing processes,"in International Electronics Manufacturing age of the prior testing flow with predefined measurement sets. Technology Symposium,2000.pp.368-373. It shows great potential of combining online measurement re- [14]S.Ellouz,P.Gamand.C.Kelma.B.Vandewiele,and B.Allard."Combin- sults with the learning results of historical data to co-optimize ing Internal Probing with Artificial Neural Networks for Optimal RFIC the chip testing. Testing."in International Test Conference,Oct.2006,pp.1-9. [15]D.Maliuk,H.-g.Stratigopoulos.H.Huang,and Y.Makris,"Analog neu- ral network design for RF built-in self-test,"in International Test Confer VII.CONCLUSIONS AND FUTURE WORK ence,2010,pp.1-10. [16]Y.Anzai,Pattern recognition and machine learing.Elsevier,1992. In this work,we start from using deep neural network as a binary classifier for chip testing and go beyond it.We success- 748
30 20 25 10 15 error (%) 0 5 10 5 10 15 20 0 # of measurements Fig. 10. Experimental results of chip testing with partial measurements. can safely reduce the number of measurements from 20 to as low as 14 without large penalty on the prediction error. We also see that there are large variations on the prediction error. These variations come from the randomness of the measurement selection. It means that the selection of measurements matters which leaves space for further investigation. If we can select the optimal measurement set for each chip during testing, the error rate can be reduced significantly, e.g., from 18% to 8% at t = 10. This measurement selection problem is further optimized in adaptive testing in Section V. D. Adaptive Testing 30 predefined 20 25 %) adaptive 10 15 error ( % 1.7x 5 10 15 20 0 5 5 10 15 20 # of measurements Fig. 11. Experimental results of adaptive testing. We implement the adaptive testing flow in Fig. 5. The updated experimental results are shown in Fig. 11 with comparison against the prior flow with predefined measurement sets. We see that our adaptive testing do not have variance on the curve any more. It eliminates the uncertainty of the prediction error brought by the measurement selection. In addition, our adaptive testing performs coherently better than the average of the prior testing flow with predefined measurement sets. It shows great potential of combining online measurement results with the learning results of historical data to co-optimize the chip testing. VII. CONCLUSIONS AND FUTURE WORK In this work, we start from using deep neural network as a binary classifier for chip testing and go beyond it. We successfully extend the insights learned by the neural network to estimate chip quality from partial measurements, and to improve the test flow combined with online analysis. As this is the first work on applying artificial neural network to chip testing, our primary goal is to identify the new research opportunities and potential benefits as presented in this work. Many opportunities exist for further improvement. For example, during the test cost optimization, we have not yet considered the variety of test cost over different measurements. Two cheap measurements might gain more information than one expensive measurements. This kind of interesting research problem can be further investigated. REFERENCES [1] H. Ayari, F. Azais, S. Bernard, M. Comte, V. Kerzerho, O. Potin, and M. Renovell, “Making predictive analog/RF alternate test strategy independent of training set size,” in International Test Conference, Nov. 2012, pp. 1–9. [2] M. Gao, P. Lisherness, and K.-T. (Tim) Cheng, “Adaptive test selection for post-silicon timing validation: A data mining approach,” 2012 IEEE International Test Conference, pp. 1–7, Nov. 2012. [3] A. Nahar, K. Butler, J. M. Carulli, and C. Weinberger, “Quality improvement and cost reduction using statistical outlier methods,” in International Conference on Computer Design, Oct. 2009, pp. 64–69. [4] B. Seshadri, P. Gupta, Y. T. Lin, and B. Cory, “Systematic defect screening in controlled experiments using volume diagnosis,” in International Test Conference, vol. Di, Nov. 2012, pp. 1–7. [5] N. Sumikawa, D. G. Drmanac, L.-C. Wang, L. Winemberg, and M. S. Abadir, “Forward prediction based on wafer sort data – A case study,” International Test Conference, pp. 1–10, Sep. 2011. [6] D. Drmanac, N. Sumikawa, L. Winemberg, L.-C. Wang, and M. S. Abadir, “Multidimensional parametric test set optimization of wafer probe data for predicting in field failures and setting tighter test limits,” in Design, Automation and Test in Europe. Ieee, Mar. 2011, pp. 1–6. [7] K. R. Gotkhindikar, “A Die-level Adaptive Test Scheme for Real-time Test Reordering and Elimination,” Master, Portland State University, 2012. [8] ITRS, “Adaptive Test,” Tech. Rep., 2013. [Online]. Available: http://www.semi.org/en/sites/semi.org/files/docs/ITRS AdaptiveTest WhitePaper2013.pd [9] S. Biswas, R. D. S. Blanton, N. Corp, and S. Clara, “Improving the Accuracy of Test Compaction through Adaptive Test Update,” in International Test Conference, 2008, p. 95050. [10] E. Yilmaz, S. Ozev, and K. M. Butler, “Adaptive test flow for mixedsignal/RF circuits using learned information from device under test,” in International Test Conference, 2010. [11] H. White, Artificial Neural Networks: Approximation and Learning Theory. Oxford: Blackwell, 1992. [12] L. Wu, J. Zhang, and G. Zhang, “A Fuzzy Neural Network Approach for Die Yield Prediction of Wafer Fabrication Line,” in International Conference on Fuzzy Systems and Knowledge Discovery, 2009, pp. 198–202. [13] T. S. Kim, Y. G. Jang, J. I. Lee, K. J. Lee, B. Y. Kim, and C. H. Cho, “Yield prediction models for optimization of high-speed micro-processor manufacturing processes,” in International Electronics Manufacturing Technology Symposium, 2000, pp. 368–373. [14] S. Ellouz, P. Gamand, C. Kelma, B. Vandewiele, and B. Allard, “Combining Internal Probing with Artificial Neural Networks for Optimal RFIC Testing,” in International Test Conference, Oct. 2006, pp. 1–9. [15] D. Maliuk, H.-g. Stratigopoulos, H. Huang, and Y. Makris, “Analog neural network design for RF built-in self-test,” in International Test Conference, 2010, pp. 1–10. [16] Y. Anzai, Pattern recognition and machine learning. Elsevier, 1992. 8B-2 748