APR Advances in Pattern Recognition Support Vector Machines for Pattern Classification Shigeo abe Second edition pringer
Topics and features: Support Vector Machines for Pattern Classification Shigeo Abe Second Edition
Advances in pattern recognition For further volumes http://www.springer.com/series/4205
Advances in Pattern Recognition For further volumes: http://www.springer.com/series/4205
Shigeo abe Support vector machines for pattern Classification Second edition Springer
Shigeo Abe Support Vector Machines for Pattern Classification Second Edition 123
Prof Dr Shigeo abe Kobe University Graduate School of engineering 1-1 Rokkodai-cho Nada-ku 657-8501 Japan abe@@kobe-uLac jp eries editor Prof. Sameer Singh. PhD Research School of Informatics loughborough University ISSN16177916 ISBN978-1-84996-097-7 e-lSBN978-1-84996-098-4 DOI10.1007/978-1-84996-098-4 Springer London Dordrecht Heidelberg New York A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010920369 O Springer-Verlag London Limited 2005, 2010 Apart from any fair dealing for the purposes of research or private study. ermitted under the Copyright, Designs and Patents Act 1988, this publication ored or transmitted, in any form or by any means, with the prior permis m ublishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should I nt to the publisher The use of registered names, trademarks, etc, in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free plied. with regard to the contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made Printed on acid-free paper SpringerispartofSpringerScience+businessMedia(www.springer.com)
ISSN 1617-7916 ISBN 978-1-84996-097-7 e-ISBN 978-1-84996-098-4 DOI 10.1007/978-1-84996-098-4 Springer London Dordrecht Heidelberg New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010920369 c Springer-Verlag London Limited 2005, 2010 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Prof. Dr. Shigeo Abe Kobe University Graduate School of Engineering 1-1 Rokkodai-cho Nada-ku Kobe 657-8501 Japan abe@kobe-u.ac.jp Series Editor Prof. Sameer Singh, PhD Research School of Informatics Loughborough University Loughborough, UK
Preface Preface to the Second edition Since the introduction of support vector machines, we have witnessed the huge development in theory, models, and applications of what is so-calle kernel-based methods: advancement in generalization theory, kernel classifiers and regressors and their variants, various feature selection and extraction methods, and wide variety of applications such as pattern classification and regressions in biology, medicine, chemistry, as well as computer science. In Support vector Machines for Pattern Classification, Second Edition, I try to reflect the development of kernel-based methods since 2005. In ad- dition, I have included more intensive performance comparison of classifiers and regressors, added new references, and corrected many errors in the first edition. The major modifications of, and additions to, the first edition are as follows I have changed the symbols of the mapping function to the feature from g(x)to more commonly used (x) and its associated kernel from H(x,x')to K(x,x') 1.3 Data Sets Used in the Book: I have added publicly available two-class data sets, microarray data sets, multiclass data sets, and regression data 1.4 Classifier Evaluation: Evaluation criteria for classifiers and regressors are discussed 2.3.2 Kernels: Mahalanobis kernels, graph kernels, etc are added 2.3.6 Empirical Feature Space: The high-dimensional feature space is treated implicitly via kernel tricks. This is an advantage and also a disadvantage because we treat the feature space without knowing its structure. The em- pirical feature space is equivalent to the feature space, in that it gives the same kernel value as that of the feature space. The introduction of the empirical feature space greatly enhances the interpretability and manipu- lability of the fea ature space
Preface Preface to the Second Edition Since the introduction of support vector machines, we have witnessed the huge development in theory, models, and applications of what is so-called kernel-based methods: advancement in generalization theory, kernel classifiers and regressors and their variants, various feature selection and extraction methods, and wide variety of applications such as pattern classification and regressions in biology, medicine, chemistry, as well as computer science. In Support Vector Machines for Pattern Classification, Second Edition, I try to reflect the development of kernel-based methods since 2005. In addition, I have included more intensive performance comparison of classifiers and regressors, added new references, and corrected many errors in the first edition. The major modifications of, and additions to, the first edition are as follows: Symbols: I have changed the symbols of the mapping function to the feature space from g(x) to more commonly used φ(x) and its associated kernel from H(x, x ) to K(x, x ). 1.3 Data Sets Used in the Book: I have added publicly available two-class data sets, microarray data sets, multiclass data sets, and regression data sets. 1.4 Classifier Evaluation: Evaluation criteria for classifiers and regressors are discussed. 2.3.2 Kernels: Mahalanobis kernels, graph kernels, etc., are added. 2.3.6 Empirical Feature Space: The high-dimensional feature space is treated implicitly via kernel tricks. This is an advantage and also a disadvantage because we treat the feature space without knowing its structure. The empirical feature space is equivalent to the feature space, in that it gives the same kernel value as that of the feature space. The introduction of the empirical feature space greatly enhances the interpretability and manipulability of the feature space. v
2.8.4 Effect of Model Selection by Cross-Validation: In realizing high gener alization ability of a support vector machine, selection of kernels and their parameter values, i. e, model selection, is very important. Here I discuss how cross-validation. which is one of the most well-used model selection methods, works to generate a support vector machine with high general- ization ability 3.4 I have deleted the section"Sophisticated Architecture"because it does not work 4.3 Sparse Support Vector Machines: Based on the idea of the empirical fea- ture space, sparse support vector machines, which realize smaller numbers of support vectors than those of support vector machines are discussed 4 Performance Comparison of Different Classifiers: Performance of some types of support vector machines is compared using benchmark data sets 4.8 Learning Using Privileged Information: Incorporating prior knowledge into support vector machines is very useful in improving the generalization ability. Here, one such approach proposed by Vapnik is explained 4.9 Semi-supervised Learning: I have explained the difference between semi- supervised learning and transductive learning 4.10 Multiple Classifier Systems: Committee machines in the first edition are renamed and new materials are added 4. 11 Multiple Kernel Learning: A weighted sum of kernels with positiv weights is also a kernel and is called a multiple kernel. A learning method of support vector machines with multiple kernels is discussed 5.6 Steepest Ascent Methods and Newton's Methods: Steepest ascent meth ods in the first edition are renamed as Newtons methods and steepest ascent methods are explained in Section 5.6.1 5.7 Batch Training by Exact Incremental Training: A batch training method based on incremental training is added 5.8 Active Set Training in Primal and Dual: Training methods in the primal or dual form by variable-size chunking are added. 5.9 Training of Linear Programming Support Vector Machines: Three de- composition techniques for linear programming support vector machines are 6 Kernel-Based Methods: Chapter 8 Kernel-Based Methods in the first edi- tion is placed just after Chapter 5 Training methods and kernel discrimi- nant analysis is added 11.5.3 Active Set Training: Active set training discussed in Section 5.8 extended to function approximation. 7 Variable Selection: Variable selection for support vector regressors is
vi Preface 2.8.4 Effect of Model Selection by Cross-Validation: In realizing high generalization ability of a support vector machine, selection of kernels and their parameter values, i.e., model selection, is very important. Here I discuss how cross-validation, which is one of the most well-used model selection methods, works to generate a support vector machine with high generalization ability. 3.4 I have deleted the section “Sophisticated Architecture” because it does not work. 4.3 Sparse Support Vector Machines: Based on the idea of the empirical feature space, sparse support vector machines, which realize smaller numbers of support vectors than those of support vector machines are discussed. 4.4 Performance Comparison of Different Classifiers: Performance of some types of support vector machines is compared using benchmark data sets. 4.8 Learning Using Privileged Information: Incorporating prior knowledge into support vector machines is very useful in improving the generalization ability. Here, one such approach proposed by Vapnik is explained. 4.9 Semi-supervised Learning: I have explained the difference between semisupervised learning and transductive learning. 4.10 Multiple Classifier Systems: Committee machines in the first edition are renamed and new materials are added. 4.11 Multiple Kernel Learning: A weighted sum of kernels with positive weights is also a kernel and is called a multiple kernel. A learning method of support vector machines with multiple kernels is discussed. 5.6 Steepest Ascent Methods and Newton’s Methods: Steepest ascent methods in the first edition are renamed as Newton’s methods and steepest ascent methods are explained in Section 5.6.1. 5.7 Batch Training by Exact Incremental Training: A batch training method based on incremental training is added. 5.8 Active Set Training in Primal and Dual: Training methods in the primal or dual form by variable-size chunking are added. 5.9 Training of Linear Programming Support Vector Machines: Three decomposition techniques for linear programming support vector machines are discussed. 6 Kernel-Based Methods: Chapter 8 Kernel-Based Methods in the first edition is placed just after Chapter 5 Training methods and kernel discriminant analysis is added. 11.5.3 Active Set Training: Active set training discussed in Section 5.8 is extended to function approximation. 11.7 Variable Selection: Variable selection for support vector regressors is added
Preface Preface to the first Edition I was shocked to see a student's report on performance comparisons between support vector machines(SVMs)and fuzzy classifiers that we had developed with our best endeavors. Classification performance of our fuzzy classifiers was comparable, but in most cases inferior, to that of support vector ma- chines. This tendency was especially evident when the numbers of class data were small. I shifted my research efforts from developing fuzzy classifiers with gh generalization ability to developing support vector machine-based clas- sifiers This book focuses on the application of support vector machines to ern classification. Specifically, we discuss the properties of support vector machines that are useful for pattern classification applications, several multi- class models, and variants of support vector machines. To clarify their appli- cability to real-world problems, we compare the performance of most models discussed in the book using real-world benchmark data. Readers interested in the theoretical aspect of support vector machines should refer to book ch as 1-4 Three-layer neural networks are universal classifiers in that they can clas- sify any labeled data correctly if there are no identical data in different classes 5, 6. In training multi layer neural network classifiers, network weights are usually corrected so that the sum-of-squares error between the network out puts and the desired outputs is minimized. But because the decision bound aries between classes acquired by training are not directly determined, clas- sification performance for the unknown data, i.e., the generalization abilit depends on the training method. And it degrades greatly when the number of training data is small and there class On the other hand, in training support vector machines the decision boundaries are determined directly from the training data so that the sepa- rating margins of decision boundaries are maximized in the high-dimensional space called feature space. This learning strategy, based on statistical learning theory developed by Vapnik [ 1, 2 minimizes the classification errors of the training data and the unknown data sion boundaries is introduced to non-SVM-type classifiers, their pefa oo,or Therefore, the generalization abilities of support vector machines and other classifiers differ significantly, especially when the number of training data is small. This means that if some mechanism to maximize the margins degradation will be prevented when the class overlap is scarce or nonexistent. I In the original support vector machine, an n-class classification problem is converted into n two-class problems, and in the ith two-class problem we determine the optimal decision function that separates class i from the remaining classes. In classification, if one of the n decision functions classifies complexity of the classifier, is added to the objective function
Preface vii Preface to the First Edition I was shocked to see a student’s report on performance comparisons between support vector machines (SVMs) and fuzzy classifiers that we had developed with our best endeavors. Classification performance of our fuzzy classifiers was comparable, but in most cases inferior, to that of support vector machines. This tendency was especially evident when the numbers of class data were small. I shifted my research efforts from developing fuzzy classifiers with high generalization ability to developing support vector machine-based classifiers. This book focuses on the application of support vector machines to pattern classification. Specifically, we discuss the properties of support vector machines that are useful for pattern classification applications, several multiclass models, and variants of support vector machines. To clarify their applicability to real-world problems, we compare the performance of most models discussed in the book using real-world benchmark data. Readers interested in the theoretical aspect of support vector machines should refer to books such as [1–4]. Three-layer neural networks are universal classifiers in that they can classify any labeled data correctly if there are no identical data in different classes [5, 6]. In training multilayer neural network classifiers, network weights are usually corrected so that the sum-of-squares error between the network outputs and the desired outputs is minimized. But because the decision boundaries between classes acquired by training are not directly determined, classification performance for the unknown data, i.e., the generalization ability, depends on the training method. And it degrades greatly when the number of training data is small and there is no class overlap. On the other hand, in training support vector machines the decision boundaries are determined directly from the training data so that the separating margins of decision boundaries are maximized in the high-dimensional space called feature space. This learning strategy, based on statistical learning theory developed by Vapnik [1, 2], minimizes the classification errors of the training data and the unknown data. Therefore, the generalization abilities of support vector machines and other classifiers differ significantly, especially when the number of training data is small. This means that if some mechanism to maximize the margins of decision boundaries is introduced to non-SVM-type classifiers, their performance degradation will be prevented when the class overlap is scarce or nonexistent.1 In the original support vector machine, an n-class classification problem is converted into n two-class problems, and in the ith two-class problem we determine the optimal decision function that separates class i from the remaining classes. In classification, if one of the n decision functions classifies 1 To improve generalization ability of a classifier, a regularization term, which controls the complexity of the classifier, is added to the objective function
Preface an unknown data sample into a definite class, it is classified into that class In this formulation, if more than one decision function classify a data sample into definite classes or if no decision functions classify the data sample into a definite class, the data sample is unclassifiable. Another problem of support vector machines is slow training. Because support vector machines are trained by solving a quadratic programming oblem with the number of variables equal to the number of training data, training is slow for a large number of training data. To resolve unclassifiable regions for multiclass support vector machines we propose fuzzy support vector machines and decision-tree-based support vector machines To accelerate training, in this book, we discuss two approaches: selection of important data for training support vector machines before training and training by decomposing the optimization problem into two subproblems To improve generalization ability of non-SVM-type classifiers, we introduc the ideas of support vector machines to the classifiers: neural network training incorporating maximizing margins and a kernel version of a fuzzy classifier with ellipsoidal regions 6, pp 90-93, 119-139 In Chapter 1, we discuss two types of decision functions: direct decision functions, in which the class boundary is given by the curve where the deci- sion function vanishes, and the indirect decision function, in which the class boundary is given by the curve where two decision functions take on the same valu In Chapter 2, we discuss the architecture of support vector machines for two-class classification problems. First we explain hard-margin support vector machines, which are used when the classification problem is linearly separa- ble, namely, the training data of two classes are separated by a single hy perplane. Then, introducing slack variables for the training data, we extend hard-margin support vector machines so that they are applicable to insepara ole problems. There are two types of support vector machines: LI soft-margin support vector machines and L2 soft-margin support vector machines. Here, LI and L2 denote the linear sum and the square sum of the slack variable that are added to the objective function for training. Then we investigate the characteristics of solutions extensively and survey several techniques for estimating the generalization ability of support vector machines In Chapter 3, we discuss some methods for multiclass problems: one- against-all support vector machines, in which each class is separated from the remaining classes: pairwise support vector machines, in which one class is separated from another class; the use of error-correcting output codes for resolving unclassifiable regions; and all-at-once support vector machines, in which decision functions for all the classes are determined at once. To resolve unclassifiable regions, in addition to error-correcting codes, we discuss fuzzy support vector machines with membership functions and decision-tree-based support vector machines. To compare several methods for multiclass prob-
viii Preface an unknown data sample into a definite class, it is classified into that class. In this formulation, if more than one decision function classify a data sample into definite classes or if no decision functions classify the data sample into a definite class, the data sample is unclassifiable. Another problem of support vector machines is slow training. Because support vector machines are trained by solving a quadratic programming problem with the number of variables equal to the number of training data, training is slow for a large number of training data. To resolve unclassifiable regions for multiclass support vector machines we propose fuzzy support vector machines and decision-tree-based support vector machines. To accelerate training, in this book, we discuss two approaches: selection of important data for training support vector machines before training and training by decomposing the optimization problem into two subproblems. To improve generalization ability of non-SVM-type classifiers, we introduce the ideas of support vector machines to the classifiers: neural network training incorporating maximizing margins and a kernel version of a fuzzy classifier with ellipsoidal regions [6, pp. 90–93, 119–139]. In Chapter 1, we discuss two types of decision functions: direct decision functions, in which the class boundary is given by the curve where the decision function vanishes, and the indirect decision function, in which the class boundary is given by the curve where two decision functions take on the same value. In Chapter 2, we discuss the architecture of support vector machines for two-class classification problems. First we explain hard-margin support vector machines, which are used when the classification problem is linearly separable, namely, the training data of two classes are separated by a single hyperplane. Then, introducing slack variables for the training data, we extend hard-margin support vector machines so that they are applicable to inseparable problems. There are two types of support vector machines: L1 soft-margin support vector machines and L2 soft-margin support vector machines. Here, L1 and L2 denote the linear sum and the square sum of the slack variables that are added to the objective function for training. Then we investigate the characteristics of solutions extensively and survey several techniques for estimating the generalization ability of support vector machines. In Chapter 3, we discuss some methods for multiclass problems: oneagainst-all support vector machines, in which each class is separated from the remaining classes; pairwise support vector machines, in which one class is separated from another class; the use of error-correcting output codes for resolving unclassifiable regions; and all-at-once support vector machines, in which decision functions for all the classes are determined at once. To resolve unclassifiable regions, in addition to error-correcting codes, we discuss fuzzy support vector machines with membership functions and decision-tree-based support vector machines. To compare several methods for multiclass prob-
Preface lems, we show performance evaluation of these methods for the benchmark data sets Since support vector machines were proposed, many variants of support vector machines have been developed In Chapter 4, we discuss some of them least-squares support vector machines whose training results in solving a set of linear equations, linear programming support vector machines, robust support vector machines, and so on. In Chapter 5, we discuss some training methods for support vector ma- chines. Because we need to solve a quadratic optimization problem with the number of variables equal to the number of training data, it is impractical to solve a problem with a huge number of training data. For example, for 10,000 training data, 800 MB memory is necessary to store the Hessian matrix in double precision. Therefore, several methods have been developed to speed training. One approach reduces the number of training data by preselecting the training data. The other is to speed training by decomposing the problem into two subproblems and repeatedly solving the one subproblem while fixing the other and exchanging the variables between the two subproblems Optimal selection of features is important in realizing high-performance classification systems. Because support vector machines are trained so that the margins are maximized, they are said to be robust for nonoptimal fea- tures. In Chapter 7, we discuss several methods for selecting optimal features and show, using some benchmark data sets, that feature selection is impor tant even for support vector machines. Then we discuss feature extraction that transforms input features by linear and nonlinear transformation Some classifiers need clustering of training data before training. But sup- ort vector machines do not require clustering because mapping into a feature pace results in clustering in the input space. In Chapter 8, we discuss how we can realize support vector machine-based clustering One of the features of support vector machines is that by mapping the in- put space into the feature space, nonlinear separation of class data is realized Thus the conventional linear models become nonlinear if the linear models are formulated in the feature space. They are usually called kernel-based methods In Chapter 6, we discuss typical kernel-based methods: kernel least squares kernel principal component analysis, and the kernel Mahalanobis distance. he concept of maximum margins can be used for conventional classifiers to enhance generalization ability. In Chapter 9, we discuss methods for max- imizing margins of multilayer neural networks and in Chapter 10 we discuss maximum-margin fuzzy classifiers with ellipsoidal regions and polyhedral re- Support vector machines can be applied to function approximation. In Chapter 11, we discuss how to extend support vector machines to function approximation and compare the performance of the support vector machine with that of other function approximators
Preface ix lems, we show performance evaluation of these methods for the benchmark data sets. Since support vector machines were proposed, many variants of support vector machines have been developed. In Chapter 4, we discuss some of them: least-squares support vector machines whose training results in solving a set of linear equations, linear programming support vector machines, robust support vector machines, and so on. In Chapter 5, we discuss some training methods for support vector machines. Because we need to solve a quadratic optimization problem with the number of variables equal to the number of training data, it is impractical to solve a problem with a huge number of training data. For example, for 10,000 training data, 800 MB memory is necessary to store the Hessian matrix in double precision. Therefore, several methods have been developed to speed training. One approach reduces the number of training data by preselecting the training data. The other is to speed training by decomposing the problem into two subproblems and repeatedly solving the one subproblem while fixing the other and exchanging the variables between the two subproblems. Optimal selection of features is important in realizing high-performance classification systems. Because support vector machines are trained so that the margins are maximized, they are said to be robust for nonoptimal features. In Chapter 7, we discuss several methods for selecting optimal features and show, using some benchmark data sets, that feature selection is important even for support vector machines. Then we discuss feature extraction that transforms input features by linear and nonlinear transformation. Some classifiers need clustering of training data before training. But support vector machines do not require clustering because mapping into a feature space results in clustering in the input space. In Chapter 8, we discuss how we can realize support vector machine-based clustering. One of the features of support vector machines is that by mapping the input space into the feature space, nonlinear separation of class data is realized. Thus the conventional linear models become nonlinear if the linear models are formulated in the feature space. They are usually called kernel-based methods. In Chapter 6, we discuss typical kernel-based methods: kernel least squares, kernel principal component analysis, and the kernel Mahalanobis distance. The concept of maximum margins can be used for conventional classifiers to enhance generalization ability. In Chapter 9, we discuss methods for maximizing margins of multilayer neural networks and in Chapter 10 we discuss maximum-margin fuzzy classifiers with ellipsoidal regions and polyhedral regions. Support vector machines can be applied to function approximation. In Chapter 11, we discuss how to extend support vector machines to function approximation and compare the performance of the support vector machine with that of other function approximators
Acknowledgments I am grateful to those who are involved in the research project, conducted t the graduate School of Engineering, Kobe University, on neural, fuzzy, d support vector machine-based classifiers and function approximators, for their efforts in developing new methods and programs. Discussions with Dr Seiichi Ozawa were always helpful. Special thanks are due to them and current graduate and undergraduate students: T Inoue, K. Sakaguchi, T. Takigawa F. Takahashi, Y. Hirokawa, T Nishikawa, K. Kaieda, Y Koshiba, D. Tsujin- shi, Y. Miyamoto, S. Katagiri, T. Yamasaki, T. Kikuchi, K. Morikawa, Y ihara, Y. Torii, N. Matsui, Y. Hosokawa, T. Nagatani, K. Morikawa. T. Ishii. K. Iwamura. T. Kitamura. S. Muraoka. S. Takeuchi. Y Tajiri, and R. Yabuwaki; and a then PhD student T. Ban I thank A. Ralescu for having used my draft version of the book as a duate course text and having given me many useful comments. Thanks are also due to V. Kecman, H Nakayama, S. Miyamoto, J. A. K. Suykens, F. Anouar, G. C Cawley, H Motoda, A Inoue, F Schwenker Lu and J. T. Dearmon for their valuable discussions and useful comments This book includes my students' and my papers published in journals and proceedings of international conferences. I must thank many anony ous reviewers who reviewed our papers for their constructive comments and pointing out errors and missing references in improving the papers and consequently my book The Internet was a valuable source of information in writing the book preparingthesecondeditionIextensivelyusedScoPus(hTtps://www. scopus.com/home.url)tocheckthewell-citedpapersandtriedtoinclude those papers that are relevant to my book. Most of the papers listed in References were obtained from the Internet, from either authors home pages free downloadable sites such EsanN:http://www.dice.ucl.ac.be/esann/proceedings /electronicproceedings htm papers
Acknowledgments I am grateful to those who are involved in the research project, conducted at the Graduate School of Engineering, Kobe University, on neural, fuzzy, and support vector machine-based classifiers and function approximators, for their efforts in developing new methods and programs. Discussions with Dr. Seiichi Ozawa were always helpful. Special thanks are due to them and current graduate and undergraduate students: T. Inoue, K. Sakaguchi, T. Takigawa, F. Takahashi, Y. Hirokawa, T. Nishikawa, K. Kaieda, Y. Koshiba, D. Tsujinishi, Y. Miyamoto, S. Katagiri, T. Yamasaki, T. Kikuchi, K. Morikawa, Y. Kamata, M. Ashihara, Y. Torii, N. Matsui, Y. Hosokawa, T. Nagatani, K. Morikawa, T. Ishii, K. Iwamura, T. Kitamura, S. Muraoka, S. Takeuchi, Y. Tajiri, and R. Yabuwaki; and a then PhD student T. Ban. I thank A. Ralescu for having used my draft version of the book as a graduate course text and having given me many useful comments. Thanks are also due to V. Kecman, H. Nakayama, S. Miyamoto, J. A. K. Suykens, F. Anouar, G. C. Cawley, H. Motoda, A. Inoue, F. Schwenker, N. Kasabov, B.-L. Lu, and J. T. Dearmon for their valuable discussions and useful comments. This book includes my students’ and my papers published in journals and proceedings of international conferences. I must thank many anonymous reviewers who reviewed our papers for their constructive comments and pointing out errors and missing references in improving the papers and consequently my book. The Internet was a valuable source of information in writing the book. In preparing the second edition, I extensively used SCOPUS (https://www. scopus.com/home.url) to check the well-cited papers and tried to include those papers that are relevant to my book. Most of the papers listed in References were obtained from the Internet, from either authors’ home pages or free downloadable sites such as ESANN: http://www.dice.ucl.ac.be/esann/proceedings /electronicproceedings.htm JMLR: http://jmlr.csail.mit.edu/papers/ xi