正在加载图片...
1.3 Data sets used in the book Table 1.2 Benchmark data sets for microarray problems Inputs raining data Test data Classes Breast cancer(1)25] 14 Breast cancer(2)25 14 Breast cancer (3)[26 24,188 8818 Breast cancer(s)(25 Colon cancer 27 2.000 High-grade glioma/29/128)7, 129 12,625 40343 09 222222222 Leukemia 30 7.129 Prostate cancer 31 The thyroid data 35, 36 include 15 digital features and more than 92% of the data belong to one class. Thus the recognition rate lower than 92% is useless The blood cell classification 37 involves classifying optically screened white blood cells into 12 classes using 13 features. This is a very difficult problem; class boundaries for some classes are ambiguous because the classes are defined according to the growth stages of white blood cells Hiragana-50 and hiragana-105 data 38, 7 were gathered from Japanese li- cense plates. The original grayscale images of hiragana characters were trans- formed into(5 X 10)-pixel and(7 X 15)-pixel images, respectively, with the grayscale range being from 0 to 255. Then by performing grayscale shift, po- sition shift, and random noise addition to the images, the training and test data were generated. Then for the hiragana-105 data to reduce the number of nput variables, i.e., 7x 15=105, the hiragana-13 data 38, 7 were generated by calculating the 13 central moments for the(7 15)-pixel images 39, 38 L, Satimage data [36] have 36 inputs: 3 x 3 pixels each with four spectral lues in a satellite image and are to classify the center pixel into one of six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil USPS data 40 are handwritten numerals in(16 x 16)-pixel grayscale im ages. They are scanned from envelopes by the United States Postal Services The MNIST data 41, 42 are handwritten numerals consisting of(28x28)- pixel inputs with 256 grayscale levels; they are often used to compare perfor mance of support vector machines and other classifiers Table 1. 4 lists the data sets for function approximation used in the book. For all the problems in the table, the number of outputs is 1 The Mackey-Glass differential equation 43 generates time series data with a chaotic behavior and is given by dr(t)0.2x(t-7) 0.1x(t) (1.22)1.3 Data Sets Used in the Book 11 Table 1.2 Benchmark data sets for microarray problems Data Inputs Training data Test data Classes Breast cancer (1) [25] 3,226 14 8 2 Breast cancer (2) [25] 3,226 14 8 2 Breast cancer (3) [26] 24,188 78 19 2 Breast cancer (s) [25] 3,226 14 8 2 Colon cancer [27] 2,000 40 20 2 Hepatocellular carcinoma [28] 7,129 33 27 2 High-grade glioma [29] 12,625 21 29 2 Leukemia [30] 7,129 38 34 2 Prostate cancer [31] 12,600 102 34 2 The thyroid data [35, 36] include 15 digital features and more than 92% of the data belong to one class. Thus the recognition rate lower than 92% is useless. The blood cell classification [37] involves classifying optically screened white blood cells into 12 classes using 13 features. This is a very difficult problem; class boundaries for some classes are ambiguous because the classes are defined according to the growth stages of white blood cells. Hiragana-50 and hiragana-105 data [38, 7] were gathered from Japanese li￾cense plates. The original grayscale images of hiragana characters were trans￾formed into (5 × 10)-pixel and (7 × 15)-pixel images, respectively, with the grayscale range being from 0 to 255. Then by performing grayscale shift, po￾sition shift, and random noise addition to the images, the training and test data were generated. Then for the hiragana-105 data to reduce the number of input variables, i.e., 7×15 = 105, the hiragana-13 data [38, 7] were generated by calculating the 13 central moments for the (7 × 15)-pixel images [39, 38]. Satimage data [36] have 36 inputs: 3 × 3 pixels each with four spectral values in a satellite image and are to classify the center pixel into one of the six classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil. USPS data [40] are handwritten numerals in (16 × 16)-pixel grayscale im￾ages. They are scanned from envelopes by the United States Postal Services. The MNIST data [41, 42] are handwritten numerals consisting of (28×28)- pixel inputs with 256 grayscale levels; they are often used to compare perfor￾mance of support vector machines and other classifiers. Table 1.4 lists the data sets for function approximation used in the book. For all the problems in the table, the number of outputs is 1. The Mackey–Glass differential equation [43] generates time series data with a chaotic behavior and is given by dx(t) dt = 0.2 x(t − τ ) 1 + x10(t − τ ) − 0.1 x(t), (1.22)
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有