正在加载图片...
K.j. Kim, H. Ahn/ Expert Systems with Applications 34(2008)1200-1209 In this paper, we try to apply hybrid K-means clustering tial seeds, but there is no mechanism to optimize the initial and genetic algorithms to carry out an exploratory segmen- seeds. In addition, it may converge to a local minimum tation of an online shopping market. To find the most under certain conditions because it works as a hill-climbing effective clustering method for those kinds of data, we strategy As described in the next section, GA can find opti adopt a number of clustering methods and compare the mal clusters by minimizing the extrinsic clustering metric performance of each clustering method by using our suggested performance criteria. In addition, we validate 2. 2. Self-organizing map(SOM) the usefulness of our proposed model in a real-world application SOM is the clustering algorithm based on the unsuper- The rest of this paper is organized as follows: the next vised neural network model. Since it was suggested by section reviews two traditional clustering algorithms, K- Kohonen (1982), it has been applied to many studies means and self-organizing map(SOM), along with the per- because of its good performance. The basic SOM model ormance criteria. Section 3 proposes the GA approach to consists of two layers, an input layer and an output layer optimize the K-means clustering and Section 4 describes the When the training set is presented to the network, the val data and the experiments. In this section, the empirical ues flow forward through the network to units in the out results are also summarized and discussed. In the final sec- put layer. The neurons in the output layer are arranged tion,conclusions and the limitations of this study are in a grid, and the units in the output layer compete with presented each other, and the one with the highest value wins. The process of SoM is as follows: 2. Clustering algorithms (1) Initialize the weights as random small numbers Cluster analysis is an effective tool in scientific or man (2)Put an input sample, Xi, into the SOM network, and genial inquiry. It groups a set of data in d-dimensional fea the distances between weight vectors, Wi=(wi ture space to maximize the similarity within the clusters w2,..., W m), and the input sample, Xi, are calculated and minimize the similarity between two different clusters Then, select the neuron whose distance with X is the There are various clustering methods and they are cur- shortest, as in Eq. (1). The selected neuron would be rently widely used. Among them, we apply two popular called the‘ winner methods, K-means and SOM, and a novel hybrid method to market segmentation. Before providing a brief descrip- Min DO tion of each method, the following assumptions are nece sary. A given population consists of n elements described (3)Update the weights of the winner as well as its neigh by m attributes and it is partitioned into K clusters bors by the following equation when a is assumed to Xi=(xi, x2, .. xi)represents the vector of the m attri be the learning rate butes of element i W+aX-W川‖ 2.1 K-means clustering algorithm (4)Iterate Steps(2) and (3)until the stop criterion is The K-means method is a widely used clustering proce dure that searches for a nearly optimal partition with a In spite of several excellent applications, SOM has some fixed number of clusters. It uses an iterative hill-climbing limitations that hinder its performance. Especially, just like algorithm. The process of K-means clustering is as follows: other neural network based algorithms, SOM has no mech anism to determine the number of clusters, initial weights (1)The initial seeds with the chosen number of clusters, and stopping conditions K, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. 2. 3. Criteria for performance comparison of clustering (2) Each record is assigned to the centroid that is nearest, thus forming a cluste (3) Keeping the same number of clusters, the new cen- We compare the performance of the clustering alge troid of each cluster is calculated rithms using 'intraclass inertia. Intraclass inertia is a mea- (4)Iterate Step(2)and(3)until the clusters stop chang- sure of how compact each cluster is in the m-dimensional ing or stop conditions are satisfied space of numeric attributes. It is the average of the dis- tances between the means and the observations in each he K-means algorithm has been popular because of its cluster. Eq. (3)represents the equation for the intraclass ess and simplicity for application. However, it also has inertia for the given K clusters (michaud, 1997 some drawbacks. First, it does not deal well with overlap- ping clusters and the clusters can be pulled out of center by F() nxk=∑∑∑(Xm-R outliers. Also, the clustering result may depend on the iniIn this paper, we try to apply hybrid K-means clustering and genetic algorithms to carry out an exploratory segmen￾tation of an online shopping market. To find the most effective clustering method for those kinds of data, we adopt a number of clustering methods and compare the performance of each clustering method by using our suggested performance criteria. In addition, we validate the usefulness of our proposed model in a real-world application. The rest of this paper is organized as follows: the next section reviews two traditional clustering algorithms, K￾means and self-organizing map (SOM), along with the per￾formance criteria. Section 3 proposes the GA approach to optimize the K-means clustering and Section 4 describes the data and the experiments. In this section, the empirical results are also summarized and discussed. In the final sec￾tion, conclusions and the limitations of this study are presented. 2. Clustering algorithms Cluster analysis is an effective tool in scientific or man￾agerial inquiry. It groups a set of data in d-dimensional fea￾ture space to maximize the similarity within the clusters and minimize the similarity between two different clusters. There are various clustering methods and they are cur￾rently widely used. Among them, we apply two popular methods, K-means and SOM, and a novel hybrid method to market segmentation. Before providing a brief descrip￾tion of each method, the following assumptions are neces￾sary. A given population consists of n elements described by m attributes and it is partitioned into K clusters. Xi = (xi1, xi2,...,xim) represents the vector of the m attri￾butes of element i. 2.1. K-means clustering algorithm The K-means method is a widely used clustering proce￾dure that searches for a nearly optimal partition with a fixed number of clusters. It uses an iterative hill-climbing algorithm. The process of K-means clustering is as follows: (1) The initial seeds with the chosen number of clusters, K, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. (2) Each record is assigned to the centroid that is nearest, thus forming a cluster. (3) Keeping the same number of clusters, the new cen￾troid of each cluster is calculated. (4) Iterate Step (2) and (3) until the clusters stop chang￾ing or stop conditions are satisfied. The K-means algorithm has been popular because of its easiness and simplicity for application. However, it also has some drawbacks. First, it does not deal well with overlap￾ping clusters and the clusters can be pulled out of center by outliers. Also, the clustering result may depend on the ini￾tial seeds, but there is no mechanism to optimize the initial seeds. In addition, it may converge to a local minimum under certain conditions because it works as a hill-climbing strategy. As described in the next section, GA can find opti￾mal clusters by minimizing the extrinsic clustering metric. 2.2. Self-organizing map (SOM) SOM is the clustering algorithm based on the unsuper￾vised neural network model. Since it was suggested by Kohonen (1982), it has been applied to many studies because of its good performance. The basic SOM model consists of two layers, an input layer and an output layer. When the training set is presented to the network, the val￾ues flow forward through the network to units in the out￾put layer. The neurons in the output layer are arranged in a grid, and the units in the output layer compete with each other, and the one with the highest value wins. The process of SOM is as follows: (1) Initialize the weights as random small numbers. (2) Put an input sample, Xi, into the SOM network, and the distances between weight vectors, Wj = (wj1, wj2, ... ,wjm), and the input sample, Xi, are calculated. Then, select the neuron whose distance with Xi is the shortest, as in Eq. (1). The selected neuron would be called the ‘winner’ Min DðjÞ ¼ X i ðwji xiÞ 2 : ð1Þ (3) Update the weights of the winner as well as its neigh￾bors by the following equation when a is assumed to be the learning rate W NEW j ¼ W j þ akXi W jk: ð2Þ (4) Iterate Steps (2) and (3) until the stop criterion is satisfied. In spite of several excellent applications, SOM has some limitations that hinder its performance. Especially, just like other neural network based algorithms, SOM has no mech￾anism to determine the number of clusters, initial weights and stopping conditions. 2.3. Criteria for performance comparison of clustering algorithms We compare the performance of the clustering algo￾rithms using ‘intraclass inertia’. Intraclass inertia is a mea￾sure of how compact each cluster is in the m-dimensional space of numeric attributes. It is the average of the dis￾tances between the means and the observations in each cluster. Eq. (3) represents the equation for the intraclass inertia for the given K clusters (Michaud, 1997) F ðkÞ ¼ 1 n X K nKIK ¼ 1 n X K X i2CK Xm P¼1 ðXiP X KP Þ ð3Þ K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1201
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有