Availableonlineatwww.sciencedirectcom scⅰ enceDirect Expert Sy stems with Applications ELSEVIER Expert Systems with Applications 34(2008)1200-1209 www.elsevier.com/locate/eswa A recommender system using Ga K-means clustering in an online shopping market Kyoung-jae Kim , Hyunchul Ahn Graduate School of Management, Korea Adranced Institute of Science and Technology, 207-43 Cheongrangri-Dong Dongdaemun-Gu, Seoul 130-722, South Korea Abstract The Internet is emerging as a new marketing channel, so understanding the characteristics of online customers'needs and expectations considered a prerequisite for activating the consumer-oriented electronic commerce market. In this study, we propose a novel clustering algorithm based on genetic algorithms(GAs)to effectively segment the online shopping market. In general, GAs are believed to be effective on NP-complete global optimization problems, and they can provide good near-optimal solutions in reasonable time. Thus, we believe that a clustering technique with Ga can provide a way of finding the relevant clusters more effectively. The research in this paper applied K-means clustering whose initial seeds are optimized by Ga, which is called GA K-means, to a real-world online shopping market seg entation case. In this study, we compared the results of Ga K-means to those of a simple K-means algorithm and self-organizing maps SOM). The results showed that GA K-means clustering may improve segmentation performance in comparison to other typical clustering algorithms. In addition, our study validated the usefulness of the proposed model as a preprocessing tool for recommendation system o 2007 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Genetic algorithms; Self-organizing maps; Market segmentation; Case-based reasoning 1. Introduction classes in which the consumers can be naturally grouped, according to the information available(velido, Lisboa, Since the Internet has become popular, the consumer- Meehan, 1999). It can be the basis for effective targeting oriented electronic commerce market has grown so huge and predicting prospects through the identification of the that now companies are convinced of the importance of proper segments. Although much of the marketing litera- more important for the companies to analyze and under- clustering techniques are frequently used in practice Wear understanding this new emerging market. It is becoming ture has proposed various market segmentation techniques, stand the needs and expectations of their online users or Kamakura, 1998). In addition, K-means clustering customers because the Internet is one of the most effective the most frequently used market segmentation technique media to gather, disseminate and utilize the information among the clustering techniques( Gehrt Shim, 1998; about its users or customers. Thus, it is easier to extract Kuo, Chang, Chien, 2004). However, the major draw- nowledge out of the shopping process to create new busi- back of K-means clustering is that it often falls in local ness opportunities under the Internet environment optima and the result largely depends on the initial cluster 1 Market segmentation is one of the ways in which such centers. Prior studies pointed out this limitation and tried owledge can be represented. It attempts to discover the to integrate K-means clustering and global search tech- niques including genetic algorithms(see Babu Murty 1993: Kuo, Liao, Tu, 2005: Maulik Bandyopadhyay Corresponding author. Tel: +822 2260 3324: fax: +822 2260 3684. 2000; Murthy chowdhury, 1996; Pena, Lozano, lar- E-mail address: kjkim(@dongguk. edu(K -j. Kim). sanaga, 1999) 0957-4174/S- see front matter 2007 Elsevier Ltd. All rights reserved doi:10.1016/eswa200612025
A recommender system using GA K-means clustering in an online shopping market Kyoung-jae Kim a,*, Hyunchul Ahn b a Department of Management Information Systems, Dongguk University, 3-26 Pil-Dong, Jung-Gu, Seoul 100-715, South Korea b Graduate School of Management, Korea Advanced Institute of Science and Technology, 207-43 Cheongrangri-Dong, Dongdaemun-Gu, Seoul 130-722, South Korea Abstract The Internet is emerging as a new marketing channel, so understanding the characteristics of online customers’ needs and expectations is considered a prerequisite for activating the consumer-oriented electronic commerce market. In this study, we propose a novel clustering algorithm based on genetic algorithms (GAs) to effectively segment the online shopping market. In general, GAs are believed to be effective on NP-complete global optimization problems, and they can provide good near-optimal solutions in reasonable time. Thus, we believe that a clustering technique with GA can provide a way of finding the relevant clusters more effectively. The research in this paper applied K-means clustering whose initial seeds are optimized by GA, which is called GA K-means, to a real-world online shopping market segmentation case. In this study, we compared the results of GA K-means to those of a simple K-means algorithm and self-organizing maps (SOM). The results showed that GA K-means clustering may improve segmentation performance in comparison to other typical clustering algorithms. In addition, our study validated the usefulness of the proposed model as a preprocessing tool for recommendation systems. 2007 Elsevier Ltd. All rights reserved. Keywords: Recommender system; Genetic algorithms; Self-organizing maps; Market segmentation; Case-based reasoning 1. Introduction Since the Internet has become popular, the consumeroriented electronic commerce market has grown so huge that now companies are convinced of the importance of understanding this new emerging market. It is becoming more important for the companies to analyze and understand the needs and expectations of their online users or customers because the Internet is one of the most effective media to gather, disseminate and utilize the information about its users or customers. Thus, it is easier to extract knowledge out of the shopping process to create new business opportunities under the Internet environment. Market segmentation is one of the ways in which such knowledge can be represented. It attempts to discover the classes in which the consumers can be naturally grouped, according to the information available (Velido, Lisboa, & Meehan, 1999). It can be the basis for effective targeting and predicting prospects through the identification of the proper segments. Although much of the marketing literature has proposed various market segmentation techniques, clustering techniques are frequently used in practice (Wedel & Kamakura, 1998). In addition, K-means clustering is the most frequently used market segmentation technique among the clustering techniques (Gehrt & Shim, 1998; Kuo, Chang, & Chien, 2004). However, the major drawback of K-means clustering is that it often falls in local optima and the result largely depends on the initial cluster centers. Prior studies pointed out this limitation and tried to integrate K-means clustering and global search techniques including genetic algorithms (see Babu & Murty, 1993; Kuo, Liao, & Tu, 2005; Maulik & Bandyopadhyay, 2000; Murthy & Chowdhury, 1996; Pena, Lozano, & Larranaga, 1999). 0957-4174/$ - see front matter 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.12.025 * Corresponding author. Tel.: +82 2 2260 3324; fax: +82 2 2260 3684. E-mail address: kjkim@dongguk.edu (K.-j. Kim). www.elsevier.com/locate/eswa Available online at www.sciencedirect.com Expert Systems with Applications 34 (2008) 1200–1209 Expert Systems with Applications
K.j. Kim, H. Ahn/ Expert Systems with Applications 34(2008)1200-1209 In this paper, we try to apply hybrid K-means clustering tial seeds, but there is no mechanism to optimize the initial and genetic algorithms to carry out an exploratory segmen- seeds. In addition, it may converge to a local minimum tation of an online shopping market. To find the most under certain conditions because it works as a hill-climbing effective clustering method for those kinds of data, we strategy As described in the next section, GA can find opti adopt a number of clustering methods and compare the mal clusters by minimizing the extrinsic clustering metric performance of each clustering method by using our suggested performance criteria. In addition, we validate 2. 2. Self-organizing map(SOM) the usefulness of our proposed model in a real-world application SOM is the clustering algorithm based on the unsuper- The rest of this paper is organized as follows: the next vised neural network model. Since it was suggested by section reviews two traditional clustering algorithms, K- Kohonen (1982), it has been applied to many studies means and self-organizing map(SOM), along with the per- because of its good performance. The basic SOM model ormance criteria. Section 3 proposes the GA approach to consists of two layers, an input layer and an output layer optimize the K-means clustering and Section 4 describes the When the training set is presented to the network, the val data and the experiments. In this section, the empirical ues flow forward through the network to units in the out results are also summarized and discussed. In the final sec- put layer. The neurons in the output layer are arranged tion,conclusions and the limitations of this study are in a grid, and the units in the output layer compete with presented each other, and the one with the highest value wins. The process of SoM is as follows: 2. Clustering algorithms (1) Initialize the weights as random small numbers Cluster analysis is an effective tool in scientific or man (2)Put an input sample, Xi, into the SOM network, and genial inquiry. It groups a set of data in d-dimensional fea the distances between weight vectors, Wi=(wi ture space to maximize the similarity within the clusters w2,..., W m), and the input sample, Xi, are calculated and minimize the similarity between two different clusters Then, select the neuron whose distance with X is the There are various clustering methods and they are cur- shortest, as in Eq. (1). The selected neuron would be rently widely used. Among them, we apply two popular called the‘ winner methods, K-means and SOM, and a novel hybrid method to market segmentation. Before providing a brief descrip- Min DO tion of each method, the following assumptions are nece sary. A given population consists of n elements described (3)Update the weights of the winner as well as its neigh by m attributes and it is partitioned into K clusters bors by the following equation when a is assumed to Xi=(xi, x2, .. xi)represents the vector of the m attri be the learning rate butes of element i W+aX-W川‖ 2.1 K-means clustering algorithm (4)Iterate Steps(2) and (3)until the stop criterion is The K-means method is a widely used clustering proce dure that searches for a nearly optimal partition with a In spite of several excellent applications, SOM has some fixed number of clusters. It uses an iterative hill-climbing limitations that hinder its performance. Especially, just like algorithm. The process of K-means clustering is as follows: other neural network based algorithms, SOM has no mech anism to determine the number of clusters, initial weights (1)The initial seeds with the chosen number of clusters, and stopping conditions K, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. 2. 3. Criteria for performance comparison of clustering (2) Each record is assigned to the centroid that is nearest, thus forming a cluste (3) Keeping the same number of clusters, the new cen- We compare the performance of the clustering alge troid of each cluster is calculated rithms using 'intraclass inertia. Intraclass inertia is a mea- (4)Iterate Step(2)and(3)until the clusters stop chang- sure of how compact each cluster is in the m-dimensional ing or stop conditions are satisfied space of numeric attributes. It is the average of the dis- tances between the means and the observations in each he K-means algorithm has been popular because of its cluster. Eq. (3)represents the equation for the intraclass ess and simplicity for application. However, it also has inertia for the given K clusters (michaud, 1997 some drawbacks. First, it does not deal well with overlap- ping clusters and the clusters can be pulled out of center by F() nxk=∑∑∑(Xm-R outliers. Also, the clustering result may depend on the ini
In this paper, we try to apply hybrid K-means clustering and genetic algorithms to carry out an exploratory segmentation of an online shopping market. To find the most effective clustering method for those kinds of data, we adopt a number of clustering methods and compare the performance of each clustering method by using our suggested performance criteria. In addition, we validate the usefulness of our proposed model in a real-world application. The rest of this paper is organized as follows: the next section reviews two traditional clustering algorithms, Kmeans and self-organizing map (SOM), along with the performance criteria. Section 3 proposes the GA approach to optimize the K-means clustering and Section 4 describes the data and the experiments. In this section, the empirical results are also summarized and discussed. In the final section, conclusions and the limitations of this study are presented. 2. Clustering algorithms Cluster analysis is an effective tool in scientific or managerial inquiry. It groups a set of data in d-dimensional feature space to maximize the similarity within the clusters and minimize the similarity between two different clusters. There are various clustering methods and they are currently widely used. Among them, we apply two popular methods, K-means and SOM, and a novel hybrid method to market segmentation. Before providing a brief description of each method, the following assumptions are necessary. A given population consists of n elements described by m attributes and it is partitioned into K clusters. Xi = (xi1, xi2,...,xim) represents the vector of the m attributes of element i. 2.1. K-means clustering algorithm The K-means method is a widely used clustering procedure that searches for a nearly optimal partition with a fixed number of clusters. It uses an iterative hill-climbing algorithm. The process of K-means clustering is as follows: (1) The initial seeds with the chosen number of clusters, K, are selected and an initial partition is built by using the seeds as the centroids of the initial clusters. (2) Each record is assigned to the centroid that is nearest, thus forming a cluster. (3) Keeping the same number of clusters, the new centroid of each cluster is calculated. (4) Iterate Step (2) and (3) until the clusters stop changing or stop conditions are satisfied. The K-means algorithm has been popular because of its easiness and simplicity for application. However, it also has some drawbacks. First, it does not deal well with overlapping clusters and the clusters can be pulled out of center by outliers. Also, the clustering result may depend on the initial seeds, but there is no mechanism to optimize the initial seeds. In addition, it may converge to a local minimum under certain conditions because it works as a hill-climbing strategy. As described in the next section, GA can find optimal clusters by minimizing the extrinsic clustering metric. 2.2. Self-organizing map (SOM) SOM is the clustering algorithm based on the unsupervised neural network model. Since it was suggested by Kohonen (1982), it has been applied to many studies because of its good performance. The basic SOM model consists of two layers, an input layer and an output layer. When the training set is presented to the network, the values flow forward through the network to units in the output layer. The neurons in the output layer are arranged in a grid, and the units in the output layer compete with each other, and the one with the highest value wins. The process of SOM is as follows: (1) Initialize the weights as random small numbers. (2) Put an input sample, Xi, into the SOM network, and the distances between weight vectors, Wj = (wj1, wj2, ... ,wjm), and the input sample, Xi, are calculated. Then, select the neuron whose distance with Xi is the shortest, as in Eq. (1). The selected neuron would be called the ‘winner’ Min DðjÞ ¼ X i ðwji xiÞ 2 : ð1Þ (3) Update the weights of the winner as well as its neighbors by the following equation when a is assumed to be the learning rate W NEW j ¼ W j þ akXi W jk: ð2Þ (4) Iterate Steps (2) and (3) until the stop criterion is satisfied. In spite of several excellent applications, SOM has some limitations that hinder its performance. Especially, just like other neural network based algorithms, SOM has no mechanism to determine the number of clusters, initial weights and stopping conditions. 2.3. Criteria for performance comparison of clustering algorithms We compare the performance of the clustering algorithms using ‘intraclass inertia’. Intraclass inertia is a measure of how compact each cluster is in the m-dimensional space of numeric attributes. It is the average of the distances between the means and the observations in each cluster. Eq. (3) represents the equation for the intraclass inertia for the given K clusters (Michaud, 1997) F ðkÞ ¼ 1 n X K nKIK ¼ 1 n X K X i2CK Xm P¼1 ðXiP X KP Þ ð3Þ K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1201
K.j. Kim, H. Ahn/ Expert Systems with Applications 34(2008)1200-1209 where n is the number of total observations, CK is set of the adjacent genes on the chromosome of a parent. The two- Kth cluster, Xip is the value of the attribute P for observa- point crossover involves two cuts in each chromosome. tion i, and Xkp is the mean of the attribute P in the Kth The uniform crossover allows two parent strings to cluster that is XkP=m∑ produce two children. It permits great flexibility in the way strings are combined. In addition, the mutation oper- 3. GA K-means clustering algorithm ator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a As indicated in Section 2.1, the K-means algorithm does chromosome. It provides the means for introducing new lot have any mechanism for choosing appropriate initial information into the population. Finally, the Ga tends to seeds. However, selecting different initial seeds may gener- converge on an optimal or near-optimal solution through ate huge differences in clustering results, especially when these operators(Wong Tan, 1994) the target sample contains outliers. In addition, ran- The Ga is popularly used to improve the performance dom selection of initial seeds often causes the clustering of artificial intelligence techniques. Recently, the Ga has quality to fall into local optimization(Bradley Fayyad, been employed to mitigate the limitations of typical cluster 1998). So, it is very important to select appropriate initial ing algorithms including the K-means algorithm. Lleti seeds in the traditional K-means clustering method. To Ortiz, Sarabia, and Sanchez(2004)proposed the use of overcome this critical limitation, we propose Ga as the the integrated genetic algorithm and K-means clustering optimization tool of the initial seeds in the K-means algo- to select relevant input variables. The method proposed rithm. We call our proposed model the GA K-means clus- in their study was based on the use of Ga to select those tering method. variables important for clarifying the presence of groups Genetic algorithms are stochastic search techniques that in data. Other studies tried to integrate Ga and the k can search large and complicated spaces. It is based on means algorithm for overcoming the drawback of the pos iology including natural genetics and evolutionary princi- sibility of convergence in a local minimum solution(see ple. In particular, GAs are suitable for parameter optimiza- Babu Murty, 1993; Kuo, An, Wang, Chung, 2006: tion problems with an objective function subject to various Maulik Bandyopadhyay, 2000; Murthy Chowdhury, hard and soft constraints (Shin Han, 1999). The GA 1996: Pena et al., 1999) basically explores a complex space in an adaptive way, guided by the biological evolution of selection, crossover, 3. 1. The process of the GA K-means algorithm and mutation. This algorithm uses natural selection-sur- vival of the fittest- to solve optimization problems(Kim GA K-means uses Ga to select optimal nitial seeds in K-means clustering. Fig. I shows the overall The first step of the Ga is problem representation. The framework of GA K-means clustering problem must be represented in a suitable form to be han The detailed explanation on the process of GA K-means dled by the GA. Thus, the problem is described in terms of clustering is as follows genetic code, like DNA chromosomes. The GA often ng. If the problems are (1) In the first step, the system generates the initial pop- coded as chromosomes, a population-a set of chromo- ulation that would be used to find global optimum somes-is initialized. Each chromosome within a popula- initial seeds The values of the chromosomes for the tion gradually evolves through biological operations population are initiated into random values before There are no general rules for determining the population the search process. To enable Ga to find the optimal size. But, population sizes of 100-200 are commonly used initial seeds, we should design the structure of a chro GA research. Once the population size is chosen, the mosome properly as a form of binary strings. A tial population is randomly generated (Bauer, 1994).After detailed explanation on chromosome design is pre- the initialization step, each chromosome is evaluated by a sented in Section 3.2 fitness function. According to the value of the fitness func- (2) The second step is the process of K-means clustering, tion the chromosomes associated with the fittest individu which is mentioned in Section 2. 1. In our study, the als will be reproduced more often than those associated maximum number of iterations for centroid adjust with unfit individuals(Davis, 1994) ment in K-means is set at 10. After the process of The GA works with three operators that are iteratively K-means, the value of the fitness function is updated used. The selection operator determines which individuals To find optimal initial seeds, we applied ' intraclass may survive(hertz Kobler, 2000). The crossover opera inertia'as a fitness function. That is, the objective tor allows the search to fan out in diverse directions look. of our ga K-means is to find the initial seeds where g for attractive solutions, and permits chromosoma intraclass intertia is minimized after K-means cluster material from different parents to be combined in a single ing finishes child. There are three popular crossover methods: single- (3)In this stage, Ga performs genetic operations such point, two point, and uniform. The single-point crossover selection. crossover and mutation. on the current makes only one cut in each chromosome and selects two population. By doing this, a new generation is pro-
where n is the number of total observations, CK is set of the Kth cluster, XiP is the value of the attribute P for observation i, and X KP is the mean of the attribute P in the Kth cluster that is X KP ¼ 1 nK P i2CK XiP . 3. GA K-means clustering algorithm As indicated in Section 2.1, the K-means algorithm does not have any mechanism for choosing appropriate initial seeds. However, selecting different initial seeds may generate huge differences in clustering results, especially when the target sample contains many outliers. In addition, random selection of initial seeds often causes the clustering quality to fall into local optimization (Bradley & Fayyad, 1998). So, it is very important to select appropriate initial seeds in the traditional K-means clustering method. To overcome this critical limitation, we propose GA as the optimization tool of the initial seeds in the K-means algorithm. We call our proposed model the GA K-means clustering method. Genetic algorithms are stochastic search techniques that can search large and complicated spaces. It is based on biology including natural genetics and evolutionary principle. In particular, GAs are suitable for parameter optimization problems with an objective function subject to various hard and soft constraints (Shin & Han, 1999). The GA basically explores a complex space in an adaptive way, guided by the biological evolution of selection, crossover, and mutation. This algorithm uses natural selection – survival of the fittest – to solve optimization problems (Kim, 2004). The first step of the GA is problem representation. The problem must be represented in a suitable form to be handled by the GA. Thus, the problem is described in terms of genetic code, like DNA chromosomes. The GA often works with a form of binary coding. If the problems are coded as chromosomes, a population – a set of chromosomes – is initialized. Each chromosome within a population gradually evolves through biological operations. There are no general rules for determining the population size. But, population sizes of 100–200 are commonly used in GA research. Once the population size is chosen, the initial population is randomly generated (Bauer, 1994). After the initialization step, each chromosome is evaluated by a fitness function. According to the value of the fitness function, the chromosomes associated with the fittest individuals will be reproduced more often than those associated with unfit individuals (Davis, 1994). The GA works with three operators that are iteratively used. The selection operator determines which individuals may survive (Hertz & Kobler, 2000). The crossover operator allows the search to fan out in diverse directions looking for attractive solutions, and permits chromosomal material from different parents to be combined in a single child. There are three popular crossover methods: singlepoint, two-point, and uniform. The single-point crossover makes only one cut in each chromosome and selects two adjacent genes on the chromosome of a parent. The twopoint crossover involves two cuts in each chromosome. The uniform crossover allows two parent strings to produce two children. It permits great flexibility in the way strings are combined. In addition, the mutation operator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a chromosome. It provides the means for introducing new information into the population. Finally, the GA tends to converge on an optimal or near-optimal solution through these operators (Wong & Tan, 1994). The GA is popularly used to improve the performance of artificial intelligence techniques. Recently, the GA has been employed to mitigate the limitations of typical clustering algorithms including the K-means algorithm. Lletı´, Ortiz, Sarabia, and Sa´nchez (2004) proposed the use of the integrated genetic algorithm and K-means clustering to select relevant input variables. The method proposed in their study was based on the use of GA to select those variables important for clarifying the presence of groups in data. Other studies tried to integrate GA and the Kmeans algorithm for overcoming the drawback of the possibility of convergence in a local minimum solution (see Babu & Murty, 1993; Kuo, An, Wang, & Chung, 2006; Maulik & Bandyopadhyay, 2000; Murthy & Chowdhury, 1996; Pena et al., 1999). 3.1. The process of the GA K-means algorithm GA K-means uses GA to select optimal or sub-optimal initial seeds in K-means clustering. Fig. 1 shows the overall framework of GA K-means clustering. The detailed explanation on the process of GA K-means clustering is as follows: (1) In the first step, the system generates the initial population that would be used to find global optimum initial seeds. The values of the chromosomes for the population are initiated into random values before the search process. To enable GA to find the optimal initial seeds, we should design the structure of a chromosome properly as a form of binary strings. A detailed explanation on chromosome design is presented in Section 3.2. (2) The second step is the process of K-means clustering, which is mentioned in Section 2.1. In our study, the maximum number of iterations for centroid adjustment in K-means is set at 10. After the process of K-means, the value of the fitness function is updated. To find optimal initial seeds, we applied ‘intraclass inertia’ as a fitness function. That is, the objective of our GA K-means is to find the initial seeds where intraclass intertia is minimized after K-means clustering finishes. (3) In this stage, GA performs genetic operations such as selection, crossover and mutation, on the current population. By doing this, a new generation is pro- 1202 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
K.j. Kim, H. Ahn Expert Systems with Applications 34(2008)1200-1209 1. Design the structure of chromosomes and define fitness function 3. Perform the K-means process according to the given initial seeds Calculate the intraclass inertia as fitness function for each chromosome 5. Apply genetic operators and produce a new generation No e stopping crite 7. Finish Fig 1. The overall framework of GA K-means. the stopping conditions are satisfe e repeated until values, they have to be coded on a chromosome, a form of binary strings. The structure of the chromosomes for GA K-means 3. 2. Chromosome encoding As shown in Fig. 2, the length of each chromosome for GA K-means depends on the types of features In the case The system searches the space to find optimal or near- of the binary selection variables whose value is 0 or l, the optimal values of the features that consist of the centroids length of each feature is just I bit. But, the features that for each cluster. To apply Ga to search for these optimal have continuous values require more bits to express them Types of the features consist of the data set Type 1. Binary selection(0 or 1) Type 2. Continuous values ranging from 0 to 1 baalislslziap o u a s u 1111234567891011121131412141 Cluster1[t。-011|1001 10110110 Cluster2[011011010011。04 Cluster3[101。1。1111-1 L。1o。1。1。o。11。。1 Cluster n[t1-1。010。11。11。1。。_10 Fig. 2. The structures of the chromosomes for GA K-means
duced. After that, Step (2) and (3) are repeated until the stopping conditions are satisfied. 3.2. Chromosome encoding The system searches the space to find optimal or nearoptimal values of the features that consist of the centroids for each cluster. To apply GA to search for these optimal values, they have to be coded on a chromosome, a form of binary strings. The structure of the chromosomes for GA K-means is presented in Fig. 2. As shown in Fig. 2, the length of each chromosome for GA K-means depends on the types of features. In the case of the binary selection variables whose value is 0 or 1, the length of each feature is just 1 bit. But, the features that have continuous values require more bits to express them 1. Design the structure of chromosomes 2. Generate the initial population and define fitness function 3. Perform the K-means process according to the given initial seeds 4. Calculate the intraclass inertia as fitness function for each chromosome 5. Apply genetic operators and produce a new generation 6. Satisfy the stopping criteria? 7. Finish User Data Base No Yes Fig. 1. The overall framework of GA K-means. 1 V1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 V2 Type 1. Binary selection (0 or 1) Type 2. Continuous values ranging from 0 to 1 Types of the features consist of the data set 1 V37 2 3 4 5 6 7 8 9 10 11 12 13 14 … … 1 V36 1 V42 1 2 … 14 V1 1 V2 … … Cluster 1 Cluster 2 Cluster 3 … Cluster n 1 0 … 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 … 1 1 … 0 0 1 … 1 01101001110011 … 0 1 … 0 1 1 … 0 11011100111101 … 1 1 … 1 1 0 … 1 11001011000111 … 0 0 … 1 1 1 … 1 00100110110100 … 1 0 … 0 Fig. 2. The structures of the chromosomes for GA K-means. K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1203
K.j. Kim, H. Ahn/Expert Systems with Applications 34(2008)1200-1209 in a chromosome. Here, we set the continuous feature val- at 0. 1. This study performs the crossover using a uniform es as precise as 1/10,000. When we apply min-max crossover routine. The uniform crossover method is consid- normalization to the continuous features, they become ered better at preserving the schema, and can generate any the values ranging from 0 to 1. In this case, 14 binary bits schema from the two parents, while single-point and two- are required to express them with 1/10,000th precision point crossover methods may bias the search with the irrel- ecause 8192=2<10000< 2=16384. These 14 bit evant position of the features. For the mutation method binary numbers are transformed into decimal floating this study generates a random number between 0 and 1 numbers, ranging from 0 to I by applying the following for each of the features in the organism. If a gene gets a equation(4) number that is less than or equal to the mutation rate, then that gene is mutated. As the stopping condition, only 4000 16383 (4) trials(20 generations)are permitted Simple K-means was conducted by SPSS for Windows where x is the decimal number of the binary code for each 11.0 and SOM by Neuroshell 2 R4.0. GA K-means was feature weight(Michalewicz, 1996) developed by using Microsoft Excel 2002 and Palisade For example, the binary code for the value of the 37th Software's Evolver Version 4.06. And, the K-means algo- feature of the first sample chromosome- the chromosome rithm for GA K-means was implemented in VBA (Visual represents Cluster I-in Fig. 2 is(10110010010110)2. The Basic for Applications)of Microsoft Excel 2002. decimal value of it is (11414)10 and it is interpreted as 1=0.696697796≈06967 In this study, we use a real-world data set from an Inter- 4.2. Experimental results net shopping mall that consists of 36 selection variables and 6 continuous variables. For the design of the chromo- In this study, we used real-world Internet shopping mall ome for this data set,(36x 1+6x 14)x K bits are data for assessing the performance of the proposed model required where K is the number of clusters The research data was collected from an online diet portal site in Korea which contains all kinds of services for online 4. Experimental design and results diets such as providing information, community services and a shopping mall. In the case of a diet shopping mall, 4. 1. Experimental design customers generally have a clear objective, and they have a strong demand for personalized care and services. Thus, We adopt three clustering algorithms - simple K-means, they are usually open-minded about providing their per- SOM and GA K-means-to our data. We try to segment sonal information to get appropriate service. As a result, the Internet users into 5 clusters(that is, K= 5). In the case the target company of our research possesses detailed and of SoM, we set the learning rate()at 0.5 accurate customer information. and it has a desire to use For the controlling parameters of the GA search, the the information as a source of direct marketing for selling population size is set at 200 organisms. The value of the their products. Consequently, we tried to build a recom- crossover rate is set at 0.7 while the mutation rate is set mendation model for the users of this web site User Data Base A K-Means Clustering Cluster1Cluster2Cluster3Cluster4Cluster5 Determine the nearest cluster for the target customer Targe Search the most similar neighbors in the corresponding cluste Using Case-Based Reasc nded Generate recommendation results referencing the nearest neighbors purchased items P: Web interfac Fig 3. The system architecture of the recommendation system
in a chromosome. Here, we set the continuous feature values as precise as 1/10,000. When we apply min–max normalization to the continuous features, they become the values ranging from 0 to 1. In this case, 14 binary bits are required to express them with 1/10,000th precision because 8192 = 213 < 10000 6 214 = 16384. These 14 bit binary numbers are transformed into decimal floating numbers, ranging from 0 to 1 by applying the following equation (4): x0 ¼ x 214 1 ¼ x 16383 ð4Þ where x is the decimal number of the binary code for each feature weight (Michalewicz, 1996). For example, the binary code for the value of the 37th feature of the first sample chromosome – the chromosome represents Cluster 1 – in Fig. 2 is (10110010010110)2. The decimal value of it is (11414)10 and it is interpreted as 11414 16383 ¼ 0:696697796 0:6967. In this study, we use a real-world data set from an Internet shopping mall that consists of 36 selection variables and 6 continuous variables. For the design of the chromosome for this data set, (36 · 1+6 · 14) · K bits are required where K is the number of clusters. 4. Experimental design and results 4.1. Experimental design We adopt three clustering algorithms – simple K-means, SOM and GA K-means – to our data. We try to segment the Internet users into 5 clusters (that is, K = 5). In the case of SOM, we set the learning rate (a) at 0.5. For the controlling parameters of the GA search, the population size is set at 200 organisms. The value of the crossover rate is set at 0.7 while the mutation rate is set at 0.1. This study performs the crossover using a uniform crossover routine. The uniform crossover method is considered better at preserving the schema, and can generate any schema from the two parents, while single-point and twopoint crossover methods may bias the search with the irrelevant position of the features. For the mutation method, this study generates a random number between 0 and 1 for each of the features in the organism. If a gene gets a number that is less than or equal to the mutation rate, then that gene is mutated. As the stopping condition, only 4000 trials (20 generations) are permitted. Simple K-means was conducted by SPSS for Windows 11.0 and SOM by Neuroshell 2 R4.0. GA K-means was developed by using Microsoft Excel 2002 and Palisade Software’s Evolver Version 4.06. And, the K-means algorithm for GA K-means was implemented in VBA (Visual Basic for Applications) of Microsoft Excel 2002. 4.2. Experimental results In this study, we used real-world Internet shopping mall data for assessing the performance of the proposed model. The research data was collected from an online diet portal site in Korea which contains all kinds of services for online diets such as providing information, community services and a shopping mall. In the case of a diet shopping mall, customers generally have a clear objective, and they have a strong demand for personalized care and services. Thus, they are usually open-minded about providing their personal information to get appropriate service. As a result, the target company of our research possesses detailed and accurate customer information, and it has a desire to use the information as a source of direct marketing for selling their products. Consequently, we tried to build a recommendation model for the users of this web site. Fig. 3. The system architecture of the recommendation system. 1204 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
K.j. Kim, H. Ahn Expert Systems with Applications 34(2008)1200-1209 There are two prevalent approaches for recommenda Nonetheless, it also has many critical limitations. Usu- ion-collaborative filtering(CF)and content-based( CB) ally, it cannot recommend newly added items, and requires methods. Both approaches have weaknesses as well as high computational complexity as the user base increases. strengths. However, CF has been used more frequently in Moreover, it mainly relies on annotations by users and pro- real-world applications vides meaningless results in heterogeneous domains such as Cf works by collecting user feedback in the form of rat- food. However, the most critical problem is that it cannot ings for items in a given domain and exploiting similarities make any recommendations when customers have not and differences among profiles of several users in determin- bought items repeatedly, which is generally called thespar ing how to recommend items. In general, CF can make sity problem'(Cho Kim, 2004; Cho, Kim, Kim, 2002; accurate recommendations in homogeneous domains, Good et al., 1999; Herlocker, Konstan, riedl, 2000: and it requires lower computational complexity when the Kim, Cho, Kim, Kim, Suh, 2002; Resnick, lacovou amount of the collected data set is small. Moreover, it Suchak, Bergstrom, 1994) can recommend items in novel territories because it deter These limitations are very critical in particular for Inter mines items to be recommended by considering only net shopping malls with specialized items. In the case of similarity between users, not characteristics of products our target shopping mall which is specialized for dieters, (Konstan et al., 1997, Pazzani, 1999 80.44% of total buyers had purchased just one time at Table 1 Features and their descriptions Code Description ADDO Residences are located in Seoul (the capital of South Korea 0: False/1: T Al Residences are located in big cities 0: False/1: T ADD2 Residences are located in the places other than Seoul and big cities 0: False/1: True CUO stomers work for companies CCU Customers are housewives 0: False/1:True oCCUr 0: False/1:True CCU Customers run their own businesses 0: False/1:True CCU EX Gender 0: Male/1: Female MARRIED 0: False/1: True LOSSI Customers'need to lose weight around their bellies 0: Not exist/1: Exist Customers need to lose weight around their hip 0: Not exist/1: Exist Customers' need to lose weight around their waists 0: Not exist/1: Exist Customers' need to lose weight around their faces 0: Not exist/1: Exist Customers' need to lose weight around their backs 0: Not exist/1: Exist Customers need to lose weight around their legs and thigh 0: Not exist/1: Exist LOSS7 Customers' need to lose weight around their arms 0: Not exist/1: Exist LOSS& Customers' need to lose weight around their chests 0: Not exist/1: Exist Customers' need to lose weight around their shoulders 0: Not exist/1: Exist 0: False/1: True Customers diet for urgent purpose such as employment, marriage ustomers diet for the reasons other than PURo. Purl WAIST Continuous (inch) Customers target weight to reduce tinuou BMI Body mass index(BMi) is the measure of body fat based on height and weight that applies Continuous(kg/m to both adult men and women. It is calculated as follows: BMI(kg/m2)=weight False/1: Tr D ustomers experienced constipation 0: Not exist/1: Exis 3 ustomers experienced hypertension or high blood fat 0: Not exist/1: Exis Customers experienced"anemia' or'maInutrition 0: Not exist/1: Exis Prior experience with functional diet food 0: Not exist/1: Exis E02 Prior experience with 'diet drugs 0: Not exist/1: Exis Prior experience with 'diuretics or the drugs generating 'diarrhea 0: Not exist/1: Exis Prior experience with hunger 0: Not exist/1: Exis Prior experience with ' folk remedies' such as 'one food diet and " Denmark die Prior experience with fasting treatment 0: Not exist/1: Exist E07 Prior experience with 'diet clinic 0: Not exist/1: Exist Prior experience with 'oriental diet treatment 0: Not exist/1: Exist Prior experience with " toning sy 0: Not exist/1: Exist Prior experience with other diet methods other than EO1-E09 0: Not exist/1: Exist
There are two prevalent approaches for recommendation – collaborative filtering (CF) and content-based (CB) methods. Both approaches have weaknesses as well as strengths. However, CF has been used more frequently in real-world applications. CF works by collecting user feedback in the form of ratings for items in a given domain and exploiting similarities and differences among profiles of several users in determining how to recommend items. In general, CF can make accurate recommendations in homogeneous domains, and it requires lower computational complexity when the amount of the collected data set is small. Moreover, it can recommend items in novel territories because it determines items to be recommended by considering only similarity between users, not characteristics of products (Konstan et al., 1997; Pazzani, 1999). Nonetheless, it also has many critical limitations. Usually, it cannot recommend newly added items, and requires high computational complexity as the user base increases. Moreover, it mainly relies on annotations by users and provides meaningless results in heterogeneous domains such as food. However, the most critical problem is that it cannot make any recommendations when customers have not bought items repeatedly, which is generally called the ‘sparsity problem’ (Cho & Kim, 2004; Cho, Kim, & Kim, 2002; Good et al., 1999; Herlocker, Konstan, & Riedl, 2000; Kim, Cho, Kim, Kim, & Suh, 2002; Resnick, Iacovou, Suchak, & Bergstrom, 1994). These limitations are very critical in particular for Internet shopping malls with specialized items. In the case of our target shopping mall which is specialized for dieters, 80.44% of total buyers had purchased just one time at Table 1 Features and their descriptions Code Description Range AGE Age Continuous (yrs) ADD0 Residences are located in Seoul (the capital of South Korea) 0: False/1: True ADD1 Residences are located in big cities 0: False/1: True ADD2 Residences are located in the places other than Seoul and big cities 0: False/1: True OCCU0 Customers work for companies 0: False/1: True OCCU1 Customers are housewives 0: False/1: True OCCU2 Customers are students 0: False/1: True OCCU3 Customers run their own businesses 0: False/1: True OCCU4 Customers are jobless 0: False/1: True SEX Gender 0: Male/1: Female MARRIED Customers got married 0: False/1: True LOSS1 Customers’ need to lose weight around their bellies 0: Not exist/1: Exist LOSS2 Customers’ need to lose weight around their hips 0: Not exist/1: Exist LOSS3 Customers’ need to lose weight around their waists 0: Not exist/1: Exist LOSS4 Customers’ need to lose weight around their faces 0: Not exist/1: Exist LOSS5 Customers’ need to lose weight around their backs 0: Not exist/1: Exist LOSS6 Customers’ need to lose weight around their legs and thighs 0: Not exist/1: Exist LOSS7 Customers’ need to lose weight around their arms 0: Not exist/1: Exist LOSS8 Customers’ need to lose weight around their chests 0: Not exist/1: Exist LOSS9 Customers’ need to lose weight around their shoulders 0: Not exist/1: Exist PUR0 Customers diet for beauty 0: False/1: True PUR1 Customers diet for urgent purpose such as employment, marriage 0: False/1: True PUR2 Customers diet for the reasons other than PUR0, PUR1 0: False/1: True HEIGHT Height Continuous (m) WEIGHT Weight Continuous (kg) WAIST Waist Continuous (inch) OBJ Customers’ target weight to reduce Continuous (kg) BMI Body mass index (BMI) is the measure of body fat based on height and weight that applies to both adult men and women. It is calculated as follows: BMI ðkg=m2Þ ¼ weight ðkgÞ ðheight ðmÞÞ2 Continuous (kg/m2 ) D1 Customers experienced ‘edema’ 0: False/1: True D2 Customers experienced ‘constipation’ 0: Not exist/1: Exist D3 Customers experienced ‘hypertension’ or ‘high blood fat’ 0: Not exist/1: Exist D4 Customers experienced ‘anemia’ or ‘malnutrition’ 0: Not exist/1: Exist E01 Prior experience with ‘functional diet food’ 0: Not exist/1: Exist E02 Prior experience with ‘diet drugs’ 0: Not exist/1: Exist E03 Prior experience with ‘diuretics’ or the drugs generating ‘diarrhea’ 0: Not exist/1: Exist E04 Prior experience with ‘hunger’ 0: Not exist/1: Exist E05 Prior experience with ‘folk remedies’ such as ‘one food diet’ and ‘Denmark diet’ 0: Not exist/1: Exist E06 Prior experience with ‘fasting treatment’ 0: Not exist/1: Exist E07 Prior experience with ‘diet clinic’ 0: Not exist/1: Exist E08 Prior experience with ‘oriental diet treatment’ 0: Not exist/1: Exist E09 Prior experience with ‘toning system’ 0: Not exist/1: Exist E10 Prior experience with other diet methods other than E01-E09 0: Not exist/1: Exist K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1205
K.j. Kim, H. Ahn/Expert Systems with Applications 34(2008)1200-1209 the time of this research. Thus, it is impossible to apply CF order to predict the items which are likely to be purchased in our target shopping mall because the buying behavior of by users. However, CBR also has a limitation that is simi the customers cannot be patternized. Consequently, we lar to CF-that is the requirement of high computational applied a simple case-based reasoning(CBR)model in complexity as the user base increases. To resolve the prob lem, clustering methods have been applied as a preprocess- ing tool for CBR(Kim Han, 2001). In this study, we Table 2 Intraclass inertia of each clustering method used our proposed clustering algorithm-GA K-means Clustering method as a preprocessing tool for CBR. Fig 3 shows the system GA K-means architecture of the recommendation system using GA K- Intraclass inertia 2.1467 2.1288 means and CBr The collected data in this case included 3298 customers It contained users' demographic variables and the items Table 3 that they had purchased. We collected totally 42 indepen t-values of th es f-test dent variables including demographic and other personal GA K-means information. Table I presents detailed information on the 5.5344 features In order to determine the most appropriate cluste Significant at the l% level lgorithm for this data set, we also applied simple K-me Table 4 Table 5 The result of Chi-square analysis The result of anova Code K-means SOM GA K-mean SOM value AGE93.4440.00020988000002170180.000 96.3680.0001348.8310.00010.235 HEIGHT ADDI 13.4680009270.957000092780.055 WEIGHT ADD2 14833200005164660000144920006OBJ 57.055000012.2480.01672.6850.000 BMI OCCUR OCCUR 450.737 834180.000167.6360.000 WAIST 0.000312450.000 OCCU2 l079419 372.9070.0001458.2390.000 OC cCU 3.6630.00820.0050.000 OC cCU 13+20 02.1150.000727.3880.000 214.6960.000184.0980.000 MARRIE GAKI LOSS2 36 216.4060.000351.1340.000 0.000478.5910.000 0.000 0.000390.2100.000 0.000 0.000724.5110.000 0.000 380.218 0.000473.0030.000 0.0001189.1620.000 LOSS6 3393980.000 0.0003777120.000 LOSS 5460720.0006586960.0006119350.000 220.42400003894920.000447.0960.000 LOSS 522.7610.0001145.6260.0001065.2250.000 PURO 15552980.0001666.5330.0001665.7390.000 PURI 141.5330.00089.0950.000134.9720.000 PURe 1772.3690.0001643.8520.0002162.9270.000 144.1860.000157.6860.000156.7710.000 88 96.1150.000169.6920.00099.7730.000 21.6300.00019.8250.00128.1300.000 20.4240.00030.5890.00037.7420.000 EOI 1588750.000105.2900.000110.9690.000 237.7830000113.7550.000971060.000 54.6770.00038.5900.000479420000 0.000116.3980.000144.5750.000 E05 90.4720.000726610.000 0.00015.3720.004 17.5440.002 0.005 0.00030.3690.00018.0630001 27.8190.000 8.0890.0012134 Fig. 4. Clustering result of Ga K-means
the time of this research. Thus, it is impossible to apply CF in our target shopping mall because the buying behavior of the customers cannot be patternized. Consequently, we applied a simple case-based reasoning (CBR) model in order to predict the items which are likely to be purchased by users. However, CBR also has a limitation that is similar to CF – that is the requirement of high computational complexity as the user base increases. To resolve the problem, clustering methods have been applied as a preprocessing tool for CBR (Kim & Han, 2001). In this study, we used our proposed clustering algorithm – GA K-means – as a preprocessing tool for CBR. Fig. 3 shows the system architecture of the recommendation system using GA Kmeans and CBR. The collected data in this case included 3298 customers. It contained users’ demographic variables and the items that they had purchased. We collected totally 42 independent variables including demographic and other personal information. Table 1 presents detailed information on the features. In order to determine the most appropriate clustering algorithm for this data set, we also applied simple K-means, Table 2 Intraclass inertia of each clustering method Clustering method K-means SOM GA K-means Intraclass inertia 2.1467 2.1340 2.1288 Table 3 t-values of the paired-samples t-test SOM GA K-means K-means 5.534a 8.233a SOM 3.588a a Significant at the 1% level. Table 4 The result of Chi-square analysis Code K-means SOM GA K-means Chisquare value Asy. Sig. Chisquare value Asy. Sig. Chisquare value Asy. Sig. ADD0 96.368 0.000 1348.831 0.000 10.235 0.037 ADD1 13.468 0.009 270.957 0.000 9.278 0.055 ADD2 148.332 0.000 516.466 0.000 14.492 0.006 OCCU0 57.055 0.000 12.248 0.016 72.685 0.000 OCCU1 450.737 0.000 83.418 0.000 167.636 0.000 OCCU2 1079.419 0.000 372.907 0.000 1458.239 0.000 OCCU3 1.937 0.747 13.663 0.008 20.005 0.000 OCCU4 603.448 0.000 202.115 0.000 727.388 0.000 SEX 167.612 0.000 214.696 0.000 184.098 0.000 MARRIED 661.327 0.000 216.406 0.000 351.134 0.000 LOSS1 757.606 0.000 478.591 0.000 593.393 0.000 LOSS2 445.300 0.000 390.210 0.000 416.854 0.000 LOSS3 1075.998 0.000 724.511 0.000 740.688 0.000 LOSS4 380.218 0.000 443.821 0.000 473.003 0.000 LOSS5 696.827 0.000 1305.146 0.000 1189.162 0.000 LOSS6 339.398 0.000 414.516 0.000 377.712 0.000 LOSS7 646.072 0.000 658.696 0.000 611.935 0.000 LOSS8 220.424 0.000 389.492 0.000 447.096 0.000 LOSS9 522.761 0.000 1145.626 0.000 1065.225 0.000 PUR0 1555.298 0.000 1666.533 0.000 1665.739 0.000 PUR1 141.533 0.000 89.095 0.000 134.972 0.000 PUR2 1772.369 0.000 1643.852 0.000 2162.927 0.000 D1 144.186 0.000 157.686 0.000 156.771 0.000 D2 96.115 0.000 169.692 0.000 99.773 0.000 D3 21.630 0.000 19.825 0.001 28.130 0.000 D4 20.424 0.000 30.589 0.000 37.742 0.000 E01 158.875 0.000 105.290 0.000 110.969 0.000 E02 237.783 0.000 113.755 0.000 97.106 0.000 E03 54.677 0.000 38.590 0.000 47.942 0.000 E04 153.452 0.000 116.398 0.000 144.575 0.000 E05 86.949 0.000 90.472 0.000 72.661 0.000 E06 27.896 0.000 15.372 0.004 17.544 0.002 E07 14.877 0.005 16.113 0.003 26.981 0.000 E08 51.238 0.000 25.048 0.000 32.696 0.000 E09 38.677 0.000 30.369 0.000 18.063 0.001 E10 27.819 0.000 18.089 0.001 21.357 0.000 Table 5 The result of ANOVA Code K-means SOM GA K-means F- value Asy. Sig. F- value Asy. Sig. F- value Asy. Sig. AGE 93.444 0.000 209.880 0.000 217.018 0.000 HEIGHT 31.631 0.000 22.773 0.000 23.624 0.000 WEIGHT 23.181 0.000 26.752 0.000 29.990 0.000 OBJ 19.671 0.000 21.144 0.000 23.581 0.000 BMI 37.419 0.000 41.808 0.000 54.352 0.000 WAIST 25.189 0.000 31.245 0.000 40.390 0.000 Fig. 4. Clustering result of GA K-means. 1206 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
K.j. Kim, H. Ahn Expert Systems with Applications 34(2008)1200-1209 ·必☆色·品回·回洛 Product Recommender Sysfem using Datamining 0HEA呈9平9属世智200形 装(电模登)[19形 是创⊙go 为出例以刀 ”叫种但啊:选栏口 :能2料 器口斗钻是备0驻? 日动 己实智u□g 4( ahmo//im. 8日”也 Product Recommender System using Datamining 08空用苦量空A0阳本型D0后厢量平附A 0上宝异封型2 留q♂召斗B吉q剖昝雙引睏因舁昌e譽凵? 器甚器器。等啡闻恕些垇”詈磔詈 测但器⊙O鲤⊙某O沿甘⊙异O咎 Fig. 5. Sample screens of the prototype system(a)The input screen(b) The result screen for recommendation. SOM, and GA K-means all together, and focused on finding Table 6 the algorithm which produced the best result. We used The result of the survey intraclass inertia as a performance measure. Table 2 shows Recommendation method Average Standard deviation the summarized performance of the three methods. As you Randomly-chosen model 3.7 342 can see, GA K-means is the best among three comparative Proposed model ( GA K-means CBR) 4.51 1.010 methods for this data set To determine whether the differences of intraclass inertia between the proposed model and other comparative mod- for paired samples. As explained in Section 2.3, intraclass els are statistically significant or not, we applied the t-test inertia is the mean of the distances between the cluster
SOM, and GA K-means all together, and focused on finding the algorithm which produced the best result. We used intraclass inertia as a performance measure. Table 2 shows the summarized performance of the three methods. As you can see, GA K-means is the best among three comparative methods for this data set. To determine whether the differences of intraclass inertia between the proposed model and other comparative models are statistically significant or not, we applied the t-test for paired samples. As explained in Section 2.3, intraclass inertia is the mean of the distances between the cluster Fig. 5. Sample screens of the prototype system (a) The input screen. (b) The result screen for recommendation. Table 6 The result of the survey Recommendation method Average Standard deviation Randomly-chosen model 3.76 1.342 Proposed model (GA K-means + CBR) 4.51 1.010 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1207
K.j. Kim, H. Ahn/Expert Systems with Applications 34(2008)1200-1209 Table 7 The results of paire les test Paired differences f-Value Degree of freedom Sig. level (2-tails Mean Standard deviation Standard error mean 95% confidence interval of the difference Lower Upper 0.751604 0.1604 0.432 l068 0.0000 center and each sample, thus we can apply the statistical they felt the randomly recommended results were mediocre test for comparing means of two samples. The paired-sam-(3. 76 point ples t-test is usually applied when the two sets of values are To examine whether the difference of the satisfaction from the same sample, such as in a pre-test/post-test situa- levels of the two recommended results is statistically signif- tion. It is sometimes called the t-test for correlated samples icant or not, we applied the paired-samples t-test. In the or dependent samples( Green, Salkind, Akey, 2000). In test, the null hypothesis was Ho H1-H2=0 while the alter the study, the null hypothesis is Ho: IN-IN=0 where native hypothesis was Hai -H2#0 where pi was the sat i=1, 2 and j=2, 3, while the alternative hypothesis is isfaction level of the proposed model and u2 was the level Ha INi-INi#0 where i= 1, 2 and j= 2, 3. INk means for the random model. Table 7 shows the results for intraclass inertia (i.e. average distance between the cluster paired-samples t-test center and each sample) for the clustering method k. Table As shown in Table 7, the satisfaction level of the pro- 3 shows the result for paired-samples t-test. As shown in posed model is higher than the level of the random model Table 3, GA K-means outperforms all of the comparative at the 1% statistical significance level. This shows that GA models including simple K-means and SoM at the 1% sta- K-means, as a preprocessing tool for a prediction model tistical significance level like CBR, may support satisfactory results, and also In addition, Chi-square analysis and ANOVA(analysis reduces search space for training of variance)are also applied to examine the discriminant power of the three clustering methods. Table 4 suggests 5.Conclusions the results of Chi-square analysis, and Table 5 presents the results of ANOVA. These results show that the five seg This study suggests a new clustering algorithm, GA ments by all three clustering methods differed significantly K-means. We applied it to a real-world case for market seg with respect to almost all of the independent variables. mentation in electronic commerce, and found that GA Thus, we can conclude that GA K-means may be the most K-means might result in better segmentation than other tra appropriate preprocessing tool for this data set. Fig. 4 dis- ditional clustering algorithms including simple K-means plays the clustering result of Ga K-means with two vari- and SOM from the perspective of intraclass inertia In addi- ables(AGE and WAIST), indicating that there are tion, we empirically examined the usefulness of GA K clearly five clusters. means as a preprocessing tool for recommendation model Using the result from GA K-means clustering and the However, this study has some limitations. Although we CBR algorithm, we constructed the recommendation sys- suggest intraclass inertia as a criterion for performance tem for the target shopping mall. The system was devel- comparison, it is uncertain that this is a complete measure oped as a Web-based system using Microsoft ASP (active for performance comparison of the clustering algorithms server pages). Fig. 5 shows the sample screens for our pro- Consequently, the efforts to develop effective measures to totype system. compare clustering algorithms should be done in the future To validate the usefulness of the recommendation model research using Ga K-means and CBR, our prototype system Moreover, we arbitrarily set the number of clusters to generated two kinds of recommendation results-one five in this study. Unfortunately, there have been few stud- was generated randomly and the other was generated using ies to propose any mechanism to determine the optimal our model that combined GA K-means and CBR. The number of clusters, so it has usually been determined by sequence of the visual presentation of these two results heuristics. Thus, the attempts to adjust the number of clus- changed randomly in order to offset main testing effect ters should be one of the focuses of future research.In he effect of a prior observation on a later observation. addition, GA K-means needs to be applied to other we added a survey function that measured the satisfaction posed mode der to validate generalizability of the pro- For measuring satisfaction of each recommendation result, domains in or level in a 7-point Likert scale. Using the prototype system, result, we collected 100 responses in total. Table 6 shows References the result of the survey. As shown in Table 6. the respon- Babu, G. P,& Murty, M. N(1993). A near-optimal initial seed value dents replied that they were quite satisfied(4.51 point) with the recommended results by our proposed model, although selection in K-means algorithm using a genetic algorithm. Pattern Recognition Letters, 14(10), 763-769
center and each sample, thus we can apply the statistical test for comparing means of two samples. The paired-samples t-test is usually applied when the two sets of values are from the same sample, such as in a pre-test/post-test situation. It is sometimes called the t-test for correlated samples or dependent samples (Green, Salkind, & Akey, 2000). In the study, the null hypothesis is H0:INi INj = 0 where i = 1,2 and j = 2,3, while the alternative hypothesis is Ha:INi INj 5 0 where i = 1,2 and j = 2,3. INk means intraclass inertia (i.e. average distance between the cluster center and each sample) for the clustering method k. Table 3 shows the result for paired-samples t-test. As shown in Table 3, GA K-means outperforms all of the comparative models including simple K-means and SOM at the 1% statistical significance level. In addition, Chi-square analysis and ANOVA (analysis of variance) are also applied to examine the discriminant power of the three clustering methods. Table 4 suggests the results of Chi-square analysis, and Table 5 presents the results of ANOVA. These results show that the five segments by all three clustering methods differed significantly with respect to almost all of the independent variables. Thus, we can conclude that GA K-means may be the most appropriate preprocessing tool for this data set. Fig. 4 displays the clustering result of GA K-means with two variables (AGE and WAIST), indicating that there are clearly five clusters. Using the result from GA K-means clustering and the CBR algorithm, we constructed the recommendation system for the target shopping mall. The system was developed as a Web-based system using Microsoft ASP (active server pages). Fig. 5 shows the sample screens for our prototype system. To validate the usefulness of the recommendation model using GA K-means and CBR, our prototype system generated two kinds of recommendation results – one was generated randomly and the other was generated using our model that combined GA K-means and CBR. The sequence of the visual presentation of these two results changed randomly in order to offset main testing effect – the effect of a prior observation on a later observation. For measuring satisfaction of each recommendation result, we added a survey function that measured the satisfaction level in a 7-point Likert scale. Using the prototype system, we conducted the survey for one month (April, 2005). As a result, we collected 100 responses in total. Table 6 shows the result of the survey. As shown in Table 6, the respondents replied that they were quite satisfied (4.51 point) with the recommended results by our proposed model, although they felt the randomly recommended results were mediocre (3.76 point). To examine whether the difference of the satisfaction levels of the two recommended results is statistically significant or not, we applied the paired-samples t-test. In the test, the null hypothesis was H0:l1 l2 = 0 while the alternative hypothesis was Ha:u1 l2 5 0 where l1 was the satisfaction level of the proposed model and l2 was the level for the random model. Table 7 shows the results for paired-samples t-test. As shown in Table 7, the satisfaction level of the proposed model is higher than the level of the random model at the 1% statistical significance level. This shows that GA K-means, as a preprocessing tool for a prediction model like CBR, may support satisfactory results, and also reduces search space for training. 5. Conclusions This study suggests a new clustering algorithm, GA K-means. We applied it to a real-world case for market segmentation in electronic commerce, and found that GA K-means might result in better segmentation than other traditional clustering algorithms including simple K-means and SOM from the perspective of intraclass inertia. In addition, we empirically examined the usefulness of GA Kmeans as a preprocessing tool for recommendation model. However, this study has some limitations. Although we suggest intraclass inertia as a criterion for performance comparison, it is uncertain that this is a complete measure for performance comparison of the clustering algorithms. Consequently, the efforts to develop effective measures to compare clustering algorithms should be done in the future research. Moreover, we arbitrarily set the number of clusters to five in this study. Unfortunately, there have been few studies to propose any mechanism to determine the optimal number of clusters, so it has usually been determined by heuristics. Thus, the attempts to adjust the number of clusters should be one of the focuses of future research. In addition, GA K-means needs to be applied to other domains in order to validate generalizability of the proposed model. References Babu, G. P., & Murty, M. N. (1993). A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm. Pattern Recognition Letters, 14(10), 763–769. Table 7 The results of paired-samples t-test Paired differences t-Value Degree of freedom Sig. level (2-tails) Mean Standard deviation Standard error mean 95% confidence interval of the difference Lower Upper 0.75 1.604 0.1604 0.432 1.068 4.675 99 0.0000 1208 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
K.j. Kim, H. Ahn Expert Systems with Applications 34(2008)1200-1209 Bauer, R. .(1994). Genetic algorithns and investment New Kuo, R.J., An, Y L, Wang, H.s,& Chung, w.J. (2006). Integration of York: John sons self-organizing feature maps neural network and genetic K-means Bradley, P S,& Fayyad, U. M.(1998). Refining initia algorithm for market segmentation. Expert Systems with Applications neans clustering. In Proceedings of the 15th international conference on O(2),313-324 machine learning(pp 91-99) Kuo. R.J. Chang. K. Chien. S. Y grano Cho, Y.H.& Kim, JK.(2004). Application of Web usage mining and organizing feature maps and genetic-algorithm-based clustering product taxonomy to collaborative recommendations in e-commerce. method for market segmentation. Journal of Organizational Computing Expert Systems with Applications, 26(2), 233-246 and Electronic Commerce, 14(1). 43-60 Cho, Y.H., Kim, J K,& Kim, S H(2002). A personalized recommender Kuo, R.J., Liao, J. L,& Tu, C.(2005). Integration of ART2 neural system based on Web usage mining and decision tree induction Expe Systems with Applications, 23(3), 329-342 Systems, 40(2) Davis, L.(1994). Handbook of genetic algorithms. New York: Van Nostrand Reinhold Lleti, R, Ortiz, M. C, Sarabia, L. A,& Sanchez, M. S(2004). Selecting Gehrt, K. C, Shim, S(1998). A shopping orientation segmentation of variables for k-means cluster analysis by using a genetic algorithm that French consumers: implications for catalog marketing. Journal of optimises the silhouettes. Analytica Chimica Acta, 515(1), 87-1 Interactice Marketing, 12(4), 34-46 Maulik, U,& Bandyopadhyay, S.(2000). Genetic algorithm-based Good, N, Schafer, J. B. Konstan, J. A, Borchers, A. Sarwar, B. clustering technique. Pattern Recognition, 33(9). 1455-1465 Herlocker, J.,& riedl, J.(1999). Combining collaborative filtering Michalewicz, Zb. (1996). Genetic algorithms data structures evolution rith personal agents for better recommendations. In Proceedings of the programs(3rd ed. ) Berlin: Springer-Verlag. 6th national conference of the American A ssociation of Artificial Michaud, P (1997). Clustering techniques. Future Generation Computer Intelligence(44A1-99)(pp. 439-446). Orlando, FL, USA. stems.13(2-3),135-14 ireen,S.B, Salkind, N.J,& Akey, T. M.(2000). Using SPSS for Murthy, C. A,& Chowdhury, N.(1996). In search of optimal clusters Windows(2nd ed. ) Upper Saddle River, NJ: Prentice Hall. sing genetic algorithms. Pattern Recognition Letters, 17(8), 825-832 Herlocker, J. L, Konstan, J. A,& Riedl, J. (2000). Explaining Pazzani, M.J(1999). A framework for collaborative, content-based and collaborative filtering recommendations. In Proceedings of the ACM demographic filtering. Artificial Intelligence Review, 13(5-6), 393-408 conference on computer supported cooperative work (pp. 241-250). Pena, J. M, Lozano, J. A,& Larranaga, P.(1999). An empirical Philadelphia, USA. omparison of four initialization methods for the K-means algorithm. Hertz, A,& Kobler, D. (2000). A framework for the descri Pattern Recognition Letters, 20(10). 1027-1040 lutionary algorithms. European Journal of Operation Resnick, P, lacovo, N, Suchak, M.,& Bergstrom, P(1994). Group. 126(1),1-12. Lens: an open architecture for collaborative filtering of netnews. In Kim, K.(2004). Toward global optimization of Proceedings of the ACM conference on computer supported cooperative systems for financial forecasting. Applied Intelligence, 21(3), 239-249 Kim, J. K, Cho, Y.H., Kim, w.J., Kim, J.R.& Suh, J. H (2002).A Shin, K.s.& Han, I(1999). Case-based reasoning supported by geneti personalized recommendation procedure for Internet shopping sup- Gorithm for corporate bond rating. Expert Systems with Applica- port. Electronic Commerce Research and Applications, 1(3-4), 301-31 ons.l6(2),85-95 Kim, K.S,& Han, I (2001). The cluster-indexing method for case-based Velido, A, Lisboa, P.J. G,& Meehan, K.(1999). Segmentation of the easoning using self-organizing maps and learning vector quantization on-line shopping market using neural networks. Expert Systems with for bond rating cases. Expert Systems Applications, 21(3 Applications, 17(4), 303-314 147-156 Wedel, M., Kamakura, w.A(1998). Market segmental Kohonen, T.(1982). Self-organized formation of topologically correct and methodological foundations. Boston: Kluwer Academic Publishers feature maps. Biological Cybernetics, 43(1), 59-69 Wong, F,& Tan, C(1994). Hybrid neural, genetic and fuzzy systems. In Konstan, J. A. Miller. B. N, Maltz, D, Herlocker, J. L, Gordon. L.R G. J. Deboeck(Ed. ) Trading on the edge(pp 245-247). New York: riedl, J.(1997). GroupLens: applying collaborative filtering to John Wiley Sons Usenet news. Commumications of ACM, 40(3), 77-87
Bauer, R. J. (1994). Genetic algorithms and investment strategies. New York: John Wiley & Sons. Bradley, P. S., & Fayyad, U. M. (1998). Refining initial points for Kmeans clustering. In Proceedings of the 15th international conference on machine learning (pp. 91–99). Cho, Y. H., & Kim, J. K. (2004). Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 233–246. Cho, Y. H., Kim, J. K., & Kim, S. H. (2002). A personalized recommender system based on Web usage mining and decision tree induction. Expert Systems with Applications, 23(3), 329–342. Davis, L. (1994). Handbook of genetic algorithms. New York: Van Nostrand Reinhold. Gehrt, K. C., & Shim, S. (1998). A shopping orientation segmentation of French consumers: implications for catalog marketing. Journal of Interactive Marketing, 12(4), 34–46. Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sarwar, B., Herlocker, J., & Riedl, J. (1999). Combining collaborative filtering with personal agents for better recommendations. In Proceedings of the 16th national conference of the American Association of Artificial Intelligence (AAAI-99) (pp. 439–446). Orlando, FL, USA. Green, S. B., Salkind, N. J., & Akey, T. M. (2000). Using SPSS for Windows (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Herlocker, J. L., Konstan, J. A., & Riedl, J. (2000). Explaining collaborative filtering recommendations. In Proceedings of the ACM conference on computer supported cooperative work (pp. 241–250). Philadelphia, USA. Hertz, A., & Kobler, D. (2000). A framework for the description of evolutionary algorithms. European Journal of Operational Research, 126(1), 1–12. Kim, K. (2004). Toward global optimization of case-based reasoning systems for financial forecasting. Applied Intelligence, 21(3), 239–249. Kim, J. K., Cho, Y. H., Kim, W. J., Kim, J. R., & Suh, J. H. (2002). A personalized recommendation procedure for Internet shopping support. Electronic Commerce Research and Applications, 1(3–4), 301–313. Kim, K.-S., & Han, I. (2001). The cluster-indexing method for case-based reasoning using self-organizing maps and learning vector quantization for bond rating cases. Expert Systems with Applications, 21(3), 147–156. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69. Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., & Riedl, J. (1997). GroupLens: applying collaborative filtering to Usenet news. Communications of ACM, 40(3), 77–87. Kuo, R. J., An, Y. L., Wang, H. S., & Chung, W. J. (2006). Integration of self-organizing feature maps neural network and genetic K-means algorithm for market segmentation. Expert Systems with Applications, 30(2), 313–324. Kuo, R. J., Chang, K., & Chien, S. Y. (2004). Integration of selforganizing feature maps and genetic-algorithm-based clustering method for market segmentation. Journal of Organizational Computing and Electronic Commerce, 14(1), 43–60. Kuo, R. J., Liao, J. L., & Tu, C. (2005). Integration of ART2 neural network and genetic K-means algorithm for analyzing Web browsing paths in electronic commerce. Decision Support Systems, 40(2), 355–374. Lletı´, R., Ortiz, M. C., Sarabia, L. A., & Sa´nchez, M. S. (2004). Selecting variables for k-means cluster analysis by using a genetic algorithm that optimises the silhouettes. Analytica Chimica Acta, 515(1), 87–100. Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33(9), 1455–1465. Michalewicz, Zb. (1996). Genetic algorithms + data structures = evolution programs (3rd ed.). Berlin: Springer-Verlag. Michaud, P. (1997). Clustering techniques. Future Generation Computer Systems, 13(2–3), 135–147. Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters using genetic algorithms. Pattern Recognition Letters, 17(8), 825–832. Pazzani, M. J. (1999). A framework for collaborative, content-based and demographic filtering. Artificial Intelligence Review, 13(5–6), 393–408. Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recognition Letters, 20(10), 1027–1040. Resnick, P., Iacovou, N., Suchak, M., & Bergstrom, P. (1994). GroupLens: an open architecture for collaborative filtering of netnews. In Proceedings of the ACM conference on computer supported cooperative work (pp. 175–186). Shin, K. S., & Han, I. (1999). Case-based reasoning supported by genetic algorithms for corporate bond rating. Expert Systems with Applications, 16(2), 85–95. Velido, A., Lisboa, P. J. G., & Meehan, K. (1999). Segmentation of the on-line shopping market using neural networks. Expert Systems with Applications, 17(4), 303–314. Wedel, M., & Kamakura, W. A. (1998). Market segmentation: concepts and methodological foundations. Boston: Kluwer Academic Publishers. Wong, F., & Tan, C. (1994). Hybrid neural, genetic and fuzzy systems. In G. J. Deboeck (Ed.), Trading on the edge (pp. 245–247). New York: John Wiley & Sons. K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1209