正在加载图片...
K.j. Kim, H. Ahn/Expert Systems with Applications 34(2008)1200-1209 Table 7 The results of paire les test Paired differences f-Value Degree of freedom Sig. level (2-tails Mean Standard deviation Standard error mean 95% confidence interval of the difference Lower Upper 0.751604 0.1604 0.432 l068 0.0000 center and each sample, thus we can apply the statistical they felt the randomly recommended results were mediocre test for comparing means of two samples. The paired-sam-(3. 76 point ples t-test is usually applied when the two sets of values are To examine whether the difference of the satisfaction from the same sample, such as in a pre-test/post-test situa- levels of the two recommended results is statistically signif- tion. It is sometimes called the t-test for correlated samples icant or not, we applied the paired-samples t-test. In the or dependent samples( Green, Salkind, Akey, 2000). In test, the null hypothesis was Ho H1-H2=0 while the alter the study, the null hypothesis is Ho: IN-IN=0 where native hypothesis was Hai -H2#0 where pi was the sat i=1, 2 and j=2, 3, while the alternative hypothesis is isfaction level of the proposed model and u2 was the level Ha INi-INi#0 where i= 1, 2 and j= 2, 3. INk means for the random model. Table 7 shows the results for intraclass inertia (i.e. average distance between the cluster paired-samples t-test center and each sample) for the clustering method k. Table As shown in Table 7, the satisfaction level of the pro- 3 shows the result for paired-samples t-test. As shown in posed model is higher than the level of the random model Table 3, GA K-means outperforms all of the comparative at the 1% statistical significance level. This shows that GA models including simple K-means and SoM at the 1% sta- K-means, as a preprocessing tool for a prediction model tistical significance level like CBR, may support satisfactory results, and also In addition, Chi-square analysis and ANOVA(analysis reduces search space for training of variance)are also applied to examine the discriminant power of the three clustering methods. Table 4 suggests 5.Conclusions the results of Chi-square analysis, and Table 5 presents the results of ANOVA. These results show that the five seg This study suggests a new clustering algorithm, GA ments by all three clustering methods differed significantly K-means. We applied it to a real-world case for market seg with respect to almost all of the independent variables. mentation in electronic commerce, and found that GA Thus, we can conclude that GA K-means may be the most K-means might result in better segmentation than other tra appropriate preprocessing tool for this data set. Fig. 4 dis- ditional clustering algorithms including simple K-means plays the clustering result of Ga K-means with two vari- and SOM from the perspective of intraclass inertia In addi- ables(AGE and WAIST), indicating that there are tion, we empirically examined the usefulness of GA K clearly five clusters. means as a preprocessing tool for recommendation model Using the result from GA K-means clustering and the However, this study has some limitations. Although we CBR algorithm, we constructed the recommendation sys- suggest intraclass inertia as a criterion for performance tem for the target shopping mall. The system was devel- comparison, it is uncertain that this is a complete measure oped as a Web-based system using Microsoft ASP (active for performance comparison of the clustering algorithms server pages). Fig. 5 shows the sample screens for our pro- Consequently, the efforts to develop effective measures to totype system. compare clustering algorithms should be done in the future To validate the usefulness of the recommendation model research using Ga K-means and CBR, our prototype system Moreover, we arbitrarily set the number of clusters to generated two kinds of recommendation results-one five in this study. Unfortunately, there have been few stud- was generated randomly and the other was generated using ies to propose any mechanism to determine the optimal our model that combined GA K-means and CBR. The number of clusters, so it has usually been determined by sequence of the visual presentation of these two results heuristics. Thus, the attempts to adjust the number of clus- changed randomly in order to offset main testing effect ters should be one of the focuses of future research.In he effect of a prior observation on a later observation. addition, GA K-means needs to be applied to other we added a survey function that measured the satisfaction posed mode der to validate generalizability of the pro- For measuring satisfaction of each recommendation result, domains in or level in a 7-point Likert scale. Using the prototype system, result, we collected 100 responses in total. Table 6 shows References the result of the survey. As shown in Table 6. the respon- Babu, G. P,& Murty, M. N(1993). A near-optimal initial seed value dents replied that they were quite satisfied(4.51 point) with the recommended results by our proposed model, although selection in K-means algorithm using a genetic algorithm. Pattern Recognition Letters, 14(10), 763-769center and each sample, thus we can apply the statistical test for comparing means of two samples. The paired-sam￾ples t-test is usually applied when the two sets of values are from the same sample, such as in a pre-test/post-test situa￾tion. It is sometimes called the t-test for correlated samples or dependent samples (Green, Salkind, & Akey, 2000). In the study, the null hypothesis is H0:INi INj = 0 where i = 1,2 and j = 2,3, while the alternative hypothesis is Ha:INi INj 5 0 where i = 1,2 and j = 2,3. INk means intraclass inertia (i.e. average distance between the cluster center and each sample) for the clustering method k. Table 3 shows the result for paired-samples t-test. As shown in Table 3, GA K-means outperforms all of the comparative models including simple K-means and SOM at the 1% sta￾tistical significance level. In addition, Chi-square analysis and ANOVA (analysis of variance) are also applied to examine the discriminant power of the three clustering methods. Table 4 suggests the results of Chi-square analysis, and Table 5 presents the results of ANOVA. These results show that the five seg￾ments by all three clustering methods differed significantly with respect to almost all of the independent variables. Thus, we can conclude that GA K-means may be the most appropriate preprocessing tool for this data set. Fig. 4 dis￾plays the clustering result of GA K-means with two vari￾ables (AGE and WAIST), indicating that there are clearly five clusters. Using the result from GA K-means clustering and the CBR algorithm, we constructed the recommendation sys￾tem for the target shopping mall. The system was devel￾oped as a Web-based system using Microsoft ASP (active server pages). Fig. 5 shows the sample screens for our pro￾totype system. To validate the usefulness of the recommendation model using GA K-means and CBR, our prototype system generated two kinds of recommendation results – one was generated randomly and the other was generated using our model that combined GA K-means and CBR. The sequence of the visual presentation of these two results changed randomly in order to offset main testing effect – the effect of a prior observation on a later observation. For measuring satisfaction of each recommendation result, we added a survey function that measured the satisfaction level in a 7-point Likert scale. Using the prototype system, we conducted the survey for one month (April, 2005). As a result, we collected 100 responses in total. Table 6 shows the result of the survey. As shown in Table 6, the respon￾dents replied that they were quite satisfied (4.51 point) with the recommended results by our proposed model, although they felt the randomly recommended results were mediocre (3.76 point). To examine whether the difference of the satisfaction levels of the two recommended results is statistically signif￾icant or not, we applied the paired-samples t-test. In the test, the null hypothesis was H0:l1 l2 = 0 while the alter￾native hypothesis was Ha:u1 l2 5 0 where l1 was the sat￾isfaction level of the proposed model and l2 was the level for the random model. Table 7 shows the results for paired-samples t-test. As shown in Table 7, the satisfaction level of the pro￾posed model is higher than the level of the random model at the 1% statistical significance level. This shows that GA K-means, as a preprocessing tool for a prediction model like CBR, may support satisfactory results, and also reduces search space for training. 5. Conclusions This study suggests a new clustering algorithm, GA K-means. We applied it to a real-world case for market seg￾mentation in electronic commerce, and found that GA K-means might result in better segmentation than other tra￾ditional clustering algorithms including simple K-means and SOM from the perspective of intraclass inertia. In addi￾tion, we empirically examined the usefulness of GA K￾means as a preprocessing tool for recommendation model. However, this study has some limitations. Although we suggest intraclass inertia as a criterion for performance comparison, it is uncertain that this is a complete measure for performance comparison of the clustering algorithms. Consequently, the efforts to develop effective measures to compare clustering algorithms should be done in the future research. Moreover, we arbitrarily set the number of clusters to five in this study. Unfortunately, there have been few stud￾ies to propose any mechanism to determine the optimal number of clusters, so it has usually been determined by heuristics. Thus, the attempts to adjust the number of clus￾ters should be one of the focuses of future research. In addition, GA K-means needs to be applied to other domains in order to validate generalizability of the pro￾posed model. References Babu, G. P., & Murty, M. N. (1993). A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm. Pattern Recognition Letters, 14(10), 763–769. Table 7 The results of paired-samples t-test Paired differences t-Value Degree of freedom Sig. level (2-tails) Mean Standard deviation Standard error mean 95% confidence interval of the difference Lower Upper 0.75 1.604 0.1604 0.432 1.068 4.675 99 0.0000 1208 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有