正在加载图片...
K.j. Kim, H. Ahn/ Expert Systems with Applications 34(2008)1200-1209 where n is the number of total observations, CK is set of the adjacent genes on the chromosome of a parent. The two- Kth cluster, Xip is the value of the attribute P for observa- point crossover involves two cuts in each chromosome. tion i, and Xkp is the mean of the attribute P in the Kth The uniform crossover allows two parent strings to cluster that is XkP=m∑ produce two children. It permits great flexibility in the way strings are combined. In addition, the mutation oper- 3. GA K-means clustering algorithm ator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a As indicated in Section 2.1, the K-means algorithm does chromosome. It provides the means for introducing new lot have any mechanism for choosing appropriate initial information into the population. Finally, the Ga tends to seeds. However, selecting different initial seeds may gener- converge on an optimal or near-optimal solution through ate huge differences in clustering results, especially when these operators(Wong Tan, 1994) the target sample contains outliers. In addition, ran- The Ga is popularly used to improve the performance dom selection of initial seeds often causes the clustering of artificial intelligence techniques. Recently, the Ga has quality to fall into local optimization(Bradley Fayyad, been employed to mitigate the limitations of typical cluster 1998). So, it is very important to select appropriate initial ing algorithms including the K-means algorithm. Lleti seeds in the traditional K-means clustering method. To Ortiz, Sarabia, and Sanchez(2004)proposed the use of overcome this critical limitation, we propose Ga as the the integrated genetic algorithm and K-means clustering optimization tool of the initial seeds in the K-means algo- to select relevant input variables. The method proposed rithm. We call our proposed model the GA K-means clus- in their study was based on the use of Ga to select those tering method. variables important for clarifying the presence of groups Genetic algorithms are stochastic search techniques that in data. Other studies tried to integrate Ga and the k can search large and complicated spaces. It is based on means algorithm for overcoming the drawback of the pos iology including natural genetics and evolutionary princi- sibility of convergence in a local minimum solution(see ple. In particular, GAs are suitable for parameter optimiza- Babu Murty, 1993; Kuo, An, Wang, Chung, 2006: tion problems with an objective function subject to various Maulik Bandyopadhyay, 2000; Murthy Chowdhury, hard and soft constraints (Shin Han, 1999). The GA 1996: Pena et al., 1999) basically explores a complex space in an adaptive way, guided by the biological evolution of selection, crossover, 3. 1. The process of the GA K-means algorithm and mutation. This algorithm uses natural selection-sur- vival of the fittest- to solve optimization problems(Kim GA K-means uses Ga to select optimal nitial seeds in K-means clustering. Fig. I shows the overall The first step of the Ga is problem representation. The framework of GA K-means clustering problem must be represented in a suitable form to be han The detailed explanation on the process of GA K-means dled by the GA. Thus, the problem is described in terms of clustering is as follows genetic code, like DNA chromosomes. The GA often ng. If the problems are (1) In the first step, the system generates the initial pop- coded as chromosomes, a population-a set of chromo- ulation that would be used to find global optimum somes-is initialized. Each chromosome within a popula- initial seeds The values of the chromosomes for the tion gradually evolves through biological operations population are initiated into random values before There are no general rules for determining the population the search process. To enable Ga to find the optimal size. But, population sizes of 100-200 are commonly used initial seeds, we should design the structure of a chro GA research. Once the population size is chosen, the mosome properly as a form of binary strings. A tial population is randomly generated (Bauer, 1994).After detailed explanation on chromosome design is pre- the initialization step, each chromosome is evaluated by a sented in Section 3.2 fitness function. According to the value of the fitness func- (2) The second step is the process of K-means clustering, tion the chromosomes associated with the fittest individu which is mentioned in Section 2. 1. In our study, the als will be reproduced more often than those associated maximum number of iterations for centroid adjust with unfit individuals(Davis, 1994) ment in K-means is set at 10. After the process of The GA works with three operators that are iteratively K-means, the value of the fitness function is updated used. The selection operator determines which individuals To find optimal initial seeds, we applied ' intraclass may survive(hertz Kobler, 2000). The crossover opera inertia'as a fitness function. That is, the objective tor allows the search to fan out in diverse directions look. of our ga K-means is to find the initial seeds where g for attractive solutions, and permits chromosoma intraclass intertia is minimized after K-means cluster material from different parents to be combined in a single ing finishes child. There are three popular crossover methods: single- (3)In this stage, Ga performs genetic operations such point, two point, and uniform. The single-point crossover selection. crossover and mutation. on the current makes only one cut in each chromosome and selects two population. By doing this, a new generation is pro-where n is the number of total observations, CK is set of the Kth cluster, XiP is the value of the attribute P for observa￾tion i, and X KP is the mean of the attribute P in the Kth cluster that is X KP ¼ 1 nK P i2CK XiP . 3. GA K-means clustering algorithm As indicated in Section 2.1, the K-means algorithm does not have any mechanism for choosing appropriate initial seeds. However, selecting different initial seeds may gener￾ate huge differences in clustering results, especially when the target sample contains many outliers. In addition, ran￾dom selection of initial seeds often causes the clustering quality to fall into local optimization (Bradley & Fayyad, 1998). So, it is very important to select appropriate initial seeds in the traditional K-means clustering method. To overcome this critical limitation, we propose GA as the optimization tool of the initial seeds in the K-means algo￾rithm. We call our proposed model the GA K-means clus￾tering method. Genetic algorithms are stochastic search techniques that can search large and complicated spaces. It is based on biology including natural genetics and evolutionary princi￾ple. In particular, GAs are suitable for parameter optimiza￾tion problems with an objective function subject to various hard and soft constraints (Shin & Han, 1999). The GA basically explores a complex space in an adaptive way, guided by the biological evolution of selection, crossover, and mutation. This algorithm uses natural selection – sur￾vival of the fittest – to solve optimization problems (Kim, 2004). The first step of the GA is problem representation. The problem must be represented in a suitable form to be han￾dled by the GA. Thus, the problem is described in terms of genetic code, like DNA chromosomes. The GA often works with a form of binary coding. If the problems are coded as chromosomes, a population – a set of chromo￾somes – is initialized. Each chromosome within a popula￾tion gradually evolves through biological operations. There are no general rules for determining the population size. But, population sizes of 100–200 are commonly used in GA research. Once the population size is chosen, the ini￾tial population is randomly generated (Bauer, 1994). After the initialization step, each chromosome is evaluated by a fitness function. According to the value of the fitness func￾tion, the chromosomes associated with the fittest individu￾als will be reproduced more often than those associated with unfit individuals (Davis, 1994). The GA works with three operators that are iteratively used. The selection operator determines which individuals may survive (Hertz & Kobler, 2000). The crossover opera￾tor allows the search to fan out in diverse directions look￾ing for attractive solutions, and permits chromosomal material from different parents to be combined in a single child. There are three popular crossover methods: single￾point, two-point, and uniform. The single-point crossover makes only one cut in each chromosome and selects two adjacent genes on the chromosome of a parent. The two￾point crossover involves two cuts in each chromosome. The uniform crossover allows two parent strings to produce two children. It permits great flexibility in the way strings are combined. In addition, the mutation oper￾ator arbitrarily alters one or more components of a selected chromosome. Mutation randomly changes a gene on a chromosome. It provides the means for introducing new information into the population. Finally, the GA tends to converge on an optimal or near-optimal solution through these operators (Wong & Tan, 1994). The GA is popularly used to improve the performance of artificial intelligence techniques. Recently, the GA has been employed to mitigate the limitations of typical cluster￾ing algorithms including the K-means algorithm. Lletı´, Ortiz, Sarabia, and Sa´nchez (2004) proposed the use of the integrated genetic algorithm and K-means clustering to select relevant input variables. The method proposed in their study was based on the use of GA to select those variables important for clarifying the presence of groups in data. Other studies tried to integrate GA and the K￾means algorithm for overcoming the drawback of the pos￾sibility of convergence in a local minimum solution (see Babu & Murty, 1993; Kuo, An, Wang, & Chung, 2006; Maulik & Bandyopadhyay, 2000; Murthy & Chowdhury, 1996; Pena et al., 1999). 3.1. The process of the GA K-means algorithm GA K-means uses GA to select optimal or sub-optimal initial seeds in K-means clustering. Fig. 1 shows the overall framework of GA K-means clustering. The detailed explanation on the process of GA K-means clustering is as follows: (1) In the first step, the system generates the initial pop￾ulation that would be used to find global optimum initial seeds. The values of the chromosomes for the population are initiated into random values before the search process. To enable GA to find the optimal initial seeds, we should design the structure of a chro￾mosome properly as a form of binary strings. A detailed explanation on chromosome design is pre￾sented in Section 3.2. (2) The second step is the process of K-means clustering, which is mentioned in Section 2.1. In our study, the maximum number of iterations for centroid adjust￾ment in K-means is set at 10. After the process of K-means, the value of the fitness function is updated. To find optimal initial seeds, we applied ‘intraclass inertia’ as a fitness function. That is, the objective of our GA K-means is to find the initial seeds where intraclass intertia is minimized after K-means cluster￾ing finishes. (3) In this stage, GA performs genetic operations such as selection, crossover and mutation, on the current population. By doing this, a new generation is pro- 1202 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有