2009 International Conference on Computer Technology and Development Document Clustering using Concept Space and Cosine Similarity Measurement Lailil muflikhah Baharum baharudin Department of Computer and Information Science Department of Computer and Information Science Universiti Teknologi Petronas Universiti Teknologi Petronas Brawijaya University Bandar Seri Iskandar, Tronoh, Perak, Malaysia Bandar Seri Iskandar, Tronoh, Perak, Malaysia baharbh@gmail.com laililmfagmail. com Abstrack-- Document clustering is related to data clustering browsing of retrieval results [2]. Some experimental oncept which is one of data mining tasks and unsupervised evidences show that IR application can benefit from the us classification. It is often applied to the huge data in order to of document clustering [3] Document clustering has always Information Retrieval in order to improve the precision and and navigating large dal rove the performance of retrieval make a partition based on their similarity. Initially, it used for been used as a tool to in recall from query. It is very easy to cluster with small data The clustering methods can be classified into document clustering is very useful in retrieve information hierarchical method and partitioning method. Partitioning application in order to reduce the consuming time and get high clustering can be divided into hard clustering and recision and recall. Therefore, we propose to integrate the overlapping(fuzzy clustering) In any given document, t there information retrieval method and document clustering as is the possibility that it can contain multiple subject or concept space approach. The method is known as Latent category. This issue is purposed to use fuzzy clustering Semantie Index (Lsi) approach which used Singular Vector approach, which allows a document to appear in multiple Decomposition(SVD)or Principle Component Analysis(PCA). clusters. The method used is different from hard clustering The aim of this method is to reduce the matrix dimension by method, which a document is only belongs to one cluster finding the pattern in document collection with refers to not more It means that we assume well defined boundary concurrent of the terms. Each method is implemented go among the clusters. Many researchers have conducted in document clustering using fuzzy c-means algorithm. Besides document clustering by hard clustering methods. In the reduction of term-document matrix, this research also uses the research show that Bisecting K-Means algorithm is better cosine similarity measurement as replacement of Euclidean than Basic K-Means algorithm and Agglomerative distance to involve in fuzzy c-means. And as a result, the hierarchical cluster using similarity concept with performance of the proposed method is better than the existing cosine formulation [4]. Thus, another research that in method with f-measure around 0.9I and entropy around O.51. grouping process used by variance value is better in the result [5], or by using Euclidean distance [6]. Furthermore, Keywords-data mining: clustering: LSI; SVD; PCA; fuzy c- there are several applications by using Fuzzy C-Means algorithm, the researcher had applied clustering for . INTRODUCTION symbolic data and its result of clustering has better quality han hierarchical method(hard clustering)[7]. Also, it had ing. ata mining is a technique to get the pattern from hidden been applied in text mining [8] rmation.This technique is to find and describe structural While grouping or clustering of document, the problem is patterns in data collection as a tool for helping to explain very huge terms or words are represented to the full text vector space. Many lexical matching at term level are mining tasks are divided into two major categories: inaccurate. Sometimes the words have the number of edictive tasks which aim to predict the value of meaning or the number of words has the same meaning, it particular attribute based on the values of other attributes effects to match returns irrelevant documents. It is difficult ind another one is descriptive tasks which aim to derive to judge which documents belong to the same cluster based patterns (correlations, trends, clusters, trajectories, and on the specific category without any selection for the terms anomalies)[1]. Clustering is a method to organize which have meaning full or correlation between the term automatically a large data collection by partition a set data, each other. Therefore, this research is used concept of another than to objects in other clusters. Document (LSI). In this method, the documents are projected onto a clustering is related to organize a large data text collection. small subspace of this vector space and clustered.So,there In the field of Information Retrieval (Ir), document is creation of new abstract vector space, which contain of clustering is used to automatically group the document that the important term is captured in order(21 978-0-7695-3892-1/0952600◎2009IEEE DOI 101109/CCTD.2009. 206
Document Clustering using Concept Space and Cosine Similarity Measurement Lailil Muflikhah Department of Computer and Information Science Universiti Teknologi Petronas Brawijaya University Bandar Seri Iskandar, Tronoh, Perak, Malaysia laililmf@gmail.com Baharum Baharudin Department of Computer and Information Science Universiti Teknologi Petronas Bandar Seri Iskandar, Tronoh, Perak, Malaysia baharbh@gmail.com Abstract— Document clustering is related to data clustering concept which is one of data mining tasks and unsupervised classification. It is often applied to the huge data in order to make a partition based on their similarity. Initially, it used for Information Retrieval in order to improve the precision and recall from query. It is very easy to cluster with small data attributes which contains of important items. Furthermore, document clustering is very useful in retrieve information application in order to reduce the consuming time and get high precision and recall. Therefore, we propose to integrate the information retrieval method and document clustering as concept space approach. The method is known as Latent Semantic Index (LSI) approach which used Singular Vector Decomposition (SVD) or Principle Component Analysis (PCA). The aim of this method is to reduce the matrix dimension by finding the pattern in document collection with refers to concurrent of the terms. Each method is implemented to weight of term-document in vector space model (VSM) for document clustering using fuzzy c-means algorithm. Besides reduction of term-document matrix, this research also uses the cosine similarity measurement as replacement of Euclidean distance to involve in fuzzy c-means. And as a result, the performance of the proposed method is better than the existing method with f-measure around 0.91 and entropy around 0.51. Keywords-data mining; clustering; LSI; SVD; PCA; fuzzy cmeans; euclidean distance; cosine similarity I. INTRODUCTION Data mining is a technique to get the pattern from hidden information. This technique is to find and describe structural patterns in data collection as a tool for helping to explain that data and make predictions from it. Generally, data mining tasks are divided into two major categories: predictive tasks which aim to predict the value of a particular attribute based on the values of other attributes and another one is descriptive tasks which aim to derive patterns (correlations, trends, clusters, trajectories, and anomalies)[1]. Clustering is a method to organize automatically a large data collection by partition a set data, so the objects in the same cluster are more similar to one another than to objects in other clusters. Document clustering is related to organize a large data text collection. In the field of Information Retrieval (IR), document clustering is used to automatically group the document that belongs to the same topic in order to provide user’s browsing of retrieval results [2]. Some experimental evidences show that IR application can benefit from the use of document clustering [3]. Document clustering has always been used as a tool to improve the performance of retrieval and navigating large data. The clustering methods can be classified into hierarchical method and partitioning method. Partitioning clustering can be divided into hard clustering and overlapping (fuzzy clustering). In any given document, there is the possibility that it can contain multiple subject or category. This issue is purposed to use fuzzy clustering approach, which allows a document to appear in multiple clusters. The method used is different from hard clustering method, which a document is only belongs to one cluster, not more. It means that we assume well defined boundary among the clusters. Many researchers have conducted in document clustering by hard clustering methods. In the research show that Bisecting K-Means algorithm is better than Basic K-Means algorithm and Agglomerative Hierarchical Clustering using similarity concept with cosine formulation [4]. Thus, another research that in grouping process used by variance value is better in the result [5], or by using Euclidean distance [6]. Furthermore, there are several applications by using Fuzzy C-Means algorithm, the researcher had applied clustering for symbolic data and its result of clustering has better quality than hierarchical method (hard clustering) [7]. Also, it had been applied in text mining [8]. While grouping or clustering of document, the problem is very huge terms or words are represented to the full text vector space. Many lexical matching at term level are inaccurate. Sometimes the words have the number of meaning or the number of words has the same meaning, it effects to match returns irrelevant documents. It is difficult to judge which documents belong to the same cluster based on the specific category without any selection for the terms which have meaning full or correlation between the terms each other. Therefore, this research is used concept of Information Retrieval approach is Latent Semantic Index (LSI). In this method, the documents are projected onto a small subspace of this vector space and clustered. So, there is creation of new abstract vector space, which contain of the important term is captured in order [2]. 2009 International Conference on Computer Technology and Development 978-0-7695-3892-1/09 $26.00 © 2009 IEEE DOI 10.1109/ICCTD.2009.206 58
In this paper, firstly it is described about information matrix[ 9]. And Fig. 2 depicts how the Lsi in getting the retrieval using LSI concept. Then, we describe document pattern in document collection similarity as basic concept of clustering, and also one of clustering algorithm applied for document clustering itself is Fuzzy C-Means. After that, the methodology which used to implement document clustering by LSI and cosine similarit embedded to experiment evaluation. Since this experiment Is implemented, the performance can be made an analysis and conclusion Figure 2. Description of LSI in data collection IL. INFORMATION RETRIEVAL B. Singular Value Decomposition SID) The Singular Value Decomposition(SVD)is a method Information retrieval is the way to search the matcn which can find the patterns in the matrix and identify which information as user desired. Unrelated document may be retrieved simply because terms occur accidentally in it, and words and documents are similar to each other. It creates the on the other hand, related documents may be missed new matrices from term(0)x document(d)matrix A that are because no term in the document occurs in the query matrices U, 2 and v such that A= USI which can be Therefore, the retrieval could be based on concept rather illustrated as in Fig. 3.[101 than on terms, by mapping first items to a"concept space and using ranking of similarity as shown in Fig. 1[2] rm matrix documen matrix Figure 3. The relative sizes of the three matrixes when 1>d In Fig. 3, the SVD matrix shows where u has orthogonal, unit-length column(U'U=l)and it is called left Figure 1. Using concept for Retrieval Information singular vectors; V has orthogonal which is called right describes there is middle layer into two query singular vectors, unit-length column (VV=l) and 2 is diagonal matrix(kx k) of singular values, where k is the of directly relating documents and term as in vector retrieval. rank of A(s min(t, d). Generally, A=UEI matrix must This vector, the query c2 of t3 return d2, d3, d4 in the answer all be of full rank. The amount of dimension reduction, need without requiring that the document contains terms cept c2, to choice correctly in order to represent the real structure in set based on the observation that they relate to cor the data[2] A. Latent Semantic Index (Lsi C. Principal Component Analysis(PCA) Initially, latent semantic indexing (Lsi) is obtained to Principal component analysis is a method to find k get pattern in the document collection which used to principal axes "which are orthonormal coordinate systems improve the accuracy in Information Retrieval. It uses that can capture most of the variance in data. Basically Singular Vector Decomposition (SVD)or Principle PCA is formed from Singular Vector Decomposition(SVD) Component Analysis(PCA)to decompose the original on the covariance matrix which used eigen vector or value matrix A of document representation and to retain only k of covariance matrix [11, 121 largest singular value from singular value matrix 2. In this matrix,2 selects only the largest singular value which is to IIL DOCUMENT DISSIMILARITY The dissimilarity of data object(document)is shown by Ind VT. The choice of s determines on how many of the the distance between document as clust ter center and th important concepts" the ranking will be based on. It others. The Euclidean distance, d, between two points, x and assumption that concepts with small singular value E are y, in one-, two-, three-, or higher-dimensional space, is rather to be considered as"noise"and thus can be ignored. given by the equation I Therefore, LSI can be depicted of how the sizes of the d(x, y)=Eka1(Xk-Yx (1) involved matrixes reduce, when only the first s singular where n is the number of dimensions and X, and Y, are values are kept for the computation of the ranking and also respectively, the kh attributes(components)of x and y [Il how the position between term and document the
In this paper, firstly it is described ab retrieval using LSI concept. Then, we de similarity as basic concept of clustering, clustering algorithm applied for document c Fuzzy C-Means. After that, the methodolog implement document clustering by LSI and embedded to experiment evaluation. Since is implemented, the performance can be m and conclusion. II. INFORMATION RETRIEV Information retrieval is the way to se information as user desired. Unrelated do retrieved simply because terms occur accid on the other hand, related documents because no term in the document occur Therefore, the retrieval could be based on than on terms, by mapping first items to a and using ranking of similarity as shown in Figure 1. Using concept for Retrieval Inf Fig. 1 describes there is middle layer based on the concept (c1 and c2) and docum of directly relating documents and term as in This vector, the query c2 of t3 return d2, d3, set based on the observation that they relat without requiring that the document contains A. Latent Semantic Index (LSI) Initially, latent semantic indexing (LSI get pattern in the document collection improve the accuracy in Information R Singular Vector Decomposition (SVD Component Analysis (PCA) to decompo matrix A of document representation and largest singular value from singular value m matrix, ∑ selects only the largest singular v keep the corresponding columns in two o and VT . The choice of s determines on h “important concepts” the ranking will be assumption that concepts with small singu rather to be considered as “noise” and thus Therefore, LSI can be depicted of how involved matrixes reduce, when only the values are kept for the computation of the how the position between term and d bout information escribe document and also one of clustering itself is gy which used to cosine similarity e this experiment made an analysis VAL earch the match ocument may be dentally in it, and may be missed rs in the query. n concept rather “concept space” Fig. 1[2]. formation r into two query ment maps instead n vector retrieval. , d4 in the answer te to concept c2, s term t3. I) is obtained to which used to Retrieval. It uses D) or Principle ose the original to retain only k matrix ∑. In this value which is to other matrixes U how many of the e based on. It is ular value ∑ are s can be ignored. the sizes of the e first s singular ranking and also ocument in the matrix[9]. And Fig.2 depicts how pattern in document collection. Figure 2. Description of LSI B. Singular Value Decomposition ( The Singular Value Decomposi which can find the patterns in the m words and documents are similar to new matrices from term (t) x docum matrices U, ∑ and V such that A illustrated as in Fig. 3.[10] Figure 3. The relative sizes of the th In Fig. 3, the SVD matrix orthogonal, unit-length column (UT singular vectors; V has orthogona singular vectors, unit-length colu diagonal matrix (k x k) of singula rank of A (≤ min (t, d)). Generally all be of full rank. The amount of di to choice correctly in order to repre the data [2]. C. Principal Component Analysis ( Principal component analysis “principal axes” which are orthono that can capture most of the vari PCA is formed from Singular Vecto on the covariance matrix which us of covariance matrix [11, 12]. III. DOCUMENT DIS The dissimilarity of data object the distance between document as others. The Euclidean distance, d, b y, in one-, two-, three-, or highe given by the equation 1. ݀ሺݔ ,ݕሻ ൌ ඥ∑ ሺܺ െ ܻሻ ଶ ୀଵ where n is the number of dimension respectively, the k th attributes (comp w the LSI in getting the in data collection (SVD) ition (SVD) is a method matrix and identify which each other. It creates the ment (d) matrix A that are A= USVT which can be hree matrixes when t > d shows where U has T U=I) and it is called left al which is called right umn (VT V=I) and ∑ is ar values, where k is the y, A = U∑VT matrix must imension reduction, need esent the real structure in (PCA) is a method to find k ormal coordinate systems iance in data. Basically, or Decomposition (SVD) ed eigen vector or value SSIMILARITY (document) is shown by s cluster center and the between two points, x and er-dimensional space, is (1) ns and Xk and Yk are ponents) of x and y [1]. 59
In contrast, the similarity between data object VI. PERFORMANCE EVALUATION (document)is known as the small distance in one cluster Documents are often represented as vectors, where eacn clustering, there are two measurements which are F-measure attribute represents the frequency with which a particular and entropy[ 17]. This basic idea is from information term (word) occurs in the document. A measure of retrieval concept. In this technique, each cluster similarity for document clustering is the cosine of the angle considered as if it were the result of query and each clasa o between two vectors as in this equation 2 [11 if it were the desired set of documents for the query os(d,d)=dd/(dl川d| (2) Furthermore. the formulation of F-measure involves where d, and d are two different documents Precision and recall for each cluster j and class i are as follows IV. FUZZY C-MEANS CLUSTERING Recall(i,D)=4 Precision(i,D)=- (4) There are various fuzzy clustering algorithms and one simple fuzzy clustering technique is the fuzzy c-means algorithm(FCM)by Duda and Hart [ 13] which was birth of where ny is the number of documents with class label i in fuzzy method. The FCM is known to produce reasonable cluster j, n, is the number of documents with class label i partitioning of the original data in many cases(see[14])and and n, is the number of documents in cluster very quickly compared to some other approaches. Besides Thus, the F-measure cluster j and class i is obtained as that, FCM is convergence to a minimum or saddle point this below equation from any initializations, it may be either a local or(the) F(i,j)=2-Recaul "Precstonc.) global minimum of objective function [ 15] the objective function JCX: U, V). Generally, the objective The higher f-measure is the higher accuracy of cluster, function is the summing dissimilarity weighted by includes pr membership degree(a) is shown as equation 3 [16] Another measurement which related to the internal where d is the distance; V is cluster center and +(3) quality of clustering is entropy measurement(E)and it can J(X; U, V)=2i12ks1 uik)d(Vi, Xx) e formulated (document-term matrix) ∑P(,).logP(i, (6) here, P(i,j) is probability that a document has class label V. MEtHOdOLOGY and is assigned to cluster j The proposed method is Lsi concept in order to get the Thus, the total entropy of clusters is obtained by small vector space. The details steps for document summing the entropies of each cluster weighted by the size clustering are as below. of each cluster. Document preprocessing which includes case E=∑ folding, parsing ng stop word and stemming of cluster j and n is total document number 2. Removing the terms which have global frequency in the corpus. The lower value of entropy, the higher quality less than 2 and local frequency is more than a half of of cluster internally the total document 3. Representation full text document to term-document VIL. EXPERIMENTAL RESULT A vector(vector space model) using TF-IDF weight term We evaluated the performance of the proposed method 4. Mapping the term-document A matrix to document using the data sets taken from 20News Group[18].The natrix in concept space using LSI approach dataset is made up of four groups with refer to the data A=U. volume: Binary 2, Multi5, Multi7 and Multil0. They consist 2 of short news with various topics which used as class Vr=∑-1·U-1·A reference in clustering process. The binary2 dataset contains Since U.U= 1(as property of SVD matrix), thus: of 200 documents(10215 terms)and 100 documents in each V=Ar·U·∑-1 topic. Also for the other dataset, there are 100 documents in 5. Implementing Fuzzy C-means clustering algorithm each topic. The multi5 dataset contains of 500 documents by term-document I vector as representative of the (18367 terms), and multi dataset contains of 700 document collection and using Cosine similarity as documents(18661 terms). And another dataset contains of replacement of Euclidean distance which defined as 1000 documents(25627 terms). Thus, each dataset is [1-cos(di, d)]. It is applied into the objective clustered separately based on the number of topics. The first function of Fuzzy C-means algorithm volume density of data and the result after preprocessing is shown in the Table l
In contrast, the similarity between data object (document) is known as the small distance in one cluster. Documents are often represented as vectors, where each attribute represents the frequency with which a particular term (word) occurs in the document. A measure of similarity for document clustering is the cosine of the angle between two vectors as in this equation 2 [1]. cos൫݀, ݀൯ ൌ ݀ ௧ ݀/ሺԡ݀ԡฮ݀ฮሻ (2) where di and dj are two different documents IV. FUZZY C-MEANS CLUSTERING There are various fuzzy clustering algorithms and one simple fuzzy clustering technique is the fuzzy c-means algorithm (FCM) by Duda and Hart [13] which was birth of fuzzy method. The FCM is known to produce reasonable partitioning of the original data in many cases (see [14]) and is very quickly compared to some other approaches. Besides that, FCM is convergence to a minimum or saddle point from any initializations, it may be either a local or (the) global minimum of objective function [15]. As principle, this algorithm is based on minimization of the objective function J(X; U,V). Generally, the objective function is the summing up dissimilarity weighted by membership degree (ߤሻ is shown as equation 3 [16] ܬሺܺ; ܷ, ܸሻ ൌ ∑∑ሺߤሻ ݀ଶሺܸ, ܺሻ ே ୀଵ ୀଵ (3) where d is the distance; V is cluster center and X is data (document-term matrix). V. METHODOLOGY The proposed method is LSI concept in order to get the small vector space. The details steps for document clustering are as below: 1. Document preprocessing which includes case folding, parsing, removing stop word and stemming. 2. Removing the terms which have global frequency less than 2 and local frequency is more than a half of the total document. 3. Representation full text document to term-document A vector (vector space model) using TF-IDF weight term. 4. Mapping the term-document A matrix to V document matrix in concept space using LSI approach. ்ܸ·∑·ܷൌܣ ்ܸ ൌ ∑ିଵ · ܷିଵ · ܣ Since ்ܷ ·ܷൌ1 (as property of SVD matrix), thus: ܸൌܣି∑·ܷ· ்ଵ 5. Implementing Fuzzy C-means clustering algorithm by term-document V vector as representative of the document collection and using Cosine similarity as replacement of Euclidean distance which defined as [1 െ cos൫݀, ݀൯ሿ. It is applied into the objective function of Fuzzy C-means algorithm. VI. PERFORMANCE EVALUATION In order to know the performance for quality of clustering, there are two measurements which are F-measure and entropy[17]. This basic idea is from information retrieval concept. In this technique, each cluster is considered as if it were the result of query and each class as if it were the desired set of documents for the query. Furthermore, the formulation of F-measure involves Precision and recall for each cluster j and class i are as follows: ܴ݈݈݁ܿܽሺ݅, ݆ሻ ൌ ೕ ; ܲݎ݅ܿ݁ݏ݅݊ሺ݅, ݆ሻ ൌ ೕ ೕ (4) where nij is the number of documents with class label i in cluster j, ni is the number of documents with class label i and nj is the number of documents in cluster j. Thus, the F-measure cluster j and class i is obtained as this below equation: ܨሺ݅, ݆ሻ ൌ ሺଶכோሺ,ሻכ௦ሺ,ሻሻ ோሺ,ሻା ௦ሺ,ሻ (5) The higher f-measure is the higher accuracy of cluster, includes precision and recall. Another measurement which related to the internal quality of clustering is entropy measurement (ܧሻ and it can be formulated: ܧ ൌ െ ∑ ܲሺ݅, ݆ሻ. log ܲሺ݅, ݆ሻ (6) where, P(i, j) is probability that a document has class label i and is assigned to cluster j. Thus, the total entropy of clusters is obtained by summing the entropies of each cluster weighted by the size of each cluster: ܧ ൌ ∑ ೕ (7 (ܧ where, nj is size of cluster j and n is total document number in the corpus. The lower value of entropy, the higher quality of cluster internally. VII. EXPERIMENTAL RESULT We evaluated the performance of the proposed method using the data sets taken from 20News Group [18]. The dataset is made up of four groups with refer to the data volume: Binary2, Multi5, Multi7 and Multi10. They consist of short news with various topics which used as class reference in clustering process. The binary2 dataset contains of 200 documents (10215 terms) and 100 documents in each topic. Also for the other dataset, there are 100 documents in each topic. The multi5 dataset contains of 500 documents (18367 terms), and multi7 dataset contains of 700 documents (18661 terms). And another dataset contains of 1000 documents (25627 terms). Thus, each dataset is clustered separately based on the number of topics. The first step is document preprocessing in order to reduce the volume density of data and the result after preprocessing is shown in the Table I. 60
TABLE I DESCRIPTION OF DATASET AFTER PREPROCESSING After that, it is represented the weight term using tF- IDF method in matrix (vector space model). By applying Dataset LSI method. the term-document matrix dimension is Docs Terms reduced as shown in the Table i. The size of volume reduction is based on the selected k-rank with optimum 200 74 32 condition at interval [2.501 c. motorcycles TABLE I. REDUCTION DIMENSION OF TERM-DOCUMENT USING LSI Multis recsport. basebal 13646 Data set Total Total Patten- Total Pattern- Documents terms(SVI comp. sys. mac hardware misc. forsale Multi 13959 Multi Multi10 sci. electroni Thus, to cluster the document collection is used Fuzzy comp. sys. mac hardware C-Means algorithm by parameter fuzziness(m=l 1),error rate(E=0.001)and specific cluster number()based on the number of topic or class. By applied LSI method (SVD and recsport. hockey PCA), the distribution of term-document for binary dataset scielectronics and the position of cluster center at the certain k-rank can be seEmed illustrated at Fig. 4. There is different distribution of dataset sclspace in clustering for the both method talk politics.guns SVD method of binary dataset PCA method of binary dataset 18111“01 OE 04 .DD2 Figure 4. Dataset and cluster center distribution using LSI approach VIIL discussion It means that there are k patterns in each document collection. In order to know the effect of k-rank with quality of cluster and get optimum condition (high performance), we apply these methods by various k-rank It is applied to binary2 dataset using SVD and PCa wit k-1, 2, 3,..., 50 and the result is shown in Fig 4 and 5 2610141382n03438424650 Figure 5. Performance of clustering for binary using SVD
TABLE I. DESCRIPTION OF DATASET AFTER PREPROCESSING Dataset Total Topics Docs Terms Binary2 talk.politics.mideast talk.politics.misc 200 7432 Multi5 comp.graphics rec.motorcycles rec.sport.baseball sci.space talk.politics.mideast 500 13646 Multi7 alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.electronics talk.politics.guns 700 13959 Multi10 alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.crypt sci.electronics sci.med sci.space talk.politics.guns 1000 18655 After that, it is represented the weight term using TFIDF method in matrix (vector space model). By applying LSI method, the term-document matrix dimension is reduced as shown in the Table II. The size of volume reduction is based on the selected k-rank with optimum condition at interval [2...50]. TABLE II. REDUCTION DIMENSION OF TERM-DOCUMENT USING LSI Data set Total Documents Total Patternterms (SVD) Total Patternterms (PCA) Binary2 200 18 26 Multi2 500 22 22 Multi7 700 28 24 Multi10 1000 20 24 Thus, to cluster the document collection is used Fuzzy C-Means algorithm by parameter fuzziness (m=1.1), error rate (ࢿ=0.001) and specific cluster number (c) based on the number of topic or class. By applied LSI method (SVD and PCA), the distribution of term-document for binary2 dataset and the position of cluster center at the certain k-rank can be illustrated at Fig. 4. There is different distribution of dataset in clustering for the both methods. SVD method of binary2 dataset PCA method of binary2 dataset Figure 4. Dataset and cluster center distribution using LSI approach VIII. DISCUSSION It means that there are k patterns in each document collection. In order to know the effect of k-rank with quality of cluster and get optimum condition (high performance), we apply these methods by various k-rank. It is applied to binary2 dataset using SVD and PCA with k=1, 2, 3, …, 50 and the result is shown in Fig. 4 and 5. Figure 5. Performance of clustering for binary2 using SVD -0.18 -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.02 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0 0.2 0.4 0.6 0.8 1 1.2 2 6 10 14 18 22 26 30 34 38 42 46 50 prec recall f-measure entropy k-rank 61
entropy(svd: 0.710, pca: 0.732). The performance of rest data sets is also increase even not significant 0.8 The document clustering can be applied using concept space and cosine similarity. It had made the significant reduction of term-document matrix dimension with refer to k-rank(total number of pattern). Also their average performance is very high with f-measure about 0.91 and 261014182226303438424650 entropy about 0.51. It is significant improvement when applied in huge data volume(multi and multil0 dataset) until more than 50% increasing Figure 6. Performance of clustering for binary 2 using PCA The performance of clustering for each LSI method is REFERENCES ptimum condition at different k-rank. This showed that the optimum SVD of binary 2 is at k-rank=18 1. Pang- Ning Tan, M.S., Vipin Kumar, Introduction to Data Mining. and for PCA is at k-rank=6. It is also applied to the other Pearson International ed. 2006: Pearson Education, Inc. data sets: multi. multi and multi10. Furthermore. the 2. M.A. Hearst, a. P. Reexamining the cluster hypothesis. 1996: In comparison of performance for document clustering eeding of sigir 96 3. Jardine, N a.v. R, C.J., The Use of hi between without LSI applied and with LSI applied mation Retrieval. Information Storage and retr including Cosine similarity measurement as replacement of Euclidean distance is shown in Fig. 7 for all data sets 4. Steinbach M, K.G., Kumar V, 4 Comparison of lustering Techniques. 2000, University of Selection in Divisive Clustering Algorithms. 2002 2.5 6. Larose, D T, An Introduction to Data Mining. Discovering Euclidea Recall cosine 7. El-Sonbaty, YaL, M.A., Fic-y Clustering for Symbol Data. IEEE Transactions on Fuzzy Systems, 1998. 6 8. Rodrigues, M.E.S. M.a.S., L. A Scalable chical Flee Clustering Algorithm for Text Mining. in The 5th International nces in Soft Computing. 2004 9. Aberer, K, EPFL-SSC, Ld.s.d. 1. repartis, Editor. 10. S. Deerwester, e a, Indexing by latent semantic analysis. Joumal of Euclidean American Society for Information Science and Technology, 1990 1:p.391-407 11. Smith, L, A Tutorial on Principe 2. Shlens, J, A Tutorial on Principal Component Analysis. 2009 13. Bezdek, J C, Flcsy Mathematics in Pattern Classification. 1973 Cornell University: Ithaca, New York. multi5 multi multi1o 14. Bezdek, J.C., Patten Recognition with Fucey Objective Function Figure 7. Performance comparison of document clustering Algorith, 1981: Plenum Press 15. Hathaway, R, Bezdek, J Tucker, W, An Impro Fig. 7 depicts that the accuracy of document clustering Convergence Theory for the Ficey ISODATA Clustering Algorithms. without LSI applied is very low, especially for huge data Analysis of Fuzzy Information, 1987. 3(Boca Raton: CRC Press): p volume (multi and multi1o) which using Euclidean 16. Sadaaki Miyamoto, H.L., Katsuhiro Honda, Algorithm for distance. When it is applied to multi dataset, it has lustering. Methods in c-Means Clustering with Application precision=0 464, recall=0. 324, and f-measure=0.381, but S.i.F. a.s. Computing. Vol. 229. 2008, Osaka, Japan: S it has high entropy which is 2.337. It is also happened to Publishing Services Pvt. Ltd, Chennai, India. multilo dataset which has the worst performance with 17. Brojner Larsen and Chinatsu Aone, Fast and E sing Linear-time Document Clustering, in KDD-99 precision=0.410, recall=0. 341, f-measure=0.372 and Diego, California. entropy=2. 422. In contrast, it is used LSI approach and Cosine similarity as reprecement of Euclidean distance method to be applied in FC-Means algorithm. And the performance for external and internal quality of cluster is very high. The multi dataset is obtain f-measure (svd: 0.906, pca: 0.903)and entropy(svd: 0.597; pca: 0.609) and multi10 with f-measure(svd: 0.888; pca: 0.887)and
Figure 6. Performance of clustering for binary2 using PCA The performance of clustering for each LSI method is obtained optimum condition at different k-rank. This is showed that the optimum SVD of binary2 is at k-rank=18, and for PCA is at k-rank=6. It is also applied to the other data sets: multi5, multi7 and multi10. Furthermore, the comparison of performance for document clustering between without LSI applied and with LSI applied including Cosine similarity measurement as replacement of Euclidean distance is shown in Fig. 7 for all data sets. Figure 7. Performance comparison of document clustering Fig. 7 depicts that the accuracy of document clustering without LSI applied is very low, especially for huge data volume (multi7 and multi10) which using Euclidean distance. When it is applied to multi7 dataset, it has precision=0.464, recall=0.324, and f-measure =0.381, but it has high entropy which is 2.337. It is also happened to multi10 dataset which has the worst performance with precision=0.410, recall=0.341, f-measure=0.372 and entropy=2.422. In contrast, it is used LSI approach and Cosine similarity as reprecement of Euclidean distance method to be applied in FC-Means algorithm. And the performance for external and internal quality of cluster is very high. The multi7 dataset is obtain f-measure (svd:0.906; pca:0.903) and entropy(svd:0.597; pca:0.609) and multi10 with f-measure (svd:0.888; pca:0.887) and entropy (svd:0.710; pca:0.732). The performance of rest data sets is also increase even not significant. IX. CONCLUSION The document clustering can be applied using concept space and cosine similarity. It had made the significant reduction of term-document matrix dimension with refer to k-rank (total number of pattern). Also their average performance is very high with f-measure about 0.91 and entropy about 0.51. It is significant improvement when applied in huge data volume (multi7 and multi10 dataset) until more than 50% increasing. REFERENCES 1. Pang-Ning Tan, M.S., Vipin Kumar, Introduction to Data Mining. Pearson International ed. 2006: Pearson Education, Inc. 2. M.A. Hearst, a.J.O.P. Reexamining the cluster hypothesis. 1996: In Proceeding of SIGIR '96. 3. Jardine, N.a.v.R., C.J., The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval. Vol. 7. 1971. 4. Steinbach M., K.G., Kumar V., A Comparison of Document Clustering Techniques. 2000, University of Mineasota. 5. Saveresi, S.M., D.L. Boley, S.Bittanti and G. Gazzaniga, Cluster Selection in Divisive Clustering Algorithms. 2002. 6. Larose, D.T., An Introduction to Data Mining. Discovering Knowledge in Data. 2005: Willi & Sons, Inc. 7. El-Sonbaty, Y.a.I., M.A., Fuzzy Clustering for Symbol Data. IEEE Transactions on Fuzzy Systems, 1998. 6. 8. Rodrigues, M.E.S.M.a.S., L. A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining. in The 5th International Conference on Recent Advances in Soft Cpmputing. 2004. 9. Aberer, K., EPFL-SSC, L.d.s.d.i. repartis, Editor. 2003. 10. S. Deerwester, e.a., Indexing by latent semantic analysis. Journal of American Society for Information Science and Technology, 1990. 41: p. 391-407. 11. Smith, L., A Tutorial on Principal Component Analysis. 2002. 12. Shlens, J., A Tutorial on Principal Component Analysis. 2009. 13. Bezdek, J.C., Fuzzy Mathematics in Pattern Classification. 1973, Cornell University: Ithaca, New York. 14. Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function Algorithm. 1981: Plenum Press. 15. Hathaway, R., Bezdek, J., and Tucker, W., An Improved Convergence Theory for the Fuzzy ISODATA Clustering Algorithms. Analysis of Fuzzy Information, 1987. 3(Boca Raton: CRC Press): p. 123 - 132. 16. Sadaaki Miyamoto, H.I., Katsuhiro Honda, Algorithm for Fuzzy Clustering. Methods in c-Means Clustering with Applications, ed. S.i.F.a.S. Computing. Vol. 229. 2008, Osaka, Japan: Scientific Publishing Services Pvt. Ltd., Chennai, India. 17. Brojner Larsen and Chinatsu Aone, Fast and Effective Text Mining Using Linear-time Document Clustering, in KDD-99. 1999: San Diego, California. 18. http://kdd.ics.uci.edu/ databases/20newgroups.html 0 0.2 0.4 0.6 0.8 1 1.2 2 6 10 14 18 22 26 30 34 38 42 46 50 prec recall fmeasure entropy k-rank 0 0.5 1 1.5 2 2.5 3 no LSI SVD PCA no LSI SVD PCA no LSI SVD PCA no LSI SVD PCA binary2 multi5 multi7 multi10 Precision Euclidean Recall cosine Recall Euclidean F-measure cosine F-measure Euclidean Entropy cosine Entropy Euclidean 62