k-means++:The Advantages of Careful Seeding David Arthur Sergei Vassilvitskiit Abstract than most of its competitors. The k-means method is a widely used clustering technique Unfortunately,the empirical speed and simplicity that seeks to minimize the average squared distance between of the k-means algorithm come at the price of accuracy. points in the same cluster.Although it offers no accuracy There are many natural examples for which the algo- guarantees,its simplicity and speed are very appealing in rithm generates arbitrarily bad clusterings (i.e.,is practice.By augmenting k-means with a very simple,ran- unbounded even when n and k are fixed).Furthermore. domized seeding technique,we obtain an algorithm that is these examples do not rely on an adversarial placement e(logk)-competitive with the optimal clustering.Prelim-of the starting centers,and the ratio can be unbounded inary experiments show that our augmentation improves with high probability even with the standard random- both the speed and the accuracy of k-means,often quite ized seeding technique. dramatically. In this paper,we propose a way of initializing k-means by choosing random starting centers with 1 Introduction very specific probabilities.Specifically,we choose a Clustering is one of the classic problems in machine point p as a center with probability proportional to p's learning and computational geometry.In the popular contribution to the overall potential.Letting o denote k-means formulation,one is given an integer k and a set the potential after choosing centers in this way,we show of n data points in Rd.The goal is to choose k centers the following. so as to minimize o,the sum of the squared distances THEOREM 1.1.For any set of data points,El]< between each point and its closest center. 8(Ink +2)ooPT. Solving this problem exactly is NP-hard,even with just two clusters [10],but twenty-five years ago,Lloyd This sampling is both fast and simple,and it already [20]proposed a local search solution that is still very achieves approximation guarantees that k-means can- widely used today (see for example [1,11,15]).Indeed, not.We propose using it to seed the initial centers a recent survey of data mining techniques states that it for k-means,leading to a combined algorithm we call "is by far the most popular clustering algorithm used in k-means++. scientific and industrial applications"[5]. This complements a very recent result of Ostrovsky Usually referred to simply ask-means,Lloyd's et al.24],who independently proposed much the same algorithm begins withk arbitrary centers,typically algorithm.Whereas they showed this randomized seed- chosen uniformly at random from the data points.Each ing is O(1)-competitive on data sets following a certain point is then assigned to the nearest center,and each separation condition,we show it is O(log k)-competitive center is recomputed as the center of mass of all points on all data sets. assigned to it.These two steps(assignment and center We also show that the analysis for Theorem 1.1 is calculation)are repeated until the process stabilizes. tight up to a constant factor,and that it can be eas- One can check that the total error is monotoni-ily extended to various potential functions in arbitrary cally decreasing,which ensures that no clustering is re- metric spaces.In particular,we can also get a sim- peated during the course of the algorithm.Since there ple O(logk)approximation algorithm for the k-median are at most k possible clusterings,the process will al- objective.Furthermore,we provide preliminary experi- ways terminate.In practice,very few iterations are usu- mental data showing that in practice,k-means++really ally required,which makes the algorithm much faster does outperform k-means in terms of both accuracy and speed,often by a substantial margin Stanford University,Supported in part by NDSEG Fellow-1.1 Related work As a fundamental problem in ship,NSF Grant ITR-0331640,and grants from Media-X and SNRC. machine learning,k-means has a rich history.Because tStanford University,Supported in part by NSF Grant ITR-of its simplicity and its observed speed,Lloyd's method 0331640,and grants from Media-X and SNRC. [20]remains the most popular approach in practice
k-means++: The Advantages of Careful Seeding David Arthur ∗ Sergei Vassilvitskii† Abstract The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically. 1 Introduction Clustering is one of the classic problems in machine learning and computational geometry. In the popular k-means formulation, one is given an integer k and a set of n data points in R d . The goal is to choose k centers so as to minimize φ, the sum of the squared distances between each point and its closest center. Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example [1, 11, 15]). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” [5]. Usually referred to simply as k-means, Lloyd’s algorithm begins with k arbitrary centers, typically chosen uniformly at random from the data points. Each point is then assigned to the nearest center, and each center is recomputed as the center of mass of all points assigned to it. These two steps (assignment and center calculation) are repeated until the process stabilizes. One can check that the total error φ is monotonically decreasing, which ensures that no clustering is repeated during the course of the algorithm. Since there are at most k n possible clusterings, the process will always terminate. In practice, very few iterations are usually required, which makes the algorithm much faster ∗Stanford University, Supported in part by NDSEG Fellowship, NSF Grant ITR-0331640, and grants from Media-X and SNRC. †Stanford University, Supported in part by NSF Grant ITR- 0331640, and grants from Media-X and SNRC. than most of its competitors. Unfortunately, the empirical speed and simplicity of the k-means algorithm come at the price of accuracy. There are many natural examples for which the algorithm generates arbitrarily bad clusterings (i.e., φ φOPT is unbounded even when n and k are fixed). Furthermore, these examples do not rely on an adversarial placement of the starting centers, and the ratio can be unbounded with high probability even with the standard randomized seeding technique. In this paper, we propose a way of initializing k-means by choosing random starting centers with very specific probabilities. Specifically, we choose a point p as a center with probability proportional to p’s contribution to the overall potential. Letting φ denote the potential after choosing centers in this way, we show the following. Theorem 1.1. For any set of data points, E[φ] ≤ 8(ln k + 2)φOP T . This sampling is both fast and simple, and it already achieves approximation guarantees that k-means cannot. We propose using it to seed the initial centers for k-means, leading to a combined algorithm we call k-means++. This complements a very recent result of Ostrovsky et al. [24], who independently proposed much the same algorithm. Whereas they showed this randomized seeding is O(1)-competitive on data sets following a certain separation condition, we show it is O(log k)-competitive on all data sets. We also show that the analysis for Theorem 1.1 is tight up to a constant factor, and that it can be easily extended to various potential functions in arbitrary metric spaces. In particular, we can also get a simple O(log k) approximation algorithm for the k-median objective. Furthermore, we provide preliminary experimental data showing that in practice, k-means++ really does outperform k-means in terms of both accuracy and speed, often by a substantial margin. 1.1 Related work As a fundamental problem in machine learning, k-means has a rich history. Because of its simplicity and its observed speed, Lloyd’s method [20] remains the most popular approach in practice
despite its limited accuracy.The convergence time of For the k-means problem,we are given an integer k Lloyd's method has been the subject of a recent series and a set of n data points cRd.We wish to choose of papers [2,4,8,14];in this work we focus on improving k centers C so as to minimize the potential function, its accuracy. In the theory community,Inaba et al.[16]were ∑mix-c2. cEC the first to give an exact algorithm for the k-means problem,with the running time of O(nkd).Since then,a number of polynomial time approximation schemes have Choosing these centers implicitly defines a clustering been developed (see [9,13,19,21]and the references for each center,we set one cluster to be the set of therein).While the authors develop interesting insights data points that are closer to that center than to any into the structure of the clustering problem,their other.As noted above,finding an exact solution to the algorithms are highly exponential (or worse)in k,and k-means problem is NP-hard. are unfortunately impractical even for relatively small Throughout the paper,we will let Copr denote the n,k and d. optimal clustering for a given instance of the k-means Kanungo et al.[17]proposed an O(n3e-d)algorithm problem,and we will let oopr denote the corresponding that is (9+e)-competitive.However,n compares potential.Given a clustering C with potential o,we unfavorably with the almost linear running time of also let (A)denote the contribution of A C&to the Lloyd's method,and the exponential dependence on d potential (i.e.,(A)=Amincecll-c2). can also be problematic.For these reasons,Kanungo et al.also suggested a way of combining their techniques 2.1 The k-means algorithm The k-means method with Lloyd's algorithm,but in order to avoid the is a simple and fast algorithm that attempts to locally exponential dependence on d,their approach sacrifices improve an arbitrary k-means clustering.It works as all approximation guarantees. follows. Mettu and Plaxton [22]also achieved a constant- 1.Arbitrarily choose k initial centers C ={c1,...,ck}. probability O(1)approximation using a technique called 2.For each i{1,...,k},set the cluster Ci to be the successive sampling.They match our running time of set of points in that are closer to ci than they O(nkd),but only if k is sufficiently large and the spread are to ci for all j≠i. is sufficiently small.In practice,our approach is simpler, 3.For each i {1,...,k},set ci to be the center of and our experimental results seem to be better in terms of both speed and accuracy. mass of all points in C:c=高∑xec,z Very recently,Ostrovsky et al.[24]independently 4.Repeat Steps 2 and 3 until C no longer changes. proposed an algorithm that is essentially identical to It is standard practice to choose the initial centers ours,although their analysis is quite different.Letting uniformly at random from For Step 2,ties may be ooprk denote the optimal potential for a k-clustering broken arbitrarily,as long as the method is consistent. on a given data set,they prove k-means++is O(1)- Steps 2 and 3 are both guaranteed to decrease o,so competitive in the case where The theagorith makes local improvements to an arbitrary intuition here is that if this condition does not hold,clustering until it is no longer possible to do so.To see then the data is not well suited for clustering with the that Step 3 does in fact decreases o,it is helpful to recall given value for k. a standard result from linear algebra (see [14]). Combining this result with ours gives a strong characterization of the algorithm's performance.In LEMMA 2.1.Let S be a set of points with center of particular,k-means++is never worse than O(logk)- mass c(S),and let z be an arbitrary point.Then, competitive,and on very well formed data sets,it ∑ESl-22-∑resl-c(S)I2=lS·lc(S)-2. improves to being O(1)-competitive. Monotonicity for Step 3 follows from taking s to be a Overall,the seeding technique we propose is similar single cluster and z to be its initial center. in spirit to that used by Meyerson [23]for online facility As discussed above,the k-means algorithm is at- location,and Mishra et al.[12]and Charikar et al.[6] tractive in practice because it is simple and it is gener- in the context of k-median clustering.However,our ally fast.Unfortunately,it is guaranteed only to find a analysis is quite different from those works. local optimum,which can often be quite poor. 2 Preliminaries 2.2 The k-means++algorithm The k-means algo- In this section,we formally define the k-means problem,rithm begins with an arbitrary set of cluster centers. as well as the k-means and k-means++algorithms. We propose a specific way of choosing these centers.At
despite its limited accuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers [2, 4, 8, 14]; in this work we focus on improving its accuracy. In the theory community, Inaba et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd). Since then, a number of polynomial time approximation schemes have been developed (see [9, 13, 19, 21] and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunately impractical even for relatively small n, k and d. Kanungo et al. [17] proposed an O(n 3 −d ) algorithm that is (9 + )-competitive. However, n 3 compares unfavorably with the almost linear running time of Lloyd’s method, and the exponential dependence on d can also be problematic. For these reasons, Kanungo et al. also suggested a way of combining their techniques with Lloyd’s algorithm, but in order to avoid the exponential dependence on d, their approach sacrifices all approximation guarantees. Mettu and Plaxton [22] also achieved a constantprobability O(1) approximation using a technique called successive sampling. They match our running time of O(nkd), but only if k is sufficiently large and the spread is sufficiently small. In practice, our approach is simpler, and our experimental results seem to be better in terms of both speed and accuracy. Very recently, Ostrovsky et al. [24] independently proposed an algorithm that is essentially identical to ours, although their analysis is quite different. Letting φOPT,k denote the optimal potential for a k-clustering on a given data set, they prove k-means++ is O(1)- competitive in the case where φOPT,k φOPT,k−1 ≤ 2 . The intuition here is that if this condition does not hold, then the data is not well suited for clustering with the given value for k. Combining this result with ours gives a strong characterization of the algorithm’s performance. In particular, k-means++ is never worse than O(log k)- competitive, and on very well formed data sets, it improves to being O(1)-competitive. Overall, the seeding technique we propose is similar in spirit to that used by Meyerson [23] for online facility location, and Mishra et al. [12] and Charikar et al. [6] in the context of k-median clustering. However, our analysis is quite different from those works. 2 Preliminaries In this section, we formally define the k-means problem, as well as the k-means and k-means++ algorithms. For the k-means problem, we are given an integer k and a set of n data points X ⊂ R d . We wish to choose k centers C so as to minimize the potential function, φ = X x∈X min c∈C kx − ck 2 . Choosing these centers implicitly defines a clustering – for each center, we set one cluster to be the set of data points that are closer to that center than to any other. As noted above, finding an exact solution to the k-means problem is NP-hard. Throughout the paper, we will let COPT denote the optimal clustering for a given instance of the k-means problem, and we will let φOPT denote the corresponding potential. Given a clustering C with potential φ, we also let φ(A) denote the contribution of A ⊂ X to the potential (i.e., φ(A) = P x∈A minc∈Ckx − ck 2 ). 2.1 The k-means algorithm The k-means method is a simple and fast algorithm that attempts to locally improve an arbitrary k-means clustering. It works as follows. 1. Arbitrarily choose k initial centers C = {c1, . . . , ck}. 2. For each i ∈ {1, . . . , k}, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all j 6= i. 3. For each i ∈ {1, . . . , k}, set ci to be the center of mass of all points in Ci : ci = 1 |Ci| P x∈Ci x. 4. Repeat Steps 2 and 3 until C no longer changes. It is standard practice to choose the initial centers uniformly at random from X . For Step 2, ties may be broken arbitrarily, as long as the method is consistent. Steps 2 and 3 are both guaranteed to decrease φ, so the algorithm makes local improvements to an arbitrary clustering until it is no longer possible to do so. To see that Step 3 does in fact decreases φ, it is helpful to recall a standard result from linear algebra (see [14]). Lemma 2.1. Let S be a set of points with center of mass P c(S), and let z be an arbitrary point. Then, x∈S kx − zk 2 − P x∈S kx − c(S)k 2 = |S| · kc(S) − zk 2 . Monotonicity for Step 3 follows from taking S to be a single cluster and z to be its initial center. As discussed above, the k-means algorithm is attractive in practice because it is simple and it is generally fast. Unfortunately, it is guaranteed only to find a local optimum, which can often be quite poor. 2.2 The k-means++ algorithm The k-means algorithm begins with an arbitrary set of cluster centers. We propose a specific way of choosing these centers. At
any given time,let D(z)denote the shortest distance Our next step is to prove an analog of Lemma 3.1 from a data point z to the closest center we have al-for the remaining centers,which are chosen with D2 ready chosen.Then,we define the following algorithm,weighting. which we call k-means++. LEMMA 3.2.Let A be an arbitrary cluster in CoPT,and 1a.Choose an initial center c uniformly at random let c be an arbitrary clustering.If we add a random from center to C from A,chosen with D2 weighting,then 1b.Choose the next center ci,selecting ci=z'E D(')2 E[(A)]≤8poPT(A): with probability( 1c.Repeat Step 1b until we have chosen a total of k Proof.The probability that we choose some fixed ao as centers. our center,given that we are choosing our center from D(ao)2 24.Proceed as with the standard agorith.is preciselyFrthermore,after choo We call the weighting used in Step Ib simply "D2 ing the center o point a will contribute precisely weighting”. min(D(a),lla-aoll)2 to the potential.Therefore, 3 k-means++is O(logk)-competitive D(ao)2 In this section,we prove our main result. l(A)eA D(a)min(D(a).la-dol. a∈A THEOREM 3.1.If C is constructed with k-means++, Note by the triangle inequality that D(ao)min(D(a),lla-aoll)2. aEA random. In the first expression,we substitute min(D(a),la- LEMMA 3.1.Let A be an arbitrary cluster in CopT,and aoll)2la-aoll2,and in the second expression,we let C be the clustering with just one center,which is substitute min(D(a),lla-aoll)2<D(a)2.Simplifying, chosen uniformly at random from A.Then,Elo(A)]= we then have, 200PT(A). Proof.Let c(A)denote the center of mass of the data EioA1≤4·∑∑Ia-aol points in A.By Lemma 2.1,we know that since 前e 80oPT(A). Copr is optimal,it must be using c(A)as the center corresponding to the cluster A.Using the same lemma The last step here follows from Lemma 3.1. again,we see E[(A)]is given by, We have now shown that seeding by D2 weighting 向 -ar is competitive as long as it chooses centers from each cluster of Copr,which completes the first half of our argument.We now use induction to show the total error ∑a-cAP+l4lao-c42 in general is at most O(logk). = 2∑Ia-c(4I2, IThe power-mean inequality states for any real numbers a∈A a1,…,am that2a号≥a(②a,)2,It follows from the Cauchy- Schwarz inequality.We are only using the case m=2 here,but and the result follows. we will need the general case for Lemma 3.3
any given time, let D(x) denote the shortest distance from a data point x to the closest center we have already chosen. Then, we define the following algorithm, which we call k-means++. 1a. Choose an initial center c1 uniformly at random from X . 1b. Choose the next center ci , selecting ci = x 0 ∈ X with probability D(x 0 ) 2 P x∈X D(x) 2 . 1c. Repeat Step 1b until we have chosen a total of k centers. 2-4. Proceed as with the standard k-means algorithm. We call the weighting used in Step 1b simply “D2 weighting”. 3 k-means++ is O(log k)-competitive In this section, we prove our main result. Theorem 3.1. If C is constructed with k-means++, then the corresponding potential function φ satisfies E[φ] ≤ 8(ln k + 2)φOPT. In fact, we prove this holds after only Step 1 of the algorithm above. Steps 2 through 4 can then only decrease φ. Not surprisingly, our experiments show this local optimization is important in practice, although it is difficult to quantify this theoretically. Our analysis consists of two parts. First, we show that k-means++ is competitive in those clusters of COPT from which it chooses a center. This is easiest in the case of our first center, which is chosen uniformly at random. Lemma 3.1. Let A be an arbitrary cluster in COPT, and let C be the clustering with just one center, which is chosen uniformly at random from A. Then, E[φ(A)] = 2φOPT(A). Proof. Let c(A) denote the center of mass of the data points in A. By Lemma 2.1, we know that since COPT is optimal, it must be using c(A) as the center corresponding to the cluster A. Using the same lemma again, we see E[φ(A)] is given by, X a0∈A 1 |A| · X a∈A ka − a0k 2 ! = 1 |A| X a0∈A X a∈A ka − c(A)k 2 + |A| · ka0 − c(A)k 2 ! = 2X a∈A ka − c(A)k 2 , and the result follows. Our next step is to prove an analog of Lemma 3.1 for the remaining centers, which are chosen with D2 weighting. Lemma 3.2. Let A be an arbitrary cluster in COPT, and let C be an arbitrary clustering. If we add a random center to C from A, chosen with D2 weighting, then E[φ(A)] ≤ 8φOPT(A). Proof. The probability that we choose some fixed a0 as our center, given that we are choosing our center from A, is precisely D(a0) 2 P a∈A D(a) 2 . Furthermore, after choosing the center a0, a point a will contribute precisely min(D(a), ka − a0k) 2 to the potential. Therefore, E[φ(A)] = X a0∈A D(a0) 2 P a∈A D(a) 2 X a∈A min(D(a), ka − a0k) 2 . Note by the triangle inequality that D(a0) ≤ D(a) + ka − a0k for all a, a0. From this, the powermean inequality1 implies that D(a0) 2 ≤ 2D(a) 2 + 2ka − a0k 2 . Summing over all a, we then have that D(a0) 2 ≤ 2 |A| P a∈A D(a) 2 + 2 |A| P a∈Aka − a0k 2 , and hence, E[φ(A)] is at most, 2 |A| · X a0∈A P a∈A D(a) 2 P a∈A D(a) 2 · X a∈A min(D(a), ka − a0k) 2 + 2 |A| · X a0∈A P a∈Aka − a0k 2 P a∈A D(a) 2 · X a∈A min(D(a), ka − a0k) 2 . In the first expression, we substitute min(D(a), ka − a0k) 2 ≤ ka − a0k 2 , and in the second expression, we substitute min(D(a), ka − a0k) 2 ≤ D(a) 2 . Simplifying, we then have, E[φ(A)] ≤ 4 |A| · X a0∈A X a∈A ka − a0k 2 = 8φOPT(A). The last step here follows from Lemma 3.1. We have now shown that seeding by D2 weighting is competitive as long as it chooses centers from each cluster of COPT, which completes the first half of our argument. We now use induction to show the total error in general is at most O(log k). 1The power-mean inequality states for any real numbers a1, · · · , am that Σa 2 i ≥ 1 m (Σai) 2 . It follows from the CauchySchwarz inequality. We are only using the case m = 2 here, but we will need the general case for Lemma 3.3
LEMMA 3.3.Let C be an arbitrary clustering.Choose follows that our contribution to E[oopT]in this case is u >0 "uncovered"clusters from Copr,and let u at most denote the set of points in these clusters.Also let X。=X-Xu.Now suppose we addt≤u random centers to C,chosen with D2 weighting.Let C'denote the the .∑p(a+6.+86orx)-8oPra) resulting clustering,and let denote the corresponding potential.Then,Elo]is at most, 1+-)+二(x)-4到)) o()+8oopr(化))·(1+)+"-t.)。 ((x)+8oopr(x)·(1+A-) Here,H:denotes the harmonic sum,1+.+ Proof.We prove this by induction,showing that if the result holds for (t-1,u)and (t-1,u-1),then it The last step here follows from the fact that also holds for (t,u).Therefore,it suffices to check ∑aEAPaOa≤8oopT(A),which is implied by Lemma t=0.u>0 and t=u=1 as our base cases. 3.2. If t =0 and u >0,the result follows from the fact Now,the power-mean inequality implies that that 1+H=t=1.Next,suppose t u 1. ∑Acxp(A2≥是·p(化.)2.Therefore,ifwe.sum over We choose our one new center from the one uncovered all uncovered clusters A,we obtain a potential contri- cluster with probability exactly.In this case, bution of at most Lemma 3.2 guarantees that El]<(e)+8oopT(Yu). Since d'<o even if we choose a center from a covered .(e(x)+80opm()小(1+Hi-) cluster,we have. E[≤ .()+84orx)+2. a(exP-xr (比u) ≤2(Xe)+8ooPr(X): (((X)+8or(比)·(1+H-) Since 1+H=2 here,we have shown the result holds for both base cases. We now proceed to prove the inductive step.It is Combining the potential contribution to Eo]from convenient here to consider two cases.First suppose we both cases,we now obtain the desired bound: choose our first center from a covered cluster.As above, this happens with probability exactly.Note that this new center can only decrease o.Bearing this in E[]≤((X)+8oPr(比u)·(1+H-1) mind,apply the inductive hypothesis with the same t u-t choice of covered clusters,but with t decreased by one. (比)+).(比) It follows that our contribution to Elo']in this case is at most, ≤(()+8oom()(1+-1+》 ()+8aoer化)-1+H- (Xc) t). u +u=t+1.x) The inductive step now follows from the fact that是≤是 u We specialize Lemma 3.3 to obtain our main result. On the other hand,suppose we choose our first center from some uncovered cluster A.This happens Theorem 3.1 If C is constructed with k-means++, with probability.Let pa denote the probability then the corresponding potential function satisfies that we choose aA as our center,given the center is Elo]<8(Ink+2)ooPT. somewhere in A,and let oa denote (A)after we choose a as our center.Once again,we apply our inductive Proof.Consider the clustering C after we have com- hypothesis,this time adding A to the set of covered pleted Step 1.Let A denote the Copr cluster in which clusters,as well as decreasing both t and u by 1.It we chose the first center.Applying Lemma 3.3 with
Lemma 3.3. Let C be an arbitrary clustering. Choose u > 0 “uncovered” clusters from COPT, and let Xu denote the set of points in these clusters. Also let Xc = X −Xu. Now suppose we add t ≤ u random centers to C, chosen with D2 weighting. Let C 0 denote the the resulting clustering, and let φ 0 denote the corresponding potential. Then, E[φ 0 ] is at most, φ(Xc) + 8φOPT(Xu) · (1 + Ht) + u − t u · φ(Xu). Here, Ht denotes the harmonic sum, 1 + 1 2 + · · · + 1 t . Proof. We prove this by induction, showing that if the result holds for (t − 1, u) and (t − 1, u − 1), then it also holds for (t, u). Therefore, it suffices to check t = 0, u > 0 and t = u = 1 as our base cases. If t = 0 and u > 0, the result follows from the fact that 1 + Ht = u−t u = 1. Next, suppose t = u = 1. We choose our one new center from the one uncovered cluster with probability exactly φ(Xu) φ . In this case, Lemma 3.2 guarantees that E[φ 0 ] ≤ φ(Xc)+8φOPT(Xu). Since φ 0 ≤ φ even if we choose a center from a covered cluster, we have, E[φ 0 ] ≤ φ(Xu) φ · φ(Xc) + 8φOPT(Xu) + φ(Xc) φ · φ ≤ 2φ(Xc) + 8φOPT(Xu). Since 1 + Ht = 2 here, we have shown the result holds for both base cases. We now proceed to prove the inductive step. It is convenient here to consider two cases. First suppose we choose our first center from a covered cluster. As above, this happens with probability exactly φ(Xc) φ . Note that this new center can only decrease φ. Bearing this in mind, apply the inductive hypothesis with the same choice of covered clusters, but with t decreased by one. It follows that our contribution to E[φ 0 ] in this case is at most, φ(Xc) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t + 1 u · φ(Xu) . On the other hand, suppose we choose our first center from some uncovered cluster A. This happens with probability φ(A) φ . Let pa denote the probability that we choose a ∈ A as our center, given the center is somewhere in A, and let φa denote φ(A) after we choose a as our center. Once again, we apply our inductive hypothesis, this time adding A to the set of covered clusters, as well as decreasing both t and u by 1. It follows that our contribution to E[φOPT] in this case is at most, φ(A) φ · X a∈A pa φ(Xc) + φa + 8φOPT(Xu) − 8φOPT(A) · (1 + Ht−1) + u − t u − 1 · φ(Xu) − φ(A) ≤ φ(A) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t u − 1 · φ(Xu) − φ(A) . The last step here follows from the fact that P a∈A paφa ≤ 8φOPT(A), which is implied by Lemma 3.2. P Now, the power-mean inequality implies that A⊂Xu φ(A) 2 ≥ 1 u · φ(Xu) 2 . Therefore, if we sum over all uncovered clusters A, we obtain a potential contribution of at most, φ(Xu) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + 1 φ · u − t u − 1 · φ(Xu) 2 − 1 u · φ(Xu) 2 = φ(Xu) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t u · φ(Xu) . Combining the potential contribution to E[φ 0 ] from both cases, we now obtain the desired bound: E[φ 0 ] ≤ φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t u · φ(Xu) + φ(Xc) φ · φ(Xu) u ≤ φ(Xc) + 8φOPT(Xu) · 1 + Ht−1 + 1 u + u − t u · φ(Xu). The inductive step now follows from the fact that 1 u ≤ 1 t . We specialize Lemma 3.3 to obtain our main result. Theorem 3.1 If C is constructed with k-means++, then the corresponding potential function φ satisfies E[φ] ≤ 8(ln k + 2)φOPT. Proof. Consider the clustering C after we have completed Step 1. Let A denote the COPT cluster in which we chose the first center. Applying Lemma 3.3 with
t=u=k-1 and with A being the only covered clus-Finally,,since no2u≥u·是·62.B and no2u≥n62H3, ter,we have. we have. E6opml≤(④+8or-86oPr)1+l-ik≥a(d2.1+)B+(层A2-2ns). The result now follows from Lemma 3.1,and from the This completes the base case. fact that Hk.-l≤1+lnk. We now proceed to prove the inductive step.As with Lemma 3.3,we consider two cases.The probability 4 A matching lower bound that our first center is chosen from an uncovered cluster In this section,we show that the D2 seeding used is, by k-means++is no better than (logk)-competitive in expectation,thereby proving Theorem 3.1 is tight u·是·△2 within a constant factor. u…是·△2+(k-w)装·62-(k-t)62 Fixk,and then choose n,△,6 with n>kand△≥ u△2 6.We construct with n points.First choose k centers u△2+(k-u)62 c1,c2,,ck such that llc-c2=△2-(=)·62 for all ij.Now,for each ci,add data points u△2 0· Ti.1,i.2,..,i arranged in a regular simplex with u△2+(k-u)82 center c,side length and radius6.If we do Applying our inductive hypothesis with t and u both this in orthogonal dimensions for each i,we then have, decreased by 1,we obtain a potential contribution from this case of at least, e-w={&i u42 u△2+(k-)62 …a+1. n62.(1+H-1) We prove our seeding technique is in expectation (logk)worse than the optimal clustering in this case. Clearly,the optimal clustering has centers {ci +(42-2n)u-t月 which leads to an optimal potential of opr=n.62 The probability that our first center is chosen from Conversely,using an induction similar to that of Lemma a covered cluster is 3.3,we show D2 seeding cannot match this bound.As before,we bound the expected potential in terms of (k-)·是·62-(k-t)62 the number of centers left to choose and the number u·是·△2+(k-四)·装·62-(k-t)2 of uncovered clusters (those clusters of Co from which (k-四)·是·62-(k-t)62.(k-u)62 we have not chosen a center). (k-u)·朵·62 u△2+(k-u)82 LEMMA 4.1.Let C be an arbitrary clustering on X with (k-u)62 k-t 1 centers,but with u clusters from Copr 2a· u△2+(k-u)62 uncovered.Now suppose we add t random centers to C,chosen with D2 weighting.Let c denote thethe Applying our inductive hypothesis with t decreased by 1 resulting clustering,and let denote the corresponding but with u constant,we obtain a potential contribution potential. from this case of at least, Furthermore,let a=,B=and a2+k-四·a+1(n2.(1+)月 (-u)62 H=∑1号.Then,Eo]is at least, a+1.(n2.1+)3+(RA2-2m2)(u-) +(侯A2-2m)u-t+0) Proof.We prove this by induction on t.If t=0,note that. Therefore,Elo]is at least, ==(a-u君-P+uRa2 a+1.(n2.1+)B+(A2-2n62)u-刊) ot+ Since n--u:是之是,we have"m≥学=a.Also, a,B≤1.Therefore, +aft-四原广(-u-(层a2-2n) ≥a(-u)P+u天a) -uA2.(Hr四-H'u-)n2.B
t = u = k − 1 and with A being the only covered cluster, we have, E[φOPT] ≤ φ(A) + 8φOPT − 8φOPT(A) · (1 + Hk−1). The result now follows from Lemma 3.1, and from the fact that Hk−1 ≤ 1 + ln k. 4 A matching lower bound In this section, we show that the D2 seeding used by k-means++ is no better than Ω(log k)-competitive in expectation, thereby proving Theorem 3.1 is tight within a constant factor. Fix k, and then choose n, ∆, δ with n k and ∆ δ. We construct X with n points. First choose k centers c1, c2, . . . , ck such that kci − cjk 2 = ∆2 − n−k n · δ 2 for all i 6= j. Now, for each ci , add data points xi,1, xi,2, · · · , xi, n k arranged in a regular simplex with center ci , side length δ, and radius q n−k 2n · δ. If we do this in orthogonal dimensions for each i, we then have, kxi,i0 − xj,j0k = δ if i=j, or ∆ otherwise. We prove our seeding technique is in expectation Ω(log k) worse than the optimal clustering in this case. Clearly, the optimal clustering has centers {ci}, which leads to an optimal potential of φOPT = n−k 2 · δ 2 . Conversely, using an induction similar to that of Lemma 3.3, we show D2 seeding cannot match this bound. As before, we bound the expected potential in terms of the number of centers left to choose and the number of uncovered clusters (those clusters of C0 from which we have not chosen a center). Lemma 4.1. Let C be an arbitrary clustering on X with k − t ≥ 1 centers, but with u clusters from COPT uncovered. Now suppose we add t random centers to C, chosen with D2 weighting. Let C 0 denote the the resulting clustering, and let φ 0 denote the corresponding potential. Furthermore, let α = n−k 2 n , β = ∆2−2kδ2 ∆2 and H0 u = Pu i=1 k−i ki . Then, E[φ 0 ] is at least, α t+1 · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · (u − t) . Proof. We prove this by induction on t. If t = 0, note that, φ 0 = φ = n − u · n k − k · δ 2 + u · n k · ∆2 . Since n−u· n k ≥ n k , we have n−u· n k −k n−u· n k ≥ n k −k n k = α. Also, α, β ≤ 1. Therefore, φ 0 ≥ α · n − u · n k · δ 2 · β + u · n k · ∆2 . Finally, since nδ2u ≥ u · n k · δ 2 · β and nδ2u ≥ nδ2H0 uβ, we have, φ 0 ≥ α · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · u . This completes the base case. We now proceed to prove the inductive step. As with Lemma 3.3, we consider two cases. The probability that our first center is chosen from an uncovered cluster is, u · n k · ∆2 u · n k · ∆2 + (k − u) · n k · δ 2 − (k − t)δ 2 ≥ u∆2 u∆2 + (k − u)δ 2 ≥ α · u∆2 u∆2 + (k − u)δ 2 . Applying our inductive hypothesis with t and u both decreased by 1, we obtain a potential contribution from this case of at least, u∆2 u∆2 + (k − u)δ 2 · α t+1 · nδ2 · (1 + H0 u−1 ) · β + n k ∆2 − 2nδ2 · (u − t) . The probability that our first center is chosen from a covered cluster is (k − u) · n k · δ 2 − (k − t)δ 2 u · n k · ∆2 + (k − u) · n k · δ 2 − (k − t)δ 2 ≥ (k − u) · n k · δ 2 − (k − t)δ 2 (k − u) · n k · δ 2 · (k − u)δ 2 u∆2 + (k − u)δ 2 ≥ α · (k − u)δ 2 u∆2 + (k − u)δ 2 . Applying our inductive hypothesis with t decreased by 1 but with u constant, we obtain a potential contribution from this case of at least, (k − u)δ 2 u∆2 + (k − u)δ 2 · α t+1 · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · (u − t + 1) . Therefore, E[φ 0 ] is at least, α t+1 · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · (u − t) + α t+1 u∆2 + (k − u)δ 2 · (k − u)δ 2 · n k ∆2 − 2nδ2 − u∆2 · H0 (u) − H0 (u − 1) · nδ2 · β
However,and so Proof.Let c denote the center of A in Copr.Then, u△2.(H'(u-H'(u-1))·n62.3 Eo9(41= 可∑∑Ia-or 1 a0∈AaeA =k-2.(A2-2n) ∑∑a-cf+lm-的 20-1 aoEAaEA and the result follows. 2®8r(A). As in the previous section,we obtain the desired The second step here follows from the triangle inequality result by specializing the induction. and the power-mean inequality. THEOREM 4.1.D2 seeding is no better than 2(Ink)- The rest of our upper bound analysis carries competitive. through without change,except that in the proof of Lemma 3.2,we lose a factor of 2-1 from the power- Proof.Suppose a clustering with potential o is con- mean inequality,instead of just 2.Putting everything structed using k-means++on described above.Ap- together,we obtain the general theorem. ply Lemma 4.1 with u=t=k-1 after the first THEOREM 5.1.If c is constructed with Dt seeding, center has been chosen.Noting that 1+1= 1+∑(仔-)=H>lnk,we then have,. then the corresponding potential function satisfies, Eo4]≤24(mk+2)o8r E[l≥aB.nd2.lnk. 6 Empirical results In order to evaluate k-means++in practice,we have Now,fix k and 6 but let n and A approach infinity.implemented and tested it in C++[3].In this section, Then a and B both approach 1,and the result follows we discuss the results of these preliminary experiments. from the fact that We found that D2 seeding substantially improves both the running time and the accuracy of k-means. 5 Generalizations Although the k-means algorithm itself applies only 6.1 Datasets We evaluated the performance of in vector spaces with the potential function k-means and k-means++on four datasets. mineecl-c2,we note that our seeding tech- The first dataset,Norm25,is synthetic.To generate nique does not have the same limitations.In this sec-it,we chose 25 "true"centers uniformly at random tion,we discuss extending our results to arbitrary met- from a 15-dimensional hypercube of side length 500. ric spaces with the more general potential function, We then added points from Gaussian distributions of φlg=∑x mincecl-c for≥1.In particular, variance 1 around each true center.Thus,we obtained note that the case of e=1 is the k-medians potential a number of well separated Gaussians with the the true function. centers providing a good approximation to the optimal These generalizations require only one change to clustering. the algorithm itself.Instead of using D2 seeding,we We chose the remaining datasets from real-world switch to De seeding-i.e.,we choose zo as a center examples off the UC-Irvine Machine Learning Reposi- with probability D(zo)e tory.The Cloud dataset [7]consists of 1024 points in 10 For the analysis,the most important change ap- dimensions,and it is Philippe Collard's first cloud cover database.The Intrusion dataset [18]consists of 494019 pears in Lemma 3.1.Our original proof uses an inner product structure that is not available in the general points in 35 dimensions,and it represents features avail- able to an intrusion detection system.Finally,the Spam case.However,a slightly weaker result can be proven dataset [25]consists of 4601 points in 58 dimensions, using only the triangle inequality. and it represents features available to an e-mail spam detection system. LEMMA 5.1.Let A be an arbitrary cluster in CoPT,and For each dataset,we tested k =10.25,and 50. let C be the clustering with just one center,which is cho- sen uniformly at random from A.Then,Ell(A)] 248r(4). 6.2 Metrics Since we were testing randomized seed- ing processes,we ran 20 trials for each case.We report
However, H0 u − H0 u−1 = k−u ku and β = ∆2−2kδ2 ∆2 , so u∆2 · H0 (u) − H0 (u − 1) · nδ2 · β = (k − u)δ 2 · n k ∆2 − 2nδ2 , and the result follows. As in the previous section, we obtain the desired result by specializing the induction. Theorem 4.1. D2 seeding is no better than 2(ln k)- competitive. Proof. Suppose a clustering with potential φ is constructed using k-means++ on X described above. Apply Lemma 4.1 with u = t = k − 1 after the first center has been chosen. Noting that 1 + H0 k−1 = 1 + Pk−1 i=1 1 i − 1 k = Hk > ln k, we then have, E[φ] ≥ α kβ · nδ2 · ln k. Now, fix k and δ but let n and ∆ approach infinity. Then α and β both approach 1, and the result follows from the fact that φOPT = n−k 2 · δ 2 . 5 Generalizations Although the k-means algorithm itself applies only in vector spaces with the potential function P φ = x∈X minc∈Ckx − ck 2 , we note that our seeding technique does not have the same limitations. In this section, we discuss extending our results to arbitrary metric spaces with the more general potential function, φ [`] = P x∈X minc∈Ckx − ck ` for ` ≥ 1. In particular, note that the case of ` = 1 is the k-medians potential function. These generalizations require only one change to the algorithm itself. Instead of using D2 seeding, we switch to D` seeding – i.e., we choose x0 as a center with probability D(x0) ` P x∈X D(x) ` . For the analysis, the most important change appears in Lemma 3.1. Our original proof uses an inner product structure that is not available in the general case. However, a slightly weaker result can be proven using only the triangle inequality. Lemma 5.1. Let A be an arbitrary cluster in COPT, and let C be the clustering with just one center, which is chosen uniformly at random from A. Then, E[φ [`] (A)] ≤ 2 `φ [`] OPT(A). Proof. Let c denote the center of A in COPT. Then, E[φ [`] (A)] = 1 |A| X a0∈A X a∈A ka − a0k ` ≤ 2 `−1 |A| X a0∈A X a∈A ka − ck ` + ka0 − ck ` = 2`φ [`] OPT(A). The second step here follows from the triangle inequality and the power-mean inequality. The rest of our upper bound analysis carries through without change, except that in the proof of Lemma 3.2, we lose a factor of 2`−1 from the powermean inequality, instead of just 2. Putting everything together, we obtain the general theorem. Theorem 5.1. If C is constructed with D` seeding, then the corresponding potential function φ [`] satisfies, E[φ [`] ] ≤ 2 2` (ln k + 2)φ [`] OPT. 6 Empirical results In order to evaluate k-means++ in practice, we have implemented and tested it in C++ [3]. In this section, we discuss the results of these preliminary experiments. We found that D2 seeding substantially improves both the running time and the accuracy of k-means. 6.1 Datasets We evaluated the performance of k-means and k-means++ on four datasets. The first dataset, Norm25, is synthetic. To generate it, we chose 25 “true” centers uniformly at random from a 15-dimensional hypercube of side length 500. We then added points from Gaussian distributions of variance 1 around each true center. Thus, we obtained a number of well separated Gaussians with the the true centers providing a good approximation to the optimal clustering. We chose the remaining datasets from real-world examples off the UC-Irvine Machine Learning Repository. The Cloud dataset [7] consists of 1024 points in 10 dimensions, and it is Philippe Collard’s first cloud cover database. The Intrusion dataset [18] consists of 494019 points in 35 dimensions, and it represents features available to an intrusion detection system. Finally, the Spam dataset [25] consists of 4601 points in 58 dimensions, and it represents features available to an e-mail spam detection system. For each dataset, we tested k = 10, 25, and 50. 6.2 Metrics Since we were testing randomized seeding processes, we ran 20 trials for each case. We report
AverageΦ MinimumΦ Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 1.365·105 8.47% 1.174·105 0.93% 0.12 46.72% 25 4.233.104 99.96% 1.914.10 99.92% 0.90 87.79% 50 7.750.103 99.81% 1.474.10 0.53% 2.04 -1.62% Table 1: Experimental results on the Norm25 dataset (n 10000.d 15). For k-means,we list the actual potential and time in seconds. For k-means++, we list the percentage improvement over k-means 100%.(1k-means++value). k-means value Average o MinimumΦ Average T k-means k-means++ k-means k-means++ k-means k-means++ 10 7.921.10> 22.33% 6.284.10 10.37% 0.08 51.09% 25 3.637.103 42.76% 2.550.103 22.60% 0.11 43.21% 50 1.867.103 39.01% 1.407.103 23.07% 0.16 41.99% Table 2:Experimental results on the Cloud dataset(n=1024,d=10).For k-means,we list the actual potential and time in seconds.For k-means++,we list the percentage improvement over k-means. AverageΦ Minimum o Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 3.387.10 93.37% 3.206.10 94.40% 63.94 44.49% 25 3.149.108 99.20% 3.100.10 99.32% 257.34 49.19% 50 3.079.109 99.84% 3.076.108 99.87% 917.00 66.70% Table 3:Experimental results on the Intrusion dataset (n =494019,d =35).For k-means,we list the actual potential and time in seconds.For k-means++,we list the percentage improvement over k-means. Average o Minimum o Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 3.698.101 49.43% 3.684.10 54.59% 2.36 69.00% 25 3.288.104 88.76% 3.280.104 89.58% 7.36 79.84% 50 3.183.104 95.35% 2.384.104 94.30% 12.20 75.76% Table 4:Experimental results on the Spam dataset(n=4601,d=58).For k-means,we list the actual potential and time in seconds.For k-means++,we list the percentage improvement over k-means
Average φ Minimum φ Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 1.365 · 105 8.47% 1.174 · 105 0.93% 0.12 46.72% 25 4.233 · 104 99.96% 1.914 · 104 99.92% 0.90 87.79% 50 7.750 · 103 99.81% 1.474 · 101 0.53% 2.04 −1.62% Table 1: Experimental results on the Norm25 dataset (n = 10000, d = 15). For k-means, we list the actual potential and time in seconds. For k-means++, we list the percentage improvement over k-means: 100% · 1 − k-means++ value k-means value . Average φ Minimum φ Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 7.921 · 103 22.33% 6.284 · 103 10.37% 0.08 51.09% 25 3.637 · 103 42.76% 2.550 · 103 22.60% 0.11 43.21% 50 1.867 · 103 39.01% 1.407 · 103 23.07% 0.16 41.99% Table 2: Experimental results on the Cloud dataset (n = 1024, d = 10). For k-means, we list the actual potential and time in seconds. For k-means++, we list the percentage improvement over k-means. Average φ Minimum φ Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 3.387 · 108 93.37% 3.206 · 108 94.40% 63.94 44.49% 25 3.149 · 108 99.20% 3.100 · 108 99.32% 257.34 49.19% 50 3.079 · 108 99.84% 3.076 · 108 99.87% 917.00 66.70% Table 3: Experimental results on the Intrusion dataset (n = 494019, d = 35). For k-means, we list the actual potential and time in seconds. For k-means++, we list the percentage improvement over k-means. Average φ Minimum φ Average T k k-means k-means++ k-means k-means++ k-means k-means++ 10 3.698 · 104 49.43% 3.684 · 104 54.59% 2.36 69.00% 25 3.288 · 104 88.76% 3.280 · 104 89.58% 7.36 79.84% 50 3.183 · 104 95.35% 2.384 · 104 94.30% 12.20 75.76% Table 4: Experimental results on the Spam dataset (n = 4601, d = 58). For k-means, we list the actual potential and time in seconds. For k-means++, we list the percentage improvement over k-means
the minimum and the average potential(actually di-k-means++achieves asymptotically better results if it is vided by the number of points),as well as the mean allowed several trials.For example,if k-means++is run running time.Our implementations are standard with 2k times,our arguments can be modified to show it is no special optimizations. likely to achieve a constant approximation at least once. We ask whether a similar bound can be achieved for a 6.3 Results The results for k-means and k-means++ smaller number of trials. are displayed in Tables 1 through 4.We list the absolute Also,experiments showed that k-means++generally results for k-means,and the percentage improvement performed better if it selected several new centers during achieved by k-means++(e.g.,a 90%improvement in each iteration,and then greedily chose the one that the running time is equivalent to a factor 10 speedup). decreased o as much as possible.Unfortunately,our We observe that k-means++consistently outperformed proofs do not carry over to this scenario.It would be k-means,both by achieving a lower potential value,in interesting to see a comparable(or better)asymptotic some cases by several orders of magnitude,and also by result proven here. having a faster running time.The D2 seeding is slightly Finally,we are currently working on a more thor- slower than uniform seeding,but it still leads to a faster ough experimental analysis.In particular,we are mea- algorithm since it helps the local search converge after suring the performance of not only k-means++and stan- fewer iterations. dard k-means,but also other variants that have been The synthetic example is a case where standard suggested in the theory community. k-means does very badly.Even though there is an "obvious"clustering,the uniform seeding will inevitably Acknowledgements merge some of these clusters,and the local search will We would like to thank Rajeev Motwani for his helpful never be able to split them apart (see [12 for further comments. discussion of this phenomenon).The careful seeding method of k-means++avoided this problem altogether, References and it almost always attained the optimal clustering on the synthetic dataset. [1]Pankaj K.Agarwal and Nabil H.Mustafa.k-means The difference between k-means and k-means++ projective clustering.In PODS04:Proceedings of the on the real-world datasets was also substantial.In twenty-third ACM SIGMOD-SIGACT-SIGART sum- every case,k-means++achieved at least a 10%accuracy posium on Principles of database systems,pages 155- improvement over k-means,and it often performed 165,New York,NY,USA,2004.ACM Press. much better.Indeed.on the Spam and Intrusion [2]D.Arthur and S.Vassilvitskii.Worst-case and smoothed analysis of the ICP algorithm,with an ap- datasets,k-means++achieved potentials 20 to 1000 plication to the k-means method.In Symposium on times smaller than those achieved by standard k-means Foundations of Computer Science,2006. Each trial also completed two to three times faster,and [3]David Arthur and Sergei Vassilvitskii.k-means++ each individual trial was much more likely to achieve a test code. http://www.stanford.edu/~darthur/ good clustering. kMeansppTest.zip. [4]David Arthur and Sergei Vassilvitskii.How slow is 7 Conclusion and future work the k-means method?In SCG '06:Proceedings of We have presented a new way to seed the k-means the twenty-second annual symposium on computational geometry.ACM Press,2006. algorithm that is O(log k)-competitive with the optimal [5]Pavel Berkhin.Survey of clustering data mining clustering.Furthermore,our seeding technique is as fast techniques.Technical report,Accrue Software,San and as simple as the k-means algorithm itself,which Jose,CA,2002. makes it attractive in practice.Towards that end, [6]Moses Charikar,Liadan O'Callaghan,and Rina Pani- we ran preliminary experiments on several real-world grahy.Better streaming algorithms for clustering prob- datasets,and we observed that k-means++substantially lems.In STOC '03:Proceedings of the thirty-fifth an- outperformed standard k-means in terms of both speed nual ACM symposium on Theory of computing,pages and accuracy. 30-39,New York,NY,USA,2003.ACM Press. Although our analysis of the expected potential [7]Philippe Collard's cloud cover database.ftp://ftp. Elo]achieved by k-means++is tight to within a con- ics.uci.edu/pub/machine-learning-databases/ stant factor,a few open questions still remain.Most undocumented/taylor/cloud.data. importantly,it is standard practice to run the k-means [8]Sanjoy Dasgupta.How fast is kmeans?In Bernhard Scholkopf and Manfred K.Warmuth,editors,COLT, algorithm multiple times,and then keep only the best volume 2777 of Lecture Notes in Computer Science, clustering found.This raises the question of whether page 735.Springer,2003
the minimum and the average potential (actually divided by the number of points), as well as the mean running time. Our implementations are standard with no special optimizations. 6.3 Results The results for k-means and k-means++ are displayed in Tables 1 through 4. We list the absolute results for k-means, and the percentage improvement achieved by k-means++ (e.g., a 90% improvement in the running time is equivalent to a factor 10 speedup). We observe that k-means++ consistently outperformed k-means, both by achieving a lower potential value, in some cases by several orders of magnitude, and also by having a faster running time. The D2 seeding is slightly slower than uniform seeding, but it still leads to a faster algorithm since it helps the local search converge after fewer iterations. The synthetic example is a case where standard k-means does very badly. Even though there is an “obvious” clustering, the uniform seeding will inevitably merge some of these clusters, and the local search will never be able to split them apart (see [12] for further discussion of this phenomenon). The careful seeding method of k-means++ avoided this problem altogether, and it almost always attained the optimal clustering on the synthetic dataset. The difference between k-means and k-means++ on the real-world datasets was also substantial. In every case, k-means++ achieved at least a 10% accuracy improvement over k-means, and it often performed much better. Indeed, on the Spam and Intrusion datasets, k-means++ achieved potentials 20 to 1000 times smaller than those achieved by standard k-means. Each trial also completed two to three times faster, and each individual trial was much more likely to achieve a good clustering. 7 Conclusion and future work We have presented a new way to seed the k-means algorithm that is O(log k)-competitive with the optimal clustering. Furthermore, our seeding technique is as fast and as simple as the k-means algorithm itself, which makes it attractive in practice. Towards that end, we ran preliminary experiments on several real-world datasets, and we observed that k-means++ substantially outperformed standard k-means in terms of both speed and accuracy. Although our analysis of the expected potential E[φ] achieved by k-means++ is tight to within a constant factor, a few open questions still remain. Most importantly, it is standard practice to run the k-means algorithm multiple times, and then keep only the best clustering found. This raises the question of whether k-means++ achieves asymptotically better results if it is allowed several trials. For example, if k-means++ is run 2 k times, our arguments can be modified to show it is likely to achieve a constant approximation at least once. We ask whether a similar bound can be achieved for a smaller number of trials. Also, experiments showed that k-means++ generally performed better if it selected several new centers during each iteration, and then greedily chose the one that decreased φ as much as possible. Unfortunately, our proofs do not carry over to this scenario. It would be interesting to see a comparable (or better) asymptotic result proven here. Finally, we are currently working on a more thorough experimental analysis. In particular, we are measuring the performance of not only k-means++ and standard k-means, but also other variants that have been suggested in the theory community. Acknowledgements We would like to thank Rajeev Motwani for his helpful comments. References [1] Pankaj K. Agarwal and Nabil H. Mustafa. k-means projective clustering. In PODS ’04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 155– 165, New York, NY, USA, 2004. ACM Press. [2] D. Arthur and S. Vassilvitskii. Worst-case and smoothed analysis of the ICP algorithm, with an application to the k-means method. In Symposium on Foundations of Computer Science, 2006. [3] David Arthur and Sergei Vassilvitskii. k-means++ test code. http://www.stanford.edu/∼darthur/ kMeansppTest.zip. [4] David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In SCG ’06: Proceedings of the twenty-second annual symposium on computational geometry. ACM Press, 2006. [5] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. [6] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In STOC ’03: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30–39, New York, NY, USA, 2003. ACM Press. [7] Philippe Collard’s cloud cover database. ftp://ftp. ics.uci.edu/pub/machine-learning-databases/ undocumented/taylor/cloud.data. [8] Sanjoy Dasgupta. How fast is k-means? In Bernhard Sch¨olkopf and Manfred K. Warmuth, editors, COLT, volume 2777 of Lecture Notes in Computer Science, page 735. Springer, 2003
[9]W.Fernandez de la Vega,Marek Karpinski,Claire [17]Tapas Kanungo,David M.Mount,Nathan S.Ne- Kenyon,and Yuval Rabani.Approximation schemes tanyahu,Christine D.Piatko,Ruth Silverman,and An- for clustering problems.In STOC'03:Proceedings of gela Y.Wu.A local search approximation algorithm the thirty-fifth annual ACM symposium on Theory of for k-means clustering.Comput.Geom.,28(2-3):89- computing,pages 50-58,New York.NY.USA.2003 112.2004. ACM Press. [18]KDD Cup 1999 dataset.http://kdd.ics.uci.edu/ [10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,and /databases/kddcup99/kddcup99.html. V.Vinay.Clustering large graphs via the singular value 19 Amit Kumar,Yogish Sabharwal,and Sandeep Sen.A decomposition.Mach.Learn.,56(1-3):9-33,2004. simple linear time (1 +e)-approximation algorithm for [11]Frederic Gibou and Ronald Fedkiw.A fast hybrid k-means clustering in any dimensions.In FOCS 04: k-means level set algorithm for segmentation.In 4th Proceedings of the 45th Annual IEEE Symposium on Annual Hawaii International Conference on Statistics Foundations of Computer Science (FOCS04),pages and Mathematics,pages 281-291,2005. 454-462,Washington,DC,USA,2004.IEEE Com- 12]Sudipto Guha.Adam Meyerson.Nina Mishra.Rajeev puter Society. Motwani,and Liadan O'Callaghan.Clustering data [20 Stuart P.Lloyd.Least squares quantization in pcm. streams:Theory and practice.IEEE Transactions on IEEE Transactions on Information Theory,28(2):129- Knowledge and Data Engineering,15(3):515-528,2003. 136.1982. [13]Sariel Har-Peled and Soham Mazumdar.On coresets [21]Jiri Matousek.On approximate geometric k-clustering. for k-means and k-median clustering.In STOC 04: Discrete Computational Geometry,24(1):61-84. Proceedings of the thirty-sirth annual ACM symposium 2000. on Theory of computing,pages 291-300,New York, [22]Ramgopal R.Mettu and C.Greg Plaxton.Optimal NY.USA.2004.ACM Press. time bounds for approximate clustering.In Adnan 14 Sariel Har-Peled and Bardia Sadri.How fast is the Darwiche and Nir Friedman,editors,UAI,pages 344- k-means method?In SODA '05:Proceedings of the 351.Morgan Kaufmann,2002. sirteenth annual ACM-SIAM symposium on Discrete 23 A.Meyerson.Online facility location.In FOCS '01: algorithms,pages 877-885,Philadelphia,PA,USA, Proceedings of the 42nd IEEE symposium on Founda- 2005.Society for Industrial and Applied Mathematics. tions of Computer Science,page 426,Washington,DC, 15 R.Herwig,A.J.Poustka,C.Muller,C.Bull, USA.2001.IEEE Computer Society. H.Lehrach,and J O'Brien.Large-scale clustering of 24 R.Ostrovsky,Y.Rabani,L.Schulman,and C.Swamy. cdna-fingerprinting data.Genome Research,9:1093- The effectiveness of Lloyd-type methods for the k- 1105.1999. Means problem.In Symposium on Foundations of 16 Mary Inaba,Naoki Katoh,and Hiroshi Imai.Applica- Computer Science,2006. tions of weighted voronoi diagrams and randomization 25 Spam e-mail database.http://www.ics.uci.edu/ to variance-based k-clustering:(extended abstract).In ~mlearn/databases/spambase/. SCG '94:Proceedings of the tenth annual symposium on Computational geometry,pages 332-339,New York, NY.USA.1994.ACM Press
[9] W. Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Approximation schemes for clustering problems. In STOC ’03: Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 50–58, New York, NY, USA, 2003. ACM Press. [10] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the singular value decomposition. Mach. Learn., 56(1-3):9–33, 2004. [11] Fr´ed´eric Gibou and Ronald Fedkiw. A fast hybrid k-means level set algorithm for segmentation. In 4th Annual Hawaii International Conference on Statistics and Mathematics, pages 281–291, 2005. [12] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515–528, 2003. [13] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In STOC ’04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291–300, New York, NY, USA, 2004. ACM Press. [14] Sariel Har-Peled and Bardia Sadri. How fast is the k-means method? In SODA ’05: Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 877–885, Philadelphia, PA, USA, 2005. Society for Industrial and Applied Mathematics. [15] R. Herwig, A.J. Poustka, C. Muller, C. Bull, H. Lehrach, and J O’Brien. Large-scale clustering of cdna-fingerprinting data. Genome Research, 9:1093– 1105, 1999. [16] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In SCG ’94: Proceedings of the tenth annual symposium on Computational geometry, pages 332–339, New York, NY, USA, 1994. ACM Press. [17] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89– 112, 2004. [18] KDD Cup 1999 dataset. http://kdd.ics.uci.edu/ /databases/kddcup99/kddcup99.html. [19] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1 + )-approximation algorithm for k-means clustering in any dimensions. In FOCS ’04: Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science (FOCS’04), pages 454–462, Washington, DC, USA, 2004. IEEE Computer Society. [20] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129– 136, 1982. [21] Jir´ı Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61–84, 2000. [22] Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. In Adnan Darwiche and Nir Friedman, editors, UAI, pages 344– 351. Morgan Kaufmann, 2002. [23] A. Meyerson. Online facility location. In FOCS ’01: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, page 426, Washington, DC, USA, 2001. IEEE Computer Society. [24] R. Ostrovsky, Y. Rabani, L. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the kMeans problem. In Symposium on Foundations of Computer Science, 2006. [25] Spam e-mail database. http://www.ics.uci.edu/ ∼mlearn/databases/spambase/