K-Means 聚类算法
一般的, 我们用 \(C_i\) 既聚类的中心来代表一个聚类
Euclidean distance(\(in R^2\)): \(dist(a, b)=\sqrt{(a_x-b_x)^2+(a_y-b_y)^2}\)
蔟内变差 (用于衡量一个蔟 \(C_i\) 的质量):\(E=\sum_{i=1}^k\sum_{p\in{C_i}}dist(p,c_i)^2\)
Question: 既然蔟内变差可以恒量一个蔟的质量, 那么枚举完所有的可能, 然后再采用最佳的划分不就好了吗?
Answer: 不可以, 这是个 NP 完全问题, 即使固定蔟的个数 #cluster 和空间维度 #dimension, 依然开销巨大, 所以我们要采用 Greedy Algorithm(贪心方法)
算法: k-mean 用于划分 k-mean 算法, 其中每个蔟的中心都用蔟中所有的对象均值来表示
- Input:
- k: number of cluster
- D: Data set
- Output:k cluster
- Approach:
- Select k object as cluster center randomly from D;
- repeat
- Based on the distance between a obj(or points) with centroids, partition them into the most similar cluster;
- Update the mean of very cluster as the new centroids;
- until it doesn't change
代码部分
- import numpy as np
- import pandas as pd
- data = pd.read_csv('../trip.csv')
- data = data.iloc[:, 2:]
- class KMeans():
- def __init__(self, data, k, r):
- self.data = data
- self.k = k
- self.label = np.empty(data.shape[0])
- self.E = np.empty(data.shape[0])
- self.centroids = self.__init(data, k, r)
- self.__convergence = False
- def __distance(self, x, y):
- '''
- Return Euclid distance of x and y.
- '''
- return sum((x-y)**2)**(1/2)
- def __init(self, data, k_cluster, random_state):
- '''
- Select k objs as centroids from data set.
- '''
- rs = np.random.RandomState(random_state)
- num, dim = data.shape
- rand_id = rs.randint(num, size=k_cluster)
- centroids = np.empty((k_cluster, dim))
- for i, v in enumerate(rand_id):
- centroids[i, :] = data.iloc[v, :]
- return centroids
- def __converge(self, pre):
- return True if False not in (pre==self.centroids) else False
- def __update(self):
- for i in range(self.k):
- centroids[i] = data.loc[label==i].mean(axis=0)
- def fit(self):
- while not self.__convergence:
- pre_centroids = self.centroids.copy()
- num, dim = data.shape
- for i in range(num):
- min_dist, neighbour = np.inf, -1
- for j in range(self.k):
- dist = self.__distance(self.data.iloc[i,:], self.centroids[j])
- if dist < min_dist:
- min_dist, neighbour = dist, j
- self.label[i] = j
- self.E[i] = self.__distance(self.data.iloc[i,:], self.centroids[j])**2
- self.__convergence = self.__converge(pre_centroids)
来源: http://www.bubuko.com/infodetail-3457406.html