K-means 聚类算法 - Python

K-Means 聚类算法

一般的, 我们用 \(C_i\) 既聚类的中心来代表一个聚类

Euclidean distance(\(in R^2\)): \(dist(a, b)=\sqrt{(a_x-b_x)^2+(a_y-b_y)^2}\)

蔟内变差 (用于衡量一个蔟 \(C_i\) 的质量):\(E=\sum_{i=1}^k\sum_{p\in{C_i}}dist(p,c_i)^2\)

Question: 既然蔟内变差可以恒量一个蔟的质量, 那么枚举完所有的可能, 然后再采用最佳的划分不就好了吗?

Answer: 不可以, 这是个 NP 完全问题, 即使固定蔟的个数 #cluster 和空间维度 #dimension, 依然开销巨大, 所以我们要采用 Greedy Algorithm(贪心方法)

算法: k-mean 用于划分 k-mean 算法, 其中每个蔟的中心都用蔟中所有的对象均值来表示

Input:
k: number of cluster
D: Data set
Output:k cluster
Approach:
Select k object as cluster center randomly from D;
repeat
Based on the distance between a obj(or points) with centroids, partition them into the most similar cluster;
Update the mean of very cluster as the new centroids;
until it doesn't change

代码部分

import numpy as np
import pandas as pd
data = pd.read_csv('../trip.csv')
data = data.iloc[:, 2:]
class KMeans():
def __init__(self, data, k, r):
    self.data = data
    self.k = k
    self.label = np.empty(data.shape[0])
    self.E = np.empty(data.shape[0])
    self.centroids = self.__init(data, k, r)
    self.__convergence = False
def __distance(self, x, y):
    '''
    Return Euclid distance of x and y.
    '''
    return sum((x-y)**2)**(1/2)
def __init(self, data, k_cluster, random_state):
    '''
    Select k objs as centroids from data set.
    '''
    rs = np.random.RandomState(random_state)
    num, dim = data.shape
    rand_id = rs.randint(num, size=k_cluster)
    centroids = np.empty((k_cluster, dim))
    for i, v in enumerate(rand_id):
        centroids[i, :] = data.iloc[v, :]
    return centroids
def __converge(self, pre):
    return True if False not in (pre==self.centroids) else False
def __update(self):
    for i in range(self.k):
        centroids[i] = data.loc[label==i].mean(axis=0)
def fit(self):
    while not self.__convergence:
        pre_centroids = self.centroids.copy()
        num, dim = data.shape
        for i in range(num):
            min_dist, neighbour = np.inf, -1
            for j in range(self.k):
                dist = self.__distance(self.data.iloc[i,:], self.centroids[j])
                if dist < min_dist:
                    min_dist, neighbour = dist, j
                    self.label[i] = j
            self.E[i] = self.__distance(self.data.iloc[i,:], self.centroids[j])**2
        self.__convergence = self.__converge(pre_centroids)

来源: http://www.bubuko.com/infodetail-3457406.html

与本文相关文章

暂无,快来抢沙发吧！