geekdoc-python-zh/docs/askpython/k-means-clustering-from-scr...

137 lines
4.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 在 Python 中 K-Means 从零开始聚类[算法讲解]
> 原文:<https://www.askpython.com/python/examples/k-means-clustering-from-scratch>
K-Means 是一种非常流行的聚类技术。K-means 聚类是另一类非监督学习算法,用于找出给定数据集中的数据聚类。
在本文中,我们将使用 [Numpy 模块](https://www.askpython.com/python-modules/numpy/python-numpy-module)从头开始实现 K-Means 聚类算法。
## K-means 聚类算法的 5 个步骤
**第一步。随机选取 k 个数据点作为我们的初始质心。**
**第二步。**用 k 个质心找出训练集中每个数据点之间的距离(我们的目的是欧几里德距离)。
**第三步。**现在根据找到的距离将每个数据点分配到最近的质心。
**第四步。**通过取每个聚类组中的点的平均值来更新质心位置。
**第五步。重复步骤 2 到 4直到我们的质心不变。**
我们可以使用像肘方法这样的方法来选择 K(聚类数)的最佳值。
## 实现 K 均值聚类算法
现在让我们用代码实现上面的步骤。导入 numpy 模块,然后浏览这里的其余代码,以了解 K-Means 聚类是如何在代码中实现的。
```py
#Importing required modules
import numpy as np
from scipy.spatial.distance import cdist
#Function to implement steps given in previous section
def kmeans(x,k, no_of_iterations):
idx = np.random.choice(len(x), k, replace=False)
#Randomly choosing Centroids
centroids = x[idx, :] #Step 1
#finding the distance between centroids and all the data points
distances = cdist(x, centroids ,'euclidean') #Step 2
#Centroid with the minimum Distance
points = np.array([np.argmin(i) for i in distances]) #Step 3
#Repeating the above steps for a defined number of iterations
#Step 4
for _ in range(no_of_iterations):
centroids = []
for idx in range(k):
#Updating Centroids by taking mean of Cluster it belongs to
temp_cent = x[points==idx].mean(axis=0)
centroids.append(temp_cent)
centroids = np.vstack(centroids) #Updated Centroids
distances = cdist(x, centroids ,'euclidean')
points = np.array([np.argmin(i) for i in distances])
return points
```
上面函数为我们训练集中的每个数据点返回一个聚类标签数组。
## 测试 K 均值聚类
我们将使用 digits 数据集(内置在 sklearn 模块中)来测试我们的功能。可以参考[这篇](https://www.askpython.com/python/examples/plot-k-means-clusters-python)文章,了解更多绘制 K-Means 聚类的方法。
```py
#Loading the required modules
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
#Defining our function
def kmeans(x,k, no_of_iterations):
idx = np.random.choice(len(x), k, replace=False)
#Randomly choosing Centroids
centroids = x[idx, :] #Step 1
#finding the distance between centroids and all the data points
distances = cdist(x, centroids ,'euclidean') #Step 2
#Centroid with the minimum Distance
points = np.array([np.argmin(i) for i in distances]) #Step 3
#Repeating the above steps for a defined number of iterations
#Step 4
for _ in range(no_of_iterations):
centroids = []
for idx in range(k):
#Updating Centroids by taking mean of Cluster it belongs to
temp_cent = x[points==idx].mean(axis=0)
centroids.append(temp_cent)
centroids = np.vstack(centroids) #Updated Centroids
distances = cdist(x, centroids ,'euclidean')
points = np.array([np.argmin(i) for i in distances])
return points
#Load Data
data = load_digits().data
pca = PCA(2)
#Transform the data
df = pca.fit_transform(data)
#Applying our function
label = kmeans(df,10,1000)
#Visualize the results
u_labels = np.unique(label)
for i in u_labels:
plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.legend()
plt.show()
```
![Plotting K-Means Clustering](img/5b66cb6d22feecd6b65e310d65f8120b.png)
Plotting K Means Clusters
输出结果看起来很有希望。我们的实现是可行的。
## 结论
在本文中,我们使用 Python 从头开始创建了一个 K-Means 聚类算法。我们还讲述了制作 K-Means 算法的步骤,最后在 Digits 数据集上测试了我们的实现。你可以在维基百科的[页面上阅读 K-means 聚类算法的理论](https://en.wikipedia.org/wiki/K-means_clustering)
快乐学习