无监督学习是一类用于在数据中寻找模式的机器学习技术. 无监督学习算法使用的输入数据都是没有标注过的, 这意味着数据只给出了输入变量 (自变量 X) 而没有给出相应的输出变量(因变量). 在无监督学习中, 算法本身将发掘数据中有趣的结构.
人工智能研究的领军人物 Yan Lecun, 解释道: 无监督学习能够自己进行学习, 而不需要被显式地告知他们所做的一切是否正确. 这是实现真正的人工智能的关键!
监督学习 VS 无监督学习
在监督学习中, 系统试图从之前给出的示例中学习.(而在无监督学习中, 系统试图从给定的示例中直接找到模式.)因此, 如果数据集被标注过了, 这就是一个监督学习问题; 而如果数据没有被标注过, 这就是一个无监督学习问题.
上图是一个监督学习的例子, 它使用回归技术找到在各个特征之间的最佳拟合曲线. 而在无监督学习中, 根据特征对输入数据进行划分, 并且根据数据所属的簇进行预测.
重要的术语
特征: 进行预测时使用的输入变量.
预测值: 给定一个输入示例时的模型输出.
示例: 数据集中的一行. 一个示例包含一个或多个特征, 可能还有一个标签.
- # Importing Modules
- from sklearn import datasets
- import matplotlib.pyplot as plt
- # Loading dataset
- iris_df = datasets.load_iris()
- # Available methods on dataset
- print(dir(iris_df))
- # Features
- print(iris_df.feature_names)
- # Targets
- print(iris_df.target)
- # Target Names
- print(iris_df.target_names)
- label = {0: 'red', 1: 'blue', 2: 'green'}
- # Dataset Slicing
- x_axis = iris_df.data[:, 0] # Sepal Length
- y_axis = iris_df.data[:, 2] # Sepal Width
- # Plotting
- plt.scatter(x_axis, y_axis, c=iris_df.target)
- plt.show()
- ['DESCR', 'data', 'feature_names', 'target', 'target_names']
- ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
- [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
- ['setosa' 'versicolor' 'virginica']
- # Importing Modules
- from sklearn import datasets
- from sklearn.cluster import KMeans
- # Loading dataset
- iris_df = datasets.load_iris()
- # Declaring Model
- model = KMeans(n_clusters=3)
- # Fitting Model
- model.fit(iris_df.data)
- # Predicitng a single input
- predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]])
- # Prediction on the entire data
- all_predictions = model.predict(iris_df.data)
- # Printing Predictions
- print(predicted_label)
- print(all_predictions)
- [0]
- [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2]
- # Importing Modules
- from scipy.cluster.hierarchy import linkage, dendrogram
- import matplotlib.pyplot as plt
- import pandas as pd
- # Reading the DataFrame
- seeds_df = pd.read_csv(
- "https://raw.githubusercontent.com/vihar/unsupervised-learning-with-python/master/seeds-less-rows.csv")
- # Remove the grain species from the DataFrame, save for later
- varieties = list(seeds_df.pop('grain_variety'))
- # Extract the measurements as a NumPy array
- samples = seeds_df.values
- """
- specifying the keyword arguments labels=varieties, leaf_rotation=90,
- and leaf_font_size=6.
- """
- dendrogram(mergings,
- labels=varieties,
- leaf_rotation=90,
- leaf_font_size=6,
- )
- plt.show()
- # Importing Modules
- from sklearn import datasets
- from sklearn.manifold import TSNE
- import matplotlib.pyplot as plt
- # Loading dataset
- iris_df = datasets.load_iris()
- # Defining Model
- model = TSNE(learning_rate=100)
- # Fitting Model
- transformed = model.fit_transform(iris_df.data)
- # Plotting 2d t-Sne
- x_axis = transformed[:, 0]
- y_axis = transformed[:, 1]
- plt.scatter(x_axis, y_axis, c=iris_df.target)
- plt.show()
- # Importing Modules
- from sklearn.datasets import load_iris
- import matplotlib.pyplot as plt
- from sklearn.cluster import DBSCAN
- from sklearn.decomposition import PCA
- # Load Dataset
- iris = load_iris()
- # Declaring Model
- dbscan = DBSCAN()
- # Fitting
- dbscan.fit(iris.data)
- # Transoring Using PCA
- pca = PCA(n_components=2).fit(iris.data)
- pcapca_2d = pca.transform(iris.data)
- # Plot based on Class
- for i in range(0, pca_2d.shape[0]):
- if dbscan.labels_[i] == 0:
- c1 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='r', marker='+')
- elif dbscan.labels_[i] == 1:
- c2 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='g', marker='o')
- elif dbscan.labels_[i] == -1:
- c3 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='b', marker='*')
- plt.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2', 'Noise'])
- plt.title('DB
来源: http://zhuanlan.51cto.com/art/201805/574750.htm