8、Unsupervised Learning
8.1 Clustering
8.1.1 Unsupervised Learning: Introduction
Applications of clustering:
- Market segementation
- Social network analysis
- Organize computing clusters
- Astronomical data analysis
无监督学习,数据是没有label的。
8.1.2 K-Means Algorithm
Input:
K (number of clusters)
Training set {x1,x2,...xm}
算法:
随机初始化K个聚类簇(cluster centroids)u1,u2…uK
然后repeat操作:
cluster assignment step:
for i = 1 to m c(i):=index(from 1 to K) of cluster centroids closest to xi
Move centroid:
for k = 1 to K uk := average(mean)of points assigned to cluster k
8.1.3 Optimization objective
最小化distortion:
因此K-means算法也可以表示为:
- Randomly initialize K cluster centroids u1,u2,…uk
Repeat:
for i = 1 to m c(i):= index(from 1 to K) of cluster centroid closest to xi
for k = 1 to K
uk:= average(mean) of points assigned to cluster k [minize J]
8.1.4 Random initialization
Random initialize:
首先满足 K < m;
然后随机选择K个训练实例;
将u1,u2,…uk等于这K个实例;
for i = 1 to 100{
Randomly initialize K-means.
Run K-means. Get c1,c2,...cm,u1,...uk.
Compute cost function(distortion)
选择最小cost的聚类
8.1.5 Choosing the number of clusters
目前最好的方法还是手动选择;
Elbow cost:手肘方法,不是所有情况都能画出这样的图形,有时候会下降趋势较平滑。
sometimes, you’re running K-means to get clusters to use for some later/downstream purpose.
Evaluate K-means based on a metric for how well it performs for that later purpose.
8.2 Dimensionality Reduction
介绍了第二种无监督学习算法:维数约减。
主要分为两大类应用:
- Data Compression
- Data Visualization
8.2.1 Data Compression
将数据从2D 降维到 1D
将数据从3D 降维到 2D (将三维中的点投影到一个平面上)
8.2.2 Data Visualization
数据可视化,便于用户更好的观察数据,理解数据之间的含义。
将数据比较相关的进行降维,原来N维的数据降低到K维
k <= N; 并且通常K=2或3, 便于进行可视化视图分析
8.3 Principal Component Analysis
8.3.1 Principal Component Analysis Problem Formulation
Reduce from 2-dimension to 1-dimension:
Find a direction (a vector v1) onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension:
Find k vectors v1,v2…vk on to which to project the data, so as to minimize the projection error.
PCA 不是线性回归:线性回归是竖着映射,PCA是垂直映射。
8.3.2 Principal Component Analysis algorithm
Data Preprocessing:
Training set: x1,x2,…xm
Preprocessing(feature scaling/mean normalization):
uj = 1/m(x1+x2+...+xm)
xj' = xj - uj
If different features on different scales, scale features to have comparable range of values.
Principal Component Analysis (PCA) algorithm
Reduce data from n-dimensions to k-dimensions
compute “covariance matrix”: sigma = 1/m(x1x1T+…+xnxnT)
compute “eigenvectors” of matrix sigma: [U,S,V] = svd(sigma)
Ureduce = U(:,1:k)
z = Ureduce’ * x
8.4 Applying PCA
8.4.1 Reconstruction from Compressed Representation
Xapprox = Ureduce*z
8.4.2 Choosing the number of principal components
choosing k
(1)Average squared projection error: 1/m(||x1-xapprox1||2 +…)
(2)Total variation in the data : 1/m(||x1||2 + … +||xm||2)
Typically, choose k to be smallest value so that (1)/(2)<=0.01
“99% of variance is retained”
Advice for applying PCA
Supervised learning speedup
Compression:
- Reduce memory/disk needed to store data
- speed up learning algorithm
Visualization
Bad use of PCA: To prevent overfitting
fewer features, less likely to overfit.
This might work ok, but isn’t good way to address overfitting. Use regularization instead.
Before implementing PCA, first try running whatever you want to do with the original/raw data xi. Only if that doesn’t do what you want, then implement PCA and consider using zi.