8、Unsupervised Learning

8.1 Clustering

8.1.1 Unsupervised Learning: Introduction

Applications of clustering:

Market segementation
Social network analysis
Organize computing clusters
Astronomical data analysis

无监督学习，数据是没有label的。

8.1.2 K-Means Algorithm

Input:

K (number of clusters)
Training set {x1,x2,...xm}

算法：

随机初始化K个聚类簇（cluster centroids）u1,u2…uK

然后repeat操作：

cluster assignment step：

for     i = 1 to m
  c(i):=index(from 1 to K) of cluster centroids closest to xi

Move centroid:

for k = 1 to K
  uk := average(mean)of points assigned to cluster k

8.1.3 Optimization objective

最小化distortion：

因此K-means算法也可以表示为：

Randomly initialize K cluster centroids u1,u2,…uk

Repeat:

for i = 1 to m
    c(i):= index(from 1 to K) of cluster centroid closest to xi

for k = 1 to K
    uk:= average(mean) of points assigned to cluster k [minize J]

8.1.4 Random initialization

Random initialize:

首先满足 K < m；

然后随机选择K个训练实例；

将u1,u2,…uk等于这K个实例；

for i = 1 to 100{
    Randomly initialize K-means.
    Run K-means. Get c1,c2,...cm,u1,...uk.
    Compute cost function(distortion)

选择最小cost的聚类

8.1.5 Choosing the number of clusters

目前最好的方法还是手动选择；

Elbow cost：手肘方法，不是所有情况都能画出这样的图形，有时候会下降趋势较平滑。

sometimes, you’re running K-means to get clusters to use for some later/downstream purpose.

Evaluate K-means based on a metric for how well it performs for that later purpose.

8.2 Dimensionality Reduction

介绍了第二种无监督学习算法：维数约减。

主要分为两大类应用：

Data Compression
Data Visualization

8.2.1 Data Compression

将数据从2D 降维到 1D

将数据从3D 降维到 2D （将三维中的点投影到一个平面上）

8.2.2 Data Visualization

数据可视化，便于用户更好的观察数据，理解数据之间的含义。

将数据比较相关的进行降维，原来N维的数据降低到K维

k <= N；并且通常K=2或3，便于进行可视化视图分析

8.3 Principal Component Analysis

8.3.1 Principal Component Analysis Problem Formulation

Reduce from 2-dimension to 1-dimension:

Find a direction (a vector v1) onto which to project the data so as to minimize the projection error.

Reduce from n-dimension to k-dimension:

Find k vectors v1,v2…vk on to which to project the data, so as to minimize the projection error.

PCA 不是线性回归：线性回归是竖着映射，PCA是垂直映射。

8.3.2 Principal Component Analysis algorithm

Data Preprocessing:

Training set: x1,x2,…xm

Preprocessing(feature scaling/mean normalization):

uj = 1/m(x1+x2+...+xm)
xj' = xj - uj

If different features on different scales, scale features to have comparable range of values.

Principal Component Analysis (PCA) algorithm

Reduce data from n-dimensions to k-dimensions

compute “covariance matrix”: sigma = 1/m(x1x1T+…+xnxnT)

compute “eigenvectors” of matrix sigma: [U,S,V] = svd(sigma)

Ureduce = U(:,1:k)

z = Ureduce’ * x

8.4 Applying PCA

8.4.1 Reconstruction from Compressed Representation

Xapprox = Ureduce*z

8.4.2 Choosing the number of principal components

choosing k

(1)Average squared projection error: 1/m(||x1-xapprox1||2 +…)

(2)Total variation in the data : 1/m(||x1||2 + … +||xm||2)

Typically, choose k to be smallest value so that (1)/(2)<=0.01

“99% of variance is retained”

Advice for applying PCA

Supervised learning speedup

Compression:

Reduce memory/disk needed to store data
speed up learning algorithm

Visualization

Bad use of PCA: To prevent overfitting

fewer features, less likely to overfit.

This might work ok, but isn’t good way to address overfitting. Use regularization instead.

Before implementing PCA, first try running whatever you want to do with the original/raw data xi. Only if that doesn’t do what you want, then implement PCA and consider using zi.