9、Anomaly detection
9.1、Density Estimation
9.1.1、Problem motivation
密度估计,判断一个test实例是否为不正常的。
Anomaly detection example:
Fraud detection
xi = features of user i's ativities. Model p(x) from data. Identify unusual users by checking which have p(x)<ε
Manufacturing
Monitoring computers in a data center.
xi = features of machine i. memory use,number of disk access/sec,cpu load...
9.1.2、Gaussian distribution
9.1.3、Anomaly detection algorithm
- Choose features xi that might be indicative of anomalous examples.
Fit parameters μ1, … μn,σ12,…σn2
μj = 1/m ξxji σj2 = 1/m ξ(xji-μj)2
- Given new example x, compute p(x),Anomaly if p(x)<ε.
9.1.4、Developing and evaluating an anomaly detection system
- The importance of real-number evaluation
例如 10000 good engines, 20 flawed engines,我们可以进行如下划分:
Training set:6000 good engines
CV:2000 good engines,10 anomalous
Test:2000 good engines,10 anomalous
- Algorithm evaluation
可以利用F1-score来评估算法,我们也可以用CV来选择参数ε。
9.1.5、Anomaly detection VS supervised learning
Anomaly detection:
- Very small number of positive example.
- Large number of negative example.
- Many different types od anomalies. 很难通过positive实例来学习异常的特征
- 未来和异常和目前的异常实例不相关
Supervised Learning:
- Large number of positive and negative examples.
- 可以根据大量的positive值推断出其特征值,未来的positive和现在的训练集非常相似
9.1.6、多元高斯分布
通过μ矩阵和ξ矩阵来对多远高斯分布进行调整。
9.2 推荐系统
9.3.1 基于内容的推荐
问题描述:
- r(i,j)=1 if user j has rated movie i
- y(i,j)=rating by user j on movie i
- θ(j)=paramater vector for user j
- x(i)=feature vector for movie i
- For user j,movie i, predicted rating θ(j)T(x(i))
- m(j)=no. of movies rated by user j
9.3.2 正交过滤
9.3.2 实现技巧
归一化,计算平均值,然后同时减去该值