AI入门之——Andrew Ng “Machine Learning”课程学习笔记第六周

6、Advice for Applying Machine Learning

6.1、Evaluating a learning Algorithm

6.1.1、Deciding What to Try Next

Which of the following statements about diagnostics are true? (BCD)

A. It’s hard to tell what will work to improve a learning algorithm, so the best approach is to go with gut feeling and just see what works.

B. Diagnostics can give guidance as to what might be more fruitful things to try to improve a learning algorithm.

C. Diagnostics can be time-consuming to implement and try, but they can still be a very good use of your time.

D. A diagnostic can sometimes rule out certain courses of action(changes to your learning algorithm) as being unlikely to improve its performance signigicantly.

6.1.2、Evaluating a Hypothesis

如何评估假设函数,后面可以以此为基础讨论如何避免过拟合和欠拟合的问题。

Suppose an implementation of linear regression(without reguarization) is badly overfitting the training set. In this case, we would expect:

The training error to be low, and the test error to be high.

数据划分为训练集和测试集,一般比例为7:3,random select。如果数据不是随机的,最好自己随机排序或打乱顺序后再选取70%.

Misclassification error(0/1 misclassification error)

6.1.3、Model Selection and Training/Validation/Test Sets

如何确定对于某组数据,最合适的多项式次数是几次?如何选择学习算法中的正规化参数lamda?

模型选择问题。(泛化误差)

我们将数据集分为三部分,training set(60%), cross validation set(20%), test set(20%).

Training error

cross validation error

test error

我们使用交叉验证集来选择模型,看这些假设在交叉验证集表现如何。选择最小交叉验证集误差的模型。

Consider the model selection procedure where we choose the degree of polynomial using a cross validation set. For the final model, we might generally expect Jcv to be lower than Jtest because:

An extra parameter(d,the degree of the polynomial) has been fit to the cross validation set.

6.2、Bias vs. Variance(偏差和方差)

6.2.1、Diagnosing Bias vs. Variance

High Bias Overfitting

High Variance Underfitting

6_1

6.2.2、Regularization and Bias/Variance

6_2

6.2.3、Learning Curves

判断一个假设是否存在偏差方差问题

High Bias
If a learning algorithm is suffering from high bias, getting more training data will not help much.

6_3
因此知道自己的假设是否存在高偏差/方差问题,非常有用。

High variance
If a learning algorithm is suffering from high variance, getting more training data is likely to help.

6_4

6.2.4、Deciding What to Do Next Revisited

6_5

6.3、Machine Learning system design

6.3.1、Priorityzing what to work on

6.3.2、Error Analysis

建议的方法:

1、尽快的实现一个最基本的算法,然后在交叉验证数据集上进行验证;

2、画出learning curves,看是否更多的数据、更多的features能提升正确性;

3、Error Analysis:主要分析那些算法预测出错的例子,看能否发现一些系统趋势。

Error analysis may not be helpful for deciding is this is likely to improve performance, only solution is to try it and say if it works.

6.3.3 Error Metrics for Skewed Classes

不能仅仅依靠错误率来判断算法的性能,还需要利用Precision 和 Recall
6_6

6.3.4 Trading off precision and recall

我们可以通过计算F1来自动选择性能较高的算法。

F1 = 2*(P*R)/(P+R);

6.3.5 Data for Machine Learning

Banko anf Brill 在2001年提出了一个观点:

It's not who has the best algorithm that wins, 
It's who has the most data.

这个观点有时候不一定正确:

当feature很少不足以预测时(例,预测房价时仅知道尺寸),此时再多的数据可能也无法提升算法的性能;

当算法参数很多时(例,逻辑回归和线性回归有很多的feature,或者神经网络有很多的隐藏单元),更大的测试集可以尽可能的减少过拟合。