Это 30-й день моего участия в августовском испытании обновлений. Узнайте подробности события:Испытание августовского обновления
Заметки о машинном обучении Эндрю Нг - (12) машины опорных векторов
Large Margin Classification
Optimization Objective
The Support Vector Machine (or SVM) is a powerful algorithm that is widely used in both industry and academia. Compared to both logistic regression and neural networks, the SVM somethimes gives a cleaner and more powerful way of learning complex non-linear functions.
What the cost of example in our logistic regression is as:
It looks like this:
With SVM, we'd like to make a little change of them so that:
We set a instead of and set a instead of .
So, for logistic regression, we are going to:
And for Support Vector Machine, we change it to:
We dropped the in both terms, then multiplied a to them. In this case we can promise that we can get the same min with those two operators.
Large Margin Intuition
Let's say the SVM decision boundary:
If we set a very large value, for example, when this optimization objective , we're going to be highly motivated to choose a value, so that the first term is equal to zero. And our goal become:
We will get a very intersting decision boundary. The Support Vector Machines will choose a decision boundary that does a great job of separating the positive and negative examples (the black line below install of others).
we see that the black decision boundary has some larger minimum distance from any of my training examples, whereas the magenta and the green lines come awfully close to the training examples.
And mathematically, what that does is, this black decision boundary has a larger distance (the blue lines).That distance is called the margin of the support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible. So the support vector machine is sometimes also called a large margin classifier.
If is very large than the SVM will be sensitive to outliers. For example, we a very large , maybe, with a outliers at the left bottom, the learning algorithm will get the magenta decision boundary. However, if is not too large, we are going to get the black one as wanted:
Mathematics Behind Large Margin Classification
Vector Inner Product
Let's say we have two vectors:
We define their Vector Inner Product as:
And we can also get the inner product by getting:
SVM Decision Boundary
When the is very large, we can get this:
If we make a simplification as , so that , then our .
So, our description can be:
In the condition of , to min the , we should promise is small as well as make as possible. So, we get a large .
For the negative example, similarly, we will get a whos absolute value is large.
As we seen, a large gives us a large margin.
kernels
Kernels and Similarity
We can develop complex nonlinear classifiers by adapting support vector machines. The main technique for this is something called kernels.
Let's say if we a training set that looks this:
Recall the past, we will work with lots of polynomial features to hypothesis if and predict otherwise.
Another way of that is:
Where
If we have a lot of features, then there will be too many polynomial terms, which can become very computationally expensive. So is there a different or a better choice of the features that we can use to plug into this sort of hypothesis form, or in another words, how to define a set of new to avoid high polynomial calculations?
Here is one idea:
To simplify the description and make it more intuitional, let's say we have only and we are going to define only three new features ().
Leave alone, we manually pick a few points from a plot of -, in this example they are notated as . We call this points landmarks
.
And then, given a , we can compute new feature depending on proximity to landmarks as follow:
The mathematical term for the simiarity function is kernel function
. There are many formula of kernel functions can be chosen. And the specific kernel we're using here is actually called a Gaussian Kernel.
Note. means the length of the vector .
Given , for each we can get a :
- If x is near :
- If far from :
We can see that: If sigma squared is large, as we move away from , the value of falls away much more slowly.
So, given this definition of the features, let's see what source of hypothesis we can learn:
We can see that, in this example, only given the near or will it predict "1", if the if far from this two landmarks, it will predict "0".
SVM with Kernels
We have talked about the process of picking a few landmarks. But how to choose our s? Actually we will simply choose the location of our landmarks to be exactly near the locations of our m training examples. Here is what we are going to do:
The Support Vector Machine With Kernels Algorithm
Given a set of training data:
Choose landmarks:
For example :
Group s into a vector: , in which the is a extra feature for convention.
Hypothesis:
- Given , compute features
- Predict if (where )
Training:
Notice: instead of that we use before, now we use . And here our is equal to .
Mathematical tricks in practice: Ignoring the (s.t. ), we can get . Moreover, for mathematical convenience when deal with a large training set, we gonna use instead of .
SVM parameters
Here are parameters that we are going to choose when use the algorithm above:
-
(=)
- Large : Lower bias, high variance (small )
- Small : Higher bias, low variance (large )
-
- Large : Features vary more smoothly. Higher bias, lower variance.
- Small : Features vary less smoothly. Lower bias, higher variance.
E.g. Suppose we train an SVM and find it overfits our training data. A reasonable next step is going to decrease or increase .
SVM in Practice
Using An SVM
We are not recommended writing our own software to solve for the parameters via a SVM. We should call some library functions to do that.
When using SVM software package (e.g. liblinear
, libsvm
) to solve for parameters , we need to specify:
- Choice of parameter ;
- Choice of kernel (similarity function):
- No kernel ("linear kernel", do our common logistic regression): Predict if . For large and small case (no kernel can avoid overfitting).
- Gaussian kernel: , where .(Need to choose ). For small and large case (Gaussian kernel can fit a more complex nonlinear decision boundary).
If we decide to use a Gaussian kernel, here is what we are going to do:
- Provide a kernel (similarity) function:
Note: Do perform feature scaling before using the Gaussian kernel.
Other choices of kernel
Not all similarity functions make valid kernels. (Need to satisfy technical condiction called "Mercer's Theorem" to make sure SVM packages' optimizations run correctly, and do not diverge).
There are many off-the-shelf kernels available, for example:
- Polynomial kernel: , need to choose the and . Usually performs worse than Gaussian kernel. Only for and are all strictly non negative.
- More esoteric: String kernel (for input is text string), chi-square kernel, histogram intersection kernel, ...
Multi-class classification
Logistic regression v.s. SVMs
= number of features ();
= number of training examples;
- If is large (relative to ): (, e.g. )
- Use logistic regression, or SVM without a kernel ("linear kernel")
- If is small, is intermediate: (e.g. )
- Use SVM with Gaussian kernel
- If is small, is large:
- Create/add more features, then use logistic regression or SVM without a kernel
Note: Neural network likely to work well for most of these settings, but may be slower to train.