Машинное обучение (12) Машины опорных векторов

машинное обучение

Это 30-й день моего участия в августовском испытании обновлений. Узнайте подробности события:Испытание августовского обновления

Заметки о машинном обучении Эндрю Нг - (12) машины опорных векторов

Large Margin Classification

Optimization Objective

The Support Vector Machine (or SVM) is a powerful algorithm that is widely used in both industry and academia. Compared to both logistic regression and neural networks, the SVM somethimes gives a cleaner and more powerful way of learning complex non-linear functions.

What the cost of example in our logistic regression is as:

ylog(11+eθTx)(1y)log(111+eθTx)-y\log\Big(\frac{1}{1+e^{-\theta^Tx}}\Big)-(1-y)\log\Big(1-\frac{1}{1+e^{-\theta^Tx}}\Big)

It looks like this:

屏幕快照 2019-09-19 23.23.39

With SVM, we'd like to make a little change of them so that:

image-20191029161946046

We set a cost1\textrm{cost}_1 instead of log(h)-\log(h) and set a cost0\textrm{cost}_0 instead of log(1h)-\log(1-h).

So, for logistic regression, we are going to:

minθ1mi=1m[y(i) log(hθ(x(i)))(1y(i)) log(1hθ(x(i)))]+λ2mj=1nθj2\min_\theta\frac{1}{m} \sum_{i=1}^m \large[ -y^{(i)}\ \log (h_\theta (x^{(i)})) - (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2

And for Support Vector Machine, we change it to:

minθCi=1m[y(i)cost1(θTx(i))+(1y(i)) cost0(θTx(i))]+12j=1nθj2\min_\theta C\sum_{i=1}^m \large[ y^{(i)} \mathop{\textrm{cost}_1}(\theta^Tx^{(i)}) + (1 - y^{(i)})\ \mathop{\textrm{cost}_0}(\theta^Tx^{(i)})\large] + \frac{1}{2}\sum_{j=1}^n \theta_j^2

We dropped the 1m\frac{1}{m} in both terms, then multiplied a C=λC=\lambda to them. In this case we can promise that we can get the same min θ\theta with those two operators.

image-20191029164223055

Large Margin Intuition

Let's say the SVM decision boundary:

minθCi=1m[y(i)cost1(θTx(i))+(1y(i)) cost0(θTx(i))]+12j=1nθj2\min_\theta C\sum_{i=1}^m \large[ y^{(i)} \mathop{\textrm{cost}_1}(\theta^Tx^{(i)}) + (1 - y^{(i)})\ \mathop{\textrm{cost}_0}(\theta^Tx^{(i)})\large] + \frac{1}{2}\sum_{j=1}^n \theta_j^2

If we set CC a very large value, 100000100000 for example, when this optimization objective , we're going to be highly motivated to choose a value, so that the first term is equal to zero. And our goal become:

minθ12j=1nθj2s.t.θTx(i)1if y(i)=1θTx(i)1if y(i)=0\begin{array}{l} \min_\theta \frac{1}{2}\sum_{j=1}^{n}\theta_j^2 \\\\ s.t. \quad \begin{array}{l} \theta^Tx^{(i)} \ge 1 & \textrm{if } y^{(i)}=1\\ \theta^Tx^{(i)} \le -1 & \textrm{if } y^{(i)}=0 \end{array} \end{array}

We will get a very intersting decision boundary. The Support Vector Machines will choose a decision boundary that does a great job of separating the positive and negative examples (the black line below install of others).

image-20191029171503318

we see that the black decision boundary has some larger minimum distance from any of my training examples, whereas the magenta and the green lines come awfully close to the training examples.

And mathematically, what that does is, this black decision boundary has a larger distance (the blue lines).That distance is called the margin of the support vector machine and this gives the SVM a certain robustness, because it tries to separate the data with as a large a margin as possible. So the support vector machine is sometimes also called a large margin classifier.

If CC is very large than the SVM will be sensitive to outliers. For example, we a very large CC, 100000100000 maybe, with a outliers at the left bottom, the learning algorithm will get the magenta decision boundary. However, if CC is not too large, we are going to get the black one as wanted:

image-20191029173209265

Mathematics Behind Large Margin Classification

Vector Inner Product

Let's say we have two vectors:

u=[u1u2],v=[v1v2]u=\left[\begin{array}{l}u1 \\ u2\end{array}\right],\qquad v=\left[\begin{array}{l}v1 \\ v2\end{array}\right]

We define their Vector Inner Product as:

Inner Product=uTv=vTu\textrm{Inner Product} = u^Tv=v^Tu

And we can also get the inner product by getting:

u=length of vector u=u12+u22p=length of projection of v onto u (signed)uTv=pu\begin{array}{ccl} ||u|| &=& \textrm{length of vector } u= \sqrt{u_1^2+u_2^2}\\ p &=& \textrm{length of projection of } v \textrm{ onto } u \textrm{ (signed)} \\ \\ u^Tv &=& p \cdot ||u|| \end{array}

image-20191031111759471

SVM Decision Boundary

When the CC is very large, we can get this:

minθ12j=1nθj2=12(j=1nθj2)2=12θ2s.t.θTx(i)1if y(i)=1θTx(i)1if y(i)=0\begin{array}{l} \min_\theta \frac{1}{2}\sum_{j=1}^{n}\theta_j^2 =\frac{1}{2}\Big(\sqrt{\sum_{j=1}^n\theta_j^2}\Big)^2 =\frac{1}{2}||\theta||^2 \\\\ s.t. \quad \begin{array}{l} \theta^Tx^{(i)} \ge 1 & \textrm{if } y^{(i)}=1\\ \theta^Tx^{(i)} \le -1 & \textrm{if } y^{(i)}=0 \end{array} \end{array}

If we make a simplification as θ0=0,n=2\theta_0=0, n=2, so that θ=[θ1,θ2]T\theta=[\theta_1,\theta_2]^T, then our θTx(i)=p(i)θ=θ1x1(i)+θ2x2(i)\theta^Tx^{(i)}=p^{(i)}\cdot ||\theta||=\theta_1x_1^{(i)}+\theta_2x_2^{(i)}.

image-20191031123156678

So, our description can be:

minθ12j=1nθj2=12(j=1nθj2)2=12θ2s.t.p(i)θ1if y(i)=1p(i)θ1if y(i)=0where p(i) is the projection of x(i) onto the vector θ.\begin{array}{l} \min_\theta \frac{1}{2}\sum_{j=1}^{n}\theta_j^2 =\frac{1}{2}\Big(\sqrt{\sum_{j=1}^n\theta_j^2}\Big)^2 =\frac{1}{2}||\theta||^2 \\\\ s.t. \quad \begin{array}{l} p^{(i)}\cdot ||\theta|| \ge 1 & \textrm{if } y^{(i)}=1\\ p^{(i)}\cdot ||\theta|| \le -1 & \textrm{if } y^{(i)}=0 \end{array}\\\\ \textrm{where $p^{(i)}$ is the projection of $x^{(i)}$ onto the vector $\theta$.} \end{array}

In the condition of y=1y=1, to min the θ\theta, we should promise θ||\theta|| is small as well as make p(i)θ1p^{(i)}\cdot ||\theta|| \ge 1 as possible. So, we get a large p(i)p^{(i)}.

For the negative example, similarly, we will get a p(i)p^{(i)} whos absolute value is large.

屏幕快照 2019-10-31 12.58.43

As we seen, a large p(i)p^{(i)} gives us a large margin.

kernels

Kernels and Similarity

We can develop complex nonlinear classifiers by adapting support vector machines. The main technique for this is something called kernels.

Let's say if we a training set that looks this:

image-20191102114127170

Recall the past, we will work with lots of polynomial features to hypothesis y=1y=1 if θ0+θ1x1+θ2x2+θ3x1x2+θ4x12+θ5x22+0\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1x_2+\theta_4x_1^2+\theta_5x_2^2+\cdots\ge 0 and predict 00 otherwise.

Another way of that is:

Predict y=1ifθ0+θ1f1+θ2f2+θ3f3+0\begin{array}{lcl} \textrm{Predict } y=1 &\textrm{if}&\theta_0+\theta_1f_1+\theta_2f_2+\theta_3f_3+\cdots\ge 0 \end{array}

Where f1=x1,f2=x2,f3=x1x2,f4=x12,...f_1=x_1,f_2=x_2,f_3=x_1x_2,f_4=x_1^2,...

If we have a lot of features, then there will be too many polynomial terms, which can become very computationally expensive. So is there a different or a better choice of the features that we can use to plug into this sort of hypothesis form, or in another words, how to define a set of new f1,f2,f3f_1,f_2,f_3 to avoid high polynomial calculations?

Here is one idea:

image-20191102124615797

To simplify the description and make it more intuitional, let's say we have only x0,x1,x2x_0,x_1,x_2 and we are going to define only three new features (f1,f2,f3f_1,f_2,f_3).

Leave x0x_0 alone, we manually pick a few points (x1(i),x2(i))(x_1^{(i)},x_2^{(i)}) from a plot of x1x_1-x2x_2, in this example they are notated as l(1),l(2),l(3)l^{(1)},l^{(2)},l^{(3)}. We call this points landmarks.

And then, given a xx, we can compute new feature depending on proximity to landmarks l(1),l(2),l(3)l^{(1)},l^{(2)},l^{(3)} as follow:

fi=similarity(x,l(i))=exp(xl(i)22о2)=exp(j=1n(xjlj(i))22о2)f_i = \mathop{\textrm{similarity}}(x,l^{(i)}) = \exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}) = \exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(i)})^2}{2\sigma^2})

The mathematical term for the simiarity function is kernel function. There are many formula of kernel functions can be chosen. And the specific kernel we're using here is actually called a Gaussian Kernel.

Note. w||w|| means the length of the vector ww.

Given xx, for each l(i)l^{(i)} we can get a fif_i:

  • If x is near l(i)l^{(i)}:
filimxl(i)exp(xl(i)22о2)=exp(022о2)=1f_i \approx \lim_{x\to l^{(i)}}\exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}) = \exp(-\frac{0^2}{2\sigma^2}) = 1
  • If xx far from l(i)l^{(i)}:
filimxl(i)+exp(xl(i)22о2)=exp(22о2)=0f_i \approx \lim_{||x-l^{(i)}||\to +\infin}\exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}) = \exp(-\frac{\infin^2}{2\sigma^2}) = 0

image-20191102132852867

We can see that: If sigma squared is large, as we move away from l(1)l^{(1)}, the value of f1f_1 falls away much more slowly.

So, given this definition of the features, let's see what source of hypothesis we can learn:

image-20191102133600603

We can see that, in this example, only given the xx near l(1)l^{(1)} or l(2)l^{(2)} will it predict "1", if the xx if far from this two landmarks, it will predict "0".

SVM with Kernels

We have talked about the process of picking a few landmarks. But how to choose our l(i)l^{(i)}s? Actually we will simply choose the location of our landmarks to be exactly near the locations of our m training examples. Here is what we are going to do:


The Support Vector Machine With Kernels Algorithm

Given a set of training data: (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})

Choose landmarks: l(i)=x(i)for i=1,2,,ml^{(i)}=x^{(i)} \quad \textrm{for }i=1,2,\cdots,m

For example xx: fi=similarity(x,l(i))for i=1,2,,mf_i=\mathop{\textrm{similarity}}(x,l^{(i)}) \quad \textrm{for }i=1,2,\cdots,m

Group fif_is into a vector: f=[f0,f1,f2,,fm]Tf=[f_0,f_1,f_2,\cdots,f_m]^T, in which the f01f_0 \equiv 1 is a extra feature for convention.

Hypothesis:

  • Given xx, compute features fеRm+1f\in\R^{m+1}
  • Predict y=1y=1 if θTf=θ0f0+θ1f1+0\theta^Tf=\theta_0f_0+\theta_1f_1+\cdots\ge 0 (where θеRm+1\theta\in\R^{m+1})

Training:

minθCi=1m[y(i)cost1(θTf(i))+(1y(i)) cost0(θTf(i))]+12j=1nθj2\min_\theta C\sum_{i=1}^m \large[ y^{(i)} \mathop{\textrm{cost}_1}(\theta^Tf^{(i)}) + (1 - y^{(i)})\ \mathop{\textrm{cost}_0}(\theta^Tf^{(i)})\large] + \frac{1}{2}\sum_{j=1}^n \theta_j^2

Notice: instead of θTx(i)\theta^Tx^{(i)} that we use before, now we use θTf(i)\theta^Tf^{(i)}. And here our nn is equal to mm.

Mathematical tricks in practice: Ignoring the θ0\theta_0 (s.t. θ=[θ1,,θm]T\theta=[\theta_1,\cdots,\theta_m]^T), we can get jθj2=θTθ\sum_j\theta_j^2=\theta^T\theta. Moreover, for mathematical convenience when deal with a large training set, we gonna use θTMθ\theta^T M \theta instead of θTθ\theta^T\theta.


SVM parameters

Here are parameters that we are going to choose when use the algorithm above:

  • CC (=1m\frac{1}{m})
    • Large CC: Lower bias, high variance (small λ\lambda)
    • Small CC: Higher bias, low variance (large λ\lambda)
  • о2\sigma^2
    • Large о2\sigma^2: Features fif_i vary more smoothly. Higher bias, lower variance.屏幕快照 2019-11-02 15.51.46
    • Small о2\sigma^2: Features fif_i vary less smoothly. Lower bias, higher variance.屏幕快照 2019-11-02 15.51.50

E.g. Suppose we train an SVM and find it overfits our training data. A reasonable next step is going to decrease CC or increase о2\sigma^2.

SVM in Practice

Using An SVM

We are not recommended writing our own software to solve for the parameters via a SVM. We should call some library functions to do that.

When using SVM software package (e.g. liblinear, libsvm) to solve for parameters θ\theta, we need to specify:

  • Choice of parameter CC;
  • Choice of kernel (similarity function):
    • No kernel ("linear kernel", do our common logistic regression): Predict y=1y=1 if θTx0\theta^Tx\ge0. For large nn and small mm case (no kernel can avoid overfitting).
    • Gaussian kernel: fi=exp(xl(i)22о2)f_i=\exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}), where l(i)=x(i)l^{(i)}=x^{(i)}.(Need to choose о2\sigma^2). For small nn and large mm case (Gaussian kernel can fit a more complex nonlinear decision boundary).

If we decide to use a Gaussian kernel, here is what we are going to do:

  • Provide a kernel (similarity) function:
    function f = kernel(x1, x2)f=exp(x1x222о2)return\begin{array}{l} \textrm{function f = kernel(x1, x2)}\\ \qquad \textrm{f} = \exp(-\frac{||\textrm{x1}-\textrm{x2}||^2}{2\sigma^2})\\ \textrm{return} \end{array}
    Note: Do perform feature scaling before using the Gaussian kernel.

Other choices of kernel

Not all similarity functions similarity(x,l)\mathop{\textrm{similarity}}(x,l) make valid kernels. (Need to satisfy technical condiction called "Mercer's Theorem" to make sure SVM packages' optimizations run correctly, and do not diverge).

There are many off-the-shelf kernels available, for example:

  • Polynomial kernel: k(x,l)=(XTl+constant)degreek(x,l)=(X^Tl+constant)^{degree}, need to choose the constantconstant and degreedegree. Usually performs worse than Gaussian kernel. Only for XX and ll are all strictly non negative.
  • More esoteric: String kernel (for input is text string), chi-square kernel, histogram intersection kernel, ...

Multi-class classification

image-20191102171432072

Logistic regression v.s. SVMs

nn = number of features (xеRn+1x\in\R^{n+1});

mm = number of training examples;

  • If nn is large (relative to mm): (nmn\ge m, e.g. n=10,000,m=101000n=10,000, m=10 \sim 1000)
    • Use logistic regression, or SVM without a kernel ("linear kernel")
  • If nn is small, mm is intermediate: (e.g. n=11000,m=50,000+n=1\sim1000,m=50,000+)
    • Use SVM with Gaussian kernel
  • If nn is small, mm is large:
    • Create/add more features, then use logistic regression or SVM without a kernel

Note: Neural network likely to work well for most of these settings, but may be slower to train.