CODES / fit / svm

Train a Support Vector Machine

Contents

Syntax

Description

Support Vector Machine (SVM) is a machine learning classification tool. Its core idea relies on the definition of an optimal decision function. In the linear case, the decision function is a hyper-plane which is defined through the following quadratic optimization problem:

$$\begin{array}{rll}\mathop{\min}\limits_{\beta,b}&\frac{1}{2}||\beta||^2\\\textbf{s.t.}&1-l_i(\beta^T\mathbf{x}^{(i)}+b)\leq0&\forall i=\{1,\ldots,n\}\end{array}$$

where $n$ is the number of training samples, and the optimal separating plane is defined as $\beta^T\mathbf{x}+b=0$. The constant $b$ (sometimes referred to as $\beta_0$) is called the bias, $\mathbf{x}^{(i)}$ is the $i\textrm{\textsuperscript{th}}$ training sample, and $l^{(i)}$ the $i\textrm{\textsuperscript{th}}$ label. As per convention, labels $l^{(i)}$ derive from function values $y^{(i)}$ such that:

$$l^{(i)}=\left\{\begin{array}{ll}+1&\textrm{if }y^{(i)}>0\\-1&\textrm{if }y^{(i)}\leq0\end{array}\right.$$

Note: in the case of a response $f$, and a threshold $f_0$, the classification is performed on the sign of a limit state function $y$ defined as $y(x)=f(x)-f_0$ or $y(x)=f_0-f(x)$.

In the case of non-separable data, the optimization problem is infeasible and needs to be relaxed. This is done through the introduction of slack variables $\xi_i$:

$$\begin{array}{rll}\mathop{\min}\limits_{\beta,b,\xi}&\frac{1}{2}||\beta||^2+\frac{C}{L}\displaystyle\sum_{i=1}^n\xi_i^L\\\textbf{s.t.}&1-l_i(\beta^T\mathbf{x}^{(i)}+b)-\xi_i\leq0&\forall i=\{1,\ldots,n\}\\&\xi_i\geq0&\forall i=\{1,\ldots,n\}\end{array}$$

where $L$ is either 1 or 2 and represents the loss function used. In the remainder of this help, $L=1$ (L1 SVM) is used but can be easily extended to L2 SVM. Using Karush-Kuhn-Tucker conditions, one can show that:

$$\left\{\begin{array}{rcl}0&=&\displaystyle\sum_{i=1}^n\alpha_il^{(i)}\\\beta&=&\displaystyle\sum_{i=1}^n\alpha_il^{(i)}\mathbf{x}^{(i)}\\\alpha_i&=&C-\mu_i\end{array}\right.$$

where $\alpha_i$ and $\mu_i$ are the $i\textrm{\textsuperscript{th}}$ Lagrange multiplier associated to the first and second set of constraints,respectively. This leads to the dual problem:

$$\begin{array}{rll}\mathop{\max}\limits_{\alpha}&\displaystyle\sum_{i=1}^n\alpha_i-\frac{1}{2}\displaystyle\sum_{i=1}^n\displaystyle\sum_{j=1}^n\alpha_i\alpha_jl^{(i)}l^{(j)}\mathbf{x}^{(i)T}\mathbf{x}^{(j)}\\\textbf{s.t.}&0\leq\alpha_i\leq C&\forall i=\{1,\ldots,n\}\\&\displaystyle\sum_{i=1}^n\alpha_il^{(i)}=0\end{array}$$

Finally, the so-called kernel trick consists in recognizing that the scalar product $x^Ty$ could be replaced by a kernel function $K_\theta$ of parameter $\theta$ such that the dual problem becomes:

$$\begin{array}{rll}\mathop{\max}\limits_{\alpha}&\displaystyle\sum_{i=1}^n\alpha_i-\frac{1}{2}\displaystyle\sum_{i=1}^n\displaystyle\sum_{j=1}^n\alpha_i\alpha_jl^{(i)}l^{(j)}K_\theta(\mathbf{x}^{(i)},\mathbf{x}^{(j)})\\\textbf{s.t.}&0\leq\alpha_i\leq C&\forall i=\{1,\ldots,n\}\\&\displaystyle\sum_{i=1}^n\alpha_il^{(i)}=0\end{array}$$

For an in-depth discussion of Support Vector Machine and Machine Learning, the interested reader is referred to Vapnik (2000).

Slack variables, bias, Lagrange multipliers, and indices of the support vectors can be obtained using: svm.xis, svm.bias, svm.alphas, and svm.SVs.

Kernels

Two kernels are available:

$$K(x,y)=x^Ty$$

$$K_\theta(x,y)=\exp\left[\frac{||x-y||^2}{2\theta^2}\right]$$

Solvers

Three solvers are implemented:

$$\begin{array}{rll}\mathop{\min}\limits_{\beta,b,\xi}&\frac{1}{2}||\beta||^2+\frac{C}{L}\displaystyle\sum_{i=1}^n\xi_i^L\\\textbf{s.t.}&1-l_i(\beta^T\mathbf{x}^{(i)}+b)-\xi_i\leq0&\forall i=\{1,\ldots,n\}\\&\xi_i\geq0&\forall i=\{1,\ldots,n\}\end{array}$$

Ideas to extend the primal solver can be found in Chapelle (2007).

$$\begin{array}{rll}\textrm{If }L=1\ :\ \mathop{\max}\limits_{\alpha}&\displaystyle\sum_{i=1}^n\alpha_i-\frac{1}{2}\displaystyle\sum_{i=1}^n\displaystyle\sum_{j=1}^n\alpha_i\alpha_jl^{(i)}l^{(j)}K_\theta(\mathbf{x}^{(i)},\mathbf{x}^{(j)})\\\textbf{s.t.}&0\leq\alpha_i\leq C&\forall i=\{1,\ldots,n\}\\&\displaystyle\sum_{i=1}^n\alpha_il^{(i)}=0\end{array}$$

$$\begin{array}{rll}\textrm{If }L=2\ :\ \mathop{\max}\limits_{\alpha}&\displaystyle\sum_{i=1}^n\alpha_i-\frac{1}{2}\displaystyle\sum_{i=1}^n\displaystyle\sum_{j=1}^n\alpha_i\alpha_jl^{(i)}l^{(j)}\left(K_\theta(\mathbf{x}^{(i)},\mathbf{x}^{(j)})+\frac{\delta_{ij}}{C}\right)\\\textbf{s.t.}&0\leq\alpha_i&\forall i=\{1,\ldots,n\}\\&\displaystyle\sum_{i=1}^n\alpha_il^{(i)}=0\end{array}$$

Weighted formulation

Osuna et al. (1997) and Vapnik (2000) introduced different cost coefficients (i.e., weights) for the different classes in the SVM formulation. The corresponding linear formulation is:

$$\begin{array}{rl}\mathop{\min}\limits_{\mathbf{w},\mathbf{\xi},b}&\frac{1}{2} \|\mathbf{w}\|^2 + C^+ \sum_{i=1}^{N^+} \xi_i + C^- \sum_{i=1}^{N^-} \xi_i\\\textbf{s.t.}&y_i(\mathbf{w}\cdot \mathbf{x}_i - b) \ge 1 - \xi_i\\&\xi_i \ge 0 , i = 1,\dots,N\end{array}$$

where $C^+$ and $C^-$ are cost coefficients for the +1 and -1 classes respectively. $N^+$ and $N^-$ are the number of samples from +1 and -1 classes. The coefficients are typically chosen as (Chang and Lin, 2011):

$$\begin{array}{r}C^+ = C \times w^+\\C^- = C \times w^-\end{array}$

where $C$ is the common cost coefficient for both classes, $w^+$ and $w^-$ are the weights for +1 and -1 class respectively. The weights are typically chosen as:

$$w^+ = 1$$

$$w^- = \frac{N^+}{N^-}$$

Training Options

param value Description
'scale' {'square'}, 'circle', 'none' Define scaling method for the inputs (c.f., Scaling for details)
'UseParallel' logical, {false} Switch to use parallel settings
'theta' numeric, { [ ] } Value for kernel parameter. If [ ], should be calibrated.
'kernel' {'gauss'}, 'lin' Kernel type, see Kernels.
'solver' {'libsvm'}, 'dual', 'primal' Type of solver to use to solve the SVM optimization problem, see Solvers.
'loss' {1}, 2 Loss function.
'C' positive numeric, { [ ] } Cost parameter. If [ ], should be calibrated.
'weight' logical, {false} If true, train weighted SVM, see Weighted formulation.
'w_plus' numeric, {1} Weight for +1 samples. If default and 'weight' is true, weight computed based on sample imbalance.
'w_minus' numeric, {1} Weight for -1 samples. If default and 'weight' is true, weight computed based on sample imbalance.
'param_select'
  • {'fast'}
  • 'loo'
  • 'cv'
  • 'chapelle'
  • 'Nsv'
  • 'loo_bal'
  • 'cv_bal'
  • 'chapelle_bal'
  • 'Nsv_bal'
  • 'auc'
  • 'cv_auc'
  • 'stiffest'
Kernel parameter calibration strategy, see svm (parameters selection). Kernel parameters are optimized such that they either minimize (maximize in the case of the AUC) the elected metric or satisfy some heuristic ('fast' or 'stiffest').

Evaluation and Post-Processing

Capabilities of an svm object.

Mini tutorials

A mini tutorial of the capabilities of the svm class.
A presentation of parameters selection techniques for svm.
An illustration of the svm path.

References

Copyright © 2015 Computational Optimal Design of Engineering Systems (CODES) Laboratory. University of Arizona.

Computational Optimal Design of
Engineering Systems