Mathematical formulation - 5.2 English - 68552

AOCL API Guide (68552)

Document ID
68552
Release Date
2025-12-29
Version
5.2 English

SVM can have different formulations depending on the task at hand. The most common ones are:

  • Support Vector Classification (SVC): Uses a parameter \(C\) to balance margin maximization and classification errors, seeking a decision boundary that maximizes the margin while penalizing misclassifications.

  • Nu-Support Vector Classification (NuSVC): Employs a parameter \(\nu\) to constrain the fraction of training errors and the fraction of support vectors, providing an alternative classification formulation to SVC.

  • Support Vector Regression (SVR): Fits a function that allows for a margin of tolerance \(\epsilon\) around the predicted values, penalizing data points lying outside this tolerance to achieve a trade-off between model complexity and fitting errors.

  • Nu-Support Vector Regression (NuSVR): Similar to SVR but utilizes a parameter \(\nu\) to control the allowable support vectors and fitting errors, offering a different mechanism to balance model complexity and error tolerance.

In the simplest case (binary classification), given training samples \(\{ (x_i, y_i) \}_{i=1}^{n}\) with \(y_i \in \{-1, +1\}\), the SVC in its primal form, poses the following problem:

\[\begin{split}\min_{\beta,\,b,\,\xi_i} \frac{1}{2} \|\beta\|^2 &+ C \sum_{i=1}^{n} \xi_i \\ \text{subject to} \quad y_i (\beta^\top \phi(x_i) + b ) &\ge 1 - \xi_i,\quad \xi_i \ge 0,\end{split}\]

where \(\beta\) represents the vector of coefficients, also known as the weights, associated with each input feature, \(b\) is the bias term, \(\phi\) is a (possibly nonlinear) mapping defined by a kernel, \(C > 0\) is a regularization parameter controlling the trade-off between margin maximization and misclassification, and \(\xi_i\) (known as a slack variable) is the distance from the \(i\)-th sample to the correct boundary.

Minimizing the norm of the coefficients leads to maximizing the margin between classes. Meanwhile, slack variables introduce a penalty for misclassification errors; if a data point is correctly classified, \(\xi_i\) is 0, otherwise, \(\xi_i\) is greater than 0. The optimal hyperplane is determined by a subset of the training samples known as support vectors, which lie on or within the margin.

Since the feature space can be high-dimensional or data might not be linearly separable, it is more convenient to solve the dual version of this problem. This formulation focuses on the dual coefficients \(\alpha_i\) rather than explicit feature transformations. The corresponding objective problem is:

\[\begin{split}\min_{\alpha} \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i, x_j) - \sum_{i=1}^{n} \alpha_i \\ \text{subject to} \quad 0 \le \alpha_i \le C,\quad \sum_{i=1}^{n} \alpha_i y_i = 0,\end{split}\]

where \(K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)\) is the kernel function. This eliminates the need to perform an explicit mapping \(\phi(x)\), allowing SVMs to handle both linear and nonlinear relationships in the data. The decision function is then given by

\[f(x) = \sum_{i=1}^{n} \alpha_i y_i K(x_i, x) + b,\]

and predictions are made based on the sign of \(f(x)\).

Multi-class classification with SVMs is approached using a one-vs-one strategy, which decomposes the multi-class task in a way that each class is paired with each other to form \(\frac{n_{\mathrm{class}} \times (n_{\mathrm{class}}-1)}{2}\) binary classification submodels. Each submodel learns to distinguish between two classes. The final label is determined by aggregating the results of these binary decisions, by a voting mechanism.

Note

In da_svm_decision_function_? you can access decision function values in both OvO (one-vs-one) and OvR (one-vs-rest) shapes. In the one-vs-rest (OvR) approach, \((n_{\mathrm{class}} - 1)\) subproblems are defined, each separating a single class from all remaining classes. Note that in our case, OvR decision values are derived from OvO, so this setting does not affect the underlying training process.

For regression, a similar formulation is used, minimizing errors within an margin of tolerance \(\epsilon\) around the regression function.