Bayesian Support Vector Machine Hyperparameter Tuning - User's Guide

  1. Getting Started
    1. CLAPACK Libraries
    2. Compiling
    3. Running
    4. Data Files
    5. Output Files
  2. Program Options (Parameters)
    1. Run Parameters
    2. SVM, HMC (Hybrid Monte Carlo Simulation) Parameters
    3. Initial Hyperparameter Values
    4. Gradient Ascent Parameters
      1. Convergence
      2. Initial Gradient Ascent Rates
      3. Maximum Step Sizes
      4. Min/Max Hyperparameter Values
      5. Rate Changes
  3. Code Overview
  4. Bibliography
  5. Authors' Contact Info

Getting Started

This document describes the installation and use of the Bayesian Support Vector Machine Hyperparameter Tuning (BSVMHT) program.

CLAPACK Libraries

The BSVMHT program requires the CLAPACK (C Linear Algebra Package) library version 3.0, which is free software. The BSVMHT distribution includes pre-compiled versions for x86 Linux in the 'lib' directory. If you need to compile the library yourself, source code is available from:

http://www.netlib.org/clapack (or use the following direct link to the .tgz dowload: clapack.tgz)

The required libraries are blas (the linear algebra engine), lapack, and F77.

The included Makefile assumes that the libraries are located in the 'lib' directory. If you wish to move the libraries to another location (i.e. /usr/lib, etc.) you must include the appropriate paths in the Makefile.

Compiling

After the CLAPACK libraries are installed (or if you are using the default installation), you can compile the program from the 'src' directory with

make OPT=-O

You can use whatever optimization option your compiler supports instead of '-O'.

Running

To run the program you must provide at least a single argument, the data file. For example, from the bsvmht directory,

bin/bsvmht data/splice_01

This will run the program with the default parameters.

Data File Structure

The data file structure required by the program is:

Output Files

The program creates two output each time it is run. The name for the output files will be

datafile-name.rundate_runtime.txt
datafile-name.rundate_runtime.dat

The '.txt' file contains essentially a running commentary of the trial in (more or less) human readable form. Some of this verges on debugging information. This file is useful to see how far a run has progressed at any given point. (For example, you can 'tail -f outputfile.txt' and watch the run progress.) The '.dat' file gives the values of the hyperparameters, the hyperparameter gradients, the current error on the test set, and the current gradient ascent rates in tabbed data format suitable for plotting. The layout of the datafile is given in the commented header.

Plotting the Results

The BSVMHT distribution package includes a sample Matlab script to plot the results of a trial. To use the plot function start matlab in the 'matlab' directory (or add that directory to your path from wherever you started matlab) and type:

plot_grad_tune(dataDir, runName, numDims, save_type)

where

  1. dataDir = the directory holding the '.dat' file produced by the run, without an ending '/'
  2. runName = the base-name for the run output, not including the '.dat' ending
  3. numDims = the number of dimensions in the data set
  4. save_type = (optional) the type of file to save the plots, i.e. '.jpg', '.epsc'. If you include this parameter then the plots will be closed by the script.
For example, to create the plots and save as jpeg:

plot_grad_tune('../data', 'splice_01.200705_134915', 60, 'jpeg')

Or to create the plots but not save them (i.e. leave them open):

plot_grad_tune('../data', 'splice_01.200705_134915', 60, '')


Program Options (Parameters)

The BSVMHT program includes a number of parameters controlling different aspects of the run. The way to override the default parameters is to include the name of a parameter file on the command line, after the data file. That is:

bin/bsvmht data/splice_01 params/myparams

The structure of the parameter file is fixed. Each line contains parameters in a fixed order (the lines are in a fixed order, and the parameters on each line are in a fixed order.) You have the option to only specify some of the parameters - any parameters left unspecified will use the default values. Because the parameters are in a fixed order if you wish to specify a parameter at the end of the ordering, you must also specify all parameters that precede it (this is not particularly inconvenient, because the most frequently used parameters are at the beginning of the order and the least frequently used parameter are at the end.)

Comment lines in the parameter files should begin with '#'. This is useful for defining your own parameter files.

The distribution package contains examples of parameter files that you can modify:

  1. params/run : Only specifies parameters pertaining to the run, i.e. run length, whether or not to use a validation set for early stopping, whether or not to randomly reorganize the data
  2. params/nys : Specifies the run parameters, and also parameters for the Hybrid Monte Carlo Simulation - i.e. whether to use the full HMC, or the Nystrom approximation, the number of parameters for the Nystrom approximation, etc.
  3. params/start : Specifies the run parameters, the HMC parameters, and also the starting values for the SVM hyperparameters
  4. params/full : Specifies the run parameters, the HMC parameters, the SVM hyperparameters, and also a number of parameters controlling the gradient ascent algorithm
Complete descriptions of each parameter are in the sections below.

Run Parameters

Line 1 of the parameter file contains the run parameters. These are, in order:
  1. Whether or not to normalize the input data to zero mean, unit variance: 1=normalize, 0=do not normalize. (Normally data should be normalized - only set this option to zero if your own data is already normalized.) (Default=Normalize)
  2. Maximum number of gradient ascent steps : an integer value. Usually between 20 and 50 is enough, but it can vary for different data sets. (Default=50)
  3. Random Data Split - 1=random split 0=no random split. Using this option results in a randomly chosen training set. (Default = No)
  4. Number of training points - This parameter can be set to either an integer value, or a fraction less than 1. If the value is integer it specifies the number of training points exactly. If it is a fraction it indicates what percent of the total number of data points should be used for training. (Default = 1/3 of data points for training.)
  5. Number of points to use as a validation set - If set this parameter will withhold the specified number of points from the training data for use as a validation set. That is, the SVM will not be trained on these points and the error on the validation points will be reported at each step. This can be used as an early stopping criterion. (Default = 0)

SVM, HMC (Hybrid Monte Carlo Simulation) Parameters

Line 2 of the parameter file contains the parameters for the SVM optimization and the HMC. These are, in order:
  1. SVM Loss type - 1= hinge loss, 2=quadratic loss. These are the only options. (default = hinge)
  2. HMC type - 0=full HMC, 1=Nystrom Approximation HMC. (default = Nystrom)
  3. # of samples for the Nystrom Approximation - (default = 200)
  4. # of eigenvalues for the Nystrom Approximation - (default = 20)
  5. Epsilon for the HMC - (default = .01)

Initial Hyperparameter Values

Note: the kernel hyperparameter search takes place on a logarithmic scale. For this reason you must specify the natural logarithm of the K0, Koff and Length Scale kernel parameters in the parameter file.

Line 3 of the parameter file gives initial values for the noise parameter (C), the kernel amplitude (K0), and the kernel offset (Koff). You can specify any values for these parameters. The default values are:
C=1
K0=1 (=e0)
Koff=0.05 (=e-3, approximately)

Line 4 of the parameter file gives initial values for the length scale parameters. You can specify any number of length scales - if you specify fewer than the dimension of the data set then the last value specified will be reused (i.e. if you specify a single value it will be used for all of the of the length scales.)

The default value for the length scales is to scale with the square root of number of input features, i.e. the dimensionality of the input vectors: li=sqrt(dim)

The hyperparameter initialization uses two "codes" to initialize any parameter to a random value, or to initialize a length scale to be scaled with the dimension of the inputs:

Gradient Ascent Parameters

The remaining parameters control the adaptive gradient ascent, which is described in reference 3 in the bibliography.

Convergence

Line 5 of the parameter file contains parameters controlling the condition for convergence of the gradient ascent. There are two possible conditions for convergence:

  1. The average of the (normalized) gradients has decreased to be smaller than a fixed fraction of its peak value during the run (parameter named Grad_Convergence_Ratio)
  2. All of the hyperparameters have changed by less than a given fraction (Param_Convergence_Ratio) in a specified number of most recent steps (Param_Convergence_Steps)
The format for the line with the convergence parameters is:

Grad_Convergence_Ratio Param_Convergence_Ratio Param_Convergence_Ratio

The default values are:
Grad_Convergence_Ratio = 0.15
Param_Convergence_Ratio = 0.01
Param_Convergence_Steps = 5

Note that a run will also terminate when the maximum number of steps is reached, but this is not considered convergence.

Initial Gradient Ascent Rates

Line 6 of the parameter file gives the initial gradient ascent rates. The program includes separate rates for the noise parameter ("C"), the kernel offset ("Koff"), the kernel amplitude ("K0") and the length scales (a single rate is used for all the length scale gradient). These rates are adapted separately during the gradient descent, as described in reference 3 in the bibliography.

The format for the parameter line is:

Koff_rate K0_rate LengthScale_rate C_rate

The default values are:
Koff_rate = 1.0
K0_rate = 10.0
LengthScale_rate = log(sqrt(dim))*10.0
C_rate = 1.0

Maximum Step Sizes

Line 7 of the parameter file gives maximum step sizes for each hyperparameter. If the current gradient and ascent rate gives a step larger than the maximum value it will be truncated to the maximum. Including reasonable maximum step size is a useful additional constraint on the adaptive gradient ascent. The format for the parameter line is:

Koff_maxstep K0_maxstep LengthScale_maxstep C_maxstep

The default values are:
Koff_maxstep = 0.1
K0_maxstep = 2.0
LengthScale_maxstep = sqrt(dim)
C_maxstep = 1.0

Min/Max Hyperparameter Values

Lines 8 and 9 of the parameter file contain maximum and minimum allowed values for each hyperparameter respectively. If the current step attempts to make the hyperparameter greater/less than the maximum/minimum the value will be set to the maximum/minimum. Including reasonable maximum and minimum values provides a useful additional constraint on the adaptive gradient ascent. The format for these parameter lines are:

Koff_max/min K0_max/min LengthScale_max/min C_max/min

NOTE: The Maximum/Minimum hyperparameter values in the parameter file must be given IN LOGARITHMIC SCALE for the kernel hyperparameters (Koff, K0, Length Scales)

The default maximum values are:
Koff_max = e^2.303 ~ 10
K0_max = e^3.92 ~ 50
LengthScale_max = e^3.92 ~ 50
C_max = 20.0

The default minimum values are:
Koff_min = e^-6.91 = 0.001
K0_min = e^-6.91 = 0.001
LengthScale_min = e^-6.91 = 0.001
C_min = 0.1

Rate Checking

Line 10 of the parameter file contains parameters that control the adaptive changes in the ascent rates. There are two parts of checking the ascent rate: checking the change in the amplitude of the gradient (applies to all hyperparameters) and checking the change in the angle of the gradient (applies to the length scale gradients).

The format for the rate checking parameter line is:

Max_Grad_Angle_Cos Min_Grad_Angle_Cos Min_Grad_Change Max_Grad_Change Rate_Divide Rate_Multiply

In checking the change in the amplitude of the gradient the criterion is that the gradients should change gradually (not too fast, not too slow). If the change in the gradient is changing only very slightly the ascent rate should be increased (by multiplying by the rate_multiply parameter.) This threshold is given by the Min_Grad_Change_Parameter. If the change in the gradient is too large, and particularly if the direction of the gradient changes, then the ascent rate is too large and should be decreased (by dividing by the rate_divide parameter.) The default values for checking the change in the amplitude of the gradient are:

Min_Grad_Change = 0.1 (i.e. changes should be greater than 10%)
Max_Grad_Change = -1.0 (i.e. changes where the sign stays the same are allowable, but changes where the sign of the gradient changes must involve an amplitude change of < 100%)

For the length scale ascent rate an additional check is made on the direction of successive length scale gradients. If the change in direction is too large, then the ascent rate is lowered (by dividing by the rate_divide parameter.) If there is virtually no change in the direction, then the rate is increased (by multiplying by the rate_multiply parameter.) The thresholds for min/max angle changes are specified as the cosine of the angles (because for high dimensional length scale gradient vectors this is what can be easily calculated: the dot product of the two successive length scale gradient vectors divided by the square of the amplitude.) The default values for checking the change in the direction of the length scale gradient vectors are:

Max_Grad_Angle_Cos = -0.1 (i.e. cos for the maximum angle is -0.1)
Min_Grad_angle_Cos = 0.995 (i.e. cos for the minimum angle is 0.995)

The default values for asccent rate division, multiplication are:

Ascent_Rate_Divide = 2.0
Ascent_Rate_Multiply = 1.1

(i.e. by default ascent rates are cut faster than they are increased, a standard procedure in adaptive gradient ascent.)


Code Overview

Unfortunately, not all of the code is commented. However, the most mathematically complex areas have the most detailed comments and in other areas of code functions and variables have verbose descriptive names that make it reasonably readable. The code is divided into the following files:

Bibliography

  1. P Sollich (2000) Probabilistic Methods for Support Vector Machines, NIPS 12, pp. 349-355
  2. P Sollich (2002) Bayesian Methods for Support Vector Machines: Evidence and Predictive Class Probabilities , Machine Learning, (46) pp. 21-52
  3. C Gold and P Sollich (2003) Model Selection for Support Vector Machine Classification, Neurocomputing, (55) pp. 221-249
  4. C Gold and P Sollich (2005) Fast Bayesian Support Vector Machine Parameter Tuning with the Nystrom Method, Proceeding of the International Joint Conference on Neural Networks
  5. C Gold and P Sollich (2005) Bayesian Approach to parameter tuning and feature selection for Support Vector Machine Classifiers, Neural Networks, in press

All papers are downloadable from the SVM section of the full publications list.


Authors' Contact Info

If you have question, comments or bug reports please contact us:

Peter Sollich (peter.sollich<at>kcl.ac.uk)
Carl Gold (carl<at>klab.caltech.edu)

We want to hear about your results using the program! Good luck!