Machine Learning Day

Lab 2.A: Regularized Least Squares (RLS)

This lab is about linear Regularized Least Squares for classification or regression.

Getting Started

  • Get the code file, add the directory to MATLAB path (or set it as current/working directory).
  • Use the editor to write/save and run/debug longer scripts and functions.
  • Use the command window to try/test commands, view variables and see the use of functions.
  • Use plot (for 1D), imshow, imagesc (for 2D matrices), scatter, scatter3D to visualize variables of different types.
  • Work your way through the examples below, by following the instructions.

1. Data Generation

Start by generating data from a mixture of Gaussians distribution using the provided MixGauss.

  1. Generate a 2-dimensional, 2-class training set (Xtr, Ytr), with classes centered at (-0.5,-0.5) and (0.5,0.5) and variance 0.5 for both (5 points per class). Adjust the output labels Ytrto be {1,-1}, e.g. using Ytr(Ytr==2)=-1
  2. Generate a corresponding test set of 200 points per class (Xte,Yte) from the same distribution.
  3. Add noise to the generated data by randomly flipping a percentage of the point labels (e.g. 10%), using the provided function flipLabels. You will obtain a new set of training Ytrn and test Yten label/output vectors. Plot the various datasets using scatter, e.g.:
    figure; hold on;
    scatter(Xtr(Ytr==1,1), Xtr(Ytr==1,2), '.r');
    scatter(Xtr(Ytr==-1,1), Xtr(Ytr==-1,2), '.b');

2. Linear RLS

  1. Complete the code in functions regularizedLSTrain and regularizedLSTest for training and testing a regularized Least Squares classifier.
  2. Try the functions on the previously generated 2-dimensional, 2-class data from 1.1-1.3. Pick a "reasonable" lambda and check the effect of regularization and the effect of noise. Plot the data in a way that visualizes the obtained results (e.g. a scatter plot with the misclassified points labeled differently; similar to Lab 1) and evaluate the classification performance by comparing the estimated to the true outputs.
    Note: To visualize the separating function (and thus visualize what areas of the 2D plane are associated with each class) you can use the function separatingFRLS. Superimpose the training and test set data (Xtr, Ytr) and (Xts, Yts), on separate plots, to analyze the generalization properties of the solution.
  3. Perform parameter selection using hold-out cross-validation to select lambda in the range {exp(-10), ...exp(0)}, using the provided holdoutCVRLS. Plot the training and validation errors for the different values of lambda; apply the best model to the test set (Xte) and check the classification error; show the separating function and generalization of the solution.
  4. Repeat the procedure data generation -- parameter selection -- test multiple times and compare the test error of RLS with that of ordinary least squares (OLS), i.e. with lambda=0. Does regularization improve classification performance?
  5. (Optional) Repeat the classification 2.2 for the high-dimensional dataset. Generate the same classes as in Section 1 with the Gaussians now residing in a d-dimensional space. How would you choose the class mean vectors? Try as indicative values d = 10, p~0.1, N~10. Check what happens with: varying lambda, varying the input space dimension d (i.e., the effect of "distance" between points), varying the effect of noise. Perform parameter selection using hold-out cross-validation for lambda in a reasonable range (using holdoutCVRLS) and find the generalization error of the best model.

3. (Optional)

  1. Modify the regularizedLSTrain and regularizedLSTest functions to incorporate an offset in the linear model (i.e., y = <w,x> + b). Compare the solution with and without offset, in a 2-class dataset where classes are centered on (0,0) and (1,1) with variance 0.35 each.
  2. Repeat 2.2 and 2.4 for different configurations (change training set size, mean vectors position/variance, percentage of noise).
  3. Modify the regularizedLSTrain and regularizedLSTest functions to handle multiclass problems.