Machine Learning Day

Lab 1: Local Methods and Cross-validation

This lab is about local methods for binary classification on synthetic data. The goal is to offer some familiarity with the k Nearest Neighbors (kNN) algorithm and get a practical understanding of the bias-variance trade-off.

Getting Started

  • Get the code file, add the directory to MATLAB path (or set it as current/working directory).
  • Use the editor to write/save and run/debug longer scripts and functions.
  • Use the command window to try/test commands, view variables and see the use of functions.
  • Use plot (for 1D), imshow, imagesc (for 2D matrices), scatter, scatter3D to visualize variables of different types.
  • Work your way through the examples below, by following the instructions.

1. Generate Classification Data

  1. Call the function MixGauss with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3. The function call should look like [Xtr,Ytr] = MixGauss(...)
  2. Use the function scatter to plot the points in the 2D plane.
  3. Manipulate the data to obtain a 2-class problem where data on opposite corners share the same class with labels +1 and -1. Hint: if you produced the data following the centers order provided above, you can use a mapping like Ytr = 2*(1/2-mod(Ytr, 2));
  4. Similarly generate a "test set" [Xte, Yte] drawn from the same distribution (start with 200 samples per class) and plot it using scatter.

2. Core - kNN classifier

The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set.

  1. Study the code of function kNNClassify (for quick reference type help kNNClassify)
  2. Use kNNClassify on the 2-class data generated at step 1. Pick a "reasonable" k.
  3. Plot the data to visualize the obtained results. A possible way is to plot the wrongly classified points using different colors/markers:
    scatter(Xte(:,1), Xte(:,2), 25, Yte); % color points based on "true" label
    sel = (Ypred ~= Yte);
    scatter(Xte(sel, 1), Xte(sel, 2), 25, Yte(sel), 'X'); % wrongly predicted test points
  4. Evaluate the classification performance by comparing the estimated to the true outputs:
    sum(Ypred~=Yte)/Nt % Nt number of test data
  5. To visualize the separating function (and thus visualize what areas of the 2D plane are associated with each class) you can use the provided function separatingFkNN and again plot the test points with scatter.

3. Parameter selection - What is a good value for k?

So far we considered an arbitrary choice for k.

  1. Perform a hold-out cross validation procedure on the available training data for a large range of candidate values for k (e.g. k=1,3,5...,41). Repeat the hold-out experiment for rep=10 times using at each iteration p=30% of the training set for validation. You can use the provided function holdoutCVkNN(type help holdoutCVkNN for an example use). Plot the training and validation errors for the different values of k. How would you now answer the question "what is the best value for k"? Note: for the parameters rep=10 and p=0.3, the hold-out procedure may be quite unstable.
  2. How is the value affected by p (percentage of points held out) and number rep (number of repetitions e.g., 1, 5, 30, 50, 100) of the experiment? What does a large number of repetitions provide?
  3. Apply the model obtained by cross validation (i.e., best k) to the test set (Xte) and see if there is an improvement on the classification error over the result of 2.4.

4. (Optional)

  1. Dependence on training size: Evaluate the performance as the size of the training set grows (varies), e.g., n = {3, 5, 20, 50, 100, 300, 500,...} How would you choose a good range for k as n changes? Repeat the validation and test multiple times. What can you say about the stability of the solution/performances?
  2. Try classifying more difficult datasets, generated through MixGauss, for instance, by increasing variance or adding noise by randomly flipping the labels on the training set.
  3. Modify the function kNNClassify to handle a) multi-class problems and b) regression problems.