Machine Learning Day
Lab 1: Local Methods and Cross-validation
This lab is about local methods for binary classification on synthetic data. The goal is to offer some familiarity with the k Nearest Neighbors (kNN) algorithm and get a practical understanding of the bias-variance trade-off.
Getting Started
- Get the code file, add the directory to MATLAB path (or set it as current/working directory).
- Use the editor to write/save and run/debug longer scripts and functions.
- Use the command window to try/test commands, view variables and see the use of functions.
- Use
plot
(for 1D),imshow
,imagesc
(for 2D matrices),scatter
,scatter3D
to visualize variables of different types. - Work your way through the examples below, by following the instructions.
1. Generate Classification Data
- Call the function
MixGauss
with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3. The function call should look like[Xtr,Ytr] = MixGauss(...)
- Use the function
scatter
to plot the points in the 2D plane. - Manipulate the data to obtain a 2-class problem where data on opposite corners share the same class with labels +1 and -1. Hint:
if you produced the data following the centers order provided above, you can use a mapping like
Ytr = 2*(1/2-mod(Ytr, 2));
- Similarly generate a "test set"
[Xte, Yte]
drawn from the same distribution (start with 200 samples per class) and plot it usingscatter
.
2. Core - kNN classifier
The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set.
- Study the code of function
kNNClassify
(for quick reference typehelp kNNClassify
) - Use
kNNClassify
on the 2-class data generated at step 1. Pick a "reasonable" k. - Plot the data to visualize the obtained results. A possible way is to plot the wrongly classified points using different colors/markers:
scatter(Xte(:,1), Xte(:,2), 25, Yte);
% color points based on "true" label
sel = (Ypred ~= Yte);
scatter(Xte(sel, 1), Xte(sel, 2), 25, Yte(sel), 'X');
% wrongly predicted test points
- Evaluate the classification performance by comparing the estimated to the true outputs:
sum(Ypred~=Yte)/Nt
% Nt number of test data - To visualize the separating function (and thus visualize what areas of the 2D plane are associated with each class) you can use the
provided function
separatingFkNN
and again plot the test points withscatter
.
3. Parameter selection - What is a good value for k?
So far we considered an arbitrary choice for k.
- Perform a hold-out cross validation procedure on the available training data for a large range of candidate values for k (e.g. k=1,3,5...,41). Repeat the hold-out experiment for rep=10 times using at each iteration p=30% of the training set for validation. You can use the provided function
holdoutCVkNN
(typehelp holdoutCVkNN
for an example use). Plot the training and validation errors for the different values of k. How would you now answer the question "what is the best value for k"? Note: for the parameters rep=10 and p=0.3, the hold-out procedure may be quite unstable. - How is the value affected by p (percentage of points held out) and number rep (number of repetitions e.g., 1, 5, 30, 50, 100) of the experiment? What does a large number of repetitions provide?
- Apply the model obtained by cross validation (i.e., best k) to the test set (
Xte
) and see if there is an improvement on the classification error over the result of 2.4.
4. (Optional)
- Dependence on training size: Evaluate the performance as the size of the training set grows (varies), e.g., n = {3, 5, 20, 50, 100, 300, 500,...} How would you choose a good range for k as n changes? Repeat the validation and test multiple times. What can you say about the stability of the solution/performances?
- Try classifying more difficult datasets, generated through
MixGauss
, for instance, by increasing variance or adding noise by randomly flipping the labels on the training set. - Modify the function
kNNClassify
to handle a) multi-class problems and b) regression problems.