Machine Learning Day
Lab 1: k-Nearest Neighbors and Cross-validation
This lab is about local methods for binary classification and model selection. The goal is to provide some familiarity with a basic local method algorithm, namely k-Nearest Neighbors (k-NN) and offer some practical insights on the bias-variance trade-off. In addition, it explores a basic method for model selection, namely the selection of parameter k through Cross-validation (CV).
Getting Started
- Get the code file, add the directory to MATLAB path (or set it as current/working directory).
- Use the editor to write/save and run/debug longer scripts and functions.
- Use the command window to try/test commands, view variables and see the use of functions.
- Use
plot
(for 1D),imshow
,imagesc
(for 2D matrices),scatter
,scatter3D
to visualize variables of different types. - Work your way through the examples below, by following the instructions.
1. Data generation
- Use function
MixGauss
with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3. - Obtain a 2-class train set
[X, Y]
by having data on opposite corners sharing the same class with labels +1 and -1. Example: if you generated the data following the order above, you can use a mapping likeY = 2*(1/2-mod(Y, 2));
- Generate a test set
[Xte, Yte]
from the same distribution, starting with 200 samples per class. - Visualize both sets using
scatter
.
2. kNN classification
The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set. Study the code of function kNNClassify
(for quick reference type help kNNClassify
).
- Use
kNNClassify
to generate predictionsYp
for the 2-class data generated at Section 1. Pick a "reasonable" k. - Evaluate the classification performance (prediction error) by comparing the estimated labels
Yp
to the true labelsYte
by:err = sum(Yp~=Yte)/length(Yte);
- Visualize the obtained results, e.g. by plotting the wrongly classified points using different colors/markers:
scatter(Xte(:, 1), Xte(:, 2), markerSize, Yte);
% color points by "true" label
l = (Yp ~= Yte);% specify wrong predictions
scatter(Xte(l, 1), Xte(l, 2), markerSize, Yp(l), 'x');% color them - Use the provided function
separatingFkNN
to visualize the separating function, or the areas of the 2D plane that are associated by the classifier with each class. Overlay the test points usingscatter
.
3. Parameter selection: what is a good value for k?
So far we considered an arbitrary choice for k. You will now use the provided function holdoutCVkNN
for model selection (type help holdoutCVkNN
for an example use).
- Perform hold-out cross-validation using a percentage of the training set for validation. Note: for the suggested parameters rep=10 and pho=0.3, the hold-out procedure may be quite unstable.
- Use a large range of candidate values for k (e.g. k=1,3,5...,21).
- Repeat the process for
rep
times using at each iteration a randomp
of the training set for validation. Tryrep=10, pho=0.3
. - Plot the training and validation errors for the different values of k.
- How would you now answer the question "what is the best value for k"?
- How is the value of k affected by
pho
(percentage of points held out) andrep
(number of repetitions e.g., 1, 5, 30, 50, 100)? What does a large number of repetitions provide? - Apply the model obtained by cross-validation (i.e., best k) to the test set and check if there is an improvement on the classification error over the result of Part 2.
4. (Optional)
- Dependence on training size: Evaluate the performance as the size of the training set grows, e.g., n = {50, 100, 300, 500,...}. How would you choose a good range for k as n changes? What can you say about the stability of the solution? Check by repeating the validation multiple times.
- Try classifying more difficult datasets, for instance, by increasing the variance or adding noise by randomly flipping the labels on the training set.
- Modify the function
kNNClassify
to handle a) multi-class problems and b) regression problems.