MLCC - Laboratory 1 - Local methods

This lab is about local methods for binary classification on synthetic data. The goal of the lab is to get familiar with the kNN algorithm and to get a practical grasp of what we have discussed in class. Follow the instructions below. Think hard before you call the instructors!

Download:

zipfile (unzip it in a local folder)
Matlab getstart.pdf guide for a quick intro

1. Warm up - data generation

Open the matlab file MixGauss.m

1.A The function MixGauss(means, sigmas, n) generates datasets where the distribution of each class is a Gaussian with a given means and variance.
1.B Have a look at the code or, for a quick help, type "help MixGauss" on the Matlab shell.
1.C Type on the Matlab shell the commands

[X1, Y1] = MixGauss([[0;0],[1;1]],[0.5,0.25],1000)
figure(1); scatter(X1(:,1),X1(:,2),25,Y1); %type "help scatter" to see what the parameters mean
title('dataset 1')

1.D Now generate a more complex dataset following the instructions below.

Call MixGauss with appropriate parameters and produce a dataset with four classes: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.2.
[X2,C]=MixGauss(....)

Use the Matlab function "scatter" to plot the points.

Manipulate the data so to obtain a 2-class problem where data on opposite corners share the same class (if you produced the data following the centers order provided above, you may use the function "mod" for a quick result: Y2=mod(C,2) )

1.E Following the same procedure as above (section 1.D) generate a new set of data, from the same distribution, to be used as a test set: (X2t,Y2t)

2. Core - kNN classifier

The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set.

2.A Have a look at the code of function kNNClassify (for a quick reference type "help kNNClassify" on the Matlab shell)
2.B Use kNNClassify on the previously generated 2-class data from section 1.D. Pick a "reasonable" k.
2.C Think of how to plot the data to get a glimpse of the obtained results. A possible way is:

figure;
scatter(Xt(:,1),Xt(:,2),25,Yt,'filled'); %plot test points (filled circles) associating a different color to each "true" label
hold on
scatter(Xt(:,1),Xt(:,2),25,Yest); % plot test points (empty circles) associating a different color to each estimated label

2.D To evaluate the classification performance compare the estimated outputs with those previously generated.

Matlab line: sum(Yest~=Yt)./Nt %Nt number of test data

2.E To visualize the separating function (and thus get a more general view of what areas are associated with each class) you may use the routine separatingF (type "help separatingF" on the Matlab shell, if you still have doubts on how to use it, have a look at the code).

3. Parameter selection - What is a good value for k?

So far we considered an arbitrary k...

3.A Perform a hold out cross validation procedure on the available training data.
You may want to use the function holdoutCV available on the zip file (here again, type "help holdoutCV" on the Matlab shell, you fill find there a useful example of use).
Plot the training and validation errors for the different values ok k.
3.B Now, can you answer to the question "what is the best value for k"?
3.C What happens with different values of p (percentage of points held out) and rep (number of repetitions of the experiment)?
3.D Test the model by cross validation by applying kNN (with the best k) to a separate test set (eg X2t generated before) and see if there is an improvement on the classification error with respect to what you got at section 2.D.

4. If you have time - More experiments

4.A Evaluate the results as the size of the training set grows. n=10,20,50,100,300,... (of course k needs to be chosen accordingly)
4.B Generate more complex datasets with MixGauss function, for instance by choosing larger variance on the data generation part
4.C You may also add noise to the data by randomly flipping the labels on the training set (percentage of flipped labels ..)