MLCC - Laboratory 1 - Local methods
This lab is about local methods for binary classification on synthetic
data. The goal of the lab is to get familiar with the kNN algorithm and to
get a practical grasp of what we have discussed in class. Follow the
instructions below. Think hard before you call the instructors!
Download:
1. Warm up - data generation
Open the matlab file MixGauss.m
- 1.A
The function MixGauss(means,
sigmas, n) generates datasets where the distribution of each
class is a Gaussian with a given means and variance.
- 1.B Have a look
at the code or, for a quick help, type "help
MixGauss" on the Matlab shell.
- 1.C Type on the
Matlab shell the commands
[X1,
Y1] = MixGauss([[0;0],[1;1]],[0.5,0.25],1000)
figure(1); scatter(X1(:,1),X1(:,2),25,Y1); %type "help scatter" to see
what the parameters mean
title('dataset 1')
- 1.D Now generate
a more complex dataset following the instructions below.
- Call MixGauss with appropriate
parameters and produce a dataset with four classes: the classes must
live in the 2D space and be centered on the corners of the unit
square (0,0), (0,1) (1,1), (1,0), all with variance 0.2.
[X2,C]=MixGauss(....)
- Use the Matlab function "scatter"
to plot the points.
- Manipulate the data so to obtain
a 2-class problem where data on opposite corners share the same
class (if you produced the data following the centers order provided
above, you may use the function "mod" for a quick result: Y2=mod(C,2)
)
- 1.E Following the
same procedure as above (section 1.D) generate a new set of data, from
the same distribution, to be used as a test set: (X2t,Y2t)
2. Core - kNN classifier
The k-Nearest Neighbors algorithm (kNN)
assigns to a test point the most frequent label of its k closest examples
in the training set.
- 2.A Have a look
at the code of function kNNClassify
(for a quick reference type "help
kNNClassify" on the Matlab shell)
- 2.B Use
kNNClassify on the previously generated 2-class data from section 1.D.
Pick a "reasonable" k.
- 2.C Think
of how to plot the data to get a glimpse of the obtained
results. A possible way is:
figure;
scatter(Xt(:,1),Xt(:,2),25,Yt,'filled'); %plot test points (filled
circles) associating a different color to each "true" label
hold on
scatter(Xt(:,1),Xt(:,2),25,Yest); % plot test points (empty circles)
associating a different color to each estimated label
- 2.D To evaluate
the classification performance compare the estimated outputs with
those previously generated.
Matlab
line: sum(Yest~=Yt)./Nt
%Nt number of test data
- 2.E To
visualize the separating function (and thus get a more general view of
what areas are associated with each class) you may use the routine separatingF
(type "help
separatingF" on the Matlab shell, if you still have doubts on
how to use it, have a look at the code).
3. Parameter selection - What is a
good value for k?
So far we considered an arbitrary k...
- 3.A Perform a
hold out cross validation procedure on the available training data.
You may want to use the function holdoutCV
available on the zip file (here again, type "help
holdoutCV" on the Matlab shell, you fill find there a useful
example of use).
Plot the training and validation errors for the different
values ok k.
- 3.B Now, can you
answer to the question "what is the best value for k"?
- 3.C What happens
with different values of p (percentage of points held out) and rep
(number of repetitions of the experiment)?
- 3.D Test
the model by cross validation by applying kNN (with the best k)
to a separate test set (eg X2t generated before) and see if there is
an improvement on the classification error with respect to what you
got at section 2.D.
4. If you have time - More
experiments
- 4.A Evaluate
the results as the size of the training set grows.
n=10,20,50,100,300,... (of course k needs to be chosen
accordingly)
- 4.B Generate more
complex datasets with MixGauss function, for instance by choosing
larger variance on the data generation part
- 4.C You may also
add noise to the data by randomly flipping the labels on the training
set (percentage of flipped labels ..)