MLCC - Laboratory 1 - Local methods
This lab is about local methods for binary classification on synthetic
data. The goal of the lab is to get familiar with the kNN algorithm and to
get a practical grasp of what we have discussed in class. Follow the
instructions below. Think hard before you call the instructors!
Download:
1. Warm up - data generation
Open the matlab file MixGauss.m
- 1.A
The function MixGauss(means,
sigmas, n) generates dataset [X,Y]
where the X
is composed of mixed classes, each class being generated according
to a Gaussian distribution with given mean and standard deviation. The points in the dataset
X are enumerated
from 1 to
n, and Y
represents the label of each point.
Hint: if the command help MixGauss
fails, this probably means that you have not set up correctly your current working directory'
Have a look at the code or, for a quick help, type "help
MixGauss" on the Matlab shell.
- 1.B Type on the
Matlab shell the commands
[X,
Y] = MixGauss([[0;0],[1;1]],[0.5,0.25],50);
figure(1); title('dataset 1');
scatter(X(:,1),X(:,2),50,Y,'filled'); %type "help scatter" to see
what the parameters mean
title('dataset 1');
- 1.C Now generate
a more complex dataset following the instructions below.
This dataset will be referred hereafter as training dataset.
- Call MixGauss with appropriate
parameters and produce a dataset with four classes: the classes must
live in the 2D space and be centered on the corners of the unit
square (0,0), (0,1) (1,1), (1,0), all with standard deviation 0.3.
The number of points in the dataset is up to you.
[Xtr,C]=MixGauss(....)
- Use the Matlab function "scatter"
to plot the training dataset.
- Manipulate the data so to obtain
a 2-class problem where data on opposite corners share the same
class.
If you produced the data following the centers order provided
above, you may do: Ytr = 2*mod(C,2)-1;
- 1.D Following the
same procedure as above (section 1.C) generate a new set of data
[Xte,Yte] with
the same distribution, hereafter called test dataset.
2. Core - kNN classifier
The k-Nearest Neighbors algorithm (kNN)
assigns to a test point the most frequent label among its k closest points/examples
in the training set.
- 2.A Have a look
at the code of function kNNClassify
(for a quick reference type "help
kNNClassify" on the Matlab command prompt)
- 2.B Use
kNNClassify on the previously generated 2-class data from section 1.D.
Pick a "reasonable" k.
Below we propose three ways of evaluating the quality
of the prediction made by the kNN method. Try them and see the
influence of the parameter k
- 2.C1 [Evaluating the prediction] Plot the test data
Xte twice. Once with its true labels
Yte, and once with the predicted labels
Ypred.
A possible way is:
figure;
scatter(Xte(:,1),Xte(:,2),50,Yte,'filled'); %plot test points (filled
circles) associating a different color to each "true" label
hold on
scatter(Xte(:,1),Xte(:,2),70,Ypred,'o'); % plot test points (empty circles)
associating a different color to each estimated label
- 2.C2 [Evaluating the prediction] To compute
the classification error percentage compare the estimated outputs with
those previously generated:
sum(Ypred ~= Yte) ./ size(Yte, 1)
- 2.C3 [Evaluating the prediction] To visualize
the separating function, use the routine separatingFkNN.
You may use help separatingFkNN
in the command prompt or look directly at the code.
3. Parameter selection - What is a
good value for k?
So far we considered an arbitrary k.
We now want to inroduce different approaches for selecting it.
- 3.A Perform a
hold out cross validation procedure on the available training data.
You may want to use the function holdoutCVkNN
available on the zip file (here again, type "help
holdoutCVkNN" on the Matlab command prompt, you will find there a useful
example of use).
Plot the training and validation errors for the different
values ok k.
- 3.B Add noise to the
data by randomly flipping the labels on the training set, and call it
Ytr noisy.
You can use the function flipLabels
to do that. How does the validation error behave now with respect to
k ?
Note: Keep track of the best k , and the corresponding validation error. You will
need it in 3.D.
- 3.C What happens
with different values of p (percentage of points held out) and rep
(number of repetitions of the experiment)?
- 3.D For now we
have been using the training set to obtain a classifier. Now we
want to evaluate its performance by applying it to an independent test set.
- Consider the test dataset
[Xte,Yte]
generated in point 1.E.
Add some noise to the dataset by randomly flipping some labels from
Yte.
You can use the function flipLabels
to create [Xte,Yte_noisy].
- Take the best
k
you obtained by hold out cross validation in 3.B, and use it to
get a prediction from Xtr,Ytrnoisy,Xte,
as you did in part 2.
- Evaluate the prediction with respect to Yte noisy (as you did in 2.C2),
and compare it to the validation error you had in 3.B.
4. If you have time - More
experiments
- 4.A Evaluate
the results as the size of the training set grows.
n=10,20,50,100,300,... (of course k needs to be chosen
accordingly)
- 4.B Generate more
complex datasets with MixGauss function, for instance by choosing
larger variance on the data generation part