This lab is about local methods for binary classification on synthetic
data. The goal of the lab is to get familiar with the kNN algorithm and
to get a practical understanding of what we discussed in class. Follow
the instructions below. Think hard before you call the instructors!
Extract the zip file in a
folder and set the MATLAB path to that folder.
1.A Call the function MixGauss with appropriate parameters and produce a dataset with four classes and 30 samples per class: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1) (1,1), (1,0), all with variance 0.3. The function call should look like [Xtr,Ytr] = MixGauss(...)
1.B Use the MATLAB function "scatter" to plot the points.
1.C Manipulate the data so to obtain a 2-class problem where data on opposite corners share the same class with labels +1 and -1. Hint: if you produced the data following the centers order provided above, you can use a mapping like Ytr = 2*(1/2-mod(Ytr, 2));
1.D Similarly generate a "test set" [Xte, Yte], drawn from the same distribution (start with 200 samples per class). Again, plot this dataset using scatter.
The k-Nearest Neighbors algorithm (kNN) assigns to a test point the most frequent label of its k closest examples in the training set.
2.A Have a look at the code of function kNNClassify (for a quick reference type "help kNNClassify" on the MATLAB shell)
2.B Use kNNClassify on the 2-class data generated at step 1. Pick a "reasonable" k.
2.C Plot the data so as to visualize the obtained results. A possible way is to plot the wrongly classified points using different colors/markers:
2.E To visualize the separating function (and thus visualize what areas of the 2D plane are associated with each class) you can use the function separatingFkNN and again plot the test points with scatter.
So far we considered an arbitrary choice for k.
3.A Perform a hold-out cross validation procedure on the available training data for a large range of candidate values for k (e.g. k=1,3,5...,41). Repeat the hold-out experiment for rep=10 time using at each iteration p=30% of the training set for validation. You can use the provided function holdoutCVkNN (type "help holdoutCVkNN" for an example use). Plot the training and validation errors for the different values of k. How would you now answer the question "what is the best value for k"? Note: for the parameters rep=10 and p=0.3, the hold-out procedure may be quite unstable.
3.B How is the value affected by p (percentage of points held out) and number rep (number of repetitions e.g., 1, 5, 30, 50, 100) of the experiment? What does a large number of repetitions provide?
3.C Apply the model obtained by cross validation (i.e., best k) to the test set (Xte) and see if there is an improvement on the classification error over the result of 2.D.
4.A Dependence on training size: Evaluate the performance as the size of the training set grows (Varies), e.g. n = {3, 5, 20,50, 100, 300, 500, 1000... How would you choose a good range for k as n changes? Repeat the validation and test multiple times. What can you say about the stability of the solution/performances?
4.B Try classifying more difficult datasets, generated through MixGauss, for instance, by increasing variance or adding noise by randomly flipping the labels on the training set.
4.C Modify function kNNClassify to handle multi-class problems.
4.D Modify function kNNClassify to handle regression problems.