Lab: Real Data - Handwritten Digit Challenge

Machine Learning Day

Lab: Real Data - Handwritten Digit Challenge

The lab is aimed as a challenge to having you applying a learning pipeline on a real dataset (images of handwritten digits)

Code/data

Get the code, unzip and add the directory to MATLAB path (or set it as current/working directory).
Follow the instructions to work your way through both parts.

1. MNIST Data - Binary Classification

Load the dataset "MNIST_3_5", using:load('MNIST_3_5.mat'); to contain X and Y, the matrices of examples and labels respectively.
Each row of X is a vectorized 28x28 grayscale image of a handwritten digit from the MNIST dataset. The positive examples are '5' while the negative are '3'. Visualize some examples from the dataset by using the function visualizeExample.
Analyze the eigenvalues of the Gram matrix for the polynomial kernel (e.g. use MATLAB eig function) for different values of deg (plot them by using semilogy. What happens as deg increases (e.g. deg = 1,2,...,10)? Why?
Repeat Sec. 2, with less training points (e.g. 70, 50, 30, 20) chosen randomly and 5% of flipped labels. How do the parameters vary with respect to the number of points?
Randomly split the dataset into a training and a test set, of n_train = 100 and n_test = 1000 points respectively:
[Xtr, Ytr, Xts, Yts] = randomSplitDataset(X, Y, n_train, n_test);
Choose a learning algorithm from the ones you encountered (and experimented with) already, namely: kNN, regularized linear least squares, regularized kernel least squares (gaussian or polynomial kernel). Use the split of 1.3, to compute the classification error on the test set using the function calcErr. To select the best tuning parameters, define a suitable range for the parameters and use hold out cross-validation (using the functions holdoutCVkNN, holdoutCVRLS, holdoutCVKernRLS that you used before).
Predict the labels for the test set and visualize some of the misclassified examples:
ind = find((sign(Ypred)~=sign(Yts)));
x = ind(randi(numel(ind)));
figure; visualizeExample(Xts(idx,:));
(Optional) Compute the mean and standard deviation of the test error over a number of random splits in test and training sets and for an increasing number of training examples (10, 20, 50, 100, 200, 400). For each split compute the test error (n_test = 1000) by choosing suitable parameter range(s) and applying holdout CV on the training set. What happens to the mean and standard deviation across different random splits, when the number of examples in the training set increases? How do these values depend on the settings of holdout CV (range of parameters, number of repetitions and the fraction of training set used as validation)?

2. Challenge Data

The problem that you are given is the classification of images of handwritten digits (unknown set and distribution) and specifically the binary classification problem of discriminating '4' from '2'. The file Challenge_Train.mat contains the given training set (Xtr, Ytr), drawn from the unknown distribution, with each row of Xtr being a vectorized 16x16 grayscale image of a digit, and Ytr the vector of class labels. You will train a model on the given data, to be subsequently evaluated on a test set from the same distribution.

Use your functions to select the algorithm and model parameters given the training set. In particular, the algorithm must be chosen among the following: k-Nearest Neighbors, Linear RLS, Kernel RLS.
Once you are satisfied with your algorithm and chosen parameters, produce a script that solves the classification problem for a generic test set Xts. The output should be the binary prediction label vector Ypred on Xts. Script example:

sigma = 0.1;

lambda = 0.02;

kernel = 'gaussian';

w = regularizedKernLSTrain(Xtr, Ytr, kernel, sigma, lambda);

Ypred = regularizedKernLSTest(w, Xtr, kernel, sigma, Xts);

Note

Submission instructions:

Send your script through an email with subject:CBMM ML Challenge – Surname Name
Put the script in the body of the email.
Attach also a file with any code you used to select an algorithm and find the parameters.
Your script will be run on an unseen test set to compute the classification error, i.e., the score of your learning model for this challenge.