Machine Learning Day
Lab: Real Data - Handwritten Image Classification
The lab is aimed at applying a full learning pipeline on a real dataset, namely images of handwritten digits. You will be using a subset of the MNIST dataset for a binary classification task.
- Get the code file and add the directory to MATLAB path (or set it as current/working directory).
- Follow the instructions to work your way through the lab.
1. MNIST Data
- Load the dataset "MNIST_3_5", using
load('MNIST_3_5.mat');
to obtain X and Y, the matrices of examples and labels respectively. - Each row of X is a vectorized 28x28 grayscale image of a handwritten digit from the MNIST dataset. The positive examples are '5' while the negative are '3'. Visualize some examples from the dataset using the function
visualizeExample
. - Randomly split the dataset into a training and a test set, of
n_train = 100
andn_test = 1000
points respectively usingrandomSplitDataset
.
1. Learning pipeline - Binary classification
- Choose a learning algorithm from the ones you used already, namely: kNN, regularized linear least squares, kernel regularized least squares (gaussian or polynomial kernel). Compute the classification error on the test set using the function
calcErr
. To select the best tuning parameters, define a suitable range for the parameters and use hold out cross-validation (usingholdoutCVkNN, holdoutCVRLS, holdoutCVKernRLS
). - Predict the labels for the test set and visualize some of the misclassified examples:
ind = find((sign(Yp)~=sign(Yte)));
idx = ind(randi(numel(ind)));
figure; visualizeExample(Xte(idx,:)); - (Optional) Compute the mean and standard deviation of the test error over a number of random splits in test and training sets and for an increasing number of training examples, e.g. [10, 400]. For each split compute the test error by choosing suitable parameter range(s) and applying holdout CV on the training set. What happens to the mean and standard deviation across splits, when the number of examples in the training set increases? How do these values depend on the settings of holdout CV (range of parameters, number of repetitions and the fraction of training set used as validation)?