Extract the zip file in a directory and set the MATLAB path to that with subdirs (for the data). Follow the instructions below and think/try hard before you call the instructors!
1.A Load the dataset "MNIST_3_5", using: load('MNIST_3_5.mat'); to produce two matrices X and Y, the first containing examples and the second the labels.
1.B Each row of X is a vectorized 28x28 grayscale image of a handwritten digit from the MNIST dataset. The positive examples are '5' while the negative are '3'. Visualize some examples from the dataset by using the function visualizeExample.
1.C Randomly split the dataset in a training and a test set, of n_train = 100 and n_test = 1000 points respectively:
[Xtr, Ytr, Xts, Yts] = randomSplitDataset(X, Y, n_train, n_test);
1.D Choose a learning algorithm from the ones you encountered (and experimented with) already, namely: kNN, regularized linear least squares, regularized kernel least squares (gaussian or polynomial kernel). Use the split of 1.C, to compute the classification error on the test set using the function calcErr. To select the best tuning parameters, define a suitable range for the parameters and use hold out cross-validation (using the functions holdoutCVkNN, holdoutCVRLS, holdoutCVKernRLS that you used before).
1.E Predict the labels for the test set and visualize some of the misclassified examples:
1.F (Optional) Compute the mean and standard deviation of the test error over a number of random splits in test and training sets and for an increasing number of training examples (10, 20, 50, 100, 200, 400). For each split compute the test error (n_test = 1000) by choosing suitable parameter range(s) and applying holdout CV on the training set. What happens to the mean and standard deviation across different random splits, when the number of examples in the training set increases? How do these values depend on the settings of holdout CV (range of parameters, number of repetitions and the fraction of training set used as validation)?
The problem that you are given is the classification of images of handwritten digits and specifically the binary classification problem of discriminating '4' and '2'. The file Challenge_Train.mat contains the training set (Xtr, Ytr), with each row of Xtr being a vectorized 16x16 grayscale image of a digit, and Ytr the vector of class labels. You will train a model on the given data, which will subsequently