In this lab we will address the problem of data analysis with a
reference to a classification problem. Get the zip
file and follow the instructions below. Think hard before you call
the instructors!
Generate a 2-class
dataset of D-dimensional points with N points for each class. Start with
N = 100 and D = 30 and create a train and a test set.
1.A The first two variables (dimensions) for each point will be generated by MixGauss, i.e., extracted from two Gaussian distributions, with centroids (1, 1) and (-1,-1) and sigmas 0.7. the first one with Y=1 the second Y=-1). Adjust the output labels of the classes to be {1,-1} respectively, e.g. using Ytr(Ytr==2) = -1.
1.C The remaining (D-2) variables will be generated as Gaussian noise with sigma_noise = 0.01
2.A Compute the principal components of the training set, using the provided function PCA (see help PCA).
2.B Plot the first component of X_proj (as a line, using plot), the first 2 (scatter(X_proj(:,1), X_proj(:,2), 25, Ytr);) and the first 3 components (scatter3(X_proj(:,1), X_proj(:,2), X_proj(:,3), 25, Ytr);). Reason on the meaning of the obtained plots and results. What is the effective dimensionality of this dataset?
3.A
Standardize the data matrix, so that each column
has mean 0 and standard deviation 1 (use vectorized implementations
for speed!). Use the statistics from the train set Xtr (mean and
standard deviation) to standardize the corresponding test set Xts.
Useful commands:
3.B Use the orthogonal matching pursuit algorithm (type 'help OMatchingPursuit') using T repetitions, to obtain T-1 coefficients for a sparse approximation of the training set Xtr. Plot the resulting coefficients w using stem(1:D, w). What is the output when setting T = 3 and what is the interpretation of the indices of these first active dimensions (coefficients)?
3.C
Check the predicted labels on the training (and
test) set when approximating the output using w:
Ypred
= sign(Xts * w);
err = calcErr(Yts, Ypred);
4.A Compare the results of sections 2 and 3 when choosing