Machine Learning Day

Lab 3: Dimensionality reduction and feature selection

In this lab we will address the problem of dimensionality reduction through Principal Component Analysis (PCA) and feature selection through Orthogonal Matching Pursuit (OMP) for a synthetic dataset.

  • Get the code, unzip and add the directory to MATLAB path (or set it as current/working directory).
  • Work your way through the examples below, by following the instructions.

1. Data Generation

Generate a 2-class dataset of D-dimensional points with N points for each class. Start with N = 100 and D = 30 and create a train and a test set.

  1. The first two variables (dimensions) for each point will be generated by MixGauss, i.e., extracted from two Gaussian distributions, with centroids (1, 1) and (-1,-1) and sigmas 0.7 (the first one with Y=1, the second with Y=-1). Adjust the output labels of the classes to be {1,-1} respectively, e.g. using Ytr(Ytr==2) = -1.
  2. You can plot these variables of the data in a 2D plane, using scatter():
    scatter(X2tr(:,1), X2tr(:,2), 25, Ytr);
    scatter(X2ts(:,1), X2ts(:,2), 25, Yts);
  3. The remaining (D-2) variables will be generated by Gaussian noise with sigma_noise = 0.01
    Xtr_noise = sigma_noise*randn(2*N, D-2);
    Xts_noise = sigma_noise*randn(2*N, D-2);
  4. The final train and test data matrices will be composed as:
    Xtr = [X2tr Xtr_noise];

2. Principal Component Analysis

  1. Compute the principal components of the training set, using the provided function PCA.
  2. Plot the first component of X_proj (as a line, using plot), the first 2 (scatter(X_proj(:,1), X_proj(:,2), 25, Ytr);) and the first 3 components (scatter3(X_proj(:,1), X_proj(:,2), X_proj(:,3), 25, Ytr);). Reason on the meaning of the obtained plots and results. What is the effective dimensionality of this dataset?
  3. Visualize/display the square root of the first k=10 eigenvalues, and the coefficients (eigenvectors) associated with the largest eigenvalue, i.e., scatter(1:D, abs(V(:,1))). Same for the second and third largest.
  4. Repeat the above steps with a dataset generated using different sigma_noise in {0, 0.01, 0.1, 0.5, 0.7, 1:0.2:2}. To what extent is data visualization by PCA affected by noise?

3. Variable Selection

  1. Standardize the data matrix, so that each column has mean 0 and standard deviation 1 (use vectorized implementations for speed!). Use the statistics from the train set Xtr (mean and standard deviation) to standardize the corresponding test set Xts. Useful commands:
    m = mean(Xtr); % mean of each column/dimension
    s = std(Xtr); % standard deviation
    Xtr(i,:) = Xtr(i,:) - m; % remove mean of each dimension for data point i
    Xtr = Xtr - ones(2*N,1)*m; % vectorized mean removal for the entire matrix
    Xtr(i,:) = Xtr(i,:)./ s; % scale std in each dimension for data point i
  2. Use the orthogonal matching pursuit algorithm (type help OMatchingPursuit) using T repetitions, to obtain T-1 coefficients for a sparse approximation of the training set Xtr. Plot the resulting coefficients w using stem(1:D,w). What is the output when setting T = 3 and what is the interpretation of the indices of these first active dimensions (coefficients)?
  3. Check the predicted labels on the training (and test) set when approximating the output using w:
    Ypred = sign(Xts * w);
    err = calcErr(Yts, Ypred);
    How does the train and test error change with the number of iterations of the method? Plot the errors on the same plot for increasing T.
  4. By applying cross-validation on the training set through the provided holdoutCVOMP find the optimum number of iterations in the range intIter = 2:D (indicative values: perc = 0.5, nrip = 30).
    [it, Vm, Vs, Tm, Ts] = holdoutCVOMP(Xtrn, Ytr, perc, nrip, intIter);
    Plot the training and validation error on the same plot. What is the behavior of the training and the validation errors with respect to the number of iterations?

4. (Optional)

Compare the results of Parts 2 and 3 and evaluate the benefits of the two different methods for dimensionality reduction and feature selection when choosing N >> D, N ~= D and N << D.