MLCC - Laboratory 3 - Dimensionality reduction and feature selection

In this laboratory we will address the problem of data analysis with a reference to a classification problem.
Follow the instructions below. Think hard before you call the instructors!

Download:

zipfile (unzip it in a local folder)

1. Warm up - data generation

you will generate a dataset of D-dimensional points (N points for each class). A starting point will be N=100 D=30

1.A For each point, the first two variables will be generated by MixGauss, points extracted from two gaussian distributions with same sigma = 1 and centroids (3, 3) and (-3,-3) respectively (the first one with Y=1 the second Y=-1).

[X2,…]=MixGauss(….);

1.B The remaining variables will be generated as gaussian noise (mean 0 standard deviation 1, as in randn):
X_noise=sigma_noise *randn(2*N,D-2);

To compose the final data matrix X=[X2; X_noise];

1.C Standardize the data matrix, so that each column has mean 0 and standard deviation 1

m=mean(X); (see "help mean", it computes the mean for each column)
for i = 1:2*N
X(i,:) = X(i,:) - m;
end
s = std(X);
for i = 1:2*N
X(i,:) = X(i,:) ./ s;
end
1D. You may want to plot your data (relevant variables only)
scatter(X(1:N,1),X(1:N,2),'r');
scatter(X(N+1:2*N,1),X(N+1:2*N,2),'b');

2. Variable selection

2.A Use the orthogonal matching pursuit algorithm (type 'help OMatchingPursuit')
2.B You may want to check the predicted labels on the training set
Y_lr = sign(X * w);
err = double(sum(Y ~= Y_lr))/size(Y,1)
and plot the coefficients w

3. Principal Component Analysis

3.A Compute the data principal components
3.B Plot the first two components using the function scatter (and, as usual, different colors for the two classes);
try now with the first 3 components. Reason on the meaning of the results you are obtaining
3.C Plot the coefficients (eigenvector) associated with the largest eigenvalue

4. If you have time - More experiments

4.A Analyse the results you obtain on sections 2 and 3 once you choose

N >> D
N ~ D
N << D
and evaluate the benefits of the two different analysis