MLCC - Laboratory 3 -
Dimensionality reduction and feature selection
In this laboratory we will address the problem of data analysis with a
reference to a classification problem.
Follow the instructions below. Think hard before you call the instructors!
Download:
- zipfile
(unzip it in a local folder)
1. Warm up - data generation
you will generate a dataset of
D-dimensional points (N points for each class). A starting point will be
N=100 D=30
- 1.A
For each point, the first two variables will be generated by MixGauss,
points extracted from two gaussian distributions with same sigma = 1 and centroids (3, 3)
and (-3,-3) respectively (the first one with Y=1 the second Y=-1).
[X2,…]=MixGauss(….);
- 1.B The
remaining variables will be generated as gaussian noise (mean 0
standard deviation 1, as in randn):
X_noise=sigma_noise
*randn(2*N,D-2);
To compose the final data matrix X=[X2;
X_noise];
- 1.C Standardize
the data matrix, so that each column has mean 0 and standard deviation
1
m=mean(X);
(see "help mean", it computes the mean for each column)
for i = 1:2*N
X(i,:) = X(i,:) - m;
end
s = std(X);
for i = 1:2*N
X(i,:) = X(i,:) ./ s;
end
- 1D.
You may want to plot your data (relevant variables only)
scatter(X(1:N,1),X(1:N,2),'r');
scatter(X(N+1:2*N,1),X(N+1:2*N,2),'b');
2. Variable selection
- 2.A Use the
orthogonal matching pursuit algorithm (type 'help
OMatchingPursuit')
- 2.B You may want
to check the predicted labels on the training set
Y_lr =
sign(X * w);
err = double(sum(Y ~= Y_lr))/size(Y,1)
and plot the coefficients w
3. Principal Component Analysis
- 3.A Compute
the data principal components
- 3.B Plot
the first two components using the function scatter (and, as
usual, different colors for the two classes);
try now with the first 3 components. Reason on the meaning of
the results you are obtaining
- 3.C Plot the
coefficients (eigenvector) associated with the largest eigenvalue
4. If you have time - More
experiments
- 4.A Analyse the
results you obtain on sections 2 and 3 once you choose
- N >> D
- N ~ D
- N << D
and evaluate the benefits of the two different analysis