You will generate a training and a test set of D-dimensional points (N points for each class), with N=100 D=30. Only two of those dimensions will be meaningful, the other ones will be an irrelevant noise.
N=100;
D=30;
1.A For each point, the first two variables will be generated by MixGauss, extracted from two gaussian distributions with centroids (1, 1) and (-1,-1) and standard deviation 0.7 (the first one with Y=1, the second with Y=-1)
Ytr(Ytr==2)= -1;
[Xts, Yts] = MixGauss(...);
Yts(Yts==2) = -1;
1B.
You may want to plot the relevant variables of
the data
scatter(Xtr(:,1),
Xtr(:,2), 50, Ytr, 'filled');
hold on;
scatter(Xts(:,1), Xts(:,2), 50, Yts);
1.C The
remaining variables will be generated as gaussian
noise
sigma_noise = 0.01;
Xts_noise=sigma_noise*randn(2*N,D-2);
To compose the final data matrix, run:
Xts =[Xts, Xts_noise];
2.A Compute the data principal components (see help PCA)
2.B Plot the first two components of X_proj using the following line
2.C Try now with the first 3 components, by using
scatter3(X_proj(:,1), X_proj(:,2), X_proj(:,3), 50, Ytr, 'filled');
Reason on the meaning of the results you are obtaining
2.D Display the sqrt of the first 10 eigenvalues (disp(sqrt(d(1:10)))). Plot the coefficients (eigenvector) associated with the largest eigenvalue:
2.E Repeat the above steps with dataset generated using different sigma_noise (0, 0.01, 0.1, 0.5, 0.7, 1, 1.2, 1.4, 1.6, 2). To what extent data visualization by PCA is affected by the noise?
3.A Use
the data generated in section 1. Standardize
the data matrix, so that each column has mean 0 and standard
deviation 1
m=mean(Xtr);
(see "help mean", it computes the mean for each column)
s
= std(Xtr);
for i = 1:2*N
Xtr(i,:) = Xtr(i,:) - m;
end
for
i = 1:2*N
Xtr(i,:) = Xtr(i,:) ./ s;
end
Do the same for Xts, by using m and s computed on Xtr.
3.B Use the orthogonal matching pursuit algorithm (type 'help OMatchingPursuit')
3.C You
may want to check the predicted labels on the training set
Ypred
= sign(Xts *
w);
err = calcErr(Yts, Ypred);
and
plot the coefficients w
with scatter(1:D, abs(w)).
How the error changes with the number of iterations of the method?
3.D By using the method holdoutCVOMP find the best number of iterations with intIter = 2:D (and, for instance, perc=0.75 nrip = 20).
Moreover, plot the training and validation error with the following lines:
plot(intIter, Tm, 'r');
hold on;
plot(intIter, Vm, 'b');
hold off;
What is the behavior of the training and the validation errors with respect to the number of iterations?
3.E Try to increase the number of relevant variables d = 3,5,.. (and the corresponding standard deviation of the Gaussians) around the centroids
ones(d,1); % vector of all 1s
and-ones(d,1); % vector of all -1s
and see how this change is reflected in the cross-validation.4.A Analyse the results you obtain on sections 2 and 3 once you choose
N >> D
N ~ D
N << D
and evaluate the benefits of the two different analysis