Machine Learning Day
Lab 0: Data Generation
This first (optional) lab is focused on getting started with MATLAB/Octave and working with data for ML. The goal is to provide basic familiarity with MATLAB syntax, along with some preliminary data generation, processing and visualization.
MATLAB/Octave resources
The labs are designed for MATLAB/Octave. Below you can find a number of resources to get you started.
- MATLAB getting started tutorial for an introduction to the environment, syntax and conventions.
- MATLAB has very thorough documentation, both online and built in. In the command window, type:
help functionName
(check use) ordoc functionName
(pull up documentation). - Built in tutorials: in the command window enter
demo
. - Comprehensive MATLAB reference and introduction: (pdf
- MIT Open CourseWare: Introduction to MATLAB
- Stanford/Coursera Octave Tutorial (video)
- Writing Fast MATLAB Code (pdf): Profiling, JIT, vectorization, etc.
- Stack Overflow: MATLAB tutorial for programmers
Getting Started
- Get the code file, add the directory to MATLAB path (or set it as current/working directory).
- Use the editor to write/save and run/debug longer scripts and functions.
- Use the command window to try/test commands, view variables and see the use of functions.
- Use
plot
(for 1D),imshow
,imagesc
(for 2D matrices),scatter
,scatter3D
to visualize variables of different types. - Work your way through the examples below, by following the instructions.
1. Optional - MATLAB Warm-up
- Create a column vector
v = [1; 2; 3]
and a row vectoru = [1,2,3]
- What happens with the command
v'
? What is the corresponding algebraic/matrix operation? - Create
z = [5;4;3]
and try basic numerical operations of addition and subtraction withv
. - What happens with
u + z
?
- What happens with the command
- Create the matrices
A = [1 2 3; 4 5 6; 7 8 9]
andB = A'
- What kind of matrix is
C = A + B
? - Explore what happens with
A(:,1)
,A(1,:)
,A(2:3,:)
andA(:)
.
- What kind of matrix is
- Use the product operator
*
- What happens with
2*u
,u*2
,2*v
? - What happens with
u*v
andv*u
, why? WithA*v
,u*A
andA*u
? - Use
size
and/orlength
functions to find the dimensions of vectors and matrices.
- What happens with
- Use the element-wise operators
.*
and./
, e.g.,u.*z
andz./u
- What happens with
v.*z
andv./z
? - Why aren't
A*A
andA.*A
the same?
- What happens with
- Use the functions
zeros
,ones
,rand
,randn
- Create a 3 x 5 matrix of all zeros, all ones or random numbers uniformly distributed between 2 and 3 and random numbers distributed according to a Gaussian of variance 2.
- Use the functions
eye
anddiag
- Create a 3 x 3 identity matrix and a matrix whose diagonal is the vector
v
.
- Create a 3 x 3 identity matrix and a matrix whose diagonal is the vector
2. Core - Data generation
The function MixGauss(means, sigmas, n)
generates datasets where the
distribution of each class is an isotropic Gaussian with a given mean and variance, according to the values in matrices/vectors means
and sigmas
. Study the function code or type help MixGauss
on the MATLAB shell. The function scatter
can be used to plot points in 2D.
- Generate and visualize a simple dataset:
[X, C] = MixGauss([[0;0], [1;1]], [0.5, 0.25], 1000);
figure; scatter(X(:,1), X(:,2), 25, C);
- Generate more complex datasets:
- 4-class dataset: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1), (1,1), (1,0), all with variance 0.2.
- 2-class dataset: manipulate the data to obtain a 2-class problem where data on opposite corners share the same class. Hint: if you generated the data following the suggested center order, you can use the function
mod
to quickly obtain two labels, e.g.Y = mod(C, 2).
3. Optional - Extra practice
- Generate datasets of larger variances, higher dimensionality of input space etc.
- Add noise to the data by flipping the labels of random points.
- For a dataset compute the distances among all input points (use vectorization in your code, avoid using a
for
loop). How does the mean distance change with the number of dimensions? - Generate regression data: Consider a regression model defined by a linear function with coefficients
w
and Gaussian noise of level (SNR)delta
.- Create a MATLAB function with input the number of points
n
, the number of dimensionsD
, the D-dimensional vectorw
and the scalardelta
and output an (n x D) matrixX
and an (n x 1) vectorY
. - Plot the underlying (linear) function and the noisy output on the same figure.
- Test/visualize the 1-D and 2-D cases, but make the function generic to account for higher dimensional data.
- Create a MATLAB function with input the number of points
- Generate regression data using a 1-D model with a non-linear function.
- Generate a dataset (either for regression or for classification) where most of the input variables are "noise", i.e., they are unrelated to the output.