Machine Learning Day

Lab 0: Data Generation

This lab is about getting started with MATLAB and working with data. The goal is to provide familiarity with MATLAB syntax, along with some preliminary data generation, processing and visualization.

Getting Started

  • Get the code file, add the directory to MATLAB path (or set it as current/working directory).
  • Use the editor to write/save and run/debug longer scripts and functions.
  • Use the command window to try/test commands, view variables and see the use of functions.
  • Use plot (for 1D), imshow, imagesc (for 2D matrices), scatter, scatter3D to visualize variables of different types.
  • Work your way through the examples below, by following the instructions.

MATLAB/Octave resources

  • MATLAB getting started tutorial for an introduction to the environment, syntax and conventions.
  • MATLAB has very thorough documentation, both online and built in. In the command window, type: help functionName (check use) or doc functionName (pull up documentation).
  • Built in tutorials: in the command window enter: playbackdemo('GettingStartedwithMATLAB','toolbox/matlab/demos/html')
  • Comprehensive MATLAB reference and introduction: (pdf)
  • MATLAB Tutorials and Learning Resources. Note the section with university-authored tutorials.
  • Writing Fast MATLAB Code (pdf): Profiling, JIT, vectorization, etc.
  • Stack Overflow: MATLAB tutorial for programmers
  • MIT Open CourseWare: Introduction to MATLAB
  • Octave Programming Tutorial (Wikibook)
  • Stanford/Coursera Octave Tutorial (video)

1. MATLAB warm-up

  1. Create a column vector v = [1; 2; 3] and a row vector u = [1,2,3]
    • What happens with the command v'? What is the corresponding algebraic/matrix operation for this?
    • Create z = [5;4;3] and try basic numerical operations of addition and subtraction.
    • What happens with u + z?
  2. Create the matrices A = [1 2 3; 4 5 6; 7 8 9] and B = A'
    • What kind of matrix is C = A + B?
    • Explore what happens with A(:,1), A(1,:), A(2:3,:) and A(:).
  3. Use the product operator *
    • What happens with 2*u, u*2, 2*v?
    • What happens with u*v and v*u, why? With A*v, u*A and A*u?
    • Use size and/or length functions to find the dimensions of vectors and matrices.
  4. Use the element-wise operators .* and ./, e.g., u.*z and z./u
    • What happens with v.*z and v./z?
    • Why aren't A*A and A.*A the same?
  5. Use the functions zeros, ones, rand, randn
    • Create a 3 x 5 matrix of all zeros, all ones or random numbers uniformly distributed between 2 and 3 and random numbers distributed according to a Gaussian of variance 2.
  6. Use the functions eye and diag
    • create a 3 x 3 identity matrix and a matrix whose diagonal is the vector v.

2. Core - Data generation

  1. The function MixGauss(means, sigmas, n) generates datasets where the distribution of each class is an isotropic Gaussian with a given mean and variance, according to the values in matrices/vectors means and sigmas. Study the function code or type help MixGauss on the MATLAB shell.
  2. Generate a simple dataset through the following commands:
    [X1, Y1] = MixGauss([[0;0], [1;1]], [0.5, 0.25], 1000);
    figure(1); scatter(X1(:,1), X1(:,2), 25, Y1);
  3. Generate a more complex dataset:
    • Call MixGauss with appropriate parameters and produce a dataset with four classes: the classes must live in the 2D space and be centered on the corners of the unit square (0,0), (0,1), (1,1), (1,0), all with variance 0.2. Use scatter to plot the points.
    • Manipulate the data to obtain a 2-class problem where data on opposite corners share the same class. Hint: if you generated the data following the suggested center order, you can use the function mod to quickly obtain two labels, e.g. Y = mod(C, 2).

3. Optional - Extra practice

  1. Generate more complex datasets using MixGauss, for instance by choosing larger variances, higher dimensionality of input space etc.
  2. Add noise to the data by randomly flipping the labels on the training set (in this case the error rate is the percentage of flipped labels.
  3. Given a data set compute the distances among all input points (use vectorization in your code, avoid using a "for" loop): how does the mean distance change with the number of dimensions?
  4. Generate regression data considering a regression model defined by a linear function with Gaussian noise. Create a MATLAB function with input the number of points n, the number of dimensions D, a D-dimensional coefficients vector w and a noise level delta. The output should be an (n x D) matrix X and an (n x 1) vector Y. You might want to test/visualize 1-D and 2-D cases, but make the function generic to account for higher dimensional data. Plot the underlying (linear) function and the noisy output on the same figure.
  5. Generate regression data for a 1-D regression model defined by a non linear function.
  6. Generate a data-set (either for regression or for classification) where most of the input variables are "noise", in the sense that they are unrelated to the output).