Self-Taught Learning

[hide]

1 Overview
2 Learning features
3 On pre-processing the data
4 On the terminology of unsupervised feature learning

Overview

Assuming that we have a sufficiently powerful learning algorithm, one of the most reliable ways to get better performance is to give the algorithm more data. This has led to the that aphorism that in machine learning, "sometimes it's not who has the best algorithm that wins; it's who has the most data."

One can always try to get more labeled data, but this can be expensive. In particular, researchers have already gone to extraordinary lengths to use tools such as AMT (Amazon Mechanical Turk) to get large training sets. While having large numbers of people hand-label lots of data is probably a step forward compared to having large numbers of researchers hand-engineer features, it would be nice to do better. In particular, the promise of self-taught learning and unsupervised feature learning is that if we can get our algorithms to learn from unlabeled data, then we can easily obtain and learn from massive amounts of it. Even though a single unlabeled example is less informative than a single labeled example, if we can get tons of the former---for example, by downloading random unlabeled images/audio clips/text documents off the internet---and if our algorithms can exploit this unlabeled data effectively, then we might be able to achieve better performance than the massive hand-engineering and massive hand-labeling approaches.

In Self-taught learning and Unsupervised feature learning, we will give our algorithms a large amount of unlabeled data with which to learn a good feature representation of the input. If we are trying to solve a specific classification task, then we take this learned feature representation and whatever (perhaps small amount of) labeled data we have for that classification task, and apply supervised learning on that labeled data to solve the classification task.

These ideas probably have the most powerful effects in problems where we have a lot of unlabeled data, and a smaller amount of labeled data. However, they typically give good results even if we have only labeled data (in which case we usually perform the feature learning step using the labeled data, but ignoring the labels).

Learning features

We have already seen how an autoencoder can be used to learn features from unlabeled data. Concretely, suppose we have an unlabeled training set

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

with

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

unlabeled examples. (The subscript "u" stands for "unlabeled.") We can then train a sparse autoencoder on this data (perhaps with appropriate whitening or other pre-processing):

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

Having trained the parameters

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

of this model, given any new input

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

, we can now compute the corresponding vector of activations

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

of the hidden units. As we saw previously, this often gives a better representation of the input than the original raw input

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

. We can also visualize the algorithm for computing the features/activations

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

as the following neural network:

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

This is just the sparse autoencoder that we previously had, with with the final layer removed.

Now, suppose we have a labeled training set

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

examples. (The subscript "l" stands for "labeled.") We can now find a better representation for the inputs. In particular, rather than representing the first training example as

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

, we can feed

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

as the input to our autoencoder, and obtain the corresponding vector of activations

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

. To represent this example, we can either just replace the original feature vector with

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

. Alternatively, we can concatenate the two feature vectors together, getting a representation

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

Thus, our training set now becomes

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

(if we use the replacement representation, and use

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

to represent the

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

-th training example), or

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

(if we use the concatenated representation). In practice, the concatenated representation often works better; but for memory or computation representations, we will sometimes use the replacement representation as well.

Finally, we can train a supervised learning algorithm such as an SVM, logistic regression, etc. to obtain a function that makes predictions on the

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

values. Given a test example

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

, we would then follow the same procedure: For feed it to the autoencoder to get

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

. Then, feed either

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

to the trained classifier to get a prediction.

On pre-processing the data

During the feature learning stage where we were learning from the unlabeled training set

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

, we may have computed various pre-processing parameters. For example, one may have computed a mean value of the data and subtracted off this mean to perform mean normalization, or used PCA to compute a matrix

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

to represent the data as

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

(or used PCA whitening or ZCA whitening). If this is the case, then it is important to save away these preprocessing parameters, and to use the same parameters during the labeled training phase and the test phase, so as to make sure we are always transforming the data the same way to feed into the autoencoder. In particular, if we have computed a matrix

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

using the unlabeled data and PCA, we should keep the samematrix

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

and use it to preprocess the labeled examples and the test data. We should not re-estimate a different

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

matrix (or data mean for mean normalization, etc.) using the labeled training set, since that might result in a dramatically different pre-processing transformation, which would make the input distribution to the autoencoder very different from what it was actually trained on.

On the terminology of unsupervised feature learning

There are two common unsupervised feature learning settings, depending on what type of unlabeled data you have. The more general and powerful setting is the self-taught learning setting, which does not assume that your unlabeled data xu has to be drawn from the same distribution as your labeled data xl. The more restrictive setting where the unlabeled data comes from exactly the same distribution as the labeled data is sometimes called the semi-supervised learning setting. This distinctions is best explained with an example, which we now give.

Suppose your goal is a computer vision task where you'd like to distinguish between images of cars and images of motorcycles; so, each labeled example in your training set is either an image of a car or an image of a motorcycle. Where can we get lots of unlabeled data? The easiest way would be to obtain some random collection of images, perhaps downloaded off the internet. We could then train the autoencoder on this large collection of images, and obtain useful features from them. Because here the unlabeled data is drawn from a different distribution than the labeled data (i.e., perhaps some of our unlabeled images may contain cars/motorcycles, but not every image downloaded is either a car or a motorcycle), we call this self-taught learning.

In contrast, if we happen to have lots of unlabeled images lying around that are all images of either a car or a motorcycle, but where the data is just missing its label (so you don't know which ones are cars, and which ones are motorcycles), then we could use this form of unlabeled data to learn the features. This setting---where each unlabeled example is drawn from the same distribution as your labeled examples---is sometimes called the semi-supervised setting. In practice, we often do not have this sort of unlabeled data (where would you get a database of images where every image is either a car or a motorcycle, but just missing its label?), and so in the context of learning features from unlabeled data, the self-taught learning setting is more broadly applicable.

Exercise:Self-Taught Learning

[hide]

1 Overview
2 Dependencies
3 Step 1: Generate the input and test data sets
4 Step 2: Train the sparse autoencoder
5 Step 3: Extracting features
6 Step 4: Training and testing the logistic regression model
7 Step 5: Classifying on the test set

Overview

In this exercise, we will use the self-taught learning paradigm with the sparse autoencoder and softmax classifier to build a classifier for handwritten digits.

You will be building upon your code from the earlier exercises. First, you will train your sparse autoencoder on an "unlabeled" training dataset of handwritten digits. This produces feature that are penstroke-like. We then extract these learned features from a labeled dataset of handwritten digits. These features will then be used as inputs to the softmax classifier that you wrote in the previous exercise.

Concretely, for each example in the the labeled training dataset

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

, we forward propagate the example to obtain the activation of the hidden units

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

. We now represent this example using

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

(the "replacement" representation), and use this to as the new feature representation with which to train the softmax classifier.

Finally, we also extract the same features from the test data to obtain predictions.

In this exercise, our goal is to distinguish between the digits from 0 to 4. We will use the digits 5 to 9 as our "unlabeled" dataset which which to learn the features; we will then use a labeled dataset with the digits 0 to 4 with which to train the softmax classifier.

In the starter code, we have provided a file stlExercise.m that will help walk you through the steps in this exercise.

Dependencies

The following additional files are required for this exercise:

MNIST Dataset
Support functions for loading MNIST in Matlab
Starter Code (stl_exercise.zip)

You will also need your code from the following exercises:

Exercise:Sparse Autoencoder
Exercise:Vectorization
Exercise:Softmax Regression

If you have not completed the exercises listed above, we strongly suggest you complete them first.

Step 1: Generate the input and test data sets

Download and decompress stl_exercise.zip, which contains starter code for this exercise. Additionally, you will need to download the datasets from the MNIST Handwritten Digit Database for this project.

Step 2: Train the sparse autoencoder

Next, use the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder, using the same sparseAutoencoderCost.m function as you had written in the previous exercise. (From the earlier exercise, you should have a working and vectorized implementation of the sparse autoencoder.) For us, the training step took less than 25 minutes on a fast desktop. When training is complete, you should get a visualization of pen strokes like the image shown below:

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

Informally, the features learned by the sparse autoencoder should correspond to penstrokes.

Step 3: Extracting features

After the sparse autoencoder is trained, you will use it to extract features from the handwritten digit images.

Complete feedForwardAutoencoder.m to produce a matrix whose columns correspond to activations of the hidden layer for each example, i.e., the vector a(2) corresponding to activation of layer 2. (Recall that we treat the inputs as layer 1).

After completing this step, calling feedForwardAutoencoder.m should convert the raw image data to hidden unit activations a(2).

Step 4: Training and testing the logistic regression model

Use your code from the softmax exercise (softmaxTrain.m) to train a softmax classifier using the training set features (trainFeatures) and labels (trainLabels).

Step 5: Classifying on the test set

Finally, complete the code to make predictions on the test set (testFeatures) and see how your learned features perform! If you've done all the steps correctly, you should get an accuracy of about 98% percent.

As a comparison, when raw pixels are used (instead of the learned features), we obtained a test accuracy of only around 96% (for the same train and test sets).

CS294A/CS294W Self-taught Learning Exercise
======================================================================
======================================================================
======================================================================
----------------- YOUR CODE HERE ----------------------
-----------------------------------------------------
STEP 3: Extract Features from the Supervised Dataset
STEP 4: Train the softmax classifier
----------------- YOUR CODE HERE ----------------------
-----------------------------------------------------
STEP 5: Testing
----------------- YOUR CODE HERE ----------------------
-----------------------------------------------------

CS294A/CS294W Self-taught Learning Exercise

%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  self-taught learning. You will need to complete code in feedForwardAutoencoder.m
%  You will also need to have implemented sparseAutoencoderCost.m and
%  softmaxCost.m from previous exercises.
%

======================================================================

STEP 0: Here we provide the relevant parameters values that will
allow your sparse autoencoder to get good filters; you do not need to
change the parameters below.

inputSize  = 28 * 28;
numLabels  = 10;
hiddenSize = 400;
sparsityParam = 0.05; % desired average activation of the hidden units.
                     % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
                     %  in the lecture notes).
lambda = 3e-3;       % weight decay parameter
beta = 3;            % weight of sparsity penalty term
maxIter = 400;

======================================================================

STEP 1: Load data from the MNIST database

This loads our training and test data from the MNIST database files.
We have sorted the data for you in this so that you will not have to
change it.

% Load MNIST database files
mnistData   = loadMNISTImages('train-images.idx3-ubyte');
mnistData=mnistData(:,1:1000);
mnistLabels = loadMNISTLabels('train-labels.idx1-ubyte');
mnistLabels=mnistLabels(1:1000);
%mnistLabels(mnistLabels==0) = 10; % Remap 0 to 10

% testData   = loadMNISTImages('t10k-images.idx3-ubyte');
% testData=testData(:,1:1000);
% testLabels = loadMNISTLabels('t10k-labels.idx1-ubyte');
% testLabels=testLabels(1:1000);
% testLabels(testLabels==0) = 10; % Remap 0 to 10


% % %debug
% trainData=mnistData(:,1:100);
% mnistLabels=mnistLabels(1:100);

% Set Unlabeled Set (All Images)
% Simulate a Labeled and Unlabeled set
labeledSet   = find(mnistLabels >= 0 & mnistLabels <= 4);
unlabeledSet = find(mnistLabels >= 5);

%%增加的一行代码
unlabeledSet = unlabeledSet(1:end/3);

numTest = round(numel(labeledSet)/2);%拿一半的样本来测试%
numTrain = round(numel(labeledSet)/3);
trainSet = labeledSet(1:numTrain);
testSet  = labeledSet(numTrain+1:2*numTrain);

unlabeledData = mnistData(:, unlabeledSet);%%为什么这两句连在一起都要出错呢？
%unlabeledData=trainData;
% pack;
trainData   = mnistData(:, trainSet);
trainLabels = mnistLabels(trainSet)' + 1; % Shift Labels to the Range 1-5

% mnistData2 = mnistData;
testData   = mnistData(:, testSet);
testLabels = mnistLabels(testSet)' + 1;   % Shift Labels to the Range 1-5

% Output Some Statistics
fprintf('# examples in unlabeled set: %d\n', size(unlabeledData, 2));
fprintf('# examples in supervised training set: %d\n\n', size(trainData, 2));
fprintf('# examples in supervised testing set: %d\n\n', size(testData, 2));

======================================================================

STEP 2: Train the sparse autoencoder
This trains the sparse autoencoder on the unlabeled training
images.

%  Randomly initialize the parameters
theta = initializeParameters(hiddenSize, inputSize);

----------------- YOUR CODE HERE ----------------------

Find opttheta by running the sparse autoencoder on
unlabeledTrainingImages

tic
opttheta = theta;
addpath minFunc/
options.Method = 'cg';
options.maxIter = 400;
options.display = 'on';
[opttheta, loss] = minFunc( @(p) sparseAutoencoderCost(p, ...
      inputSize, hiddenSize, ...
      lambda, sparsityParam, ...
      beta, trainData), ...
      theta, options);
toc

-----------------------------------------------------

% Visualize weights
W1 = reshape(opttheta(1:hiddenSize * inputSize), hiddenSize, inputSize);
display_network(W1');

%%======================================================================

STEP 3: Extract Features from the Supervised Dataset

You need to complete the code in feedForwardAutoencoder.m so that the
following command will extract features from the data.

trainFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
                                       trainData);
totalElem=size(trainFeatures,1)*size(trainFeatures,2);
ave=sum(trainFeatures(:))/totalElem
ind=find(abs(trainFeatures)<0.1);
sparseRate=numel(ind)/totalElem


%trainFeatures2=trainFeatures;
%trainFeatures2(ind)=0;


%display_network(trainFeatures);

testFeatures = feedForwardAutoencoder(opttheta, hiddenSize, inputSize, ...
                                       testData);
totalElem2=size(testFeatures,1)*size(testFeatures,2);
ave2=sum(testFeatures(:))/totalElem2
ind2=find(abs(testFeatures)<0.1);
sparseRate2=numel(ind2)/totalElem2
%%======================================================================

ave =

    0.0534


sparseRate =

    0.9056


ave2 =

    0.0449


sparseRate2 =

    0.9019

STEP 4: Train the softmax classifier

softmaxModel = struct;

----------------- YOUR CODE HERE ----------------------

Use softmaxTrain.m from the previous exercise to train a multi-class
classifier.

%  Use lambda = 1e-4 for the weight regularization for softmax
lambda = 1e-4;
inputSize = hiddenSize;
numClasses = numel(unique(trainLabels));%unique为找出向量中的非重复元素并进行排序

% You need to compute softmaxModel using softmaxTrain on trainFeatures and
% trainLabels

tic
options.maxIter = 100;
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...
                            trainFeatures, trainLabels, options);
toc
%numClasses

-----------------------------------------------------

STEP 5: Testing

----------------- YOUR CODE HERE ----------------------

Compute Predictions on the test set (testFeatures) using softmaxPredict and softmaxModel

[pred] = softmaxPredict(softmaxModel, testFeatures);

-----------------------------------------------------

% Classification Score
fprintf('Test Accuracy: %f%%\n', 100*mean(pred(:) == testLabels(:)));

% (note that we shift the labels by 1, so that digit 0 now corresponds to
%  label 1)
%
% Accuracy is the proportion of correctly classified images
% The results for our implementation was:
%
% Accuracy: 98.3%
%
%

Published with MATLAB® 7.11

UFLDL Tutorial_Self-Taught Learning and Unsupervised Feature Learning Self-Taught Learning Exercise:Self-Taught Learning

Self-Taught Learning

Contents

Overview

Learning features

On pre-processing the data

On the terminology of unsupervised feature learning

Exercise:Self-Taught Learning

Contents

Overview

Dependencies

Step 1: Generate the input and test data sets

Step 2: Train the sparse autoencoder

Step 3: Extracting features

Step 4: Training and testing the logistic regression model

Step 5: Classifying on the test set

Contents

CS294A/CS294W Self-taught Learning Exercise

======================================================================

======================================================================

======================================================================

----------------- YOUR CODE HERE ----------------------

-----------------------------------------------------

STEP 3: Extract Features from the Supervised Dataset

STEP 4: Train the softmax classifier

----------------- YOUR CODE HERE ----------------------

-----------------------------------------------------

STEP 5: Testing

----------------- YOUR CODE HERE ----------------------

-----------------------------------------------------

继续阅读