20)¶

Lab 5 – Perceptrons, Deep Net, and Convolutional Neural Net¶

In this lab, we introduce how to implement a perceptron, a deep neural network and also a convolutional neural network (DNN). Though it is not a good practise, we use all the data to train and test our model for the purpose of demonstration. We also present you with a code that is working, but yields poor results. We expect you to spot these issues and improve the code. Exercises are also provided at the end of each section to improve your technical skill.

Setup¶

Make sure that the following code is executed before every other sections of this lab

# To support both python 2 and 3
from __future__ import division, print_function, unicode_literals

# Common imports
import os
import numpy as np

# These two lines are required to use Tensorflow 1
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# To plot nice figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Clear tensorflow's and reset seed
def reset_graph(seed=None):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

A Perceptron¶

In this section, we will use an artificial neuron (aka perceptron) to perform binary classification on linearly separable data. Specifically, we will use a portion of the Iris dataset; the description of this dataset can be found at http://scikit-learn.org/stable/datasets/index.html#iris-dataset.

from sklearn.datasets import load_iris

# get dataset
iris = load_iris()
X = iris.data[:, (2, 3)]  # use only petal length and petal width
y = (iris.target == 0).astype(np.int) # classify them as either setosa or not setosa

# visualise the data
axes = [0, 5, 0, 2]
plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs", label="Not Iris-Setosa")
plt.plot(X[y==1, 0], X[y==1, 1], "yo", label="Iris-Setosa")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="lower right", fontsize=14)
plt.axis(axes)
plt.show()

Clearly, this task can be easily done by using a linear classifier. Could you visualise the linear decision boundary on the figure above? Where should it be?

Now, let's move on to implementing a perceptron by using Scikit-learn.

from sklearn.linear_model import Perceptron

# initialise and train a perceptron
pct = Perceptron(max_iter=100, random_state=None)
pct.fit(X, y)

Notice that there are many parameters that you can tweak later on. You can have a look at the description of each parameter in the Scikit-Learn's documentation http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html

Next, we will extract the decision boundary from the model. Below we show a general way of extracting a decision boundary with any model. Note that it can be very computationally expensive if the feature space is large.

# sampling and predict the whole space of features
x0, x1 = np.meshgrid(
        np.linspace(axes[0], axes[1], 10).reshape(-1, 1),
        np.linspace(axes[2], axes[3], 10).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]
y_predict = pct.predict(X_new)
zz = y_predict.reshape(x0.shape)

# plot the datapoints again
plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs", label="Not Iris-Setosa")
plt.plot(X[y==1, 0], X[y==1, 1], "yo", label="Iris-Setosa")

# get a nice color
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#9898ff', '#fafab0'])

# plot the predicted samples of feature space
plt.contourf(x0, x1, zz, cmap=custom_cmap)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="lower right", fontsize=14)
plt.axis(axes)
plt.show()

Exercise 1

The decision boundary of a single perceptron is a single straight line, but the above plot shows differently! Fix this plot. (Hint: you need to sample the feature space more)
Try running the code in [3] and [4] multiple times; two snippets above where a network is initialised, trained, and plotted. Do you always get the same decision boundary? Why?
A single perceptron is not different from a linear classifier, which can be described by a straight line equation. Retrieve the formula for it. Verify that this is correct by comparing it with the plot above. (Hint: have a look in the list of attribute on the online documentation)

Activation Functions¶

There are many activation functions that can be used in a perceptron. Different functions result in different behaviours, and consequently different pros & cons. Though we will not go into details, it is beneficial for you to know some popular activation functions.

$$ \text{heaviside} (z) = \begin{cases} 1 & \quad \text{if } z >= 0 \\ 0 & \quad \text{otherwise} \end{cases} $$$$ \text{logit} (z) = \frac{1}{1 + e^{-z}} $$$$ \text{relu} (z) = \max{\left( 0 , z \right)} $$$$ \text{leaky_relu} (z, \alpha) = \max{\left( \alpha z , z \right)} $$$$ \text{elu} (z, \alpha) = \begin{cases} \alpha \left( e^z - 1 \right) & \quad \text{if } z < 0 \\ z & \quad \text{otherwise} \end{cases} $$

Exercise 2 Complete the cell below with the code for the activation functions listed (see equations). Note that they must be able to process NumPy arrays as well.

def heaviside(z): # modify this function. Hint: Use astype(z.dtype)
    return 0

def logit(z): # modify this function. Hint: Use np.exp()
    return 0

def relu(z): # modify this function. Hint: Use np.maximum()
    return 0

def leaky_relu(z, alpha): # modify this function and set default alpha to 0.01
    return 0

def elu(z, alpha=1): # No need to modify this function!
    return np.where(z < 0, alpha * (np.exp(z) - 1), z)

def selu(z, # No need to modify this function!
         scale=1.0507009873554804934193349852946,
         alpha=1.6732632423543772848170429916717):
    return scale * elu(z, alpha)

z = np.linspace(-5, 5, 200)

plt.figure(figsize=(11,11))

plt.subplot(221)
plt.plot(z, np.sign(z), "r-", linewidth=2, label="Step")
plt.plot(z, np.tanh(z), "b:", linewidth=2, label="Tanh")
plt.plot(z, heaviside(z), "y--", linewidth=2, label="Heaviside")
plt.plot(z, logit(z), "g-.", linewidth=2, label="Logit")
plt.grid(True)
plt.legend(loc="lower right", fontsize=14)
plt.title("Activation Functions", fontsize=14)
plt.axis([-5, 5, -1.2, 1.2])

plt.subplot(222)
plt.plot(z, relu(z), "m-", linewidth=2, label="ReLU")
plt.plot(z, leaky_relu(z, 0.05), "k:", linewidth=2, label="Leaky_ReLU")
plt.plot(z, elu(z), "y--", linewidth=2, label="ELU")
plt.plot(z, selu(z), "g-.", linewidth=2, label="SELU")
plt.grid(True)
plt.legend(loc="lower right", fontsize=14)
plt.title("Activation Functions", fontsize=14)
plt.axis([-5, 5, -2, 2])

plt.show()

You should be able to see the following characteristics from the graph:

Step function and Heaviside function are quite similar except for their output ranges.
Similarly, the hyperbolic tangent and the logit/sigmoidal function are nearly the same except for their output ranges.
Lastly, all variants of ReLU functions behave differently only when the input sum of a perceptron is lower than zero.

Note that different functions have different sensitivity to the perceptron input.

Multi-Layer Perceptron (MLP) with Scikit-Learn¶

In this section, we introduce how to implement multilayer perceptron (MLP) with Scikit-learn. Note that Scikit-learn's MLP is not suitable for very large neural networks.

from sklearn.datasets import load_iris

# get dataset if you haven't
iris = load_iris()
X = iris.data[:, (2, 3)]  # use only petal length and petal width
y = (iris.target == 0).astype(np.int) # classify them as either setosa or not setosa

from sklearn.neural_network import MLPClassifier

# Initialise a multi-layer perceptron
mlp = MLPClassifier(max_iter=1, learning_rate_init=0.01, random_state=None, warm_start=True)
mlp

Note the MLP's parameters that you can play with. For a description of each parameter, have a look at the online documentation: http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html.

Now, we will show what the decision boundary looks like and how it changes after each training epoch. (Note that you must generate a new MLP every time before you run the code below)

# Pre-define the axes for plotting
axes = [0, 7, 0, 3]

# Pre-generate a grid of sampling points
x0, x1 = np.meshgrid(
        np.linspace(axes[0], axes[1], 200).reshape(-1, 1),
        np.linspace(axes[2], axes[3], 200).reshape(-1, 1),
    )

# Now, show the change after fitting epoch by epoch
for epochs in range(0,30):
    
    # Fit the model
    mlp.fit(X, y)
    
    # Plot the dataset
    plt.figure(figsize=(10, 4))
    plt.plot(X[y==1, 0], X[y==1, 1], "yo", label="Iris-Setosa")
    plt.plot(X[y==0, 0], X[y==0, 1], "bs", label="Not Iris-Setosa")
    
    # Use to model to sampling predictions over all feature space
    y_predict = mlp.predict(np.c_[x0.ravel(), x1.ravel()])
    zz = y_predict.reshape(x0.shape)
    
    # get a nice color
    from matplotlib.colors import ListedColormap
    custom_cmap = ListedColormap(['#9898ff', '#fafab0'])
    
    # Use contour plot again
    plt.contourf(x0, x1, zz, cmap=custom_cmap)
    plt.xlabel("Petal length", fontsize=14)
    plt.ylabel("Petal width", fontsize=14)
    plt.legend(loc="upper left", fontsize=14)
    plt.axis(axes)
    plt.show()

Exercise 3

What is the structure of this MLP? How many neurons in each layer?
Try different numbers of neurons in the hidden layer. Observe any difference during and after the training.
Try different activation functions such as logistic or hyperbolic tangent function. Observe any difference in the resulting plot.
Try a stochastic gradient descent optimiser, and configure the learning rate and momentum accordingly. Observe any difference during and after the training.

(Hint: Refer to the online documentation on http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

(Deeper) Neural Net for MNIST on TensorFlow¶

In this section, we will construct and train a deeper neural network with TensorFlow to perform classification. To train a large number of neurons, we would generally need a large dataset. So we will use MNIST from now on. Though it is not a good practice, we will use the whole dataset to train our neural net (for demonstration purposes).

# Load and use all digits in MNIST
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
digits = np.concatenate((X_train, X_test))
labels = np.concatenate((y_train, y_test))

# Pre-processing the data
t_digits = digits.astype(np.float32).reshape(-1, 28*28) / 255.0
t_labels = labels.astype(np.int32)

Next, we will define a function to construct a layer of fully-connected neurons. This is more convenient than individually creating each neuron or perceptron.

n_inputs = 28*28  # Total number of pixels in each MNIST's digit
n_hidden1 = 300 # Number of neurons in 1st hidden layer
n_hidden2 = 100 # Number of neurons in 2nd hidden layer
n_outputs = 10 # Number of neurons in output layer

reset_graph() # as we defined in the beginning of this notebook

# Create TensorFlow's placeholders for t_digits and t_labels
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

# Define a function to create a layer of fully-connected neurons
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="kernel")
        b = tf.Variable(tf.zeros([n_neurons]), name="bias")
        Z = tf.matmul(X, W) + b
        if activation is not None:
            return activation(Z)
        else:
            return Z

We then use this function to generate a layer of neurons that connect to either the input or the previous layer.

# Construct MLP with two layers
hidden1 = neuron_layer(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
hidden2 = neuron_layer(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
logits = neuron_layer(hidden2, n_outputs, name="outputs")

# Or decomment below to use TensorFlow's premade instead of our function
#hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
#hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
#logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

To train this network, we need to define a loss function and choose an optimiser. After everything is constructed in TensorFlow, we run a session to execute it as usual. The code below also demonstrates how to save and restore the trained model for later use.

# Use mean softmax cross entropy as a loss function
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy, name="loss")

# Use gradient descent to train MLP
training_op = tf.train.GradientDescentOptimizer(0.001).minimize(loss)

# Define accuracy measure
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

# Initilise and run TensorFlow's computation graph of MLP
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    for epoch in range(50):
        sess.run(training_op, feed_dict={X: t_digits, y: t_labels})
        acc_batch = accuracy.eval(feed_dict={X: t_digits, y: t_labels})
        print(epoch, "Accuracy:", acc_batch)
    
    # save the trained model
    save_path = tf.train.Saver().save(sess, "./trained_mnist_ann.ckpt")

# random one digit
rnd_id = np.random.randint(0, len(digits))

# show the digit
plt.figure()
plt.imshow(digits[rnd_id])
plt.colorbar()
plt.grid(False)

# load the trained model and use to predict
with tf.Session() as sess:
    tf.train.Saver().restore(sess, "./trained_mnist_ann.ckpt")
    Z = logits.eval(feed_dict={X: t_digits[rnd_id].reshape(1, 28*28)})
    y_pred = np.argmax(Z, axis=1)
print("Predicted class: ", y_pred)
print("Actual class: ", labels[rnd_id])

Rerun the cell above multiple times to see how accurate our trained model is. You should be able to see that the resulted accuracy is very low and our training is slightly time-consuming.

Exercise 4 Modify and tune the neural net such that the training time is reduced but the accuracy is still acceptably high. You should try the following:

Change the structure of the network by adding/removing a hidden layer or increasing/reducing number of neurons.
Change the activation function of the hidden layers.
Choose different optimisation algorithms such as tf.train.MomentumIptimizer(), tf.train.RMSPropOptimizer, and tf.train.AdamOptimizer(). Don't forget to change the training parameters accordingly.

Do you observe any effect on the accuracy during the tuning? What is the best model that you can achieve?

Convolutional Neural Network (CNN) with TensorFlow¶

We now move on to convolutional neural net (CNN). The idea behind this architecture originated from a study on the animal visual cortex.

# Load and use all digits in MNIST if you have directly jumped to this section
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
digits = np.concatenate((X_train, X_test))
labels = np.concatenate((y_train, y_test))

# Pre-processing the data
t_digits = digits.astype(np.float32).reshape(-1, 28*28) / 255.0
t_labels = labels.astype(np.int32)

# MNIST's specification
height = 28
width = 28
channels = 1

As usual, we begin with creating TensorFlow computation graph, a loss function, and an optimiser for training the CNN.

reset_graph() # as we defined in the beginning of this notebook

# Create TensorFlow's placeholders for digits and labels
X = tf.placeholder(tf.float32, shape=[None, height * width], name="X")
X_reshaped = tf.reshape(X, shape=[-1, height, width, channels])
y = tf.placeholder(tf.int32, shape=[None], name="y")

# Construct 2D convolutional layers
conv1 = tf.layers.conv2d(X_reshaped, filters=20, kernel_size=3, strides=1, 
                         padding="SAME", activation=tf.nn.relu, name="conv1")
conv2 = tf.layers.conv2d(conv1, filters=40, kernel_size=3, strides=2, 
                         padding="SAME", activation=tf.nn.relu, name="conv2")

# Create a max pooling layer
pool3 = tf.nn.max_pool(conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="VALID")
pool3_flat = tf.reshape(pool3, shape=[-1, 40 * 7 * 7])

# Followed by layer of fully-connected neurons
fc1 = tf.layers.dense(pool3_flat, 50, activation=tf.nn.relu, name="fc1")
logits = tf.layers.dense(fc1, 10, name="output")

# Use mean softmax cross entropy as a loss function
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(xentropy)

# Use Adam Optimiser to train CNN 
training_op = tf.train.AdamOptimizer().minimize(loss, 
                                                aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N)
# (Change the aggregation_method to tf.AggregationMethod.EXPERIMENTAL_TREE or DEFAULT if it doesn't work)

# Define accuracy measure
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

Then, we proceed to executing the TensorFlow computation graph, which will train our CNN.

**_Note that training a CNN is generally time- and memory-consuming. It is very likely that your PC will either be slow down or frozen. If this is the case, click the stop button above, wait for a couple of minutes, restart your Python kernel, and jump down to the exercise below._**

# Define a function to make training batches
# This is useful when your PC doesn't have much memory
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

# Train the CNN batch by batch
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    for epoch in range(10):
        for X_batch, y_batch in shuffle_batch(t_digits, t_labels, 50): 
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: t_digits, y: t_labels})
        print(epoch, "Accuracy:", acc_batch)
    
    # save the trained model
    save_path = tf.train.Saver().save(sess, "./trained_mnist_cnn.ckpt")

# random one digit for test CNN's prediction
rnd_id = np.random.randint(0, len(digits))

# visualise the digit
plt.figure()
plt.imshow(digits[rnd_id])
plt.colorbar()
plt.grid(False)

# load the trained model and use to predict
with tf.Session() as sess:
    tf.train.Saver().restore(sess, "./trained_mnist_cnn.ckpt")
    Z = logits.eval(feed_dict={X: t_digits[rnd_id].reshape(1, 28*28)})
    y_pred = np.argmax(Z, axis=1)

print("Predicted class: ", y_pred)
print("Actual class: ", labels[rnd_id])

Exercise 5

Visualise and/or draw on your paper this convolutional neural net to figure out its current structure.
Tune the model such that the accuracy is acceptably good, the required memory is low, and the training time is small.

Overfitting¶

'With 4 parameters I can fit an elephant and with 5 I can make him wiggle his trunk.' John von Neumann, cited by Enrico Fermi in Nature 427

Do not forget that an overfitted model will not perform well in the real world. It is therefore important for you to know how to prevent this issue with neural networks in general.

Exercise 6

Recall the characteristic of overfitted models with respect to their performance on the training and test sets.
Restore this notebook back to its original state and then modify the code above to partition the MNIST dataset into training set and test set.
Further modify the training phase of deep net and/or CNN to use only the training set and evaluate accuracy or loss on both datasets.
On deep net and/or CNN for MNIST above, implement one or a combination of the regularisation techniques listed below. Observe any difference or change in performance during training:

4.1. Early stopping, where you stop training your model if there is no further significant improvement of performance on your test set. (Hint: regularly check the performance on both sets and always store the best model)

4.2. $l_1$ or $l_2$ regularisation, by correctly specifying TensorFlow parameters. (Hint: Look for 'kernel_regularizer' in the online documentation)

4.3. Dropout, where each neuron has a probability of being turned off at each epoch in training phase (Hint: apply tf.layers.dropout() to the input layer and/or any hidden layer's output, but NOT the output of the output layer)

Sidenote¶

There are many high level APIs that you can use to quickly create and deploy Machine Learning prototypes. They are very useful but it is difficult to make non-standard changes to their implementation of Machine Learning models. If you are interested, have a look on the following:

Estimators: https://www.tensorflow.org/guide/estimators
Keras: https://www.tensorflow.org/guide/keras
Eager execution: https://www.tensorflow.org/guide/eager

Reference¶

Aurélien Géron, Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.