ExDataScientia

Coding a convolutional neural network from scratch in TensorFlow

Neural networks are today often coded in high-level programming interfaces provided by packages like Keras. Here, we are going to construct a convolutional neural network (CNN) in the lower-level Tensorflow interface, which allows a better understanding of the mathematical procedures, as well as greater flexibility.

In past articles, we have constructed deep neural networks in a variety of manners: Using the common Keras programming interface, and also from scratch and using basic auto-differntiator functions in R and in C++ with R wrapper. In Python, we can code deep neural networks using the Tensorflow package, which allows for relatively low-level coding (i.e., little dependency on complex functions) and provides auto-differentiator capacity. Tensorflow is also the backend of the more high-level Keras, which allows writing and training a neural network in few lines of code, but tends to hamper workflow when a greater degree of flexibility in model design is desired and effort needs to be directed to understanding function synthax.

We will here code a convolutional neural network (CNN) for the classification of hand-written numeric characters (a subset of the MNIST benchmark data set) into one of two classes (we here only work with the classes "3" and "6"). The design of the CNN will be relatively simplistic, with three hidden representations (resulting from three convolutional layers), in order to keep computational demands within limits. More complex structures usually give higher performance in terms of classification accuracy, but for the simple task of differentiating between two classes on relatively simple-structured images, the architecture selected here should be sufficient.

We start by loading the MNIST image data set from the hard drive into session memory from the Tensorflow (tf) package folders (in case this is done for the very first time, this process might take a little longer, as the data set is first donwloaded from the internet). We then subset this data set to keep only the images belonging to the classes mentioned above (these correspond to the third and the sixth index in the data set). Furthermore, we normalize the images (by dividing them by 255, the maximum number of gray scales that a digital image can have) and converting them to "tensors", which is basically another expression for higher-order arrays, but refers here specifically to Tensorflow objects, i.e. objects of a class that can be traced by the auto-differentiator capabilities of the Tensorflow package (unlike e.g. NumPy arrays or base-Python arrays).

Next, we subset the array of class indices (which is loaded together with the images when loading the MNIST data set), to keep only those elements corresponding to the images of threes and sixes. We set up an array filled with zeros with as many rows as there are images, and as many columns as there are classes (two in this example), and for every image of a three, we put a "1" into the left column at the corresponding row, and for every image of a six, we put it into the right column. This is the generation of one-hot-encoded labels, where each class is designated by the placement of the "1" in the vector of length two. In this form, the class labels can later be used to calculate classification loss (working directly with the numbers 3 and 6 is not recommended, as these only have a categorical meaning and not a true numeric one in classification).

We set up the CNN parameters (or "weights") as Tensorflow "variables" (which means that we can later on assign values to them, which is useful when updating the parameters). We group the parameters into layer-specific arrays, which means that we have three arrays for convolution operations (i.e. recursive multiplication of a small group of parameters with variable inputs) and one array for a "dense"-layer operation (i.e. multiplication of input with entirely specific parameters). Each convolutional array has a shape of length three, where the first dimension indicates the number of parameter matrices in the layer (here, five in each layer), and the second and third dimension indicate the "spatial" extent of each parameter matrix (i.e. the number of input dimensions covered in vertical and horizontal direction; these vary here between three and seven). The parameter array associated with the "dense" operation equals the output of the third convolution layer in shape, as it must take this output as input, but contains one additional axis equaling the number of classes between which should be differentiated in dimensionality (i.e. two).

n_class = tf.constant(2)

conv_1 = tf.random.uniform([5,3,3], minval = -np.sqrt(6/(5*3*3+1)), maxval = np.sqrt(6/(5*3*3+1)))
conv_1 = tf.Variable(conv_1)

conv_2 = tf.random.uniform([5,5,5], minval = -np.sqrt(6/(5*5*5+1)), maxval = np.sqrt(6/(5*5*5+1)))
conv_2 = tf.Variable(conv_2)

conv_3 = tf.random.uniform([5,7,7], minval = -np.sqrt(6/(5*7*7+1)), maxval = np.sqrt(6/(5*7*7+1)))
conv_3 = tf.Variable(conv_3)

dense_1 = tf.random.uniform([n_class, 5, (np.shape(mnist[0][0])[1]-1)-2-4-6, (np.shape(mnist[0][0])[2]-1)-2-4-6],
                            minval = -np.sqrt(6/(n_class*5*((np.shape(mnist[0][0])[1]-1)-2-4-6)**2+n_class)), 
                            maxval = np.sqrt(6/(n_class*5*((np.shape(mnist[0][0])[1]-1)-2-4-6)**2+n_class)))
dense_1 = tf.Variable(dense_1)

All parameters are drawn from specific Glorot-uniform distributions, which means that the bounds of the uniform distribution are set by the dimensionality of the input and that of the output (this constraint has been shown to improve DNN training compared to the usage of a simple uniform distribution).

We now need to prepare a function that prepares an image for its processing in the CNN, that is, a reshaping and replicating of portions if the image. To understand this requirement, it is necessary to remember that in a convolutional layer, a small set of parameters, arranged in form of a matrix (e.g. a 3x3 matrix or a 5x5 matrix) is moved step-by-step "over" the input image (or over a hidden representation), which would, in programming terms, require the application of a loop (where the parameter matrix moves ahead one pixel per iteration). In Python programming, this would result in a list or in a list-turned-array. When tracing parameters, like Tensorflow does, this would, however, be a highly inefficient operation that would easily lead to memory exhaustion in even slightly complex network architectures. The reason for this is that Tensorflow internally constructs a mathematical graph (which we also have done ourselves in C++ implementations of the CNN), and it does so for every Python object created while tracing parameters for auto-differentation (more on the tracing later). A list is here treated as a container for Python objects, and as the lists get quite long when looping over pixel rows and columns, the number of graphs built also becomes quite big.

The trick is now to format each input image in such a way that allows us to express the multiplication of parameters with input data (or hidden representations) as a single algebraic operation per convolutional layer, without the utilization of a loop. (It should be noted that the algebraic operation - in essence the calculation of a sum-product, or dot product, in Tensorflow, is actually executed as a series of loops in a faster lower-level language like C++). To this end, we simply do the looping over the image without applying the multiplication with parameters: We loop over the image with a nested loop, one pixel-per-row and one-pixel-per-column at a time, and take a subset of 3x3 pixels each time. We end up with an array with four axes, where the first two axes equal the number of pixels-per-row and number of pixels-per-column of the original image, each reduced by two (as we cover the current and two adjacent pixels with each iteration of the loop). The last two axes are three each in length, and thus represent the dimensionality of the subset of pixels taken at each iteration of the nested loop. This process is approximately visualized in the figure below. The array thus contains partial replications of the input image, stored in a more complex array than the original image.

The array just created would allow for efficient processing of the input image in the first convolutional layer. But we also need to account for the second and third convolutional layers. Hence we need to transform the array even further, to ensure that once it has passed the first convolutional layer, the resulting output is in exactly the right shape to be processed by the second convolutional layler, and so on. That means that now, we simulate the second convolution operation but without the multiplication of input with parameters: Similar to the step before, we "slide" over the first two axes of the array created in the first step, using a nested loop, in steps of one cell per iteration. In each iteration, we "cut" the array created in the first step along its first and second axes, now subsetting not just a simple matrix, but actually a 4D piece of the former array, concretely a 5x5x3x3 piece. The resulting array has six axes: the first two equaling the number of rows and columns of the input image minus two minus four (to account for the two sets of subsetting operations), the third and fourth equaling the lengths of the first and second axes of the slices taken in the second convolutional layer (5x5), and the fifth and sixth axes equaling the lengths of the axes of the slices taken in the first convolutional layer (3x3). The process is approximately visualized in the figure below, although, for simplicity, the first and second axes have here length three, not five.

Finally, the array just prepared must be transformed one more time such that the output generated in the second convolution step can directly be fed into the third convolutional layer. The logic is the same as before, in that we move iteratively over the first two axes of array just created, and create 6D slices by cutting along the first two axes (making cuts of 7x7). The result is an 8D array, with the first two axes equalling the number of rows and columns of the input image minus two minus four minus six, the third and fourth dimension equaling the lengths of the first and second axes of the slices taken in the third convolutional layer (7x7) and so on. The resulting array has the following quality: It can be fed into the first convolutional layer, and the output will match exactly the requirements of the second convolutional layer. The output of that layer will then exactly match the input shape required by the third convolutional layer. And the output of that layer can then directly be taken to make a class prediction.

img_proc_2 = [[img_proc_1[j:(j+7),k:(k+7)] 
                for k in range((img_list_base[0]).get_shape()[1]-2-4-6)] 
                for j in range((img_list_base[0]).get_shape()[0]-2-4-6)]
                
def prep_inp(inp_img):
    img_proc = np.array([[inp_img[j:(j+3),k:(k+3)]
                      for k in range((img_list_base[0]).get_shape()[1]-2)] 
                      for j in range((img_list_base[0]).get_shape()[0]-2)])

    img_proc_1 = np.array([[img_proc[j:(j+5),k:(k+5)] 
                for k in range((img_list_base[0]).get_shape()[1]-2-4)] 
                for j in range((img_list_base[0]).get_shape()[0]-2-4)])
    
    img_proc_2 = [[img_proc_1[j:(j+7),k:(k+7)] 
                for k in range((img_list_base[0]).get_shape()[1]-2-4-6)] 
                for j in range((img_list_base[0]).get_shape()[0]-2-4-6)]
    
    img_proc_2 = tf.convert_to_tensor(img_proc_2)
    
    return(img_proc_2)

We can now take this function and use it to transform a sample image. Then, we take this transformed image and the first convolutional parameter array, and calculate their dot product along a specific set of axes. What does that mean? Well, it means that we "match" certain axes of the convolutional parameter array and of the transformed input, in this case the axes of length three (these are the second and third axes in the parameter array, and the seventh and eigth axes in the input). We multiply the parameters with the input along these axes. As both the parameter array and the input have additional axes (in the case of the parameter array, we have five convolutional 3x3 matrices), the multiplication operation is internallly repeated over the length of each of these additional axes ("internally" means that a) the repetitions are accounted for through the symbolic synthax of algebra and b) they are conducted by translating Python code into loops written in e.g. C++). After multiplication, the products are summed over all the axes involved in their calculation; that is the definition of the dot product, here implemented via the tf.tensordot function, where the indices of the axes involved must be provided. The output of this operation has a new shape compared to the input: the original third and fourth axes are removed, since these were the ones over which the dot product was computed. But one additional axis, of length five, has appeared at the first position: This results from the fact that the convolutional parameter matrix had one additional axes, representing the fact that it contained five 3x3 parameter matrices, needed to be accounted for. Nevertheless, the output has a shape that is perfectly suited to feed it into the second convolutional layer.

Before we do just that, we apply the ReLU activation function, a non-linear (here threshold-like) function intended to enable the CNN to "learn" to scan for more complex content in input data. The implementation is simply a setting of all negative output of the first convolution operation to zero, and keeping all output > 0 unchanged. To this end, we use the tf.maximum function to compare the output (element-wise) to zero as the enforced minimum value. After running the output of the first convolutional layer through the ReLU activation function, we refer to it as the activated output.

We now perform the second convolution operation, which largely follows the first convolution operation in structure. However, since a new axis was added in front of the processed image during the first convolution operation (due to the fact that not just one but five parameter matrices were applied and hence five dot products calculated), we need to make a structural change to the second convolutional parameter array: We need to add one "placeholder" axis, in-between the first and the second axes (i.e., at index #1). This axis will match the novel axis introduced in the first convolution operation, such that we can include it in the calculation of the dot products between the new parameter array and the previous output. We use the tf.expand_dims function to insert the placeholder axis in the output of the previous convolutional layer. And we use the tf.repeat function to increase the length of this axis (by replicating the output along that axis) such that it matches the length of the novel axis introduced in the first convolution operation. Now, when calculating the dot product between the second convolutional parameter array and the output of the first layer, we map the added placeholder axis (index #1) to the novel axis in the output (axis #0), and the columns and rows of the convolutional parameter matrices (axes #2 and #3 in the parameter array) to the last two axes of the output (axes #5 and #6). As a result of the dot-product calculation, these axes vanish in the output of the second convolutional layer. But again, one axes is added in the front, due to the fact that the convolutional parameter array consisted of several convoltional parameter matrices (indexed by the first axis in the second parameter array).

After the three convolution operations, we compute the (pre-activated) classification prediction. To this end, we multiply the "dense" parameter matrix with the output of the third convolutional layer. We need to add one axis to the latter using the tf.expand_dims function (i.e. replicating it once), as the parameter array has one more axis than the convolution output - as we make a prediction of a class vector of length two (for two classes), we have one parameter set for each of its two elements. After multiplication, we sum up the products over the second, third and fourth axes, such that we obtain a vector of two elements (corresponding to the first axis in the parameter array and the axis just added on the convolution output).

The process of adding and "removing" axes (via dot-product calculation) is visualized approximately in the figure below.

We now "activate" this vector via the so-called SoftMax activation function, which ensures that all its elements are positive and sum to one. In operatioal mode, the vector index with the larger value is the predicted class index, i.e. ideally, one of the vector elements would be 1.0 and the other 0.0 (as the input data are noisy and complex, this separation is almost always only approximated, however). Finally, we compute the classification loss, here expressed as the binary cross-entropy loss function, by providing the true class vector for the respective input image. The loss is computed element-wise, i.e. for each class (for every element of the output vector) separately, but we require a scalar loss to fit the CNN' parameters, hence we use the tf.reduce_sum function to calculate the loss sum.

Now, we embed the above operations - the three convolution operations and each corresponding ReLU activation, the dense-layer computations, the SoftMax activation of the class prediction, and the loss calculation - in the context of a so-called TensorFlow "gradient tape", which means that an auto-differentiator is applied to trace the variables we would like to fit (i.e. the parameters in the convolutional and dense layers). The auto-differentiator here "observes" the sequence of mathematical operations that depend (directly or indirectly) on these variables, and prepares for calculating partial derivatives with respect to them. Outside the context of the gradient tape, we next compute the gradient of the scalar loss with respect to the various parameter arrays, using the gradient function of the "tape". I.e., here we provide the auto-differentiator with the information whereof to calculate partial derivatives (in theory, partial derivatives could be calculated of any intermediate computation depending on the variables). By the way, in Template Model Builder, the application of the gradient tape and the gradient computation would roughly equal the usage of the MakeADFun function on the model C++ script.

with tf.GradientTape(persistent = True, watch_accessed_variables = False) as tape:
        tape.watch([conv_1, conv_2, conv_3, dense_1])
        
        int_rep_1 = tf.tensordot(conv_1, img_proc, axes = [[1,2],[6,7]])
        int_rep_1 = tf.maximum(int_rep_1, tf.constant(0.))
        int_rep_2 = tf.tensordot(tf.repeat(tf.expand_dims(conv_2, 1), 5, axis = 1), int_rep_1, axes = [[1,2,3], [0,5,6]])
        int_rep_2 = tf.maximum(int_rep_2, tf.constant(0.))
        int_rep_3 = tf.tensordot(tf.repeat(tf.expand_dims(conv_3, 1), 5, axis = 1), int_rep_2, axes = [[1,2,3], [0,3,4]])
        int_rep_3 = tf.maximum(int_rep_3, tf.constant(0.))
        
        pred_pre = tf.math.reduce_sum(dense_1[:,:,:,:] * tf.expand_dims(int_rep_3[:,:,:],0), axis = [1,2,3])
        
        pred = tf.exp(pred_pre) / tf.math.reduce_sum(tf.exp(pred_pre))
        
        loss = tf.math.reduce_sum(-(class_truth * tf.math.log(pred) + 
                                          (tf.constant(1.0) - class_truth) * tf.math.log(tf.constant(1.0) - pred)))
                                          
grad_c1 = tape.gradient(loss, conv_1)
grad_c2 = tape.gradient(loss, conv_2)
grad_c3 = tape.gradient(loss, conv_3)
grad_d1 = tape.gradient(loss, dense_1)

Finally, we embed the entire process - the gradient tape watching the class prediction and loss calculation, and the calculation of the gradients, in a Python function that we can feed with any input, i.e. a processed image and a true class vector. We let the function then return a list of all the gradients, and of the loss, to be later used for updating the parameters and recording a loss "timeline". We decorate the function with the prefix @tf.function, which tells TensorFlow to construct a single mathematical graph out of all the computational steps contained in the function (in a lower-level language), rather than constructing a separate graph for every operation. This speeds execution up markedly, although in some situations in can lead to excessive build-up of memory usage.

@tf.function
def one_step(img_proc, class_truth):

    with tf.GradientTape(persistent = True, watch_accessed_variables = False) as tape:
        tape.watch([conv_1, conv_2, conv_3, dense_1])
        
        int_rep_1 = tf.tensordot(conv_1, img_proc, axes = [[1,2],[6,7]])
        int_rep_1 = tf.maximum(int_rep_1, tf.constant(0.))
        int_rep_2 = tf.tensordot(tf.repeat(tf.expand_dims(conv_2, 1), 5, axis = 1), int_rep_1, axes = [[1,2,3], [0,5,6]])
        int_rep_2 = tf.maximum(int_rep_2, tf.constant(0.))
        int_rep_3 = tf.tensordot(tf.repeat(tf.expand_dims(conv_3, 1), 5, axis = 1), int_rep_2, axes = [[1,2,3], [0,3,4]])
        int_rep_3 = tf.maximum(int_rep_3, tf.constant(0.))
        
        pred_pre = tf.math.reduce_sum(dense_1[:,:,:,:] * tf.expand_dims(int_rep_3[:,:,:],0), axis = [1,2,3])
        
        pred = tf.exp(pred_pre) / tf.math.reduce_sum(tf.exp(pred_pre))
        
        loss = tf.math.reduce_sum(-(class_truth * tf.math.log(pred) + 
                                          (tf.constant(1.0) - class_truth) * tf.math.log(tf.constant(1.0) - pred)))
    
    grad_c1 = tape.gradient(loss, conv_1)
    grad_c2 = tape.gradient(loss, conv_2)
    grad_c3 = tape.gradient(loss, conv_3)
    grad_d1 = tape.gradient(loss, dense_1)

    return([grad_c1, grad_c2, grad_c3, grad_d1, loss])

Now, we will fit the CNN by iteratively adjusting its parameters: First, we loop over all input images and prepare (i.e., reformat) each one as described above. The reformatted images are appended to a list for later use. Then, we loop over the number of iterations of parameter adjustment (which we have previously, and rather arbitrarily, set to 50), and, in a inner loop, over all the images (actually, we limit the number here to 50 to keep computation time and memory demands within manageable limits). For each image, we store the loss gradients with respect to the convolutional and dense-layer parameters in an array, and we sum the classification loss over all images. Once the gradients for every image have been computed (i.e. the inner loop is finished), we calculate the average gradient for every parameter. To this end, we use the tf.math.reduce_mean function, and set the axis argument to 0 in order to indicate that we wish to calculate parameter-specific averages and not an overall average over all parameters and images. Then, we update the parameters by subtracting the gradients, multiplied with the pre-defined learning rate (set here to 0.001), from their current values (we use the assign function here to over-write the old parameter values).

lr = tf.constant(1e-3) # learning rate

n_imgs = 50 # set the number of images to process in one iteration
n_iter = 50 # set the number of iterations

grads_arr_c1 = tf.Variable(tf.zeros([n_imgs,5,3,3]))
grads_arr_c2 = tf.Variable(tf.zeros([n_imgs,5,5,5]))
grads_arr_c3 = tf.Variable(tf.zeros([n_imgs,5,7,7]))
grads_arr_d1 = tf.Variable(tf.zeros([n_imgs,2,5,15,15]))
# empty arrays for storing the loss gradients calculated per image

loss_curve = np.zeros(n_iter) # empty vector for storing the loss
# value (over all images) per iteration

"""
process the images for use in the CNN
"""
inp_imgs = []
for i in range(n_imgs):
    inp_imgs.append(prep_inp(img_list_base[i]))

"""
fit the CNN
"""
for h in range(n_iter): # loop over iterations
    loss = tf.constant(0.)
    
    for i in range(n_imgs): # loop over all images
        grads = one_step(inp_imgs[i], class_arr_base[i])
        
        grads_arr_c1[i].assign(grads[0])
        grads_arr_c2[i].assign(grads[1])
        grads_arr_c3[i].assign(grads[2])
        grads_arr_d1[i].assign(grads[3])
        
        loss += grads[4]
        
    loss_curve[h] = loss
    print(loss)
        
    grads_c1 = tf.math.reduce_mean(grads_arr_c1, axis = 0)
    grads_c2 = tf.math.reduce_mean(grads_arr_c2, axis = 0)
    grads_c3 = tf.math.reduce_mean(grads_arr_c3, axis = 0)
    grads_d1 = tf.math.reduce_mean(grads_arr_d1, axis = 0)
    # calculate average gradients over all images
    
    conv_1.assign(conv_1 - lr * grads_c1)
    conv_2.assign(conv_2 - lr * grads_c2)
    conv_3.assign(conv_3 - lr * grads_c3)
    dense_1.assign(dense_1 - lr * grads_d1)
    # update the parameters

Ex Data, Scientia