Next, we need to define the parameters to be used within the model. Parameter values will be provided by the R session calling the C++ script. TMB will compute partial derivatives of the terminal scalar output of the model (i.e. the loss) for each input declared as a parameter, which naturally requires computing time and -power. Hence, you should be clear about what components of your model you would like to optimize (i.e. fit), and which ones not (the latter then need to be provided as DATA_ instead). In the case of the CNN, it is pretty clear that we want to optimize the convolutional matrices and vectors of "weights" that are used to process input images. Within the graph of the CNN, we generate three intermediate representations of the input as a product of convolution operations, hence we declare three stacks of convolutional matrices as parameter arrays (conv_x), one for each convolutional layer. Every stack contains several convolutional matrices; the idea is that each matrix gets optimized to check for a recurring spatial feature in the data or a previous intermediate data representation. Finally, we also declare one parameter vector of weights (dense_1), which is used to process the final intermediate data representation and calculate the predicted classification (a vector of two elements). All input parameters must be assigned to a specific type of object named PARAMETER_XXXX; again, the suffix depends on the structure of the parameter object. Within the optimization process, we first provide semi-random values as initial "guess values", run the model script, generate the overall classification loss, update the parameter values in the R session, re-run the C++ script with the updated parameter values, and repeat the procedure until we deem classification performance (i.e. classification loss) good enough (low enough). Parameter-type inputs thus always change over time, data-type inputs not necessarily.
Now, we define some objects that will be used throughout the model computations. In particular, we define three arrays that will contain the intermediate data representations generated by the convolution operations (int_rep_x). In each convolutional layer, five convolutional matrices, i.e. feature filters, will be applied, hence the second axis of each of these arrays is of length 5 (the first one is reserved for indexing the various inputs supplied to the model). The third and fourth axes, i.e. the "rows" and "columns" differ between the layers, as the data representations become smaller from layer to layer as an effect of the convolution operations. The first intermediate representation will have 25 rows and columns (the summation-multiplication of the convolutional operation maps nine input dimensions to one output dimension and hence leads to a reduction of two rows and columns relative to the original input image). The second and third intermediate representations are reduced by four and six further rows and columns, respectively (here, larger convolutional matrices are employed and hence more dimensions of the input are mapped to a single output dimension). Finally, we also define two matrices that are to contain the almost-final model prediction (pred_pre, from before the softmax activation) and the final model prediction (pred from after the softmax activation). Applying the softmax activation requires the pre-activation model output to be known for all classes (two in this case), and can hence not be conducted in one step, which forces the need for two separate matrices here. We also define two scalars (intitially set to zero); one will be a "helper" object in the softmax operation (enum_softm), the other is the scalar loss that is going to be the final output of the C++ script that is required to fit the CNN (loss).
We now move on to the actual computational graph that makes up the CNN model. The graph is written in multiple nested loops, as we need to loop over several input images, convolutional matrices etc. The utilization of loops here may come as a surprise to users of interpretated languages like R and Python, as these are generally deemed inefficient and replaced by operations from linear algebra, like matrix multiplication. In a compiled language like C++, however, loops are sufficient; in fact operations like matrix multiplication are translated to C++ loops during the interpretation step. We start by looping over a sequence of input images. In the present case, we limit the number to 12 due to memory constraints (ideally, we would loop over all available training images during one call of the C++ script).
Within this loop, we now compute the first intermediate representation of the data. To this end, we write a nested loop, where we first iterate over all but the two last rows (index "i") and then over all ut the two last columns (index "j") of the input image. We omit the final two rows and columns as a parameter matrix covers nine cells (i.e. three rows and three columns), and in the following formulation of the inner loops, we always address the current and the two following rows and columns for a given cell "ij". We then iterate over all matrices in the parameter stack used to calculate the first intermediate data representation (index "m"), and then over the rows (index "k") and columns (index "l") of the parameter matrix. Hence we multiply the cell of the parameter stack "m,k,l" with the section of the input image "i+k,j+l" and add this to the cell of the intermediate data representation "m,i,j". Hence the products of the parameter matrix (a 3x3 matrix) with the "covered" section of the input image (also a 3x3 matrix) are summed to generate one cell (one scalar value) of the intermediate data representation.
Having completed the calculation of he firs intermediate data representation, we now compute the second intermediate data representation from the first one and from a new stack of parameter matrices. The set of procedures is almost the same as before. However, we now loop over all but the final four rows and columns of the first intermediate data representation (which has the shape of a matrix stack). That number of rows and columns is provided as input data, and equals the number of rows and columns of the input image minus two. We here omit four instead of (as before) two rows and columns, since we apply larger (5x5) parameter matrices in this second layer of the CNN. A more prominent difference from the procedures of calculating the first intermediate data representation, we now have to incorporate an extra nested loop (index "n"). The reason is that the first intermediate data representation is a stack of matrices, i.e. an array, and not a single matrix like the input image itself. Thus, we now actually multiply a 5x5 parameter matrix with a 5x5-subset of each entry in the matrix stack of the first intermediate data representation, and sum over all entries. The result is one cell in one matrix of the second intermediate data representation. Furthermore, we condition each multiplication on the particular cell of the first data representation being positive. This is the implementation of the ReLU activation function, a tool adding a non-linear element to the CNN calculations that is frequently used to enable the "learning of" and detection of more complex patterns in data.
The calculation of the third intermediate representation of the data from the second intermediate representation and the stack of third-layer parameter matrices is very similar to the steps before. The only alteration is that now we use even larger parameter matrices (7x7 matrices), which we need to account for when looping over the rows and columns of the second intermediate data representation, where we need to omit the six terminal rows and columns when looping. The choice of the sizes of the parameter matrices here is quite arbitrary and intended to keep script length to a readable extent. The actual "best" CNN architecture for a given dataset is often determined through extensive experimentation.
Finally, we calculate the class prediction from the third intermediate representation of the data. The class prediction will be a vector whose number of elements equals the number of classes between which we differentiate, in the present case two. To predict the norm and direction of the vector, we employ a dense or fully-connected layer. This means that instead of employing a set of parameter matrices "scanning" for recurring elements in a previous data representation (or in the input data), we now use a set of parameters where we have one unique parameter for every connection between the cells of the matrix stack comprising the third representation and the elements of the class vector. Thus we loop now over both classes and then over all rows and all columns of the matrices of the third representation and then over the different matrices, and sum up all products for each class.
We then exponentiate each vector element of the class vector and add them to each-other. And then we exponentiate each vector element again and divide it by the sum calculated in the prior step. This is the so-called softmax activation function that ensures that the elements in the class vector sum to one. The interpretation of the class vector is then that the index of the vector element with the greater value is the index of the class predicted. (The difference between the magnitudes of the vector elements is sometimes used to infer the "certainty" of the prediction - i.e. a 0.6-0.4 ratio being less certain than a 0.9-0.1 ratio - still, very "certain" predictions might in some cases also be very wrong).
Finally, we calculate the loss between prediction and ground truth. The ground truth is a one-hot-encoded vector, i.e. a vector with a one at the true class index and a zero at the other class index, and is supplied as input data to the model script when calling it whithin the R session using TMB. We apply the binary-crossentropy loss function, which is commonly used in binary-classification problems (a further explanation of the details of this function is omitted here for the sake of simplicity). When running the model script, we add up the scalar loss for each input image over all input images. The idea here is that the CNN, or more precisely its parameters, should be adapted to enable an accurate classification of all training images. (Also, wrong classifications with particularly strong classification certainty are naturally weighted more strongly than mis-classifications with weak certainty due to the loss being based on the actual magnitudes of the elements of the prediction vector).
Having completed the loop over all input images and thus having obtained the final loss value and filled all arrays with the class predictions and intermediate data representations, we write four lines termed ADREPORT that will tell the TMB package to make the listed arrays accessible from within the R session - it means that we will actually be able to view the class predictions and to inspect the intermediate data representations. The former enables a practical implementation of the CNN for a classification task on unseen data, the latter allows investigating the model script for bugs in case the fitting process goes awry. As a final line, we add return loss. This is a very important line, as this is the scalar value with respect to which TMB will compute partial derivatives of the parameters. Putting the correct scalar in this line is therefore important to ensure that the CNN is fitted sensibly. You can find the complete C++ script here.
Now that we have finished writing the C++ model script, which we save to disk under the file name CNN.cpp, we can start an R session and write the code required for running and fitting the CNN. The R code shown here is written in a way that supports "supervised" code execution within an IDE like RStudio, but could be modified relatively easily to be run as a program. We first need to load a number of packages: The TMB package for compiling and running the model script, and for computing partial derivatives of the loss with respect to the CNN parameters and the dslabs package to download and access the MNIST dataset of images. After loading the packages, we load the MNIST dataset into the R-session memory using the read_mnist() function. When doing this for the first time, we do not provide any further arguments to the function, which means that we force a download of the dataset from the internet (a good internet connection is therefore required). Once we have downloaded the MNIST dataset, a number of compressed folders exist in the Downloads directory, and the next time we start an R session and want to load the dataset into R-session memory, we can add this directory as an argument to the read_mnist() function. This means that we will not require an internet connection for running the R session, and also speeds up the loading of the dataset into the R session. The MNIST dataset is an object comprised of four lists, one larger and one smaller set of images of hand-written letters, and one longer and one shorter vector of corresponding class names (labels "0" to "9"). The longer lists / vectors are the training data, and the shorter ones are test data that we will use to independently validate the classification performance of the fitted CNN. The MNIST dataset is a benchmark dataset for testing the strength of image-classification algorithms, and it is also comparatively easy to fit strong-performing models onto this dataset. However, for reasons of replicability and simplicity, we select this dataset for the current demonstration.
We then use the compile function of the TMB package to compile the C++ model script. This means the script is translated into machine-language code for later execution. This process can take a bit longer than a normal R command, but should not take excessively long. If the C++ code was written exactly as above, the compilation should complete successfully (should return a "0" in the R console) and not throw error messages. If you make modifications to the C++ script, you might require a couple of rounds of modifying the code and trying to compile it before you are successful, as writing C++ code can relatively easily lead to errors and especially oversights. After successful compilation, we then use the nested dyn.load() and dynlib() functions to make the compiled model script accessible in the R session.
Now we set up the parameter arrays that will later be passed to the C++ script to be used within the CNN. We first set up an empty array to be used for calculating the first intermediate data representation. It is a stack of five matrices of three rows and three columns each. We initialize the parameters by sampling from a so-called Glorot-uniform distribution, the upper and lower limits of which are defined by the dimensionality of the data before and after the calculation of the first intermediate representation, basically speaking. This distribution is widely used for the initialization of parameters of CNNs and DNNs in general. The number of samples drawn from the distribution of course equals the number of parameters required for the computation of the first intermediate data representation, i.e. the first convolutional layer of the CNN. The parameter arrays for the second and third convolutional layers (for the computation of the second and third intermediate data representations) are set up in a similar manner, but in the definition of the upper and lower bounds of the Glorot-uniform distribution we now need to account for the increased number of rows and columns of the respective parameter matrices (5 and 7 instead of 3 in the first layer).
Finally, we set up the parameter array for the final dense or fully-connected layer which predicts the class index from the third intermediate representation of the data. Since we "connect" every element of the class vector with every element of the third intermediate data representation, the parameter array has the form "number of classes, number of stacked matrices in the third data representation, number of rows in the third data representation, number of columns in the third data representation". The number of samples drawn and the boundaries of the Glorot-uniform distribution are set correspondingly. We create a named list of all parameter arrays, which will later be used as input when calling the model script using TMB.
We now subset the MNIST data set loaded earlier, as our example centers around a binary classification task, which means that we need to limit ourselves to data belonging to two classes (two digits) only. Here, we pick the digits "3" and "6". We subset the array of training images, the vector of training labels, the array of test images and the list of test labels such that they only contain images and correspondin labels belonging to the classes associated with the digits "3" and "6". The images originally come in vectorized format, but we need them in matrix format (i.e. the typical format of images that makes them perceivable by human vision). Hence we reformat the image arrays accordingly, which yields an array of stacked square matrices with 28 rows and columns each (i.e. images with 28x28 pixels). We have designed the CNN to accept images of dimensions 27x27 pixels, hence we here cut off the terminal row and the terminal column of each image. Finally, we divide the images by the scalar value 255, which is the maximum number of grayscales an image can usually have. This procedure limits the pixel-value range to a maximum of 1 (and a minimum of zero), which reduces the risk of calculating infinitely large values when processing the images through the CNN.
As a next step, we create two matrices that will contain the one-hot-encoded-vector form of the class labels. Recall that the class labels as imported into the R session are actual digits. We need to transform these into vectors of two elements (a zero and a one) with the index of the "one" indicating the class. Hence we generate a matrix with as many rows as there are training images and two columns, set all values to zero, and set the values in the first column to 1 where the class is a "three" and set the values in the second column to 1 where the class is a "six". We do the same for the test labels.
Next, we set the learning rate that will be used in the process of fitting the CNN to update its parameters via gradient descent. The learning rate determines how "boldly" the parameters should be changed in response to loss gradients. It is set here to 0.0005, a value that has already been optimized through experimentation. Normally, learning rates between 0.0001 and 0.001 are suitable for fitting CNNs; a low value risks a slow progress in searching for optimal CNN parameterization, a high value risks an over-shooting of an optimal parameterization. We also set a number of hyper-parameter values that are part of the Adam optimizer algorithm. An optimizer guides the parameter updates in the fitting process in a more sophisticated manner than pure gradient descent by "tracing" the fitting trajectory to some extent. The Adam optimizer is a state-of-the-art optimizer for fitting CNNs; however it is not necessary to know the meaning of its various hyper-parameters.
Now, we make some final preparations for running the model-fitting iterations. We set the number of iterations to a fixed value, in the present example to 50 (which is an arbitrary limit used here only for simplicity). We also set up an empty array that will contain the predictions of the CNN. As such, it has three axes: One to account for the different fitting iterations, one to account for the number of test images for which predictions will be made (set here to 12 for keeping computing time low but should be set to the total number of available test images in practice), and one to account for the dimensionality of the predictions (a vector of two elements). Further we set up an empty vector the number of elements of which matches the number of iterations. This will contain the prediction loss summed over all test images, and will allow to plot a loss curve over fitting iterations later on.
We can now finally start the fitting of the CNN. The scheme is as follows: We pass a set of images (from the total set of training images) to the CNN, run the CNN (and thus obtain the classification of these images plus the total classification loss, i.e. the binary binary crossentropy), generate the partial derivatives of the loss of the CNN with respect to every parameter of the CNN, update the hyper-parameters of the Adam optimizer via gradient descent using the partial derivatives, update the CNN parameters using the updated Adam optimizer, pass the updated parameters to the CNN, pass a set of test images to the CNN, run the CNN (with the updated parameters) on the test images, record the CNN's prediction on the test images, as well as the total classification loss, and start at the beginning of the sequence again. (Here, we provide a new set of training images by moving further ahead in the total set of training images. This is not essential, but it "encourages" the CNN to not over-fit on a limited set of training images). The single procedures will be explained further in the next section.
First, we create a named list of data-input objects to be passed to the model script. These are the number of input images, the number of rows (and columns) of each input image and of the intermediate data representations, as well as the number of classes, and the stack of input images and stack of one-hot-encoded vectors (i.e. labels) themselves. Only the image stack and the stack of labels are updated with each iteration in the fitting loop (a sequence of index values in steps of 12 is used to pass the first twelve images and labels in the first iteration, the next twelve in the next iteration, etc.) The model script is then run with this list of input data and the list of parameter arrays created above using the MakeADFun() function from the TMB package. This function not only runs the code itself that we have written ourselves, but it also computes th partial derivatives of the loss with respect to every CNN parameter (the code for that was generated internally during the above step of compiling the C++ script). The gradients are then extracted from the output object generated from the call to the function. The hyper-parameters of the Adam optimizer are then updated using the gradients; there is one set of hyper-parameters for each array of CNN parameters, and each set is updated using the corresponding subset of gradients. Finally, the CNN parameters are updated using the updated opimizer hyper-parameters and the learning rate; this is then the actual step of the gradient descent, as we update the parameters in the opposite direction of their respective gradients.
Next, we apply the updated CNN parameters on our set of independent test images and -labels. To this end, we make a copy of the list of input data and replace the image stack and the labels stack with the test data. We feed this list to the MakeADFun() function, and this time set the function argument ADreport (which by default is set to FALSE) to TRUE. This means that we want to be able to access those model objects which we have named in the ADREPORT function within the C++ script (by defaultm, they are not accessible). We extract the class prediction from the output object. Since the values in the predicted class vector can have any value between zero and one, we employ a logical rule by which every value smaller than 0.5 is set to zero and every value equal to or larger than 0.5 is set to one. Thus, we can interpret the CNN prediction as a genuine classification. We compute the scalar sum of absolute differences between the true class vectors and these predicted class vectors and record it in order to keep track of the test-loss dynamics. We then move on to the next fitting iteration, and work with the next stack of training images. Even though we have quite an extensive set of test data, we only use twelve images here in this illustrative example to keep computing time within reasonable limits. For a full assessment of model performance, it would, however, be necessary to review performance on a larger, preferably the full, set of test images.
Even though we challenge the fitting procedure by supplying a completely new stack of training images at each fitting iteration, we observe that the test loss does decrease markedly over time. The process is not linear, and the "zig-zag" dynamics indicate that it takes some time before a parameterization is achieved that can classify both training- and test images reliably well.
The approach shown here is somewhat slow when compared to the computing speed achieved by packages specifically geared towards deep-learning tasks, e.g. TensorFlow. But it is a lot faster than the CNN written in R, and still allows a lot of flexibility for experimenting with the CNN design, the optimization algorithm, the loss function etc.