**Building a convolutional neural network (CNN) from scratch**

Convolutional neural networks (CNNs) are a particularly complex type of neural network utilized in machine vision. Here, we will build one from scratch in order to shed light on these often black-box-like algorithms.

CNNs are a specific type of deep neural network (DNN) primarily used for machine-vision tasks. They are frequently utilized for classification or clustering of images, but derivatives are also found in the domain of "creative artificial intelligence". The primary aspects that make them different from DNNs are the maintaining of information about data structure (thereby emulating, to some extent, the visual apparatus of animals) and the efficient (repetitive) utilization of a reduced set of parameters, which is an important aspect when dealing with very high-dimensional data like images.

CNNs apply small groups of parameters, usually arranged in matrices of e.g. 3x3 or 5x5, on sets of input variables of equal size and extent. As the input as a whole is usually larger than parameter matrix, the matrix is "slid" over the input until every "location" of the input has been covered. When placed over a certain input set, the parameters are multiplied with tne input variables covered by them, and the entire set of products is summed up. A non-linear function (e.g. a sigmoid function) is applied on the sum, and the function output then represents one dimension of the next hidden representation of the data. In that regard, the CNN works just like a common DNN, where parameters are multiplied with scalar input variables, the products summed up and "activated", and the function output representing one dimension (or variable) of the next hidden dimension of the data. The difference in the CNN is that a common set of parameters is applied to different sets of input dimensions. The CNN is thus able to "detect" recurring patterns in the input data (or in the hidden representations), which is a common trait of many images. Furthermore, the mapping to hidden representations that maintain a square structure means that spatial information from the original data is maintained throughout the abstraction process. One layer in a CNN usually employs several parameter matrices in order to enable the learning of a set of useful patterns (useful e.g. in terms of predicting the most probable class of an image). While often represented as matrices in textbook examples, the single parameter sets actually are not true matrices, but three-dimensional arrays, whose shape depends on the number of "slices" of the input - in the case of an RGB image, the number of slices is three; in the case of a hidden representation, the number of slices depends on the number of parameter matrices applied in the previous layer.

Here, we are going to implement a CNN from scratch in R, i.e. without the high-level functions of common deep-learning frameworks like Keras. The only part where we make use of built-in functions is the computation of the partial derivatives of the CNN's parameters - here we will use the *deriv()* function from the base R package (more on that later). The CNN that we are about to build is fairly simplistic in nature. The reasons for this simplicity are that i) we want to be able to maintain an overview of all the computations in what is a structurally complicated type of model, and ii) we want to keep the number of parameters low in order to work with manageable computation times without the requirement for specific graphics processors. Our CNN will, in the end, neither be state-of-the-art in terms of performance, nor will it actually be useful for applied classification tasks (we will only train it on a single image, which will cost us a great deal of computing time already). It is simply intended to obtain an overview over the structure and the functioning of the essential components of a CNN.

Our CNN will receive a grey-scale image of dimensions 27 * 27 pixels. A single matrix of 3 * 3 parameters is slid over the image, and a sigmoid activation function is applied on the sum-products of parameters and inputs (this set of operations constitutes the first *layer* of the CNN), resulting in the first *hidden representation* of the data, i.e. an array of 25 * 25 * 1 dimensions. We then slide a matrix of 5 * 5 parameters over the first hidden representation, apply again a sigmoid actiation function, and obtain the second hidden representation, an array of 21 * 21 * 1 dimensions. Finally, we slide a matrix of 7 * 7 parameters over the second hidden representation, again apply a sigmoid activation function on the sum-products, and obtain the third hidden representation, an array of 15 * 15 * 1 dimensions. This last step has led to a fairly large reduction of dimensionality between two adjacent layers, but this is necessary in order to keep the number of model parameters in controllable numbers (the larger a data representation, the more parameters are required).

As a next step, we "flatten" the third hidden representation of the data, which means that we re-arrange the matrix into a vector (of length 15 * 15 = 225 dimensions). This is a necessary step to enable the connection between the relatively high-dimensional, spatially structured data representations and the prediction scalar (we perform a binary classification in the present example). We do lose the spatial information at this point, but given that we have three hidden representations that *have* maintained spatial information and whose parameters are able to learn spatial patterns, our CNN should be well-equipped to handle its assigned task. We connect the "flattened" representation directly to the output scalar, which means that we apply a parameter vector on that representation. We multiply one parameter with every dimension of the flattened represenation and again apply a sigmoid activation function on the sum-product (this set of procedures is known a a *fully-connected* or *dense* layer). The result is the scalar prediction of our CNN; the sigmoid function ensures that the scalar lies in the range zero to one, and for our classification task we can apply the simple rule that a prediction "less 0.5 is a prediction for class zero, and a prediction "more 0.5 is a prediction for class one.

Connecting the flattened representation directly to the prediction scalar is no particularly good practice, as we omit the learning of potentially important high-order patterns in the data. However, as *dense* layers are much more parameter-heavy than convolutional layers (*dense* layers do not learn recurring patterns in a given data point), we have to limit the size and complexity of our CNN at this point, as we do not plan on invoking a high-performance graphics processor. For the same reason, we set up our model for binary instead of multi-class classification here, as the latter would require the computation of an output *vector* and thus require a multiple of the number of *dense*-layer parameters (each dimension of the prediction vector would be connected to each element of the flattened representation via one parameter).

Let us now look at an implementation of the above in R code: We first load a single picture using the *readImage()* function from the *OpenImageR* package, in this case the image of a copepod. We force the image to a size of 27 * 27 pixels. Though this step could cause a stretching of an image that is not quadratic, building a CNN for non-uniform images (in terms of shape) is very difficult to realize (a solution could be to add pixels of background color to the image in order to fill up the space needed to make it quadratic). We also normalize the image by dividing the pixel values by 255, the maximum pixel value commonly found in digital images. Normalization forces all pixel values to be in the range zero to one, and typically improves the fitting of CNN parameters. Finally, we subset the first RGB layer of the image, as we want to build a relatively simple CNN for gray-scale images.

library('OpenImageR') library('abind') img = readImage('img.jpg') img = resizeImage(img, 27, 27) img = img / 255 img = img[,,1] image(img) # displays the image in the graphics device

Next, we create a matrix of indices, i.e. integer values, the number of which equals the number of pixels in the input image, i.e. 27 * 27. This is a necessary step, since we need to assign every input pixel to a unique object in order to allow for automatic differentiation later on. The indices are required for naming these objects in a systmatic manner. Since the first parameter matrix covers 3 * 3 pixels at a time, e.g. in the first instance three from the first row, three from the second row and three from the third row, we cannot simply assign indices by looping sequentially through all rows and columns of the input image. Instead, we first set up a vector of three elements (numbers 1-3) and duplicate it twice while adding the maximum value from either the original vector of the previous duplicate. The three vectors are concatenated as column vectors, column-by-column, yielding a 3-x-3 matrix containing the values 1-9. Now, as in the previous step, the matrix is duplicated, and the maximum value of the first matrix or the previous dublicate added. Again, the matrices are concatenated side-by-side, now yielding a larger matrix with three rows. The number of duplications equals the number of columns of the image divided by three, as the concatenated matrices all have three columns. The super-matrix has thus 27 columns. In a final step, the wide super-matrix is duplicated and the maximum value of the original super-matrix or preceding duplicate is added. The duplicated super-matrices, whose number equals the number of image rows divided by three (as each duplicate has three rows) are concatenated bottom-to-bottom, yielding a square final super matrix of 27 columns and 27 rows.

base_mat = c(1,2,3) for(i in 1:3){ if(i == 1){ base_mat_full = base_mat } else { base_mat_full = cbind(base_mat_full, base_mat+max(base_mat_full)) } } base_mat = base_mat_full for(i in 1:(ncol(img)/3)){ if(i == 1){ base_mat_full = base_mat } else { base_mat_full = abind(base_mat_full, base_mat+max(base_mat_full), along = 2) } } base_mat = base_mat_full for(i in 1:(nrow(img)/3-1)){ # for the sliding window in the next step already) if(i == 1){ base_mat_full = abind(base_mat, base_mat+max(base_mat), along = 1) } else { base_mat_full = abind(base_mat_full, base_mat+max(base_mat_full), along = 1) } }

We now loop over all rows and all columns of the input image and assign every input pixel to a discrete object, named by index value from the corresponding position in the matrix generated above. The naming scheme is "inp_index".

sapply(seq(1,nrow(img)), function(x){ sapply(seq(1,ncol(img)), function(y){ eval(parse(text = paste0("inp_",base_mat_full[x,y]," <<- img[",x,",",y,"]"))) }) })

In the next step, we set up the initial parameter values for the first parameter matrix (i.e., those corresponding to the first layer of the CNN). We do this by sampling from the *Glorot-uniform* distribution, which is defined by a uniform distribution in the range of values between 6/(n(in)+n(out)) and its negative, where "n(in)" is the number of dimensions of the input, and "n(out)" is the number of dimensions of the output of the layer. In our case, since we only consider a local field of the input, we have nine input dimensions and one output dimension (in CNNs, one would consider the entire input and the entire output). Hence, the initial parameter values are sampled from the uniform distribution limited by -6/(9+1), +6/(9+1).

lim_l1 = sqrt(6/(3*3+1)) sapply(seq(1,9*1,1), function(y){ eval(parse(text = paste0('l1m1w',y,' <<- runif(1, min = -lim_l1, max = lim_l1)'))) })

Before we apply the parameters to calculate the first intermediate representation of the data (the output of the first layer), we set up another matrix of indices that we will use to name the discrete objects that we will assign the single components of the first intermediate representation to. At this step, we already need to decide the window size (i.e. the number of parameters) of the second layer of the CNN, since we need to distribute the indices accordingly in our new index matrix. We will apply a matrix of 5 * 5 parameters in the second layer. This is a conscious choice, since the first hidden representation of the data will be a matrix of 25 * 25 dimensions (when looping over all columns and rows of the input image with the 3-*-3-parameter matrix of the first layer, two columns and two rows will "vanish" between the input data and the first representation, a result of the calculation of sum-products). As a result, in the same manner as we generated the first index matrix, we now concatenate five column vector with five elements each, then concatenate five of these matrices side-to-side, and then concatenate five of these super-matrices bottom-to-bottom.

base_mat = c(1,2,3,4,5) for(i in 1:5){ if(i == 1){ base_mat_full_2 = base_mat } else { base_mat_full_2 = cbind(base_mat_full_2, base_mat+max(base_mat_full_2)) } } base_mat = base_mat_full_2 for(i in 1:((ncol(img)-2)/5)){ if(i == 1){ base_mat_full_2 = base_mat } else { base_mat_full_2 = abind(base_mat_full_2, base_mat+max(base_mat_full_2), along = 2) } } base_mat = base_mat_full_2 for(i in 1:((nrow(img)-2)/5-1)){ if(i == 1){ base_mat_full_2 = abind(base_mat, base_mat+max(base_mat), along = 1) } else { base_mat_full_2 = abind(base_mat_full_2, base_mat+max(base_mat_full_2), along = 1) } }

Now, finally, we calculate the first hidden representation of the data utilizing the parameters defined above: We loop over all rows and columns of the input image (except for the last two elements of each of the two axes, since our parameter matrix always covers the current cell of the image plus the two adjacent ones in each axis). For each component of the hidden representation, which is here treated as a discrete object, we assign a unique name defined by the second index matrix, or, more specifically, by the element of the second index matrix corresponding to the current row-column combination of the input image. We initially asign a zero to that object. We then build a string using a loop over the components of the parameter matrix of the first layer, where we contruct a string representing a multiplication of parameter with input value. The string components are concatenated with the addition symbol. Finally, the fully-concatenated string is pasted into another string that symbolically applies a sigmoid function.

invisible(sapply(seq(1,nrow(img)-2), function(x){ sapply(seq(1,ncol(img)-2), function(y){ eval(parse(text = paste0('interm_l1_',base_mat_full_2[x,y],' <<- "0"'))) sapply(seq(1,9*1,1), function(z){ eval(parse(text = (paste0('interm_l1_',base_mat_full_2[x,y],' <<- paste0(interm_l1_',base_mat_full_2[x,y],', " + l1m1w',z,' * inp_',base_mat_full[x:(x+2),y:(y+2)][z], '")')))) }) eval(parse(text = paste0('interm_l1_',base_mat_full_2[x,y],' <<- paste0("1 / (1 + exp(-(",interm_l1_',base_mat_full_2[x,y],',")))")'))) }) })) eval(parse(text = interm_l1_1))

Now, we want to take a look at the first intermediate representation. For that, we first set up an empty array the dimensions of which match those of the second index matrix (i.e., the number of columns of the input image minus two and the number of rows of the input image minus two). We then fill in that matrix with the single components of the intermediate representation (which are all separate objects at the moment). We then use the *image()* function to visualize that matrix. We can see that the intermediate representation is an abstraction of the original image. It is not simply a visualization at a lower resolution, but certain areas are more emphasized than others.

repr_arr = array(dim = c(nrow(img)-2, nrow(img)-2, 1)) max_n = max(base_mat_full_2) for(i in 1:max_n){ coords = which(base_mat_full_2==i,arr.ind = T)[1,] eval(parse(text = eval(parse(text = paste0('paste0("repr_arr[coords[1], coords[2], 1] = ", interm_l1_',i,')'))))) } par(mfrow = c(2,2)) image(img[,]) image(repr_arr[,,1])

Next, we set up the initial parameter values for the second parameter matrix (i.e., those corresponding to the second layer of the CNN). We apply the same methodology as in the setting of parameter values for the first parameter matrix. Since the first intermediate representation of the data is a matrix of 25 * 25 elements, it makes sense to work with a parameter matrix of a size that the the representation is an integer multiple of. Here, we will work with a matrix of 5 * 5 parameters, and the parameters for defining the Glorot-uniform distribution ae hence n(in) = 25 and n(out) = 1.

lim_l2 = sqrt(6/(5*5+1)) sapply(seq(1,25*1,1), function(y){ eval(parse(text = paste0('l2m1w',y,' <<- runif(1, min = -lim_l2, max = lim_l2)'))) })

Before we generate the second intermediate representation of the data using the parameters just initialized, we first again set up a new matrix of indices that will be used to name the objects that the single components of the second intermediate repesentation will get assigned to in the next step. Again, we need to decide on the window size (number of parameters) of the third layer of the CNN. We will use a window size of 7 * 7 parameters, as the second intermediate representation (the result from computing the sum-products of input-parameter products and applying an activation function) will be a matrix of 21 * 21 elements (input to the second layer is a matrix of 25 * 25 elements, and the parameter matrix applied is of size 5 * 5, hence 4 columns and 4 rows will disappear when looping over all rows and columns at a step-size of one and summing over 5 * 5 elements), and 21 can be evenly divided by 7. The new index matrix is generated in the same manner as before (see above); the maximum index will be 21**2.

base_mat = c(1,2,3,4,5,6,7) for(i in 1:7){ if(i == 1){ base_mat_full_3 = base_mat } else { base_mat_full_3 = cbind(base_mat_full_3, base_mat+max(base_mat_full_3)) } } base_mat = base_mat_full_3 for(i in 1:((ncol(repr_arr)-4)/7)){ if(i == 1){ base_mat_full_3 = base_mat } else { base_mat_full_3 = abind(base_mat_full_3, base_mat+max(base_mat_full_3), along = 2) } } base_mat = base_mat_full_3 for(i in 1:((nrow(repr_arr)-4)/7-1)){ if(i == 1){ base_mat_full_3 = abind(base_mat, base_mat+max(base_mat), along = 1) } else { base_mat_full_3 = abind(base_mat_full_3, base_mat+max(base_mat_full_3), along = 1) } }

Now we compute the second hidden representation of the data as before, using the first hidden representation and the parameters of the second parameter matrix as input. Again, we apply a sigmoid activation function on the sum-products of input data and parameters. As before, each element of the second hidden representation is assigned to its own discrete object. Again, we visualize the second hidden representation by filling an empty matrix with the values stored in these objects. We observe an abstract version of the original input image (more strongly abstracted than the first hidden representation).

invisible(sapply(seq(1,nrow(repr_arr)-4), function(x){ sapply(seq(1,ncol(repr_arr)-4), function(y){ eval(parse(text = paste0('interm_l2_',base_mat_full_3[x,y],' <<- "0"'))) sapply(seq(1,25*1,1), function(z){ eval(parse(text = (paste0('interm_l2_',base_mat_full_3[x,y],' <<- paste0(interm_l2_',base_mat_full_3[x,y],', " + l2m1w',z,' * ", (interm_l1_',base_mat_full_2[x:(x+4),y:(y+4)][z], '))')))) }) eval(parse(text = paste0('interm_l2_',base_mat_full_3[x,y],' <<- paste0("1 / (1 + exp(-(",interm_l2_',base_mat_full_3[x,y],',")))")'))) }) })) eval(parse(text = interm_l2_1)) repr_arr_2 = array(dim = c(nrow(repr_arr)-4, nrow(repr_arr)-4,1)) max_n = max(base_mat_full_3) for(i in 1:max_n){ coords = which(base_mat_full_3==i,arr.ind = T)[1,] eval(parse(text = eval(parse(text = paste0('paste0("repr_arr_2[coords[1], coords[2], 1] = ", interm_l2_',i,')'))))) } image(repr_arr_2[,,1])

Now we set the initial values for the third parameter matrix (i.e., those corresponding to the third layer of the CNN). The third parameter matrix has a size of 7 * 7, and thus the parameters of the Glorot-uniform distribution are n(in) = 49 and n(out) = 1.

lim_l3 = sqrt(6/(7*7+1)) sapply(seq(1,49*1,1), function(y){ eval(parse(text = paste0('l3m1w',y,' <<- runif(1, min = -lim_l3, max = lim_l3)'))) })

Again we set up a matrix of indices for naming the components of the third representation of the data generated in the following step. By applying the third parameter matrix on the second representation of the data, which has a size of 21 * 21, we will "lose" six rows and six columns (through the calculation of sum-products), so the third representation of the data will be a matrix of size 15 * 15. It makes sense to divide this matrix into sub-matrices of 5 * 5, even though we will not apply a convolutional layer (i.e. a parameter matrix) in the next step. Accordingly, we construct our third index matrix out of smaller matrices with size 5 * 5.

base_mat = c(1,2,3,4,5) for(i in 1:5){ if(i == 1){ base_mat_full_4 = base_mat } else { base_mat_full_4 = cbind(base_mat_full_4, base_mat+max(base_mat_full_4)) } } base_mat = base_mat_full_4 for(i in 1:((ncol(repr_arr)-6)/5)){ if(i == 1){ base_mat_full_4 = base_mat } else { base_mat_full_4 = abind(base_mat_full_4, base_mat+max(base_mat_full_4), along = 2) } } base_mat = base_mat_full_4 for(i in 1:((nrow(repr_arr)-6)/5-1)){ if(i == 1){ base_mat_full_4 = abind(base_mat, base_mat+max(base_mat), along = 1) } else { base_mat_full_4 = abind(base_mat_full_4, base_mat+max(base_mat_full_4), along = 1) } }

Now, as before, we compute the third hidden representation of the data, from the second hidden representation (the input) and the parameters of the third parameter matrix. And again, we apply a sigmoid activation function on the sum-products of the input-parameter combinations. Again, we assign each component to its own discrete object, now using the third index matrix. We visualize the third hidden representation of the data as before, observing a further abstraction of the original data.

invisible(sapply(seq(1,nrow(repr_arr_2)-6), function(x){ sapply(seq(1,ncol(repr_arr_2)-6), function(y){ eval(parse(text = paste0('interm_l3_',base_mat_full_4[x,y],' <<- "0"'))) sapply(seq(1,49*1,1), function(z){ eval(parse(text = (paste0('interm_l3_',base_mat_full_4[x,y],' <<- paste0(interm_l3_',base_mat_full_4[x,y],', " + l3m1w',z,' * ", (interm_l2_',base_mat_full_3[x:(x+6),y:(y+6)][z], '))')))) }) eval(parse(text = paste0('interm_l3_',base_mat_full_4[x,y],' <<- paste0("1 / (1 + exp(-(",interm_l3_',base_mat_full_4[x,y],',")))")'))) print(c(x,y)) }) })) eval(parse(text = interm_l3_1)) repr_arr_3 = array(dim = c(nrow(repr_arr_2)-6, nrow(repr_arr_2)-6, 1)) max_n = max(base_mat_full_4) for(i in 1:max_n){ coords = which(base_mat_full_4==i,arr.ind = T)[1,] eval(parse(text = eval(parse(text = paste0('paste0("repr_arr_3[coords[1], coords[2], 1] = ", interm_l3_',i,')'))))) } image(repr_arr_3[,,1])

We now apply the last set of parameters on the third hidden representation of the data in order to generate the class prediction of our model (technically, that prediction, in the form of a scalar value, is thus also a "representation" of the original input data, although a severely condensed one). This time, however, we are going to do it in a different way, as we no longer work with a convolutional layer: We instead set up a set of parameter values, the number of which equals the number of elements in the third hidden representation of the data. The Glorot-uniform uniform distribution from which we draw the initial estimates is defined by that number (n(in)) and n(out) = 1, since we only make a binary class prediction (when there are more than two classes, n(out) equals the number of classes). We then loop through all elements of the third hidden representation, as before making use of the fact that the names of the objects corresponding to that representation are indexed numerically. While looping, we generate a large character string that represents a sum product of parameters and inputs (i.e., the single components of the third hidden representation).

len_l3 = max(base_mat_full_4) len_l4 = 1 # size of output vector (in this base, prediction is binary, so the output is actually scalar) lim_l4 = sqrt(6/(len_l3+len_l4)) sapply(seq(1,len_l3), function(x){ sapply(seq(1,len_l4), function(y){ eval(parse(text = paste0('l4w',x,'_',y,' <<- runif(1, min = -lim_l4, max = lim_l4)'))) }) }) invisible(sapply(seq(1,len_l4), function(x){ eval(parse(text = paste0('fin_',x,' <<- "0"'))) sapply(seq(1,len_l3), function(y){ eval(parse(text = paste0('fin_',x,' <<- paste0(fin_',x,', " + l4w',y,'_',x,' * ", (interm_l3_',y, '))'))) print(c(x,y)) }) }))

We then embed the complete character string into an expression representing a sigmoid activation function (after all, the prediction should lie in the range zero to one, with a value > 0.5 indicating a prediction of class "1" instead of class "0"). We evaluate this expression to obtain the class prediction. But we then also extend the expression with the binary-cross-entropy loss function (-(y_true * log(y_pred))). The loss function calculates the deviation of the prediction from the "true" class label. In our case, we assume the true class label to be "1".

eval(parse(text = 'pred <<- paste0("1 / (1 + exp(-(",fin_1,")))")')) eval(parse(text = 'fin_1 <<- paste0("1 / (1 + exp(-(",fin_1,")))")')) eval(parse(text = 'fin_1 <<- paste0("-(1 * log(",fin_1,"))")')) # "1" = truth eval(parse(text = pred)) # prediction eval(parse(text = fin_1)) # binary-cross-entropy loss

The predicted value of 0.XX would be interpreted as a prediction of class "1", but it is still a bit far away from 1, so the prediction is somewhat uncertain. We could try to reduce that uncertainty by utilizing the binary-cross-entropy loss to compute parameter-specific partial loss derivatives and update the parameter values accordingly.

Indeed, we intend to do just that. We are going to utilize the chain rule to implement the back-propagation algorithm, and to that end, we will calculate the partial derivatives of the outputs of the various layers i) with respect to the parameters contained within these layers and ii) with respect to the input that these layers receive. The various partial derivatives will later on be combined mathematically to calculate the partial derivatives of the classification loss (essentially the output of the final layer) with respect to all the parameters in the network. We begin by calculating the derivative of the classification loss with respect to the parameters in the final layer, i.e. the fully-connected or "dense" layer. To that end, we apply the *deriv()* function on an expression of the mathematical operations in that layer, and specify, in a second argument to that function, that the partial derivatives should be calculated with respect to the parameters in that layer. In a second step, we use the same function and expression but an altered second argument to calculate the partial derivatives of the classification loss with respect to the input to that layer (which are the singole components of the third hidden respresentation of the input data, i.e. the output of the third convolutional layer, that the parameters of the final layer are multiplied with).

sapply(seq(1,len_l4), function(x){ xpr_a = paste0(sapply(seq(1,len_l3,1), function(y){eval(parse( text = paste0('paste0("l4w',y,'_',x,' * interm_l3_',y,'")')))}), collapse = ' + ') xpr_a = paste0('1 / (1 + exp(-(',xpr_a,')))') xpr_a = paste0('-(1 * log(',xpr_a,'))') xpr_b = paste0(sapply(seq(1,len_l3), function(y){paste0('"l4w',y,'_',x,'"')}), collapse = ', ') xpr_b = paste0('c(',xpr_b,')') xpr_1 = paste0('dout_wl4_',x,' <<- deriv(expression(',xpr_a,'), ', xpr_b, ')') # derivative of loss w.r.t. layer-4 weights eval(parse(text = xpr_1)) xpr_b = paste0(sapply(seq(1,len_l3), function(y){paste0('"interm_l3_',y,'"')}), collapse = ', ') xpr_b = paste0('c(',xpr_b,')') xpr_2 = paste0('dout_interml3_',x,' <<- deriv(expression(',xpr_a,'), ', xpr_b, ')') # derivative of loss w.r.t. layer-3 output eval(parse(text = xpr_2)) print(x) }) sapply(seq(1,len_l3,1), function(y){eval(parse(text = paste0('interm_l3_',y,' <<- eval(parse(text = paste0(interm_l3_',y,')))')))})

Next, we compute the partial derivatives of the output of the third convolutional layer with respect to the parameters in that layer, to those in the second convolutional layer, and to those in the first convolutional layer. These derivatives will later on be combined with the partial derivatives of the classification loss with respect to the output of the third convolutional layer, by using the chain rule, in order to generate the derivatives of the loss with respect to the layer-1, layer-2 and layer-3 parameters. We avoid a further potentially confusing disentanglement of derivative calculations (i.e. e.g. calculating the derivative of layer-2 outputs with respect to layer-2 parameters and combining this with derivatives of layer-3 outputs with respect to layer-2 outputs in order to generate derivatives of layer-3 outputs with respect to layer-2 parameters) due to the fact that the convolutional layers employ a fairly small number of parameters. Calculations of partial derivatives of layer-3 output over several layers distance is therefore not particularly computionally strenuous, compared to calculating partial loss derivatives with respect to the parameters of early layers in the network, which would involve calculations over the entire (large) set of parameters in the fully-connected layer (layer number four). Separating the computation of partial loss derivatives with respect to layer-3 output from the calculation of partial derivatives of layer-3 output with respect to the parameters in the earlier layers means that the former step must only be conducted once, a saving of a tremendous amount of time.

sapply(seq(1,nrow(repr_arr_2)-6), function(x){ sapply(seq(1,ncol(repr_arr_2)-6), function(y){ xpr_a = paste0(sapply(seq(1,49*1,1), function(z){eval(parse(text = paste0('paste0("l3m1w',z,' * ", (interm_l2_',base_mat_full_3[x:(x+6),y:(y+6)][z],'))')))}), collapse = ' + ') xpr_a = paste0('1 / (1 + exp(-(',xpr_a,')))') xpr_b = paste0(sapply(seq(1,49*1,1), function(z){paste0('"l3m1w',z,'"')}), collapse = ', ') xpr_b = paste0('c(',xpr_b,')') xpr_1 = paste0('dinterm_l3m1_',x,'_',y,'_wl3 <<- deriv(expression(',xpr_a,'), ', xpr_b, ')') invisible(eval(parse(text = xpr_1))) # derivative of layer-3 output w.r.t. layer-3 weights xpr_b = paste0(sapply(seq(1,25*1,1), function(z){paste0('"l2m1w',z,'"')}), collapse = ', ') xpr_b = paste0('c(',xpr_b,')') xpr_1 = paste0('dinterm_l3m1_',x,'_',y,'_wl2 <<- deriv(expression(',xpr_a,'), ', xpr_b, ')') invisible(eval(parse(text = xpr_1))) # derivative of layer-3 output w.r.t. layer-2 weights xpr_b = paste0(sapply(seq(1,9*1,1), function(z){paste0('"l1m1w',z,'"')}), collapse = ', ') xpr_b = paste0('c(',xpr_b,')') xpr_1 = paste0('dinterm_l3m1_',x,'_',y,'_wl1 <<- deriv(expression(',xpr_a,'), ', xpr_b, ')') invisible(eval(parse(text = xpr_1))) # derivative of layer-3 output w.r.t. layer-1 weights print(c(x,y)) }) }) # save.image('CNN_Session.R') # saves R session to disk # load('CNN_Session.R') # loads R session

The code listed above does itself also take considerable time to run, though, so it is advisable to let it run e.g. over-night or on an external server, and to save the R session to a .R file to enable a continuation of the work at a later date.

Now, we finally calculate the loss gradients for the parameters in the convolutional layers from the partial derivatives of the classification loss with respect to the output of the final convolutional layer (the output is the result of a mathematical graph of operations involving all three convolutional layers) and the partial derivatives of this output with respect to the parameters of the three convolutional layers. The calculation is indeed only a simple multiplication of these two types of derivatives, as per the chain rule. As we have obtained partial derivatives of the loss with respect to all components of the output of the third convolutional layer, and since the parameters of all convolutional layers contribute to every one of these components, we loop over the single components and iteratively sum up the partial derivatives for each parameter.

grads_w4 = attributes(eval(dout_wl4_1))$gradient # gradients for parameters in the fully-connected layer grads_w3 = 0 grads_w2 = 0 grads_w1 = 0 sapply(seq(1,nrow(repr_arr_2)-6), function(x){ sapply(seq(1,ncol(repr_arr_2)-6), function(y){ eval(parse(text = paste0('grads_w3 <<- grads_w3 + attributes(eval(dout_interml3_1))$gradient[,"interm_l3_',base_mat_full_4[x,y],'"] * attributes(eval(dinterm_l3m1_',x,'_',y,'_wl3))$gradient'))) eval(parse(text = paste0('grads_w2 <<- grads_w2 + attributes(eval(dout_interml3_1))$gradient[,"interm_l3_',base_mat_full_4[x,y],'"] * attributes(eval(dinterm_l3m1_',x,'_',y,'_wl2))$gradient'))) eval(parse(text = paste0('grads_w1 <<- grads_w1 + attributes(eval(dout_interml3_1))$gradient[,"interm_l3_',base_mat_full_4[x,y],'"] * attributes(eval(dinterm_l3m1_',x,'_',y,'_wl1))$gradient'))) }) })

Finally, we update the parameters of the fully-connected- and of the convolutional layers using their current values, the loss gradients and a learning rate. We set the learning rate to 0.1, which is relatively high and might lead to sub-optimal results in the long run, but is effective to demonstrate an effect of the parameter update immediately. The update formula is *new_parameter_value = current_parameter_value - learning_rate * loss_gradient*.

lr = 1e-1 for(i in 1:length(grads_w4)){ eval(parse(text = paste0(attributes(grads_w4)$dimnames[[2]][i], ' <<- ', attributes(grads_w4)$dimnames[[2]][i], ' - lr * grads_w4[,"',attributes(grads_w4)$dimnames[[2]][i],'"]'))) } for(i in 1:length(grads_w3)){ eval(parse(text = paste0(attributes(grads_w3)$dimnames[[2]][i], ' <<- ', attributes(grads_w3)$dimnames[[2]][i], ' - lr * grads_w3[,"',attributes(grads_w3)$dimnames[[2]][i],'"]'))) } for(i in 1:length(grads_w2)){ eval(parse(text = paste0(attributes(grads_w2)$dimnames[[2]][i], ' <<- ', attributes(grads_w2)$dimnames[[2]][i], ' - lr * grads_w2[,"',attributes(grads_w2)$dimnames[[2]][i],'"]'))) } for(i in 1:length(grads_w1)){ eval(parse(text = paste0(attributes(grads_w1)$dimnames[[2]][i], ' <<- ', attributes(grads_w1)$dimnames[[2]][i], ' - lr * grads_w1[,"',attributes(grads_w1)$dimnames[[2]][i],'"]'))) }

All parameter values are now updated, and we run the cycle of operations from the start again, beginning with computing the first hidden representation of the input as the output of the first convolutional layer. When comparing the visualizations of the intermediate representations resulting from the three convolutional layers, we do indeed observe some changes when comparing them to the visualizations from the first iteration. We also obtain a lower value of classification loss.