ExDataScientia

Building a deep neural network from scratch

Deep neural networks for classification or regression are typically constructed via the Keras API in Python, which is so user-friendly that it is essentially a blackbox. Here we will look at a much less opaque approach using the deriv() function in R.

Deep neural networks (DNNs) are at the core of modern classifier systems colloquially referred to as devices using "Artificial Intelligence". Yet the inner mechanisms of the models behind these device are shrouded in mystery to many. This may be also founded in the fact that with the availability of modern application-programming interfaces (APIs) like Keras, one does not need to have intricate knowledge about the optimization of such models. Here, we are going to build a simple DNN on our own using only the deriv() function in R. This allows us to construct all steps in the assembly and training of DNNs on our own, rather than relying on the "black-box" API. Please be aware that this article is not intended to teach the inner mechanisms of DNNs and of their training; we will only look at the implementation of model construction and training here.

Deriv() is a basic function in R that calculates the derivative of any function with respect to parameters contained within this function. Calculating derivatives is of importance in training a DNN, as in most Machine-Learning models. What is calculated are actually the gradients of the loss function of the DNN model with respect to its parameters, and the gradient of any function is given by its first derivative. Functions passed to deriv() can be of arbitrary complexity, but can only contain relatively basic mathematical operations; more specific functions like sum() or mean() cannot be used.

Let us now build our DNN within the context of the deriv() function. In particular, what we actually construct is the loss function that takes an observation and the DNN itself, which calculates the prediction, and yields a scalar value. This scalar loss is the key to calculating the partial derivatives with respect to the parameters in the DNN. Our DNN will have three layers: One input layer that receives a four-dimensional input, one hidden layer with three nodes, and one output layer with three nodes. The input and first hidden layers will be connected with 4 * 3 = 12 weights, plus three biases, one for each node in the first hidden layer. The first hidden layer and the output layer (i.e. the second hidden layer) will be connected with 3 * 3 = 9 weights plus three biases, one for each node in the output layer. Formally, we will treat each bias as an additional weight that is applied to a pseudo-input-dimension or to a pseudo-node of value 1. The summed products of input data and weights, plus the bias, will be activated with a sigmoid function in the first hidden layer, and with a softmax function in the output layer. The softmax function simply ensures that the outputs of all nodes in the final layer sum to 1. The final output of the DNN, that is, a vector of three values that sum to one, will be compared to the ground truth, a one-hot-encoded vector of three values, i.e. a vector with two zeros and one one.

Since deriv() allows no usage of aggregative operators, like sum(), and since we do not wanter to enter the subject of matrix calculations at this point, we will have to write a lot of replications in our code in order to calculate the final scalar loss, from the multivariate DNN output and ground truth. We will look at single expressions first, and then gradually piece everything together. We begin with the calculations for one single node of the first hidden layer: Since we have four-dimensional input data, we have four input values, termed in1a to in1d. Each of these is multiplied with one parameter, or weight, as in x1a1 * in1a. x1a1 to x1d1 are the names of the weights. The products are added. Finally, we add the bias, which is, in this example, formally treated like a weight, hence it is multiplied with a scalar 1 instead of an input value. The bias is called x1e1. The product of bias and scalar 1 is added to the previous addition operations. Finally, The entire sum is activated with a sigmoid function of the form 1 / (1 + exp(-x)), where x is the sum of input-weight multiplications and the bias.

exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
      x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
      x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
      x2d1 * 1)) /
(exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
       x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
       x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
       x2d1 * 1)) + 
 exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
       x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
       x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
       x2d2 * 1)) +
 exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
       x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
       x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
       x2d3 * 1)))

Now, in order to get the gradient of the loss function, we need to calculate the loss for the classification task, which is the categorical cross-entropy. This is defined as the negative of the sum over all classes of the product of one element of the true one-hot-encoded class vector with the logarithm of its corresponding element in the predicted class vector, -sum(y_true(i) * log(y_pred(i)). Therefore, we take the (natural) logarithm of each output node and multiply it with the corresponding true value (either a zero or a one, whith the one indicating that the corresponding index in the ground-truth vector is the true class index). Then, we add the three products together, and multiply the sum with -1.

-(
  (y_1 * log(
  exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
        x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
        x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
        x2d1 * 1)) /
  (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d1 * 1)) + 
   exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d2 * 1)) +
   exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d3 * 1)))
    )) +
  (y_2 * log(
  exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
        x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
        x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
        x2d2 * 1)) /
   (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d1 * 1)) + 
    exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d2 * 1)) +
    exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d3 * 1)))
    )) +
  (y_3 * log(
  exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
        x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
        x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
        x2d3 * 1)) /
   (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d1 * 1)) + 
      exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d2 * 1)) +
      exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d3 * 1)))
    ))
)

dv = deriv(~
  -(
  (y_1 * log(
  exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
        x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
        x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
        x2d1 * 1)) /
  (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d1 * 1)) + 
   exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d2 * 1)) +
   exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d3 * 1)))
    )) +
  (y_2 * log(
  exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
        x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
        x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
        x2d2 * 1)) /
   (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d1 * 1)) + 
    exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d2 * 1)) +
    exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d3 * 1)))
    )) +
  (y_3 * log(
  exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
        x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
        x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
        x2d3 * 1)) /
   (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d1 * 1)) + 
      exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d2 * 1)) +
      exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
         x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
         x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
         x2d3 * 1)))
    ))
  ),
  c('x1a1', 'x1b1', 'x1c1', 'x1d1', 'x1e1',
    'x1a2', 'x1b2', 'x1c2', 'x1d2', 'x1e2',
    'x1a3', 'x1b3', 'x1c3', 'x1d3', 'x1e3',
    'x2a1', 'x2b1', 'x2c1', 'x2d1',
    'x2a2', 'x2b2', 'x2c2', 'x2d2',
    'x2a3', 'x2b3', 'x2c3', 'x2d3')
)

Next, we prepare the data to train the neural network. In this demonstration, we keep it simple and do not split them up into training and validation data, but use all available data for training instead. We will work with the common iris data, a set of four attributes measured on 150 individal plants of the genus Iris that belong to three different species. The goal of our neural network will be to classify the plants using the attribute values as input (hence the four input nodes of our neural network). Using randomly shuffled indices, we randomize the sequence in which the data will be presented to the neural network - this will affect the learning process (for a general statement about its ability to learn to classify the data, it would theoretically be necessary to perform multiple training instances with different randomizations). Then, we standardize the data along each attribute, so that the common mean is zero and the common mean is 1. This helps to reduce the dispropotionate influence of numerically large attributes, which often improves the training process.

We can now update the parameters of the neural network in order to make a directed adaptation to the data based on the loss incurred with the present weight- and bias values. We need to specify the step size of reaction to the loss to determine the magnitude of change to the parameters. This is commonly referred to as learning rate. In our case, we set it to 0.1, which is fairly large. On the other hand, we have a relatively easy classification task, so a large learning rate will not do much harm and will lead us to quickly observe effects of parameter adaptation. Parameters are updated by subtracting the product of learning rate and gradient from the previous parameter value. This equals "moving" in the opposite direction of the gradient; the rate of movement is determined by the learning rate.

lr = 1e-1

x1a1 = x1a1 - lr * grds[1]
x1b1 = x1b1 - lr * grds[2]
x1c1 = x1c1 - lr * grds[3]
x1d1 = x1d1 - lr * grds[4]
x1e1 = x1e1 - lr * grds[5]
x1a2 = x1a2 - lr * grds[6]
x1b2 = x1b2 - lr * grds[7]
x1c2 = x1c2 - lr * grds[8]
x1d2 = x1d2 - lr * grds[9]
x1e2 = x1e2 - lr * grds[10]
x1a3 = x1a3 - lr * grds[11]
x1b3 = x1b3 - lr * grds[12]
x1c3 = x1c3 - lr * grds[13]
x1d3 = x1d3 - lr * grds[14]
x1e3 = x1e3 - lr * grds[15]

x2a1 = x2a1 - lr * grds[16]
x2b1 = x2b1 - lr * grds[17]
x2c1 = x2c1 - lr * grds[18]
x2d1 = x2d1 - lr * grds[19]
x2a2 = x2a2 - lr * grds[20]
x2b2 = x2b2 - lr * grds[21]
x2c2 = x2c2 - lr * grds[22]
x2d2 = x2d2 - lr * grds[23]
x2a3 = x2a3 - lr * grds[24]
x2b3 = x2b3 - lr * grds[25]
x2c3 = x2c3 - lr * grds[26]
x2d3 = x2d3 - lr * grds[27]

loss_list = loss

for(u in 1:5){
  for(j in 1:nrow(iris)){
    in1a = my_iris[j,1]
    in1b = my_iris[j,2]
    in1c = my_iris[j,3]
    in1d = my_iris[j,4]
    
    y_1 = y_mat[j,1]
    y_2 = y_mat[j,2]
    y_3 = y_mat[j,3]
    
    grds = eval(dv)
    loss = grds[1]
    grds = attributes(grds)[2]
    grds = unlist(grds)
    
    loss_list = c(loss_list, loss)
    cat(paste0(loss, '\n'))
    
    x1a1 = x1a1 - lr * grds[1]
    x1b1 = x1b1 - lr * grds[2]
    x1c1 = x1c1 - lr * grds[3]
    x1d1 = x1d1 - lr * grds[4]
    x1e1 = x1e1 - lr * grds[5]
    x1a2 = x1a2 - lr * grds[6]
    x1b2 = x1b2 - lr * grds[7]
    x1c2 = x1c2 - lr * grds[8]
    x1d2 = x1d2 - lr * grds[9]
    x1e2 = x1e2 - lr * grds[10]
    x1a3 = x1a3 - lr * grds[11]
    x1b3 = x1b3 - lr * grds[12]
    x1c3 = x1c3 - lr * grds[13]
    x1d3 = x1d3 - lr * grds[14]
    x1e3 = x1e3 - lr * grds[15]
    
    x2a1 = x2a1 - lr * grds[16]
    x2b1 = x2b1 - lr * grds[17]
    x2c1 = x2c1 - lr * grds[18]
    x2d1 = x2d1 - lr * grds[19]
    x2a2 = x2a2 - lr * grds[20]
    x2b2 = x2b2 - lr * grds[21]
    x2c2 = x2c2 - lr * grds[22]
    x2d2 = x2d2 - lr * grds[23]
    x2a3 = x2a3 - lr * grds[24]
    x2b3 = x2b3 - lr * grds[25]
    x2c3 = x2c3 - lr * grds[26]
    x2d3 = x2d3 - lr * grds[27]
  }
}

Plotting the loss trajectory, we find that the loss in general decreases with an increasing number of iterations. The trajectory is not smooth, since we see the prediction loss for every individual data point, and some data points are in general more difficult to classify than others. Adapting the parameter values after each individual prediction is not necessarily the best or most efficient scheme to train a neural network (or any model trained via gradient descent), but our make-shift formulation of the neural network and of its loss does not easily allow gradient calculation for accumulated losses (it would incur the writing of a large amount of repetitive code). Still, we can be quite happy to observe that the loss does decrease markedly.

preds_mat = matrix(nrow = nrow(iris), ncol = 3)

for(j in 1:nrow(iris)){
  in1a = my_iris[j,1]
  in1b = my_iris[j,2]
  in1c = my_iris[j,3]
  in1d = my_iris[j,4]
  
  a1 = exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
               x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
               x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
               x2d1 * 1)) /
    (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
            x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
            x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
            x2d1 * 1)) + 
       exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
              x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
              x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
              x2d2 * 1)) +
       exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
              x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
              x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
              x2d3 * 1)))
  
  b1 = exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
               x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
               x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
               x2d2 * 1)) /
    (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
            x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
            x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
            x2d1 * 1)) + 
       exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
              x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
              x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
              x2d2 * 1)) +
       exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
              x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
              x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
              x2d3 * 1)))
  
  c1 = exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
               x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
               x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
               x2d3 * 1)) /
    (exp((x2a1 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
            x2b1 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
            x2c1 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
            x2d1 * 1)) + 
       exp((x2a2 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
              x2b2 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
              x2c2 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
              x2d2 * 1)) +
       exp((x2a3 * 1/(1+exp(-(x1a1 * in1a + x1b1 * in1b + x1c1 * in1c + x1d1 * in1d + x1e1 * 1))) +
              x2b3 * 1/(1+exp(-(x1a2 * in1a + x1b2 * in1b + x1c2 * in1c + x1d2 * in1d + x1e2 * 1))) +
              x2c3 * 1/(1+exp(-(x1a3 * in1a + x1b3 * in1b + x1c3 * in1c + x1d3 * in1d + x1e3 * 1))) +
              x2d3 * 1)))
  
  preds_mat[j,] = c(a1, b1, c1)
}

This predictions matrix has the same structure as the ground-truth matrix generated above (before model training). This means that we can now directly compare the predicted class indices (positional argument of the greatest output value among the three output nodes in each row of the predictions matrix) with the true class label (positional argument of the "1" in each row of the ground-truth matrix). We will do so by calculating the squared difference between each cell in the ground-truth matrix and each corresponding cell in the predictions matrix.

Instead of going for some complicated visualization, we will simply look at the squared-differences matrix from a bird's eye perspective using the image() function. This simply converts the matrix into tile plots, where each tile corresponds to one matrix cell. The coloration of the tiles then corresponds to the magnitude of values contained in them. In the following plots, red tiles contain a "1", while yellow cells contain a "zero", tiles with different colors contain values larger than zero but less than one.

It is easy to see that the training has strongly improved the predictive capability of the neural network.

So, we have successfully completed the task of building and training a neural network from scratch. While it requires a lot of typing effort, and while the written code can look confusing in the beginning, we can trace back all the operations that make the training and the neural network itself work in intricate detail. Only the application of the chain rule to calculate gradients during training is still a bit of a black box, since the deriv function does this job for us. On the other hand, the derivation rules would have to be specified for each calculation step in the neural network individually, so there is not really much of a point to design them manually for demonstrative purpose.

Ex Data, Scientia