ExDataScientia

Implementing a Deep Neural Network in Keras - Step by Step

Deep Neural Networks are on the way to dominate the field of Machine Learning, seeing increased use in classification, regression and optimization tasks. Their implementation might appear as a mystery to some, yet the implmentation in the Keras API is actually fairly straightforward.

So what is Keras? Keras is a high-level programming language implemented in Python. Written by noted Machine-Learning researcher Francois Chollet, it provides easy access to Tensorflow functions, which perform the actual mathematical operations that an operating Deep Neural Net (DNN) consists of. For example, the multiplication of input with weights, and the complex calculations involved in computing the partial derivatives for backpropagation (a method for calculating the loss-dependent gradient for each signle neuron in a DNN) involve many lines of code in Tensorflow, but can be written in single line in Keras. This makes Keras extremely versatile to use even by persons with limited knowledge on the subject matter. Here, we are going to take a look at how one can implement image classification using a deep Convolutional Neural Network (CNN) written in Keras.

The first thing to do when writing a CNN application is of course to import the required packages. I am detailing here the implementation using Tensorflow 1. As of this writing , Tensorflow 2 is already available, though the former version is still available and supported. The exact names and functions may therefore be subject to change over time. We require the numpy, os and pandas packages, which will be used primarily to investigate the folders containing our images (os) and to analyze and save the training classification results. The pyplot class of the matplotlib package is also imported to do some visualizations. Furthermore, we import the confusion_matrix function from the scikit-learn package for some very specific analytic visualization that will be described later. Of course, we also need to import some functions from the Keras package in order to set up our CNN. Finally, we require the ImageDataGenerator function, which will allow us to import images batch-wise from disk to memory (by the way, you can see in the importing command that Keras now comes as part of the Tensorflow package, a tribute to the dominance of that Deep-Learning package. Originally, Keras also supported other competing packages like PyTorch).

Next, we need to specify the directories containing the images that we want to feed to our model for training and validation. We can define a base directory that contains three sub-directories: A training folder containing all the training images, a validation folder and a test folder. The images must be stored in named folders in each of these directories, where each folder corresponds to one class. The Keras generator functions will automatically scan the training directory and recognize the classes present. Note that no other files or directories other than the class folders should be present in the directories, and that only image files should be contained in the class folders. The training directory should contain the majority of manually-classified images available (e.g. 80 %), since we want to train the classifier with the highest possible image diversity so that it can generalize as good as possible in the end. The validation images (e.g. about 10 % of all images) are used to assess model generalizability - since they are not used for training the model, they provide a means of testing the field performance of the classifier. The performance of the model on these images as compared to its performance on the traininig images will indicate whether it has over-adapted on the latter and thus lost generalizability. The test images (also about 10 % of all images) are used to assess larger structural uncertainties in the classifier design. Since they are fully independent of the validation dataset, they can be used to test whether the classifier design, or the traininig specifications, have been adapted too strongly to the training and validation datasets.

Next, we need to set some parameter values that will be required by later functions and primarily relate to the structure of the input data and the training procedure. We set the number of images to be included in one data batch to be passed to the model during training or prediction as train_batch, val_batch and test_batch for training, validation and test images, respectively. The practice of supplying batches of images to the classifier is a compromise between the impossibility of loading all images to memory at once (which does not apply to very small images like the MNIST hand-written digits) and the desire to supply all images at once for a smoother training procedure. Indeed, the choice of batch size does have an influence on the training performance, as the loss surface for classification changes with changing training data. Supplying only one image at a time to the model leads to a constantly changing loss surface, which would likely cause the gradient-based training process to go way off course such that convergence would be far out of reach. When supplying all images at once, the loss surface stays constant, easing the training process. When supplying batches of images, the loss surface changes to some extent, making the training challenging. Given that CNNs are highly-parameterized models, making the training more challenging can potentially provide stronger generalizing capabilities to the model. When supplying all images at once, the model might in the worst case converge on the global loss minimum for the training images, which is by no means necessarily the global loss minimum for all conceivable images. Convergence on an approximate loss minimum for the training images might result in a model that is not too over-fitted to the training images and therefore more robust to field application.

We also supply the edge length of the images supplied to he model. When importing images via an image-generator function, they get forced into square format regardless of the original shape. With the "in_size" object we can determine the resolution that the images get forced into. Here, you should consider that it takes more time to process images of high resolution, but that a lower resolution can lead to loss of detail important for classification (on the other hand, it can also become difficult for a model to learn relevant features when it is supplied with images of overly high resolution relative to the easiness of the classification task). Also, we supply the number of classes via the "out_size" object, which will be used in the construction of the final layer of the CNN, which provides the classification prediction. It is determined by simply counting the number of class folders in the training-set directory.

Next, we provide the number of training, validation and test images by summing the number of images in the class folders over all classes. This is necessary to calculate the number of training, validation and test steps (train_step etc.), which are simply these numbers of images divided by the batch size. These metrics are required when calling the fit_generator or predict_generator for model training and prediction, respectively. They ensure that all available training and validation images are used in one training epoch. Finally, we provide the "verbose" object, which will be supplied as an argument to the training and prediction generators to indicate if the training or prediction progress should be printed (1) or not (0).

n_train_imgs = np.sum([len(os.listdir(train_dir + '/' + os.listdir(train_dir)[i])) for i in range(len(os.listdir(train_dir)))])
n_val_imgs = np.sum([len(os.listdir(validation_dir + '/' + os.listdir(validation_dir)[i])) for i in range(len(os.listdir(validation_dir)))])
n_test_imgs = np.sum([len(os.listdir(test_dir + '/' + os.listdir(test_dir)[i])) for i in range(len(os.listdir(test_dir)))])

train_step = n_train_imgs / train_batch
val_step = n_val_imgs / val_batch
test_step = n_test_imgs / test_batch

verbose = 1

Next, we define a custom image-data generator by supplying arguments to the ImageDataGenerator function provided by Keras. This function loads a batch of images (usually .png or .jpeg files) from the disk into the memory and processes these images according to a given set of rules. Setting up an image-data generator requires two steps: First, we call the function ImageDataGenerator and provide some arguments that will make some modifications to the loaded image. We provide the rescale argument, which will normalize all pixel values so that they lie in the range 0 to 1. Originally, pixels can have a maximum value of 255, which is the maximum number of bits, or gray-levels that the computer can generate. Model-fitting, however, works best if the values of all variables in the data are in a similar, and relatively small range. Therefore, values are typically standardized to lie in the range -1 to 1, or normalized to lie in the range 0 to 1. The latter is done here by dividing the values of all pixels by the maximum possible pixel value, i.e. 255. The other arguments perform some augmentation operations on the loaded image. These include e.g. image rotation, rescaling and shearing, and are implemented in order to avoid over-fitting in training the CNN. Successfull CNN training requires a large number of data in order to create a well-generalizing model. Manually annotated images are, however, rare - even hundreds of images per class can be insufficient to greate a good classifier. Therefore, image augmentation "teases" the CNN by supplying slightly altered versions of the input data in each training epoch - instead of supplying a larger number of original data, we supply artificially-created "new" data to avoid over-fitting. It should be added here that image augmentation can likely not completely replace the benefits from providing more original data fro training, since specific traits of the original image will always persist even in the augmented copies, be it certain lighting or color schemes, or any other image property not changed. Note that we set up two image-data generators, train_gen and val_gen, and that we do not supply the augmentation arguments to the latter generator. That is because during model validation, we want to see how the model performs under "field conditions", i.e. when confronted with new real images, not altered copies (image augmentation returns changed images but not necessarily realistic images!).

In the second step, we call the flow_from_directory attribute function of the image-data generators. This function requires some arguments that relate to the process of loading the images from disk into memory. We provide the directory containing the images (which we set up above as train_dir and validation_dir), as well as the batch size, so the function is provided with the number of images that should be loaded from memory at once. We also provide the target size that the image should assume when loaded into the Python session. Here, we provide only the edge lengths (we use the same value twice, which generates a square image, the standard input shape for a CNN). The default settings of the flow_from_directory function load a three-channel image (a color image); therefore, we do not provided the value of the third dimension of the data structure (3) in the target_size argument. If we explicitly wanted to load one-channel grey-scale images, we would have provided an additional argument named color_mode. Note that we also provide the class_mode argument, which is set to categorical. This argument ensures that the data generator will also yield a class label for each image in the form of a one-hot-encoded vector (where the "hot" element has the value 1; its position in the vector is the class index). The number of classes are inferred from the listing of class directories in the training directory, which explains why the training images must be contained in class folders on disk. A different typical setting for class_mode is binary, which is used in cases where only two classes are existing. In this case, the labels are scalars, which assume either the value 0 or 1, depending on the class. Note also that in classification tasks that do not concern image data, we would have to generate class labels ourselves by setting up an array of one-hot-encoded vectors.

Now, finally, we construct the actual CNN. Here, we define it as a function, so that we retain an overview of the components that make up the classifier model. It consists of several layer objects that reference each-other: Layer l1 references inp, layer l2 references l1 and so on. Here, the connections are implemented in a strictly linear manner: One layer is connected to one directly preceding layer and one directly subsequent layer. In practice, it is also possible to desing more intricate constructions of branching layers and residual connections between layers that are not directly following each-other. Such models often have advanced learning capabilities than the standard architecture presented here. The only rule in connecting layers is that there can be no cyclic connections; it is impossible for one layer to receive input from a subsequent layer. The stacking of layers is what makes the "magic" of CNNs: By connecting relatively simple functional objects in an intelligent manner, a powerful, very complex function is contructed that can solve complicated non-linear classification or regression tasks. While the architecture of the CNN is specified by the model designer, the process of training up-weights and down-weights connections in the net, effectively distilling the function best suited for the task at hand from the "function framework" provided (note that this often does not go as smoothely as described here - the model may well distill the wrong function, especially when provided with an unsuitable "framework" or when training data are lacking or are unsuited for the task). Indeed, most of the layers in our CNN implement a series of simple linear equations that are combined through summation and transformed using a non-linear activation.

def CNN():
    x = layers.Input(shape=[in_size, in_size, 3])
    l2 = layers.Conv2D(12, (2, 2), padding = 'valid', activation = 'relu')(x)
    l3 = layers.Conv2D(12, (2, 2), padding = 'valid', activation = 'relu')(l2)
    l4 = layers.MaxPool2D()(l3)
    l5 = layers.Conv2D(25, (2, 2), padding = 'valid', activation = 'relu')(l4)
    l6 = layers.Conv2D(25, (2, 2), padding = 'valid', activation = 'relu')(l5)
    l7 = layers.MaxPool2D()(l6)
    l8 = layers.Conv2D(50, (2, 2), padding = 'valid', activation = 'relu')(l7)
    l9 = layers.Conv2D(50, (2, 2), padding = 'valid', activation = 'relu')(l8)
    l10 = layers.MaxPool2D()(l9)
    l11 = layers.Conv2D(100, (2, 2), padding = 'valid', activation = 'relu')(l10)
    l12 = layers.Conv2D(100, (2, 2), padding = 'valid', activation = 'relu')(l11)
    l13 = layers.MaxPool2D()(l12)
    l14 = layers.Flatten()(l13)
    l15 = layers.Dense(750, activation = 'relu')(l14)
    l16 = layers.Dense(250, activation = 'relu')(l15)
    y = layers.Dense(out_size, activation = 'softmax')(l16)
    
    return(models.Model([x], [y]))

Let's have a look at a single layer, the layer l3. This is a convolutional layer, which means that it emulates a component of the process of human vision by sliding parameterized filters of limited spatial extent over its input. The design of the layer ensures that spatial information contained in the input is maintained, and that recurring features in the input are recognized as such. The fact that we are dealing with a convolutional layer is easily visible by looking at the layer type: layers.Conv2D. In the programming context, this means that we make use of the Conv2D function of the layers class provided by Keras. We provide a set of arguments to this function to make the layermore specific:

The next argument, the tuple (2,2) denotes the shape of each of the filter matrices: They are square, measuring two by two neurons in size. Given the image input size of 64 pixels edge length, this means that the filter matrices are scanning for relatively small features. This is a typical aspect of convolutional layers close to the input layer, which learn to detect fairly basic, general shapes. You may note that the later convolutional layers also apply filter matrices of the same extent. Since the input has already been compressed once these layers come into action, the relative size of the filter matrices is higher, enabling the learning of more complex, less-general features.

The third argument, padding, denotes whever the convolution operation should reduce the spatial extent of the hidden representation, or not. "Valid" applies the convolution operation only to the dimensions that are in the hidden representation (or in the original image), and since the operation is the calculation of the dot product from the values of several dimensions (2 x 2 in our case), the spatial extent of the next representation will be smaller compared to that of the input representation. Setting this argument to "same" means that some extra dimensions are appended at the edges of the hidden representation (these typically take values of zero). This artifical increase in dimensionality means that the output representation will be equal in spatial extent to the unaltered input dimension. The choice between these two modes depends on the data at hand; it is advisable to experiment with both modes to find the best fitting model architecture. Usage of the "valid" mode means that the data representation is faster reduced in dimensionality; using only the "same" mode means that only the flattening layers contribute to reducig the spatial extent of the representations.

The final argument, activation, describes the activation function to be applied on the dot products generated by the layer. The activation function introduces non-linearity to the CNN, for example, the ReLu function used here sets all negative values to zero. The usage of non-linear functions is what gives CNNs their power: The connection of several such non-linearities produces a "super-function" that can easily separate data that are not linearly separable. The choice of activation function has a major effect on the ease with which the loss-gradient generated during the training process can propagate back through all layers: Using a CNN with a large number of layers and with a poorly-chosen activation function can lead to the gradient becoming very small at layers far away from the output layer; in effect, these layers become more or less untrainable. The ReLu activation is the best choice for hidden layers in modern CNNs. Finally, note how layers l3 is connected to layer l2 by calling that layer in an extra bracket after the function bracket used to supply the arguments for layer design.

The other layers in the CNN function are similarly constructed as the convolutional layers:

The fully-connected, or Dense layers close to the CNN output only receive the number of neurons and the activation function as input arguments. Here, we should specify the number of neurons, i.e. the dimensionality of the hidden data representations resulting from applying the layer, such that the transition from the dimensionality of the flattening output to the final CNN output (the class vector) is relatively smooth. When the change in dimensionality between the output of the flattening layer and that of the first subsequent Dense layer is too abrupt, training may not be possible, even though the CNN was correctly set up. It can be helpful to construct the CNN in multiple steps, each time checking the dimensionality of the hidden representations (as well as the total number of weights, which, when too high, may also make training impossible if the computational resources required are not available) by calling the model summary (see below) and making changes where necessary.

The model-input layer, called Input, requires us to specify the dimensionality of the data structure at hand. In our case, we need to provide three values, the width, height and the number of color channels of the image. These must equal the dimensionality specified when we set up the image-data generators; otherwise, the input provided by the generator and the input expected by the CNN don't match, and an error message will be displayed. The input layer is not parameterized; it simply serves to initialize the input data to be processed by the actual CNN.

While we have just written a function that will generate a CNN from a set of layers, we now have to call that function to actually create a CNN that can be trained. As you can see from the return command in the function, the model is initialized by calling the Model function of the models class of Keras. Two argumens are passed to this function; a list of inputs, which refers to the input layers of the CNN, and a set of outputs, which refers to the output layers of the CNN. In our case, we only have one input and one output layer (termed x and y, respectively), but Keras offers the possibility to have multiple inputs and outputs, which can be useful when setting up CNNs for more complex tasks.

Also note that the fact that we have defined the model layers in a function environment means that every time the function is called to construct a CNN, these layers are constructed from scratch. Had they been defined outside of the function context, i.e. in the global environment, the model itself as well as the layers themselves would change during training. Constructing a new model from these layers would then mean constructing a model with pre-trained layers. This can be useful from a practical point of view, since it is often easier to train a new model when its weights are already adapted for a similar task (i.e., transfer learningis invoked). However, from an analytical, comparative point-of-view, the fact that all model components, even when they exist as seemingly solitary objects, change, can easily create a big mess if care is not taken.

The first argument we provide to the compile function that is an attribute of our Model-class object is the loss function to be used. This loss function will compute the deviation of the CNN prediction relative to the ground truth, i.e. observed values, human-defined class labels and so on. This loss will be passed back to the CNN and propagated through its layers to update their weights according to a gradient-descent procedure. It is therefore important to be clear about what the loss for a given task actually is, and how to formulate it mathematically. In regression, one typically uses the mean squared error, i.e. the squared difference between observation and model prediction, summed over all data-points. In classification, however, we typically use a form of the cross-entropy function; since we are dealing with more than two classes, we are using the categorical crossentropy. This loss function measures the divergence between a true and a predicted distribution. In our case, the true distribution is the one-hot-encoded class vector, which represents the label assigned to a particular image, and the predicted distribution is the output of the final layer of the CNN. To avoid comparing the "sharp-edge" one-hot-encoded class vector with the more "fuzzy" predicted class vector, which would result in a possibly too strong punishing of the CNN, one could arguably apply an argmax function on the output layer, effectively constructing a predicted "sharp-edge" one-hot-encoded class vector. However, with the standard Keras tools, this is not so easily implemented; thus we treat this consideration as just theoretical for now.

The second argument we supply to the compile function is the optimizer. The optimizer is an algorithm that applies the loss calculated by the loss function to update the parameters, or weights, in the CNN. To do this, it calculates the loss gradient with respect to a particular weight in the CNN; since the CNN is constructed as a quasi-hierarchical computational graph, the so-called backpropagation alorithm is used to calculate the gradient value for weights at specific positions in the CNN. Thus, there is one particular gradient value for every weight in the CNN, and this value depends on the position of the weight in the CNN. The loss gradient is applied to update the CNN weights; effectively, each weight is updated in the counter-direction of its particular gradient value. The magnitude of the change depends on the magnitude of the gradient and on the learning rate, a scalar value that scales the gradient value up or down.

The leaning rate is supplied as an argument to the optimizer function that we call within the compile function. It typically assumes values smaller than one, a good initial choice is 1e-3. The learning rate is probably the most critical hyperparameter in setting up a CNN! Before any other hyperparameters are changed to find out if they have an effect on training performance, the learning rate should be checked. Often, the range of working learning rates is very small, i.e. in the range of one power or less. We can also specify the decay argument, which gradually decreases the learning rate. This has the effect of ensuring a smoother convergence of the model training; when no decay is specified, it is possible that the loss minimum is never found, or that it takes a long time to find it, since the loss surface is traversed in steps whose magnitude is not adapted to cues signaling proximity to the loss minimum. 1e-3 or 1e-4 are good default values for the learning rate decay. The choice of the optimizer algorithm is one further decision the CNN designer has to make. In Deep Learning, one uses a different suite of algorithms than in standard optimization procedures (names like Nelder-Mead or nlminb might come to mind if you have experience in that field). Today"s best optimizer for training a CNN is the Adam optimizer, which features several routines that assess the training performance relative to the loss and learning rate, and adapts the learning rate in a process that is separate from the general learning-rate decay. Nevertheless, it might be worth trying out some of the older optimizers like RMSprop or SGD i specific cases.

We can finally supply a list of performance metrics to the metrics argument of the compile function. This is an optional argument which will result in a performance metric being calculated and displayed as training progresses. Note that this information is not used to update the CNN weights; only the loss is used for that purpose. The performance metrics only give us a chance to monitor the training with regard to the desired outcome, which is especially useful when the loss function is somewhat cryptic. In our case, we pass a list with a single component, the string accuracy, which will result in the classification accuracy being displayed during training. Note that only the accuracy for a batch of images will be displayed, and that accuracy alone is not the ideal metric for representing classification performance, especially when the training set is not homogeneous. Nevertheless, it provides a rough general peformance measure that can help us to evaluate relative performance gains from changing hyperparameter values or the CNN architcture.

Then, we specify the number of epochs with the epochs argument. This is a somewhat difficult choice in the beginning, since we have no good clue about how long the model will need to converge, and when model over-fitting may set in. Thus, we should start with a number that is more likely to be too large, monitor the training process, and then reduce the number of epochs such that training stops at or shortly after over-fiitng commences. This is visible from the on-line print-out of the training process as the point in time where validation loss is higher than training loss, which coincides with validation accuracy being lower than training accuracy. Note that validation loss and -accuracy are only calculated printed at the end of an epoch, so we can only determine the time of over-fitting on epoch level. This approach of determing optimum training duration does mean that training must be performed at least twice before a final trained model can be saved, though. Usually, it takes much longer, though, since you will need to or want to experiment with hyper-parameter settingsfor optimizing model performance. When using a pre-trained base of layers, a point not addressed in this post, you might also want to experiment with the scheme of "unfreezing", that is, setting to trainable, the layers of this base: in which epoch would you want to unfreeze how many layers? Would you re-freeze some layers in some epoch?

Finally, we need to provide values for two arguments realting to the validation data, i.e. those data held back to validate the generalizability of the trained model. For validation_data we call our "validation_generator", which, unlike the "train_generator", accesses the image folders containing the validation images, and does not perform any augmentation procedures on these (remember that the idea behind using validation data is to test the model on realistic data, while in the training, we use data augmentation to mitigate over-fitting from a lack of data).

The validation_steps argument is very similar to the steps_per_epoch argument: Here we supply the total number of validation images divided by the validation batch-size; this value was defined earlier as "val_step". We assign the fitting procedure to an object called "history". This does not mean that once training has finished, "history" is the trained model, instead, the object named "model" has been altered by the training process. "history" is instead a recording of the training and validation accuracies and losses calculated at the end of each epoch. It can thus be used to visualize and analyze the training trajectories, and make inferences for model adjustments related to architecture or hyper-parameter settings from this.

Finally, after training has concluded, we save the model to disk. It is stored as a .h5 file, a file in Hierarchical Data Format. This file contains both the architecture and the trained weights of the CNN. In case we are dealing with a CNN consisting only of layer types pre-specified by Keras, it is possible to load the entire model, that is, architecture, weights and optimizer, by calling models.load_model(path). In case we have defined custom layers (not described in this post), we would have to first set up the CNN architecture and built a model with random initial weights, and then "fill" it with the weights stored on disk by calling model.load_weights(path), where model is the previously built "empty" model. In this case, the state of the optimizer is not saved, however. While it is formally better to continue training with the previous optimizer state, it is not straightforward to load that state when working with custom layers. However, in my experience, re-initializing the optimizer before continuing training is also not too detrimental to training performance.

After training has finished, we can use the model to make predictions, for example on our test dataset. For this purpose, we define a new test_generator for loading the images, which calls upon the val_datagen function established earlier, which ensures that images are not augmented. Our test_generator only differs from the validation_generator in the path, which now points to the folders containing the test images, and in the batch size, which was earlier defined to be one for test images. Also, we now set the shuffle argument to False. Setting it to True for training was meant as a further measure against "uneven" learning (note that with a learning rate subject to decay and further alterations through the optimizer, the sequence in which images are supplied does have an effect on the training). Setting it to False for predictions ensures that we can keep track of the order in which the images were supplied, which helps to allocate a given prediction (i.e., a quasi-one-hot-encoded class vector) to its input. This, of course, is of utmost importance for the applied use of the classifier model.

This concludes my description of the setup and training of a CNN. As you can see, this is a fairly complex yet rather straight-forward procedure. It is likely that in the coming years, this process will be further stream-lined and extended with diagnostics, until it might be operable just like the fitting process in generalized linear and non-linear models, and in general optimization. A further post will discuss the possibilities of visualizing training history and CNN predictions.

Ex Data, Scientia