**Analyzing and Visualizing Classifier Predictions - Step by Step**

Designing and training a Deep Neural Network is one part in the process of developing a classifier application. However, it is also important to visualize its performance to judge its quality.

In my last post, I described how to design and train a Deep Neural Network, more specifically, a deep Convolutional Neural Network (CNN) for image classification. Let us now look at how to visualize the training history, and the predictions made by the model.

We assume that all required variables, folder paths, have already been set up, and that a trained model is existing inthe global environment of our Python session.

Our first goal is to visualize the training history of our model, i.e. the trajectory of loss and accuracy over time, or, more correctly, over epochs of training. As you may recall, we recorded the training history in a "container" object named *history*. This object possesses multiple attributes; we, however, are only interested in the *history* attribute (that means, we are looking for *history.history*). We turn this attribute into a separate object and transform it into a pandas data-frame, so we can easily subset named variables. Our *history* data-frame cotains the variables *loss*, *val_loss*, *acc* and *val_acc*, i.e. the training and validation loss, and training and validation accuracy. To plot these variables, we call the *plot* function from the *matplotlib.pyplot* class (imported into the session under the abbreviation *plt*). We pass a numerical sequence equaling the number of epochs as the first argument to *plot*, which is the variable plotted on the abscissa, or x-axis. This number is equal to the number of rows in our data-frame (this is equal to the "length" of the data-frame, as represened by the *len* function). As the second argument (which will be plotted on the ordinate), we pass a variable of our choice, here the training loss, by subsetting the respective column of our *history* data-frame. We then call the *plot* fuction a second time, but now pass the validation-loss variable as the second argument. We submit a third argument, the character *r*, which represents the color. *r* stands for "red". Since we want to plot both the training and the validation loss in one figure, and since the standard color in matplotlib is blue, we set off the validation loss by using the color red. Note that when executing the two calls to *plot*, both trajectories will be drawn into one plot. To close this plot and start a new one, we need to call *plt.show()* at the end. This plot is now fairly simple; further features, like axis labels, can be added by adding further *plt* functions (like *plt.xlabel*) before *plt.show()* is called. However, high-end visualization is much easier in the R programming language using e.g. the *ggplot2* package. Here, we just want to take a quick look at the training history for our own information, and not for generating publishable plots. For that purpose, we could easily save the *history* data-frame to disk as a csv file, and generate more advanced plots in R. In the same manner as plotting the loss, we also plot the trajectories of training and validation accuracy

train_hist = pd.DataFrame(history.history) plt.plot(np.arange(len(train_hist)), train_hist['loss']) plt.plot(np.arange(len(train_hist)), train_hist['val_loss'], 'r') plt.show() plt.plot(np.arange(len(train_hist)), train_hist['acc']) plt.plot(np.arange(len(train_hist)), train_hist['val_acc'], 'r') plt.show() train_hist.to_csv("/path/to/history.csv")

Now, we want to plot the model predictions for our test dataset, in order to assess its performance in more detail than is possible from the trajectories of overall loss and accuracy. To do this, we first need to generate the predictions. Assuming all required functions and arguments are already set up and defined, we call the *predict_generator* function of our model that will take the images received from a generator object (which loads batches of images from disk into memory) and feed it to the CNN as input data to be processed. As arguments, we first pass our image-data generator (named "test_generator" in this case). Second, we state the number of *steps* that the function will be called. This depends on the batch size, or number of images that are loaded simultaneously into memory, and on the total number of images for which predictions are to be made. The number of steps is the latter number divided by the former, and is in this case already defined as the object *n_test_imgs*, since we use a batch size of one. The final argument we pass is the *verbose* argument, which indicates whether a process bar should be printed while predictions are made. It is here set to the previously defined object "verbose", but generally assumes the logical values 0 or 1.

preds = model.predict_generator(test_generator, steps = n_test_imgs, verbose = verbose)

The predictions now exist in the format of quasi-one-hot-encoded class vectors. That means that each prediction is a vector, the number of elements of which corresponds to the number of classes that the model shall differentiate between. The values of these n elements sum to one for each vector, so there is a probability assigned to each class for each prediction. This means that the CNN acutally does not make one hard-cut prediction; rather, it makes a somewhat fuzzy prediction: There is one vector element, i.e. class, with a maximum probability assigned to it, but the actual value of this maximum probability can be very different among predictions. The other classes also have probabilities assigned to them, meaning that the model finds it less likely that the image belongs to these classes - but it wouldn't rule the possiblility out, either. In applied usage of CNNs, this characteristic can be interpreted as the model being more or less certain about its prediction. A "certainty threshold", or a set of class-specific thresholds (based on known class-specific performance weaknesses and strengths), can be implemented to sort predictions that are likely wrong away - for later manual validation. For now, however, it is important for us to know in what format the predictions exist. In order to work with them, we need to transform them into scalar values, i.e. into hard-cut predictions. We determine the *argumentum maximum*, i.e. the index of the vector element with the highest probability, by employing the *argmax* function of the *numpy* module. Using an abbreviated loop, we apply it to every prediction (the number of predictions is given by the *length*, or number of rows, of the predictions matrix, in which every row is one class vector, or prediction). The resulting is list is turned into a numpy array for future convenience.

preds = [np.argmax(preds[i]) for i in range(len(preds))] preds = np.array(preds)

For checking he validity of the predictions, we require the true class labels. We can access these through the *labels* attrbute of our test-images generator. The labels returned already have the format of scalar indices, and are thus directly comparable to the predictions

obs = test_generator.labels

In most applications, the performance of a classifier model is not measured by simply determining the ratio of correctly classified images to the total number of images. Performance metrics are often task-specific, but we usually want to ge a good idea of class-specific performances. In the case of ecological applications, we might for example want to know how well the classifier can predict the relative abundances of different taxa (biological groups) of organisms. In this case, it is not even necessary that the number of correct predictions per class is very high, but rather that the relative abundances (i.e. values between zero and 100 percent) predicted are close to the true abundances. To implement such a comparison, we can calculate the number of predictions of a specific class index divided by the total number of predictions, and we can do the same for the true labels. We implement these operations as loops over the unique elements, which are the unique class indices, in both the predictions vector ("preds") and the true-labels vector ("obs").

preds_relative = [sum(preds == i) / len(preds) for i in np.unique(preds)] obs_relative = [sum(obs == i) / len(obs) for i in np.unique(obs)]

We can then compare them in a barplot using the *plt.bar* function of the *pyplot* class of the *matplotlib* package. We plot the true relative abundances in blue, and the predicted abundances in red. For the latter, we also reduce the width of the bars using the *width* argument, in order to avoid over-plotting. In order to use the real class names instead of the numeric class indices, we extract these names using the *keys* function from the *class_indices* argument of our test-images generator, and pass these as the x-axis argumentin our calls of the barplot function.

classnames = np.array(list(test_generator.class_indices.keys())) plt.bar(classnames[np.unique(obs)], obs_relative) plt.bar(classnames[np.unique(preds)], preds_relative, color = 'r', width = 0.4) plt.xticks(rotation=90) plt.show()

Of further interest are the confusion rates between the different classes. This can help us find out which classes are frequently confused with one-another, and to make according changes to the training dataset e.g. by stocking up the amount of training images for those classes. A confusion analysis is easily done using the function *confusion_matrix* from the *scikit-learn* package. We simply need to supply our predicted class indices and the true class indices to this function; no further inputs are required. For purposes of plotting, we can instead provide the class names instead of class indices, by subsetting the class-names vector with our class indices.

from sklearn.metrics import confusion_matrix preds_named = np.array(classnames[preds]) obs_named = np.array(classnames[obs]) conf_mat = confusion_matrix(obs_named, preds_named)

Plotting the confusion matrix is somewhat complicated when using *matplotlib*. The following shows how the plotting canbe implemented:

fig, ax = plt.subplots(1,1) img = ax.imshow(conf_mat) ax.set_xticks(np.arange(len(classnames))) ax.set_yticks(np.arange(len(classnames))) x_label_list = classnames.tolist() ax.set_xticklabels(x_label_list, rotation = 90) ax.set_yticklabels(x_label_list) fig.colorbar(img)

The plotted confusion matrix makes it easy to see which classes are frequently confused with one-another: Bright tiles point to a high, dark tiles to a low confusion rate. We can use this information for comparisons with the class-pecific relative amounts of training images, and to check whether some classes are morphologically very similar to each-other. From the results of this analysis, we can then try to optimize our training dataset by adding training images for classes that are easily confused.

We can now calculate two metrics frequently used in the assessment of classifier models: recall and precision. Both metrics are class-specific, their average or weighted average over classes can be used to evaluate the classifier's general performance. Recall is the number of images correctly classified to belong to one class divided by the total number of images in that class. It gives information on how good a model is at finding all images that belong to a class. A non-class-specific recall is equal to the model accuracy, i.e. the total number of correctly predicted imagesdivided by the total number of images at hand. We calculate recall values by subsetting the predictions for which the observations have a specific class index, and count the number of cases in which the prediction also has that class index. This value is then divided by the total number of observations that have this specific class index.

recall = [np.sum(preds[obs == i] == i) / np.sum(obs == i) for i in range(len(np.unique(obs)))]

Precision is the number of predictions correctly classified to belong to one class divided by the total number of images classified to belong to that class (i.e. both correct and false predictions). It gives information of how seldomly an image will be mistaken for a particular class, i.e. how frequently we can assume that a prediction is correct. We calculate it by subsetting the predictions for which the observations have a specific class index, and count the number of cases in which the prediction also has that class index. This value is then divided by the total number of predictions that have this specifi class index.

precision = [np.sum(preds[obs == i] == i) / np.sum(preds == i)]

Both metrics can again be visualized as barplots, with the class names on the x-axis and the recall or precision values (ranging between zero and one) on the y-axis.

plt.bar(classnames, recall) plt.show() plt.bar(classnames, precision) plt.show()

Often, the precision pattern (over classes) will follow the recall pattern to some extent; strongly diverging values are rather rare due to the fact that both metrics indirectly show how well the model has defined class boundaries in the image hyper-space. In comparison to some performance metrics more common to regressio modeling, recall could be thought of being somewhat related to an overall metric of fit (i.e. the amount of explained variance being reflected as the amount of correctly-classified images), while precision hints at the significance of the found class boundaries by indicating how frequently the estimated class boundaries are "violated" by the ground truth. (This is not mean that these performance and statistical metrics are indeed equal!)

The performance metrics and visualization designs shown are the basics for properly assessing the performance of a classifier model, not only that of a CNN classifier, but also that of support-vector machines, classification trees, simple binomial classifiers and all other classification algorithms. Different applications may, however, require the calculation of additional or modified perfomance metrics.