ExDataScientia

The Hessian matrix and parameter fitting

When you fit a complex statistical model to data, you may sometimes encounter a warning message referring to unfulfilled properties in the so-called "Hessian matrix", with the final parameter estimates not being trustworthy as a result. Here we willy investigate the meaning of the Hessian matrix and why it is important in the statistical-fitting process.

In order to understand the Hessian matrix in the context of statistical fitting, we need to rememember that the automatic step of generating fitted values of the parameters of statistical models (like GLMs) is an optimization process (i.e. before any manual check for model diagnostics): A loss value between observations and model predictions is calculated for some initial (often random or guessed) selection of parameter values. The first derivate of the so-parameterized function, with respect to the total difference between observations and predictions, determines the local change ("change-of-slope") in the "trend" of the loss. The loss is then described as just one single data point on a loss surface. Model fitting is usually terminated when the loss is no longer decreasing by a certain (small) order of magnitude as a response to changing the model parameters.

However, in addition to that, another check is performed to determine whether one has really reached a loss minimum or whether loss could be further decreased by altering model parameters. This check concerns the local curvature of the loss function, which should be convex ("bulging downward") or indifferent (being "flat") with respect to every model parameter. If that is the case, the loss would likely increase when changing the model parameters, i.e. with the current parameterization we would sit at the "bottom of a bowl", i.e. at a loss minimum. The model is then said to have successfully "converged".

Convexity is typically checked for by computing the so-called Hessian matrix. The Hessian matrix is a square matrix containing the second partial derivatives of the current point of the loss function with respect to pairs of parameters. I.e., figuratively speaking, one cell of the Hessian matrix might tell us how "the way the loss changes when increasing parameter alpha changes when we increase parameter beta". The diagonal cells of the Hessian matrix are mono-parametric, i.e. here a cell would contain the second derivative of the loss with respect to the square of alpha (or "how the change of the loss with alpha changes when increasing alpha"). It is these cells that one is interested in when wanting to determine convexity: Ideally, all these derivatives should be positive; then the loss function would locally curve upwards in the direction of every parameter. The necessity for model convergence is, however, only a positive-semidefinite Hessian, i.e. the diagonal elements must not be negative, but they can be zero.

It should be noted that achieving convexity is a necessary condition to ascertain that a loss minimum has been reached, but it is not a sufficient one, i.e. loss might actually decrease when further changing the model parameters. This is because the local curvature of the loss function is just that - local. A loss function can be very complex, composed of e.g. several local loss minima of different depths, saddle points and areas of various steepness. The local curvature of the loss function appears like the form of a quadratic polynomial fitted to the immediate surroundings of the point on the loss surface corresponding to the current parameterization, and does not provide information on the entirety of the loss surface. Hence local curvature is only useful for an extra check on the quality of model parameterization.

Nevertheless, employing the convexity check can be a powerful tool in model design. Very complex models, like the often highly-parameterized mixed-effects statistical models, require a lot of data for successful fitting. When data is lacking, which is often the case especially in field research, it may be possible to reach what appears to be a loss minimum when using gradient-descent algorithms to fit these models, but the properties of the Hessian matrix will give a different perception, hence the common and somewhat dreaded warning, "Hessian not positive-semidefinite". In this case, the failure to reach a loss minimum with respect to every parameter indicates that the model design is too complex for the amount and quality of the data at hand, and suggests the collection of additional data (especially to achieve more balance between categorical random variables), or to employ a simpler model design (though that might result in conflict with prior knowledge on study design).

We now simulate a fitting operation and, once we believe to have found a loss minimum, use the Hessian matrix to ascertain that belief. We will predict a variable from the mtcars data set via an arbitrary function around two predictor variables from the same data set, with the function being designed to be relatively complex in order to make the fitting (optimization-) procedure not quite straight forward, i.e. to introduce a non-trivial loss function (in applied statistics, you would start from a rather simple model formulation, however). We start by picking the variable mpg (miles per gallon) as the predicted variable and drat and wt (weight) as the predictor variables, and standardizing them (in order to give them equal weight in the fitting process). We also set up two parameters to be optimized in the fitting process, and initialize them both to the value 3 (an arbitrary choice).

library('tidyverse')
library('viridis')

in1a <- (mtcars$drat-mean(mtcars$drat))/sd(mtcars$drat)
in1b <- (mtcars$wt-mean(mtcars$wt))/sd(mtcars$wt)
y <- (mtcars$mpg-mean(mtcars$mpg))/sd(mtcars$mpg)

par_1 <- 3
par_2 <- -3

We then create a string defining the optimization problem underlying the fitting task: The model, i.e. the prediction function, takes the form mpg_predicted = 1 / (1 + exp(-(par_1 * drat * 1 / (1 + exp(-par_2 * wt))))). Obviously, this function is non-linear, and its design is also very arbitrary given the simple prediction task. The loss of the prediction is then (mpg - mpg_predicted)^2, i.e. the squared difference between observations and predictions. The loss function we whish to find the minimum of is defined by this loss as dependent variable and par_1 and par_2 as independent variables. As we wish to calculate the loss over all data points in the mpg data set, and since we are going to use the D() function, which only accepts simple mathematical operators, to help with the optimization, we need to concatenate the loss function as formulated for each data point as a character string. We do so by filling a vector with these single strings using a loop, and then concatenating the single strings using the paste0() function. Note that the collapse argument is set to " + ", in order to mathematically concatenate the single strings and generate one single long equation defining the optimization problem.

xpr_vec <- vector(length = nrow(mtcars))

for(j in 1:nrow(mtcars)){
  xpr_vec[j] <- paste0("(",y[j]," - 1/(1 + exp(-(par_1 * ",in1a[j]," * 1/(1 + exp(-(par_2 * ",in1b[j],")))))))**2")
}

xpr <- paste0(xpr_vec, collapse = ' + ')

We parse the character string (i.e. we turn it into an expression that can be evaluated as an R command), and then use the D() (for "derivative") function to generate an expression that lets us calculate the partial derivative of the loss equation with respect to a particular term of the equation, i.e. in most cases a parameter to be optimized. D() uses the computer's inherent capability to calculate partial derivatives (i.e. the auto-differentiator capability), which is useful for us since the rules of derivation are rather comprehensive and difficult to rememember. We generate one such expression for both the parameters par_1 and par_2, which, when evaluated, will return the partial first derivative of the loss for each parameter and allow performing gradient-descent optimization. We generate four additional expressions for calculating the partial second derivatives of the loss: one for each combination of each of the two former expressions and for each of the two parameters. This means we want to calculate the partial derivative with respect to par_1 of the first partial derivative with respect to par_1 and so on. Technically, we simply apply the D() function on the expressions used to calculate the first partial derivatives, i.e. the second partial derivative of the loss is the same as the first partial derivative of the first partial derivative of the loss.

f <- parse(text = xpr)

f1_par_1 <- D(f, "par_1")
f1_par_2 <- D(f, "par_2")

f2_par_1par_1 <- D(f1_par_1, 'par_1')
f2_par_1par_2 <- D(f1_par_1, 'par_2')
f2_par_2par_1 <- D(f1_par_2, 'par_1')
f2_par_2par_2 <- D(f1_par_2, 'par_2')

We now set up the iterative optimization: first, we set the number of iterations, which we here set to 1000. We could later decide to increase that number, but in the interest of keeping computation time low, we will check loss dynamics and convergence after this number of iterations already. Also, we set the learning rate, i.e. the step size by which to update the equation parameters in response to the loss gradients (i.e. the first partial derivatives of the loss function). Then we start the optimization loop: In each iteration, we evaluate all the expressions generated above, i.e. the loss equation itself, the expressions calculating the first partial derivatives, and those calculating the second partial derivatives, and save the results to arrays in order to record and later visualize the optimization process. For the second partial derivatives, we save them in the shape of the Hessian matrix, i.e. as entries of a three-dimensinal array, where the first axis represents the iterations and the second and third axes the rows and columns of the Hessian. We record the second partial derivatives where we address one single parameter to the cells [i,1,1] and [i,2,2], i.e. the diagonal of the matrix. We also record whether the Hessian matrix is positive positive-semidefinite or not: We sum up the elements of the Hessian's diagonal: When the sum is 1 or greater, then the matrix is positive or positive-semidefinite, when it is 0 or lower, then at least one cell has a negative value, and the matrix is negative or negative-semidefinite, and convergence has formally not been achieved.

n_iter <- 1000
lr <- 1e-2

loss_vec <- vector(length = n_iter)
grads_arr <- matrix(nrow = n_iter, ncol = 2)
hess_arr <- array(dim = c(n_iter, 2, 2))
hess_semdef <- vector(length = n_iter)
par_arr <- array(dim = c(n_iter, 2))

for(i in 1:n_iter){
  loss_vec[i] <- eval(f)
  
  grads_arr[i,1] <- eval(f1_par_1)
  grads_arr[i,2] <- eval(f1_par_2)
  
  hess_arr[i,1,1] <- eval(f2_par_1par_1)
  hess_arr[i,1,2] <- eval(f2_par_1par_2)
  hess_arr[i,2,1] <- eval(f2_par_2par_1)
  hess_arr[i,2,2] <- eval(f2_par_2par_2)
  
  hess_semdef[i] <- sum(sign(diag(hess_arr[i,,])))
  
  ### ... ###
}

We also record the current parameter values, and then update them via the gradient-descent technique. That is, we move against the direction of the local loss gradient, which is given by the parameter-specific first derivatives of the loss function (remember that a first derivative always gives the [local] rate of change, or slope, of a function): We subtract the product of the parameter-specific first derivative and the learning rate from the current parameter value. The change of the parameter values equals a slight move on the loss "surface" (a term for the loss function when it has more than one independent variable).

for(i in 1:n_iter){
  ### ... ###
  
  par_arr[i,] <- c(par_1, par_2)
  
  par_1 <- par_1 - lr * eval(f1_par_1)
  par_2 <- par_2 - lr * eval(f1_par_2)
}

After the 1000 iterations of parameter updates, we plot both the loss vector and the convexity vector against the iterations. We find that loss first followed a slightly dome-shaped curvature, then decreased exponentially early on and then approached an asymptote, suggesting that a minimum has been found. We also find that the Hessian matrix was negative-semidefinite early (the sum of the signs of its diagonal was 0) and then turned to positive-definite (sum of the signs of its diagonal was 2). This happened about the time the loss vector switched from a dome-shaped dynamic (where the loss curve appears slightly concave) to a negative exponential one (where the loss curve appears more convex). Thus we have a good indication that a true loss minimum was indeed found here. (The presence of other (possibly lower) loss minima cannot be ruled out, however, especially for very complex loss functions.)

plot(loss_vec, type = 'l', xlab = 'Iteration', ylab = 'Loss')
plot(hess_semdef, type = 'l', xlab = 'Iteration', ylab = 'Hessian convexity')

Now, we want to investigate the loss surface in more detail. As the loss function is of moderate complexity, it is computationally feasible to calculate the loss value for every combination of values for parameters par_1 and par_2 from pre-defined ranges. We generate a sequence of values for par_1 from -5 to 18 (which contains the minimum and maximum values encountered during the optimization process) of length 1000, and a sequence of equal length for par_2 from -3 to 5. Since we do not wish to calculate partial derivatives this time, we do not need to contruct a long character string (as before) to calculate the loss. We loop over the value vector for par_2, and then calculate the summed loss over all data points for every value of the par_1 vector in one step by means of invoking outer-product calculation (from linear algebra). (The details here are not important; of importance is that the summed loss is calculated for every combinations of values from the par_1 vector and the par_2 vector, which could also be achieved using a nested loop over both vectors).

par_1_sample <- seq(-5, 18, length.out = 1000)
par_2_sample <- seq(-3, 5, length.out = 1000)

loss_arr <- array(dim = c(1000,1000))

for(i in 1:1000){
  loss_arr[,i] <- colSums((y - 1 / (1 + exp(-in1a %*% t(par_1_sample) * 1 / (1 + exp(-(par_2_sample[i] * in1b))))))**2)
}

After some data wrangling, we plot a heat map of the loss surface against par_1 on x and par_2 on y, and the optimization track on this surface as a line, with red denoting a negative or negative-semidefinite Hessian and green indicating a positive-definite Hessian. Interestingly, we see that the loss surface has two main features: Loss decreases strongly from low to high values of par_1, with a steep decrease around a value of 0. Above values of par_1 of 0, loss decreases from both high and low values of par_2 towards a minimum around a value of zero. At values of par_1 below 0, values of par_2 around 0 appear to rather correspond to a loss maximum. With the parameter values initialized to 3 and -3 for par_1 and par_2, respectively, above, the optimization process had relatively little trouble finding the loss "valley", and the route of gradient descent is relatively straight. Indeed, a pure loss minimum does not really seem to exist, as the loss appears to be equally low for all values of par_1 larger than roughly 0 and a value of par_2 of 0.

loss_arr <- as.data.frame(loss_arr)
names(loss_arr) <- par_2_sample
loss_arr$par_1 <- par_1_sample
loss_arr <- loss_arr%>%gather(-par_1, key = 'par_2', value = 'loss')
loss_arr$par_2 <- as.numeric(loss_arr$par_2)

ggplot() + 
  geom_tile(data = loss_arr, aes(par_1, par_2, fill = loss)) + 
  geom_point(data = NULL, aes(par_arr[,1], par_arr[,2], color = factor(hess_semdef, levels = c(0,2,-2))), show.legend = F) + 
  scale_fill_viridis('Loss', option = 'plasma') + 
  scale_color_manual(values = rainbow(3)) + 
  xlab('Parameter #1') + ylab('Parameter #2') + theme_bw() + theme(text = element_text(size = 15))
  
ggplot() + 
  geom_line(data = loss_arr, aes(par_2, loss, color = loss, group = par_1), show.legend = F) + 
  geom_point(data = NULL, aes(par_arr[hess_semdef==0,2], loss_vec[hess_semdef==0]), show.legend = F, color = rainbow(3)[1]) + 
  geom_point(data = NULL, aes(par_arr[hess_semdef==2,2], loss_vec[hess_semdef==2]), show.legend = F, color = rainbow(3)[2]) + 
  geom_point(data = NULL, aes(par_arr[hess_semdef==-2,2], loss_vec[hess_semdef==-2]), show.legend = F, color = rainbow(3)[3]) + 
  scale_color_viridis('Loss', option = 'plasma') + 
  xlab('Parameter #2') + ylab('Loss') + theme_bw() + theme(text = element_text(size = 15))

ggplot() + 
  geom_line(data = loss_arr, aes(par_1, loss, color = loss, group = par_1), show.legend = F) + 
  geom_point(data = NULL, aes(par_arr[hess_semdef==0,1], loss_vec[hess_semdef==0]), show.legend = F, color = rainbow(3)[1]) + 
  geom_point(data = NULL, aes(par_arr[hess_semdef==2,1], loss_vec[hess_semdef==2]), show.legend = F, color = rainbow(3)[2]) + 
  geom_point(data = NULL, aes(par_arr[hess_semdef==-2,1], loss_vec[hess_semdef==-2]), show.legend = F, color = rainbow(3)[3]) + 
  scale_color_viridis('Parameter #1', option = 'plasma') + 
  xlab('Parameter #1') + ylab('Loss') + theme_bw() + theme(text = element_text(size = 15))

Finally, we want to find out what happens if we initialize the parameters somewhat differently, to -4 and 4 for par_1 and par_2, respectively. This makes the optimization procedure more challenging, as the initial par_1 now corresponds to an area of high loss on the loss surface, and the steep decline towards the area of lower losses at positive values of par_1 must be traversed. We perform all operations as above, and find that the loss again decreases steeply at first and then approaches what appears to be an asymptotic minimum. However, the convexity dynamics show us that convergence has not been reached after the 1000 iterations, since the Hessian is negative-semidefinite (a convexity value of 0). Hence, the loss dynamics gave a false impression of a loss minimum having been reached, and the convexity check was indeed crucial to notice that. Interestingly, during the optimization process an area of the loss surface corresponding to convexity was briefly encountered.

# ...  
  
par_1 <- -4
par_2 <- 4

# ...

Checking the optimization trajectory on the fully-resolved loss-surface heatmap gives more insight into why convergence was not achieved after 1000 iterations: The optimization trajectory did indeed first traverse the steep loss decline along par_1, where convexity was achieved for a brief time in the area of exponential decrease of the loss along par_1. A phase of full concavity (negative-definite Hessian, indicated by blue color) was also encountered in the upper regions of the area of steep loss decrease. Now, after having traversed the loss decline, the optimization procedure did not manage to recognize the loss decline along par_2 after 1000 iterations, and as the loss surface is indeed rather flat along par_2 where the optimization ends up after the traversal of the loss decline along par_1, the optimization trajectory veers off in the wrong direction (away from the loss "valley"). It is likely that the trajectory would change course again after more optimization iterations. For now, we find that checking the Hessian matrix was indeed useful for us to keep us from stopping the optimization procedure too early and accepting inadequate model parameters.

Ex Data, Scientia