ExDataScientia

Computing partial derivatives via auto-differentiation - a comparison of approaches in R and Python

Computing partial derivatives of a function with respect to its parameters is a key procedure in optimization and thus a basic requirement for many machine-learning tasks. Here, we are going to compare the different methods of obtaining partial derivatives available in R and Python.

Calculation of partial derivatives of a function is not a straightforward task, as it requires comprehensive knowledge about the rules of derivation, which is a fairly large body of knowledge. Fortunately, the programming languages R and Python offer functions for calculating derivatives automatically (a process that is being done in background computer operations anyway), the so-called auto-differentiator functions. These functions are often quite hidden to the average programmer, as they tend to be the back-bone of higher-order functions like optimizers. However, knowing how to obtain function derivatives can be a powerful skill for implementing machine-learning tasks without having to rely on black-box-like library functions and thus offers a high degree of flexibility for the programmer.

Several different auto-differntiator functions have been developed over the years, and each of them has its specific advantages and disadvantages. Here, we are going to look at the auto-differentiators deriv, optimr and TMB (Template Model Builder) in R and at SymPy, Tensorflow and Keras in Python. optimr and Keras are not pure auto-differentiators but rather optimization libraries, but we include them anyway as they do still offer a useful degree of flexibility (especially when compared to the relatively constrained model-fitting functions like nls() and glm() in R, which only work for otimization problems of a very distinct design) and feature some advantages over the pure auto-differentiators for many applied tasks.

We are going to calculate derivatives for two variables in a simple mathematical function the core component of which basically resembles a linear regression model model: One variable is multiplied with a constant, the other variable is added to the product. The sum is then inserted into a sigmoid activation function (1 / (1 + exp(-s)), where "s" is the sum). Finally, the squared difference from another constant is calculated. This mathematical graph represents the most basic form of a neural network with mean-squared-error loss conceivable. The two variables are given the names param and intc here, the two constants are x and y.

For calculating derivatives with deriv() in R, we first need to generate a character string of the mathematical graph. To this end, we first formulate the part resembling a linear-regression model as a character string and subsequently paste it together with other character strings to incorporate the non-linear activation and the squared-difference term, using the paste0() function. We then paste this string together with the functional formulation deriv( ~ ..., c("param", "intc")) (where "..." is replaced by the character string of the mathematical graph). We parse and evaluate this character string in order to execute it, using the functions parse() and eval(). The result is an unevaluated R expression which contains lines of code that can calculate the partial derivatives of the function output with respect to param and intc. This expression was generated by the auto-differentiator. Finally, we evaluate the expression with eval and obtain the partial derivatives for param and intc as scalar values. The entire procedure is somewhat tedious, especially since we could have directly used the deriv() function without pasting together a complex character string and the evaluating it. However, it comes in handy when the mathematical graph to be used is very complex and may be constructed with loops etc. (deep neural networks are an example).

mod_out = 'intc + param * x'
mod_out = paste0('1 / (1 + exp(-(', mod_out, ')))')
mod_out = paste0('(y - ', mod_out, ')^2')

param = 3.5
intc = 5.0
x = 2.5
y = 6.5

drv = paste0('deriv(~ ', mod_out, ', c("param", "intc"))')

drv = eval(parse(text = drv))

eval(drv)

We can directly optimize the mathematical graph by using the R function optimr(). Here, we write the graph a a function()-type object, the first argument of which is a vector of the parameters to be optimized. We then call the function optimr() and provide a vector of initial parameter values (here arbitrarily set to 3.5 for param and 5.0 for intc) as first argument, the name of the function to be optimized ("opt_fn") as second argument, and values for the constants used in the mathematical graph. The optimr() function will directly calculate partial derivatives of the "opt_fn" function with respect to the parameters param and intc, and will also update the parameters using the partial derivatives (also termed gradients). Parameters updates are conducted until convergence is reached, i.e. until the loss minimum of the function to be optimized has (likely) been reached.

library('optimr')

opt_fn = function(parameters, x, y){
  param = parameters[1]
  intc = parameters[2]
  
  mod_out = intc + param * x
  mod_out = 1 / (1 + exp(-mod_out))
  mod_out = (y - mod_out)^2
  
  return(mod_out)
}

optimr(par = c(3.5, 5.0), fn = opt_fn, x = 2.5, y = 6.5)

We can also calculate partial derivatives for the parameters and optionally optimize the loss function (i.e. the graph) using the package TMB (which stands for Template Model Builder) in R. Utilization of this package requires writing the graph in a C++ script, and writing everything else, i.e. calling the TMB package and compiling the C++ script and generating the partial derivatives in R. Within the C++ script, here named "Derivs_R.cpp", we need to first initialize the constants x and y, as well as the parameters param and intc. (Initial) values for these parameters will be passed when the script is later compiled via TMB. We then write our mathematical graph (which is assigned to the object "mod_out").

#include 
template
Type objective_function::operator() ()
{
  DATA_SCALAR(x);
  DATA_SCALAR(y);
  
  PARAMETER(param);
  PARAMETER(intc);
  
  Type mod_out = 0;
  
  mod_out += intc + param * x;
  mod_out = 1 / (1 + exp(-mod_out));
  mod_out = pow(y - mod_out, 2);
  
  return mod_out;
}

We compile this script with the compile() function of TMB and load it into the R session via the dyn.load() function. We create a named list of the constants (named "data") and a named list of the parameters (named "parameters). We use the function MakeADFun to obtain the partial derivatives of the mathematical graph with respect to x and y. These are stored in the gr() attribute of the MakeADFun object. That object also contains the actual output of the mathematical graph as the attribute fn(). The C++ script can also be altered to return any intermediate or additional output that is not used in gradient calculation (via the C++ function ADREPORT). The gradients obtained can be used for manual optimization, but can also be directly passed to an optimizer using e.g. the R function optimr(). Here, we pass the parameters, the graph output and the gradients directly to optimr and use the function only to update the prarameters and to implement specific optimization routines, until the loss minimum is (approximately) reached.

library('TMB')

compile('Derivs_R.cpp')
dyn.load(dynlib('Derivs_R'))

data = list('x' = x, 'y' = y)
parameters = list('param' = param, 'intc' = intc)

obj = MakeADFun(data, parameters, DLL = 'Derivs_R')

obj$gr()

################################################################################

fit = optimr(obj$par, obj$fn, obj$gr)
sdr=sdreport(obj)
sdr

Now let us look at gradient calculation in Python: First, we are going to look at an implementation with the package SymPy. SymPy appears to have some incompatibility with other auto-differentation packages like TensorFlow, so it appears best to set up separate working environments for utilizing both packages. The procedures for calculating derivatives with SymPy are similar to those in R when using the deriv() function. Here, we need to initially declare all constants (x and y) and parameters (param and intc) as so-called symbols. Then, as with deriv(), we construct character strings representing the mathematical graph. Next, we use the SymPy function diff() to generate an expression that represents the calculations necessary for calculating the partial derivatives. We need to name the symbol(s) for which we wish to generate partial derivatives when calling diff(). We then transform the expression into a character string (using the basic str() function). We then replace our symbols (x, y, param and intc) with actual scalar values. Finally, we evaluate the strings using eval() (i.e. we execute the Python commands "contained" in the character strings) and obtain our partial derivatives.

from sympy import symbols, diff
from sympy import sin, exp

param = symbols('param')
intc = symbols('intc')
x = symbols('x')
y = symbols('y')

mod_out = 'intc + param * x'
mod_out = '1 / (1 + exp(-(' + mod_out + ')))'
mod_out = '(y - ' + mod_out + ')**2'

drv_param = diff(mod_out, param)
drv_param = str(drv_param)
drv_intc = diff(mod_out, intc)
drv_intc = str(drv_intc)

param = 3.5
intc = 5.0
x = 2.5
y = 6.5

eval(drv_param)
eval(drv_intc)

Another possibility for calculating partial derivatives in Python is Tensorflow in conjunction with Keras. Tensorflow is a library for implementing large-scale machine-learning tasks, like image recognition, efficiently in Python; hence computing partial derivatives of a function is a fundamental feature of Tensorflow. Keras is a high-level interface to Tensorflow that allows fast construction of (very) complex models like deep neural networks. In our first example, we only require Tensorflow, however. We first have to declare x and y as Tensorflow constants and param and intc as Tensorflow variables. This is necessary in order for Tensorflow to determine which scalars of the mathematical graph should be traced for computation of partial derivatives. We then invoke a so-called GradientTape, which is basically a way of constructing a mathematical graph that is "traced", i.e. for which auto-differentation is invoked to construct expressions for the computation of partial derivatives. The interesting thing here is that the mathematical graph is not written as a character string, but as a series of regular Python commands; the only constraint is that the corresponding lines must be written indented under the line invoking the GradientTape. Note also that any mathematical operations that would normally be done with the base Python functions or with NumPy must be done with TensorFlow functions (e.g. exp()), which can occasionally differ from their NumPy equivalents. Another interesting fact here is that TensorFlow allows the utilization of functions that are not "pure" mathematical operations, like mean(), as long as they are written using the TensorFlow functions. For obtaining the partial derivatives, we utilize the gradient() functional attribute of the Tape, where we need to pass the name of the final object generated by the mathematical graph and a list of the variables for which the partial derivatives are to be calculated as arguments.

import tensorflow as tf

param = tf.Variable(3.5)
intc = tf.Variable(5.0)

x = tf.constant(2.5)
y = tf.constant(6.5)

with tf.GradientTape(persistent = True) as tape:
    mod_out = intc + param * x
    mod_out = 1 / (1 + tf.exp(-(mod_out)))
    mod_out = (y - mod_out)**2
    
tape.gradient(mod_out, [param, intc])

We can also use TensorFlow with Keras directly for optimization. To this end, we need to re-write our mathematical graph as a sequence of Keras "layers" - the term "layer" is inherent to the Deep-Learning vocabulary and basically describes a recurring sequence of operations. In our case, since our mathematical graph acually is a neural network in its simplest form, the translation to the layer-based synthax is easy. We define one input layer, "lay_in", where we pass the shape of the input constant as argument (a single scalar value, hence [1,]). We define one layer which represents the multiplication of x with param and the addition of intc, which is the typical operation of a Dense layer. Here, we only specify the dimensionality of the layer output as an argument, since in deep neural networks, layers typically have multi-dimensional output. In the present case, however, we set the argument simply to 1. We also specify that a bias should be applied (bias is the Deep-Learning term for intercept). To connect this part of the graph, termed "lay_hd", with the constant x, we need to add the name of the input layer in brackets after the Dense layer. Note that we use internal Keras functions here to initialize param and intc, and that these two parameters are not actually defined as objects by us, but are invoked directly by the layers.Dense() function. Finally, we define an Activation() layer and specify that the activation to be implemented should be of the sigmoid type. We connect this layer, named "la_out", to the previous parts of the graph by adding the name of the previous layer in brackets. We then need to combine all three parts of the graph by using the models.Model() function of Keras, where we need to pass the names of the first and the last layer as arguments.

lay_in = tf.keras.layers.Input([1,])
lay_hd = tf.keras.layers.Dense(1, use_bias = True)(lay_in)
lay_out = tf.keras.layers.Activation('sigmoid')(lay_hd)

mod = tf.keras.models.Model(lay_in, lay_hd)

We now have two options for continuing: First, we can invoke a GradientTape, as above. Within tbe context of the Tape, we have to apply the graph in its current form on x (note that up to now, we have only specified that a scalar constant is part of the graph, but we haven not specified that that scalar is x). To this end, we simply call the graph (which we named "mod"here) and need to set the argument training to True; calling the graph simply means that the expression contained is evaluated. We also have to compute the squared difference between x and y. Once that is done, the mathematical graph is complete. We can now compute the partial derivatives of the graph output with respect to the parameters, which are contained within the trainable_weights attribute of the partial graph termed mod (note that parameters are termed "weights" in Deep-Learning parlance). It becomes obvious here that the idea of a mathematical graph is somewhat reduced in the Keras synthax, as certain components of the graph are treated as separate Keras procedures (i.e. the mathmatical operations on a placeholder constant, the incorporation of the actual constant x and the calculation of the loss). Keras differentiates between a model, the application of the model to input data (x) and the evaluation of model performance (calculation of the loss) rather than regarding all these operations as one single graph. From the perspective of trying to understand the computation of partial derivatives, the latter perspective appears more advantageous, however.

with tf.GradientTape(persistent = True) as tape:
    mod_out = mod(tf.reshape(x, [1,]), training = True)
    mod_out = (y - mod_out)**2
    
tape.gradient(mod_out, mod.trainable_weights)

The second option we have to utilize the partial graph "mod" is to optimize directly, or, in Deep-Learning parlance, "to train the model". To this end, we need to "compile" the partial graph using the Keras compile() function, where we specify the optimization algorithm and the loss type. As optimizer, we here select SGD, which stands for stochastic gradient descent and is the most basic from of optimization algorithm. We specify MSE as loss, which stands for mean-squared error and is identical to calculating the squared difference between the output of the activation function and y (when multiple data points, i.e., multiple x and y, are involved, the mean of the squared differences is computed in addition). We then invoke the fit() functional attribute of the partial graph to bring it all together: Now, the graph is completed by inserting the constants x and y (these need to be specified as arguments to fit()) and by adding the calculation of the squared difference, which is the final or top-level operation of the graph. fit() also immediately computes the partial derivatives of the output of the graph to its variables and applies these derivatives to update the variables, i.e. it optimizes the graph. These procedures are thus similar to invoking the optimr() function in R, though the mathematical graph is better visible in its full extent in optimr().

mod.compile(optimizer = 'SGD', loss = 'MSE')
mod.fit(tf.reshape(x, [1,]), y = tf.reshape(y, [1,]))

We have thus compared a series of procedures for calculating partial derivatives of a mathematical function (synonymically termed a graph). We have observed that some options, like using deriv() or SymPy, provide the user with a good understanding of the concept of mathematical graphs, and with the knowledge that a model, its precictions and the predicitive loss must be considered as one sequence of operations in order to optimize the model parameters, but are a bit inconvenient to use for applied tasks, especially larger-scale tasks. Other options like TensorFlow enable rapid optimization of functions of very large scale, but tend to hide the concept of the mathematical graph from the user. This distinction could lead to frustrating experience when trying to optimize functions that do not align with the neural-network templates, as indicated by the apparently large variety of user questions regarding difficulties obtaining gradients found on the internet. Template Model Builder provides a good understanding of the concept of the mathematical graph, as model predictions and loss calculation must be part of the C++ script, but the requirement of coding in C++ and the difficulty of tracing back errors is likely off-putting to many users. optimr is relatively user-friendly to users proficient in R, but is mostly used for immediate optimization and not for computation of partial derivatives and custom optimization. It thus appears that a user-friendly implementation of computing partial derivatives that also provides a good didactive basis on the concept is still waiting to be published...

Ex Data, Scientia