**Computing partial derivatives via auto-differentiation - a comparison of approaches in R and Python**

Computing partial derivatives of a function with respect to its parameters is a key procedure in optimization and thus a basic requirement for many machine-learning tasks. Here, we are going to compare the different methods of obtaining partial derivatives available in R and Python.

Calculation of partial derivatives of a function is not a straightforward task, as it requires comprehensive knowledge about the rules of derivation, which is a fairly large body of knowledge. Fortunately, the programming languages R and Python offer functions for calculating derivatives automatically (a process that is being done in background computer operations anyway), the so-called auto-differentiator functions. These functions are often quite hidden to the average programmer, as they tend to be the back-bone of higher-order functions like optimizers. However, knowing how to obtain function derivatives can be a powerful skill for implementing machine-learning tasks without having to rely on black-box-like library functions and thus offers a high degree of flexibility for the programmer.

Several different auto-differntiator functions have been developed over the years, and each of them has its specific advantages and disadvantages. Here, we are going to look at the auto-differentiators *deriv*, *optimr* and *TMB* (*Template Model Builder*) in R and at *SymPy*, *Tensorflow* and *Keras* in Python. *optimr* and *Keras* are not pure auto-differentiators but rather optimization libraries, but we include them anyway as they do still offer a useful degree of flexibility (especially when compared to the relatively constrained model-fitting functions like *nls()* and *glm()* in R, which only work for otimization problems of a very distinct design) and feature some advantages over the pure auto-differentiators for many applied tasks.

We are going to calculate derivatives for two variables in a simple mathematical function the core component of which basically resembles a linear regression model model: One variable is multiplied with a constant, the other variable is added to the product. The sum is then inserted into a sigmoid activation function (1 / (1 + exp(-s)), where "s" is the sum). Finally, the squared difference from another constant is calculated. This mathematical graph represents the most basic form of a neural network with mean-squared-error loss conceivable. The two variables are given the names *param* and *intc* here, the two constants are *x* and *y*.

For calculating derivatives with *deriv()* in R, we first need to generate a character string of the mathematical graph. To this end, we first formulate the part resembling a linear-regression model as a character string and subsequently paste it together with other character strings to incorporate the non-linear activation and the squared-difference term, using the *paste0()* function. We then paste this string together with the functional formulation *deriv( ~ ..., c("param", "intc"))* (where "..." is replaced by the character string of the mathematical graph). We parse and evaluate this character string in order to execute it, using the functions *parse()* and *eval()*. The result is an unevaluated R expression which contains lines of code that can calculate the partial derivatives of the function output with respect to *param* and *intc*. This expression was generated by the auto-differentiator. Finally, we evaluate the expression with *eval* and obtain the partial derivatives for *param* and *intc* as scalar values. The entire procedure is somewhat tedious, especially since we could have directly used the *deriv()* function without pasting together a complex character string and the evaluating it. However, it comes in handy when the mathematical graph to be used is very complex and may be constructed with loops etc. (deep neural networks are an example).

mod_out = 'intc + param * x' mod_out = paste0('1 / (1 + exp(-(', mod_out, ')))') mod_out = paste0('(y - ', mod_out, ')^2') param = 3.5 intc = 5.0 x = 2.5 y = 6.5 drv = paste0('deriv(~ ', mod_out, ', c("param", "intc"))') drv = eval(parse(text = drv)) eval(drv)

We can directly optimize the mathematical graph by using the R function *optimr()*. Here, we write the graph a a *function()*-type object, the first argument of which is a vector of the parameters to be optimized. We then call the function *optimr()* and provide a vector of initial parameter values (here arbitrarily set to 3.5 for *param* and 5.0 for *intc*) as first argument, the name of the function to be optimized ("opt_fn") as second argument, and values for the constants used in the mathematical graph. The *optimr()* function will directly calculate partial derivatives of the "opt_fn" function with respect to the parameters *param* and *intc*, and will also update the parameters using the partial derivatives (also termed *gradients*). Parameters updates are conducted until convergence is reached, i.e. until the loss minimum of the function to be optimized has (likely) been reached.

library('optimr') opt_fn = function(parameters, x, y){ param = parameters[1] intc = parameters[2] mod_out = intc + param * x mod_out = 1 / (1 + exp(-mod_out)) mod_out = (y - mod_out)^2 return(mod_out) } optimr(par = c(3.5, 5.0), fn = opt_fn, x = 2.5, y = 6.5)

We can also calculate partial derivatives for the parameters and optionally optimize the loss function (i.e. the graph) using the package *TMB* (which stands for *Template Model Builder*) in R. Utilization of this package requires writing the graph in a C++ script, and writing everything else, i.e. calling the *TMB* package and compiling the C++ script and generating the partial derivatives in R. Within the C++ script, here named "Derivs_R.cpp", we need to first initialize the constants *x* and *y*, as well as the parameters *param* and *intc*. (Initial) values for these parameters will be passed when the script is later compiled via *TMB*. We then write our mathematical graph (which is assigned to the object "mod_out").

#includetemplate Type objective_function ::operator() () { DATA_SCALAR(x); DATA_SCALAR(y); PARAMETER(param); PARAMETER(intc); Type mod_out = 0; mod_out += intc + param * x; mod_out = 1 / (1 + exp(-mod_out)); mod_out = pow(y - mod_out, 2); return mod_out; }

We compile this script with the *compile()* function of *TMB* and load it into the R session via the *dyn.load()* function. We create a named list of the constants (named "data") and a named list of the parameters (named "parameters). We use the function *MakeADFun* to obtain the partial derivatives of the mathematical graph with respect to *x* and *y*. These are stored in the *gr()* attribute of the *MakeADFun* object. That object also contains the actual output of the mathematical graph as the attribute *fn()*. The C++ script can also be altered to return any intermediate or additional output that is not used in gradient calculation (via the C++ function *ADREPORT*). The gradients obtained can be used for manual optimization, but can also be directly passed to an optimizer using e.g. the R function *optimr()*. Here, we pass the parameters, the graph output and the gradients directly to *optimr* and use the function only to update the prarameters and to implement specific optimization routines, until the loss minimum is (approximately) reached.

library('TMB') compile('Derivs_R.cpp') dyn.load(dynlib('Derivs_R')) data = list('x' = x, 'y' = y) parameters = list('param' = param, 'intc' = intc) obj = MakeADFun(data, parameters, DLL = 'Derivs_R') obj$gr() ################################################################################ fit = optimr(obj$par, obj$fn, obj$gr) sdr=sdreport(obj) sdr

Now let us look at gradient calculation in Python: First, we are going to look at an implementation with the package *SymPy*. *SymPy* appears to have some incompatibility with other auto-differentation packages like *TensorFlow*, so it appears best to set up separate working environments for utilizing both packages. The procedures for calculating derivatives with *SymPy* are similar to those in R when using the *deriv()* function. Here, we need to initially declare all constants (*x* and *y*) and parameters (*param* and *intc*) as so-called *symbols*. Then, as with *deriv()*, we construct character strings representing the mathematical graph. Next, we use the *SymPy* function *diff()* to generate an expression that represents the calculations necessary for calculating the partial derivatives. We need to name the symbol(s) for which we wish to generate partial derivatives when calling *diff()*. We then transform the expression into a character string (using the basic *str()* function). We then replace our symbols (*x*, *y*, *param* and *intc*) with actual scalar values. Finally, we evaluate the strings using *eval()* (i.e. we execute the Python commands "contained" in the character strings) and obtain our partial derivatives.

from sympy import symbols, diff from sympy import sin, exp param = symbols('param') intc = symbols('intc') x = symbols('x') y = symbols('y') mod_out = 'intc + param * x' mod_out = '1 / (1 + exp(-(' + mod_out + ')))' mod_out = '(y - ' + mod_out + ')**2' drv_param = diff(mod_out, param) drv_param = str(drv_param) drv_intc = diff(mod_out, intc) drv_intc = str(drv_intc) param = 3.5 intc = 5.0 x = 2.5 y = 6.5 eval(drv_param) eval(drv_intc)

Another possibility for calculating partial derivatives in Python is *Tensorflow* in conjunction with *Keras*. *Tensorflow* is a library for implementing large-scale machine-learning tasks, like image recognition, efficiently in Python; hence computing partial derivatives of a function is a fundamental feature of Tensorflow. *Keras* is a high-level interface to *Tensorflow* that allows fast construction of (very) complex models like deep neural networks. In our first example, we only require *Tensorflow*, however. We first have to declare *x* and *y* as *Tensorflow* constants and *param* and *intc* as *Tensorflow* variables. This is necessary in order for *Tensorflow* to determine which scalars of the mathematical graph should be traced for computation of partial derivatives. We then invoke a so-called *GradientTape*, which is basically a way of constructing a mathematical graph that is "traced", i.e. for which auto-differentation is invoked to construct expressions for the computation of partial derivatives. The interesting thing here is that the mathematical graph is not written as a character string, but as a series of regular Python commands; the only constraint is that the corresponding lines must be written indented under the line invoking the *GradientTape*. Note also that any mathematical operations that would normally be done with the base Python functions or with *NumPy* must be done with *TensorFlow* functions (e.g. *exp()*), which can occasionally differ from their *NumPy* equivalents. Another interesting fact here is that *TensorFlow* allows the utilization of functions that are not "pure" mathematical operations, like *mean()*, as long as they are written using the *TensorFlow* functions. For obtaining the partial derivatives, we utilize the *gradient()* functional attribute of the *Tape*, where we need to pass the name of the final object generated by the mathematical graph and a list of the variables for which the partial derivatives are to be calculated as arguments.

import tensorflow as tf param = tf.Variable(3.5) intc = tf.Variable(5.0) x = tf.constant(2.5) y = tf.constant(6.5) with tf.GradientTape(persistent = True) as tape: mod_out = intc + param * x mod_out = 1 / (1 + tf.exp(-(mod_out))) mod_out = (y - mod_out)**2 tape.gradient(mod_out, [param, intc])

We can also use *TensorFlow* with *Keras* directly for optimization. To this end, we need to re-write our mathematical graph as a sequence of *Keras* "layers" - the term "layer" is inherent to the Deep-Learning vocabulary and basically describes a recurring sequence of operations. In our case, since our mathematical graph acually *is* a neural network in its simplest form, the translation to the layer-based synthax is easy. We define one input layer, "lay_in", where we pass the shape of the input constant as argument (a single scalar value, hence *[1,]*). We define one layer which represents the multiplication of *x* with *param* and the addition of *intc*, which is the typical operation of a *Dense* layer. Here, we only specify the dimensionality of the layer output as an argument, since in deep neural networks, layers typically have multi-dimensional output. In the present case, however, we set the argument simply to *1*. We also specify that a *bias* should be applied (*bias* is the Deep-Learning term for *intercept*). To connect this part of the graph, termed "lay_hd", with the constant *x*, we need to add the name of the *input* layer in brackets after the *Dense* layer. Note that we use internal *Keras* functions here to initialize *param* and *intc*, and that these two parameters are not actually defined as objects by us, but are invoked directly by the *layers.Dense()* function. Finally, we define an *Activation()* layer and specify that the activation to be implemented should be of the *sigmoid* type. We connect this layer, named "la_out", to the previous parts of the graph by adding the name of the previous layer in brackets. We then need to combine all three parts of the graph by using the *models.Model()* function of *Keras*, where we need to pass the names of the first and the last layer as arguments.

lay_in = tf.keras.layers.Input([1,]) lay_hd = tf.keras.layers.Dense(1, use_bias = True)(lay_in) lay_out = tf.keras.layers.Activation('sigmoid')(lay_hd) mod = tf.keras.models.Model(lay_in, lay_hd)

We now have two options for continuing: First, we can invoke a *GradientTape*, as above. Within tbe context of the *Tape*, we have to apply the graph in its current form on *x* (note that up to now, we have only specified that a scalar constant is part of the graph, but we haven not specified that that scalar is *x*). To this end, we simply call the graph (which we named "mod"here) and need to set the argument *training* to *True*; calling the graph simply means that the expression contained is evaluated. We also have to compute the squared difference between *x* and *y*. Once that is done, the mathematical graph is complete. We can now compute the partial derivatives of the graph output with respect to the parameters, which are contained within the *trainable_weights* attribute of the partial graph termed *mod* (note that parameters are termed "weights" in Deep-Learning parlance). It becomes obvious here that the idea of a mathematical graph is somewhat reduced in the *Keras* synthax, as certain components of the graph are treated as separate *Keras* procedures (i.e. the mathmatical operations on a placeholder constant, the incorporation of the actual constant *x* and the calculation of the loss). *Keras* differentiates between a *model*, the application of the model to input data (*x*) and the evaluation of model performance (calculation of the loss) rather than regarding all these operations as one single graph. From the perspective of trying to understand the computation of partial derivatives, the latter perspective appears more advantageous, however.

with tf.GradientTape(persistent = True) as tape: mod_out = mod(tf.reshape(x, [1,]), training = True) mod_out = (y - mod_out)**2 tape.gradient(mod_out, mod.trainable_weights)

The second option we have to utilize the partial graph "mod" is to optimize directly, or, in Deep-Learning parlance, "to train the model". To this end, we need to "compile" the partial graph using the *Keras* *compile()* function, where we specify the optimization algorithm and the loss type. As optimizer, we here select *SGD*, which stands for *stochastic gradient descent* and is the most basic from of optimization algorithm. We specify *MSE* as loss, which stands for *mean-squared error* and is identical to calculating the squared difference between the output of the activation function and *y* (when multiple data points, i.e., multiple *x* and *y*, are involved, the mean of the squared differences is computed in addition). We then invoke the *fit()* functional attribute of the partial graph to bring it all together: Now, the graph is completed by inserting the constants *x* and *y* (these need to be specified as arguments to *fit()*) and by adding the calculation of the squared difference, which is the final or top-level operation of the graph. *fit()* also immediately computes the partial derivatives of the output of the graph to its variables and applies these derivatives to update the variables, i.e. it optimizes the graph. These procedures are thus similar to invoking the *optimr()* function in R, though the mathematical graph is better visible in its full extent in *optimr()*.

mod.compile(optimizer = 'SGD', loss = 'MSE') mod.fit(tf.reshape(x, [1,]), y = tf.reshape(y, [1,]))

We have thus compared a series of procedures for calculating partial derivatives of a mathematical function (synonymically termed a graph). We have observed that some options, like using *deriv()* or *SymPy*, provide the user with a good understanding of the concept of mathematical graphs, and with the knowledge that a model, its precictions and the predicitive loss must be considered as one sequence of operations in order to optimize the model parameters, but are a bit inconvenient to use for applied tasks, especially larger-scale tasks. Other options like *TensorFlow* enable rapid optimization of functions of very large scale, but tend to hide the concept of the mathematical graph from the user. This distinction could lead to frustrating experience when trying to optimize functions that do not align with the neural-network templates, as indicated by the apparently large variety of user questions regarding difficulties obtaining gradients found on the internet. *Template Model Builder* provides a good understanding of the concept of the mathematical graph, as model predictions and loss calculation must be part of the C++ script, but the requirement of coding in C++ and the difficulty of tracing back errors is likely off-putting to many users. *optimr* is relatively user-friendly to users proficient in R, but is mostly used for immediate optimization and not for computation of partial derivatives and custom optimization. It thus appears that a user-friendly implementation of computing partial derivatives that also provides a good didactive basis on the concept is still waiting to be published...