ExDataScientia

Common errors in coding with C++

Transferring model code to C++ is useful for attaining greatly decreased model running time. Coding in C++, however, features some pitfalls which are not straightforward to overcome. Here, we are going to look at three common sources of coding errors in C++.

When running large numerical models or graph-based models like deep neural networks, it is essential to code efficiently to keep running time within reasonable bounds. One major way to go is to code the pure model part in a fast (compiled) programming language like C++ instead of in the more commonly used slow interpreted languages R or Python. As C++ code can normally not be directly run in a console of a developer environment (like RStudio), it is relatively easy to create a C++ script containing errors that prevent either successful or correct running of the script. Discovering and removing these errors typically requires one or several rounds of "debugging" wherein the script is modified, compiled, an attempt at running it is made, and is modified again.

The three types of error are the compilation error, the runtime error, and logical / mathematical errors. Compilation errors are caused by invalid synthax and lead to unsuccessful compilation of the C++ script, actually preventing running the script. For completeness, it should be mentioned that every C++ script needs to be compiled before it can be run; that means, the code is translated into a file written in machine language, which can afterwards be fed with input and run. (In interpreted languages like R, simply speaking, every line of code is translated into C++ or a similar fast language (and then translated into machine language) whenever that line is run, i.e. not just once, leading to slower running time). When having committed a compilation error, it may have been that our knowledge of C++ synthax was limited, or we committed an oversight, which can easily happen to beginners or due to the fact that IDEs like RStudio often do not point out such errors like they do for e.g. R code, where one may (subconsciously) depend on the little warning signs and red underlining that appear when coding. Depending on system specifics, the compilation error printed to the console may help to identify the error location and error source.

The following is a small C++ script with synthax error (here written for compilation with the R package Template Model Builder; what the script does it rather irrelevant).:

#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_INTEGER(n_obs);
  DATA_VECTOR(x);
  DATA_VECTOR(y);
  
  PARAMETER(a);
  PARAMETER(b);
  
  Type loss = 0;
  
  a = exp(a);
  
  for(int i = 0; i < n_obs; ++i){
    loss += (y(i) - log(a * x(i)) + b)^2;
  }
  
  return loss;
}

The synthax error lies in the formulation of the power equation in the fourth-to-last line. C++ does not recognize the expression a^b, and instead requires the formulation pow(a, b). Therefore, trying to compile this script will result in an error message. You can save the script above as Cpp_Errors.cpp and try to compile it in an R session using the following code:

library('TMB')

a = 2.3
b = 5.3
x = runif(10, 1, 12)
y = log(a * x) + b
y = y + runif(length(y), 0, 0.3)

compile('Cpp_Errors.cpp') # compilation command

If you correct the flawed line in the C++ script, from loss += (y(i) - log(a * x(i)) + b)^2; to loss += pow(y(i) - log(a * x(i)) + b, 2);, compilation should run successfully.

Run-time errors are errors that cannot immediately be recognized during the compilation process, since they typically concern subsetting operations. A typical run-time error is subsetting e.g. a vector with an index value that is larger than the number of elements in that vector, which can happen when writing a loop and not paying attention to the mis-match between the number of iterations (in case of a "for"-loop) and the shape of the vector. Compilation of scripts flawed in this manner will usually be successful, but execution will cause a crash of the R session from which the script is started. Recognizing the existence of a run-time error is therefore easy, but spotting the exact location of the error in the script is not, making run-time errors one of the most annoying types of errors to de-bug. Strategically, one would comment parts of the script, re-compile it and run it, and repeat the proces, successively un-commenting sections of the script until the run-time error occurs, to pin-point the error in the script and then correct it.

In the following, we try to run the successfully compiled C++ script with flawed input data:

library('TMB')

a = 2.3
b = 5.3
x = runif(10, 1, 12)
y = log(a * x) + b
y = y + runif(length(y), 0, 0.3)

compile('Cpp_Errors.cpp') # compilation command
dyn.load(dynlib('Cpp_Errors'))

data = list('n_obs' = length(y) + 1, 'x' = x, 'y' = y)
parameters = list('a' = -1, 'b' = 1)

obj = MakeADFun(data, parameters, DLL = 'Cpp_Errors') # run the C++ script from within the R session

If you look at the C++ script above, you will see that it contains a loop, with the number of iterations equalling the integer n_obs, which we pass as input data to the C++ program in the R script above. You can also see that within the loop, the vector y is subsetted with index i. Hence, n_obs should have the same value as the length of y. In the above R script, we have set n_obs to length(y) + 1. Hence, when we execute the C++ program, it will attempt to subset the vector y at an index outside the length of y (it will try to subset the 11th element of a vector that has only 10 elements). Hence, when trying to run the C++ program using the MakeADFun function, the R session will crash.

Finally, there can be "logical" errors that will (usually) not affect the successful compilation or running of a C++ script. However, such errors may lead to unexpected output, e.g. NA values or infinite or negative infinite values. Such errors are mostly oversights of mathematical laws, e.g. that the square-root and the logarithm of a negative number are undefined (returning NA when computed), or that the logarithm of zero is negative infinity. Such oversights could e.g. occur in an optimization procedure, where one typically does not wish to bound the range of values that the parameter of interest can take (i.e. allow positive and negative values), but where only positive values lead to sensible output (e.g. in a model where a calculation involving the parameter is logarithmized). In this case, one would have to correct the script by first exponentiating the parameter before its incorporation in the model. As in the case of run-time errors, the strategy-of-choice to correct them is to comment most of the script and iteratively un-commenting it to pin-point the error location, unless the mathematical oversight can be immediately spotted.

The following is the C++ script with the synthax error corrected and with a logical error introduced. (Make sure that the re-compilation of the C++ script is done properly - to that end, the files with the suffixes .o and .so, generated during the previous compilation in the R session directory, should be deleted; also the R working environment should be cleared and the R session re-started with Ctrl+Shift+F10):

#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_INTEGER(n_obs);
  DATA_VECTOR(x);
  DATA_VECTOR(y);
  
  PARAMETER(a);
  PARAMETER(b);
  
  Type loss = 0;
  
  for(int i = 0; i < n_obs; ++i){
    loss += pow(y(i) - log(a * x(i)) + b, 2); // synthax error corrected
  }
  
  return loss;
}

Note that in the above script, in comparison to the version shown earlier, we have omitted the line a = exp(a);. This is important, as a is part of a logarithmic operation later in the script. As you can see in the above R script, the value we pass for a is -1. Hence, without the exponentiation done before the logarthmic operation, the logarithm of a negative number will be computed, which is undefined (an NA value). Hence, when executing the command obj = MakeADFun(data, parameters, DLL = 'Cpp_Errors') in the R session, and look at the script output via obj$fn(), we will obtain an NA value.

The following is the fully corrected C++ script:

#include <TMB.hpp>
template<class Type>
Type objective_function<Type>::operator() ()
{
  DATA_INTEGER(n_obs);
  DATA_VECTOR(x);
  DATA_VECTOR(y);
  
  PARAMETER(a);
  PARAMETER(b);
  
  Type loss = 0;
  
  a = exp(a); // logical error corrected
  
  for(int i = 0; i < n_obs; ++i){
    loss += pow(y(i) - log(a * x(i)) + b, 2); // synthax error corrected
  }
  
  return loss;
}

And here the fully corrected R script:

library('TMB')

a = 2.3
b = 5.3
x = runif(10, 1, 12)
y = log(a * x) + b
y = y + runif(length(y), 0, 0.3)

compile('Cpp_Errors.cpp') # compilation command
dyn.load(dynlib('Cpp_Errors'))

data = list('n_obs' = length(y), 'x' = x, 'y' = y) # runtime error corrected
parameters = list('a' = -1, 'b' = 1)

obj = MakeADFun(data, parameters, DLL = 'Cpp_Errors') # run the C++ script from within the R session

print(obj$fn())

The above should not result in any compilation-, runtime- or logical errors, hence the R command obj$fn() should return a real scalar value.

Ex Data, Scientia