ExDataScientia

Classes in Python and R

Classes in Python and R are somewhat peculiar but very useful objects that basically serve as functions with on-demand output or can force organized work-flow between multiple programmers working on a common project.

When writing a function in a programming language like Python or R, one major limitation is that this function will have a very specific purpose. To address several related purposes, one could incorporate one or several nested if-else loops into such a function, such that the function's doing would depend on user input. Another, more straight-forward solution available in the Python programming language is the use of so-called classes.

A class in Python can be described as an object containing several functions that all operate on the same input and in some cases also on one-another's output. The functions are either explicitly called by the user or are automatically involved in the case that one function depends on another's output. Classes are always bound to creating an object - unlike the case of a function, where we simply apply a function to some input (some object, e.g. a numeric vector) and receive an immediate output, we use the input to initialize an object of the class that we have defined, and then we can do something with this object (i.e., apply the classes's functions on the input).

Let us now take a look at an implementation of a class for the calculation of various statistical properties of an input. We begin by writing the functional term class, and then the name that we want to give to that class (in this case "descr_stat_1"). This means that the following lines will define an object class termed "descr_stat_1", i.e. we are going to define all the properties that an object of this class will have. The next block of code is indented to indicate its association with the class definition. This first bit of code is the so-called __init__ function. It is basically part of any Python class, and its purpose is to initialize an object of the class we define here, given some input. You will notice that this function requires two arguments: self and "numvec". The latter is a numeric vector and can be understood as a classical type of input that would also be required by a normal stand-alone function. self on the other hand is a purely functional term and is not actually provided by the user when creating a class-type object. Rather, it is used within the function context to "appropriate" the user input(s), such that it is available for all the subsequent functions in the class. This is important, as otherwise the functions of this class would look outside of the class context (i.e., in the global environment) for the variable "numvec". Note, thus, that the user input is not defined in the line defining the class name, but rather in the line defining the __init__ function.

The next part of our custom class is a function calculating the mean of an input vector. Note that as part of the class, the only required input to the function is the functional term self, which means that the function should access objects appropriated by the class. The function generates the object "outp" by calculating the mean of self.numvec, which means that it looks up this vector in the list of inputs appropriated by the class, and does not look outside of the class environment. The "outp" object is then returned as output of the function. We create a second function in the same manner, which calculates the standard deviation of an input vector appropriated by the class.

Let us now make use of the class just created. We generate a vector of 20 random numbers drawn from the range 0, 100. This will serve as input to calculate descriptive statistics from. Then, we create an object of the type of class that we just defined. We do this by calling "descr_stat_1" like a function and supplying the numeric vector as input. As you can see above, the __init__ function of the class requires this input, and omitting the input will lead to an error message. Note also that simply calling the class-type object just created will return nothing except a relatively cryptic message detailing the "identity" of the object. Thus, the functions of the class are not automatically executed. Instead, we need to explicitly call them following the synthax object_name.function_name, in this case e.g. "stat_calc_1.calc_mean()".

In this way, the object behaves like any object of a non-custom class. For example, our numeric vector is an object of class numpy (referring to the numpy package), and we can access any of the functions stored in this class created by the people behind the Phyton programming language, e.g. the argmax function by writing "numvec.max()". Our own custom class thus adds to the pantheon of classes already supplied with the official Python distribution. Furthermore, classes are often intercompatible; in the present case, our custom class can deal with numpy-class objects. On the other hand, we could use e.g. the np.sqrt() function on a descr_stat_1 object (when calling one of the classes's functions).

Now, we can also refer to existing (custom) classes when writing a new class. For demonstration, we will write another class as done above and then a third class referring to the first and second classes. The second class, named "descr_stat_2", will contain functions for calculating another set of descriptive statistics, that is, the median, 25-%-percentile, 75-%-percentile, minimum and maximum of a given numeric vector. Structure-wise, it equals the "descr_stat_1" class defined above, though it contains six instead of three functions.

The third class, named "stat_summary", deviates structurally a bit from the first and second classes: The __init__ function now not only appropriates the input, the numerical vector, for the class functions, but also initializes two class objects by passing the appropriated input vector to the "descr_stat_1" and "descr_stat_2" classes. This means that any object of the "stat_summary" class will have access to all the functions of the other two classes, i.e. the third class is dependent on and builds upon the first and second classes. To this end, when initializing the class objects, we need to refer to the already appropriated numeric vector, "self.numvec".

Next, we define a function named make_summary that will generate an overview of the statistical properties of the input vector. As in the other two classes, the only input required for this function is the functional term self, which gives access to the appropriated input vector and to the initialized objects of classes "descr_stat_1" and "descr_stat_2". The overview is here created as a Python dictionary, essentially a named list of objects. Within the dictionary, we call the functions of the initialized class objects, which in turn calculate the mean, standard deviation etc. of the appropriated input vector. (Certainly, to make proper use of the advantages of a class over a function, we could add more funcions, e.g. a plotting function for visualizing the descriptive statistics, but for the sake of simplicity, we will here stick to the single function.)

We create an object of our custom class stat_summary that we just created by passing the same numeric vector as before as input. As with the other two classes, simply calling the object will return only a short cryptic description of the object properties, i.e. the function contained in the class is not automatically executed. To this end, we need to explicitly call the function attribute of the class object, i.e. in this case stat_summary.make_summary(). Then this function will be applied on the input vector, or more specifically on the already internally initialized objects of the descr_stat_1 and descr_stat_2 classes, and the named dictionary of descriptive statistics will be returned as output.

In R, the term class has a somewhat different definition than in Python. The most common so-called S4 classes (there are also less common S3 classes) act like templates for named lists that enforce certain input of specific data types to be put into the list. They are therefore most useful for ensuring internal consistency in a programming project that might be handled by multiple persons with otherwise different programming styles. Furthermore, they can be useful for ensuring that a function that requires a set of certain inputs (of certain data types) actually receives all these inputs when called - which can be useful when that function is to be used in various contexts. However, unlike the Python classes, R S4 classes do not contain functions themselves, and can thus not be used to perform operations on their input.

As an example, we will here create a class that provides all the input required by a function that calculates and displays the performance of simple machine-learning models. The function will require two data sets, a training- and a test set, with each consisting of a numeric vector representing the predicted variable and a data frame containing the predictor variables as columns. Also, it will of course require the model object to compute the model predictions.

We name that class "mod_perf_inp" (for "model-performance input"), and use the base function setClass to construct it. As the first argument to this function, we provide the intended name of the class, and then a list of names for the so-called slots of the class. The slots are obligatory inputs required when creating an object belonging to our custom class. Functionally, this means that we (or any other user using this class) will be forced to provide input for the slots, which in turn ensures that a function to be applied to objects of this class will find all the required input to make them work. This way, using classes can serve to create an effective workflow, even between multiple programmers working on a common project.

In our case, we will work with a list of five slots: x_train, the training set of predictor variables, y_train, the corresponding training vector of target values (to be predicted), x_test and y_test, the test set, and mod, the model object that makes the prediction. With the exception of mod, we also enforce the input type, in case of x_train and x_test a data frame, and for y_train and y_test a numeric vector. This means that when attempting to create an object of class mod_perf_inp while providing e.g. a matrix instead of a data frame to the slot x_train will result in an error message, and thus ensures that functions further downstream are going to receive exactly the input they require and in the format they require. The input type of mod is left deliberately vague as a list, as the machine-learning models we are going to test belong to a variety of different object types.

We now set up training and test data frames, and corresponding target vectors, by randomly subsetting the mtcars data set, where we can predict a car"s efficiency based on various attributes of the car. 80 % of the 30 data points in the data set are used as training data, and the remaining 20 % as test data.

We then train three different machine-learning models to predict the variable mpg (miles-per-gallon) from the variables wt (weight) and drat (rear-axle ratio) (these were pre-selected for the purpose of this exercise among the relatively large number of predictor variables in the mtcars data set for their quality in predicting mpg with simple models while avoiding data transformations, not by statistical variable-selection theory). We train a simple linear-regression model (using the function lm), a generalized additive model (GAM, using the gam funcion of the mgcv package) and a regression tree (using the gbm function of the package of the same name).

We then create an object of our custom class mod_perf_inp for each of the three models by providing the training and test data sets (and target values), and the model object, packaged into a list. As the three functions lm, gam and gbm all return objects of different types, and we are allowed to set only one acceptable object type when constructing the list of slots when setting up our custom class, we need to do this "cheating" behaviour, although, in terms of effective programming, it is not too desirable, as some models might not respond to the functions we are planning to apply on them. (Luckily for us, we know in advance that the predict function we wish to apply does work on each of the three model types.)

m1_perf_inp = mod_perf_inp(x_train = mtcars_subs_train[,c('wt','drat')],
                            y_train = mtcars_subs_train$mpg,
                            x_test = mtcars_subs_test[,c('wt','drat')],
                            y_test = mtcars_subs_test$mpg,
                            mod = list(m1))

m2_perf_inp = mod_perf_inp(x_train = mtcars_subs_train[,c('wt','drat')],
                            y_train = mtcars_subs_train$mpg,
                            x_test = mtcars_subs_test[,c('wt','drat')],
                            y_test = mtcars_subs_test$mpg,
                            mod = list(m2))

m3_perf_inp = mod_perf_inp(x_train = mtcars_subs_train[,c('wt','drat')],
                            y_train = mtcars_subs_train$mpg,
                            x_test = mtcars_subs_test[,c('wt','drat')],
                            y_test = mtcars_subs_test$mpg,
                            mod = list(m3))

Finally, we write a function that shall calculate the prediction error for training- and test data for a machine-learning model. We write it to accept only input in form of our new custom class (mod_perf_inp). Within the function, we need to subset the "contents" of the class object using the @ symbol. This is different from classical subsetting in R where the $ symbol is used. The operations carried out by the function are in following order: i) calculate model predictions for the training data from the model object (as subsetted from the list we packaged it into) and the data frame of training predictor data supplied by the custom-class input, ii) calculate the training loss as the sum of squared differences between these predictions and the vector of training target values, also supplied by the custom-class input, iii) calculate the test predictions as above, from the supplied data frame of test predictor data, iv) calculate the test loss as above from the supplied vector of test target values. The function returns a named list containing the training- and test losses.

Ex Data, Scientia