ExDataScientia

Implementing k-means clustering in C++

Machine-learning are often conducted in the Python or R programming languages familiar to non-computer-scientists. C++, on the other hand, enables much faster-running code, but is relatively tideous to write. Here, we are going to implement a classical clustering algorithm in C++.

The programming languages R and Python are commonly used for tasks related to machine learning, modeling and statistical analyses in the environmental sciences and elsewhere. They are relatively easy to operate and especially to de-bug due to many simple functions already existing in the "base" working environment and due to sophisitcated integrated developer environments. However, these languages are comparatively slow in execution; a fact that does not become particularly obvious in small-scale routine work but can become an issue at larger-scale tasks that involve hundreds to thousands of sequential processing steps. R and Python are so-called interpreted languages, which means that code written by a user is translated to C or C++ code before being compiled into machine-language code and finally executed. In many occasions, this is actually a helpful aspect, since it relieves the user from the task of manually coding in C or C++ (this also makes coding more robust to errors). However, this workflow is not always the most efficient one. Programs that are written in a lot of lines of R or Python code, and which do not involve calls to "black-box" functions from imported packages are likely to be more efficient when written directly in C++ (a so-called compiled language), as less translation steps are required.

Writing C++ code, however, can be challenging, especially when writing more complex scripts. The major reason is that basic C++ language does not contain simple functions that are not purely one-step mathematical operations, like calculating a mean or a standard deviation (though some packages do offer functions like min()). Furthermore, de-bugging C++ code is somewhat more tedious than de-bugging R or Python code, as C++ scripts must be compiled in their entirety before they can be run for testing. Error-tracing methods, as offered by R and Python, are, to my knowledge, not available, so it can be necessary to comment the C++ code and check line by line or code block by code block for errors (by re-compiling and running the code with commented and un-commented sections).

A few general rules regarding C++ code should be mentioned before we look at a proper example of a C++ script. First, every line of code that is not the start or end of a loop or function ends with a semikolon. When compiling code, this is interpreted as a delimiter between two lines of code, i.e. two different operations. In R and Python code, one generally does not utilize a semikolon or other symbol to declare the end of a line (though using a semikolon in R should not result in an error message, either) - that step is being taken care of by the "interpretation" and translation of user-written code into C++ or C code. Second, commenting a line of code is possible in C++ but is not done by using a hash-tag to declare the comment. Instead, a double-forward slash is used. Third, while loops are generally avoided in the R and Python programming languages due to their ineffectiveness, they are a core element in C++ code. That is primarily because algebraic operations are usually not implemented in C++, that is, vectorized operations (e.g. matrix multiplications or array-scalar operations are not possible). What is written in R or Python as vectorized operations is translated into C++ loops that iterate through every element of the vector, matrix or array. The fact that C++ code is directly translated to machine code and not interpreted from a higher-level language before makes that operation efficient. Finally, C++ code is usually written as a function called main() that returns a zero. Since C++ code is executed as a compiled program, the main() function is interpreted as the component of the script to be translated into machine code. Fourth, it is necessary to declare variables before using them. That is, we cannot directly assign e.g. a mathematical operation to an object, as in R or Python. Instead, we first need to declare said variable (the more appropriate name for "object") in terms of its expected data type and shape (e.g. a scalar double, an integer, a string, a scalar array, and so on). C++ will allocate memory to these variables, and their value can be changed via mathematical operations or by assigning (altered values) from other variables to them.

Let us now look at an implementation of the K-Means clustering algorithm in C++. As a reminder, in brief, the K-Means clustering algorithm estimates the centers (multidimensional means) of a fixed (assumed) number of clusters by iteratively assigning the data points to the nearest (in terms of Euclidean distance) current estimate of cluster center and then updating said centers by averaging over the assigned data points. The number of clusters can decrease throughout the iterations (in cases where no data are assigned to a cluster center), but cannot increase. The implementation in C++ will take a lot more lines of code than implementing it in R or Python, but will ultimately run much faster. We start by importing a number of packages required for reading the data set from the hard drive and for the clustering itself, though the number of functions drawn from these packages is quite small. They are imported via the #include command; this operation actually happens outside the main() context.

#include <iostream>
#include <fstream>
#include <string>
#include <stdlib.h>
#include <cmath>
#include <random>

The first step, and actually challenge, is to load the data sets on which we wish to apply the clustering algorithm. We are going to work with the "iris" data set, which is available in the base R environment, but (to my knowledge) not directly from C++. Therefore, I exported the "iris" data set from R as a .txt. file. The content of this file must be loaded into the context of the C++ program to make use of the data set. We first need to declare a number of variables that we are going to use in writing code for loading the .txt file. Using the package std, we declare "my_file" as an ifstream variable, i.e. a variable that can access a readable file on the hard drive. We also the declare the std strings "my_line" and "token", which represent a single line of the .txt file and a single component of that line (as defined by the delimiter used in writing the file), respectively. These variables are required for extracting the content from the .txt file and entering them into the context of the C++ program. We also declare variables string "delimiter" and strings "spec_0" and "spec_1", which we require to make the program split the lines read from the .txt file properly, and to make it check for the different species included in the data set (we want to check later one whether the detected clusters match the "true" clusters, i.e. the three different Iris species. Finally, we declare one array "my_array" with 150 rows and four columns (the "iris" data set contains 150 data points, four columns with measurements on flower leaves and one column with species names), which we will fill up with the plant-measurement data extracted from the .txt file, and one vector ("spec_vec") with 150 elements, which we will fill up with zeros, ones and twos depending on the species extracted.

int main(){
  std::ifstream my_file;
  std::string my_line;
  std::string token;
  
  std::string delimiter = ",";
  std::string spec_0 = "\"setosa\"";
  std::string spec_1 = "\"versicolor\"";
  
  double my_array[150][4];
  int spec_vec[150];
  
  // ...
}

Next, we open the .txt file, and use a loop to read line by line, split the line according to the delimiter defined above, paste the first four values of the line into our array, and insert a zero, one or two into the "spec_vec" vector depending on whether the fifth value in the line is the string setosa, versicolor or virginica. Instead of a for loop we use a while loop that keeps reading lines from the data set as long as the condition "my_file.good()" is fulfilled, i.e. the variable my_file is constantly being checked for "containing more lines" while line after line is read.

int main(){
  // ...
  
  int ind;
  ind = -1;
  
  my_file.open("/home/jan/iris.txt");
  
  while(my_file.good()){
    std::getline(my_file, my_line);
    
    if(ind > -1 & ind < 151){
      for(int i = 0; i < 5; ++i){
        token = my_line.substr(0, my_line.find(delimiter));
        
        if(i < 4){
          double var_val = atof(token.c_str());
          my_array[ind][i] = var_val;
          
        } else {
          if(token == spec_0){
            spec_vec[ind] = int(0);
          } else {
            if(token == spec_1){
              spec_vec[ind] = int(1);
            } else {
              spec_vec[ind] = int(2);
            }
          }
        }
        
        my_line = my_line.erase(0, my_line.find(delimiter) + delimiter.length());
      }
    }
    
    ind = ind+1;
  }
  
  // ...
}

Next, we standardize the numeric data. In C++, this is more difficult to do than in R or Python, since we require the means and standard deviations of the four variables, which we have to compute ourselves rather than using a function. We loop over the four variables and for each variable first compute the sum over the values of all data points by looping over the 150 rows of the data set. Then we divide the sum by 150 to obtain the mean. Next, we compute the standard deviation by summing the squared difference of the data and the variable-specific mean over all data points. The pow() function is used to obtain the square of the difference. Finally, we compute the square-root of the sum divided by 150 to obtain the standard deviation. Then we can standardize the data by subtracting the variable-specific mean from each data point and dividing the difference by the variable-specific standard deviation.

int main(){
  // ...
  
  double my_array_std[150][4];
  double data_means[4];
  double data_sds[4];
  double data_mins[4];
  double data_maxs[4];
  
  for(int i = 0; i < 4; ++i){
    double my_vec[150];
    double my_vec_sum = 0;
    double for_sd = 0;
    
    data_mins[i] = my_array[0][i];
    data_maxs[i] = my_array[0][i];
    
    for(int j = 0; j < 150; ++j){
      my_vec[j] = my_array[j][i];
      my_vec_sum += my_vec[j];
      
      data_mins[i] = std::min(data_mins[i], my_vec[j]);
      data_maxs[i] = std::max(data_maxs[i], my_vec[j]);
      // determining minimum and maximum values for a given variable over all data points
    }
    
    data_means[i] = my_vec_sum / 150;
    
    for(int j = 0; j < 150; ++j){
      for_sd += pow(my_vec[j] - data_means[i], 2);
    }
    
    data_sds[i] = sqrt(for_sd / 150);
    
    for(int j = 0; j < 150; ++j){
      my_array_std[j][i] = (my_vec[j] - data_means[i]) / data_sds[i];
    }
  }
  
  // ...
}

We generate the initial cluster centers, the number of which we set to 10 (this number can decrease while running the clustering algorithm, but it cannot increase). We declare a matrix named "clust_vec", with the number of rows equaling the number of clusters, and the number of columns equaling the number of dimensions of the data (four). This matrix will be filled with random initial values. We loop over all clusters and all variables and generate a distribution object every time from which to sample a value. The properties of the distribution depend on the variable at hand; we here use a uniform distribution bounded by the standardized minimum and maximum for the given variable (the minima and maxima were determined alongside the standardization of the data above, by looping through the data and determining the minimum (or maximum) of the current data point and the minimum (or maximum) of the previous data points in the loop; see code chunk above). We then call a random generator using the mt19937 function of the std package and use it to draw a random value from the distribution.

int main(){
  // ...
  
  int num_clusts = 10;
  double clust_vec[num_clusts][4];
  
  for(int i = 0; i < num_clusts; ++i){
    for(int j = 0; j < 4; ++j){
      auto distr = std::uniform_real_distribution<double> {(data_mins[j] - data_means[j]) / data_sds[j], (data_maxs[j] - data_means[j]) / data_sds[j]};
      auto numgen = std::mt19937 {i};

      clust_vec[i][j] = distr(numgen);
    }
  }
  
  // ...
}

We then calculate the Euclidean distance between cluster center and data for each data point and each cluster over the four variables. To this end, we calculate the squared difference between cluster center and data point using the pow() function, sum the differences over the four variables and record the sum into an empty matrix (whith as many rows as data points and as many columns as clusters). While looping over the ten cluster centers, for each data point, we record the sum of squared differences and the index of the corresponding cluster whenever that difference is lower than the previous record (it is initialized with the first sum of squared differences and the index of the first cluster, i.e. zero). Both values are recorded into a vector with as many elements as there are data points. The purpose of this operation is to find the cluster center (or rather, its index) that is closest (in terms of Euclidean distance) to each data point.

int main(){
  // ...
  
  double clust_diffs[150][num_clusts];
  int mindiff_idxs[150];
  
  double tot_loss = 0;
  
  for(int i = 0; i < 150; ++i){
    for(int j = 0; j < num_clusts; ++j){
      clust_diffs[i][j] = 0;
    }
  }
  
  double best_clust_diff[150];
  
  for(int i = 0; i < 150; ++i){
    best_clust_diff[i] = 0;
  }
  
  for(int i = 0; i < 150; ++i){
    for(int j = 0; j < num_clusts; ++j){
      for(int k = 0; k < 4; ++k){
        clust_diffs[i][j] += pow(clust_vec[j][k] - my_array_std[i][k], 2);
        tot_loss += pow(clust_vec[j][k] - my_array_std[i][k], 2);
      }
      if(j == 0){
        mindiff_idxs[i] = j;
        best_clust_diff[i] = clust_diffs[i][j]; // records the best "matching" cluster found - initially assumed to be the first cluster
      } else {
        if(clust_diffs[i][j] < best_clust_diff[i]){ // compares loss for jth cluster with that of best cluster found up to j
          mindiff_idxs[i] = j;
          best_clust_diff[i] = clust_diffs[i][j]; // records the best "matching" cluster found if match is better than best previous best match
        }
      }
    }
  }
  
  tot_loss = tot_loss / num_clusts;
  
  // ...
}

Next, we count the number of clusters remaining after the assignment of data to them. Note that we started with a relatively large number of clusters - hence, it is likely that not all cluster centers have data points assigned to them. This reduces the number of clusters within the cluster analysis. We set up a scalar "num_new_clusts" of value zero and then loop over all clusters. In an inner loop, we loop over all data points, and once the index of the current cluster appears in the vector that recorded the cluster index of lowest Euclidean distance, we add a scalar "one" to "num_new_clusts". We then break the loop, as we only want to know whether or not the current cluster is among the ones with data assigned to them, and not how many data were assigned to it. The outer loop then continues automatically with the next cluster.

int main(){
  // ...
  
  int num_new_clusts = 0;
  
  for(int i = 0; i < num_clusts; ++i){
    for(int j = 0; j < 150; ++j){
      if(i == mindiff_idxs[j]){
        num_new_clusts += 1;
        break;
      }
    }
  }
  
  // ...
}

Next, using an almost identical loop, we determine the index value of every remaining cluster. We set up a vector with as many elements as there are remaining clusters (as determined in the previous step). We loop over all original clusters, and for each original cluster we loop over all data points, and assign the cluster index to the vector once it has been found within the vector that recorded the cluster index of lowest Euclidean distance for each data point. (Note that for this operation, we require an additional index "k" that is increased by one whenever a new cluster index has been found in order to move "forward" in the vector recording the unique cluster indices).

int main(){
  // ...
  
  int unique_new_clusts[num_new_clusts];
  
  int k = 0;
  for(int i = 0; i < num_clusts; ++i){
    for(int j = 0; j < 150; ++j){
      if(i == mindiff_idxs[j]){
        unique_new_clusts[k] = mindiff_idxs[j];
        k += 1;
        break;
      }
    }
  }
  
  // ...
}

Finally, we update the cluster center: For each of the remaining clusters, we first sum up the values of the data points assigned to a given cluster center (separately for each variable, of course). At the same time (i.e., within the same loop), we also count the number of data points assigned to a given cluster center by simply increasing a scalar initialized with zero by one whenever that cluster center is encountered. For both operations, as above, we use the recording of cluster indices of lowest Euclidean distance to each data point. For this purpose, we loop over all remaining clusters and all four variables. We then divide the summed-up values for every remaining cluster and every variable by the number of data points assigned and thus obtain the mean value over all assigned data points for each variable and cluster. These are then the new estimates of the cluster centers. Note that for practical reasons, we record the new estimates of cluster centers (and before that the summed values and the counts of occurrence of each cluster) in matrices and vectors whose number of rows (or number of elements) matches the original number of clusters (ten). When running the clustering algorithm iteratively in the next step, it will not be possible to generate and use a new set of matrices and vectors matching the number of reamining clusters in each iteration, as these would not enter the global environment and would thus be discarded after each iteration.

int main(){
  // ...
  
  float new_clusts_pre[num_clusts][4];
  float new_clusts[num_clusts][4];
  int clust_counter[num_clusts];
  
  for(int i = 0; i < num_new_clusts; ++i){
    clust_counter[i] = 0;
  }
  
  for(int i = 0; i < num_new_clusts; ++i){
    for(int j = 0; j < num_new_clusts; ++j){
      new_clusts_pre[i][j] = 0;
      new_clusts[i][j] = 0;
    }
  }
  
  for(int j = 0; j < num_new_clusts; ++j){
    for(int k = 0; k < 4; ++k){
      for(int i = 0; i < 150; ++i){
        if(unique_new_clusts[j] == mindiff_idxs[i]){
          clust_counter[j] += 1;
          new_clusts_pre[j][k] += my_array_std[i][k];
        }
      }
    }
  }
  
  for(int i = 0; i < num_new_clusts; ++i){
    for(int j = 0; j < 4; ++j){
      new_clusts[i][j] = new_clusts_pre[i][j] / clust_counter[i];
    }
  }
  
  // ...
}

We save the clustering results to disk by creating and opening a writable .csv file. First, we add a header line with column names. Then, for every data point, we add a line containing a zero (the index of the first clustering iteration), the standardized values of the four variables, the species name (as recorded in the beginning of the script), the index of the closest cluster center, and the total summed squared differences between data and cluster centers over all data (see code annotations for details on the computation).

int main(){
  // ...
  
  std::ofstream my_file_out;
  my_file_out.open("iris_clustering_results.txt");
  
  my_file_out << "iteration" << '\t' << "index" << '\t' << "Sepal.Length" << '\t' << "Sepal.Width" << '\t' << "Petal.Length" << '\t' << "Petal.Width" << '\t' << "species" << '\t' << "cluster" << '\t' << "loss" << '\n'; // header line
  
  for(int i = 0; i < 150; ++i){
    my_file_out << 0 << '\t' << i << '\t' << my_array[i][0] << '\t' << my_array[i][1] << '\t' << my_array[i][2] << '\t' << my_array[i][3] <<  '\t' << spec_vec[i] << '\t' << mindiff_idxs[i] << '\t' << tot_loss << '\n';
  }
  
  // ...
}

Now, we repeat all the above procedures - allocating the standardized data points to the (updated) cluster centers and updating the cluster centers as the means of the allocated data points, a couple of times. Usually, the procedures are repeated until now marked changes occur anymore to the cluster centers. In this illustrative example, we limit the number of iterations to 20, which should be sufficient to reach the state mentioned, at least in this relatively simple example. In order to ensure that previous, discarded clusters are ignored - remember that we need to work with recording matrices and vectors whose extent matches the original, maximum number of clusters (ten) - , we need to fill these matrices and vectors with very large numbers that will definitely not come to be regarded as components of a "closest" cluster center, e.g. 999. We then only replace those rows that correspond to remaining cluster indices with actual data (e.g. with the sums or means of data points assigned to a given cluster mean). The two operations ensures that cluster centers that were discarded in previous iterations (because no data were assigned to them) do not influence the clustering procedures in later iterations.

int main(){
  // ...
  
  for(int h = 0; h < 20; ++h){
    double tot_loss = 0;
    
    for(int i = 0; i < 150; ++i){
      for(int j = 0; j < num_new_clusts; ++j){
        clust_diffs[i][j] = 999;
      }
    }
    
    for(int i = 0; i < 150; ++i){
      best_clust_diff[i] = 0;
    }
    
    for(int i = 0; i < 150; ++i){
      for(int j = 0; j < num_new_clusts; ++j){
        for(int k = 0; k < 4; ++k){
          if(k == 0){
            if(clust_diffs[i][j] == 999){
              clust_diffs[i][j] = 0;
            }
          }
          clust_diffs[i][j] += pow(new_clusts[j][k] - my_array_std[i][k], 2);
          tot_loss += pow(new_clusts[j][k] - my_array_std[i][k], 2);
        }
        if(j == 0){
          mindiff_idxs[i] = j;
          best_clust_diff[i] = clust_diffs[i][j];
        } else {
          if(clust_diffs[i][j] < best_clust_diff[i]){
            mindiff_idxs[i] = j;
            best_clust_diff[i] = clust_diffs[i][j];
          }
        }
      }
    }
    
    tot_loss = tot_loss / num_new_clusts;
    
    for(int i = 0; i < 150; ++i){
      my_file_out << h+1 << '\t' << i << '\t' << my_array[i][0] << '\t' << my_array[i][1] << '\t' << my_array[i][2] << '\t' << my_array[i][3] << '\t'<< spec_vec[i] << '\t' << mindiff_idxs[i] << '\t' << tot_loss << '\n';
    }
    
    num_clusts = num_new_clusts;
    num_new_clusts = 0;
    
    for(int i = 0; i < num_clusts; ++i){
      for(int j = 0; j < 150; ++j){
        if(i == mindiff_idxs[j]){
          num_new_clusts += 1;
          break;
        }
      }
    }
    
    int unique_new_clusts[num_new_clusts];
    
    int k = 0;
    for(int i = 0; i < num_clusts; ++i){
      for(int j = 0; j < 150; ++j){
        if(i == mindiff_idxs[j]){
          unique_new_clusts[k] = mindiff_idxs[j];
          k += 1;
          break;
        }
      }
    }
    
    for(int i = 0; i < num_new_clusts; ++i){
      clust_counter[i] = 0;
    }
    
    for(int i = 0; i < num_new_clusts; ++i){
      for(int j = 0; j < num_new_clusts; ++j){
        new_clusts_pre[i][j] = 999;
        new_clusts[i][j] = 999;
      }
    }
    
    for(int j = 0; j < num_new_clusts; ++j){
      for(int k = 0; k < 4; ++k){
        for(int i = 0; i < 150; ++i){
          if(unique_new_clusts[j] == mindiff_idxs[i]){
            clust_counter[j] += 1;
            if(new_clusts_pre[j][k] == 999){
              new_clusts_pre[j][k] = 0;
            }
            new_clusts_pre[j][k] += my_array_std[i][k];
          }
        }
      }
    }
    
    for(int i = 0; i < num_new_clusts; ++i){
      for(int j = 0; j < 4; ++j){
        new_clusts[i][j] = new_clusts_pre[i][j] / clust_counter[i];
      }
    }
  }
  
  my_file_out.close();
  
  return 0;
}

After the loop, we add one line that closes the opened .csv file (to which one line was added for every iteration of the loop), and a line that reads return 0 (see code chunk above). To run this script, we now have to first save it Using the file ending ".cpp") and compile it. Compilation is performed in the command line (i.e., in the console) using the command g++. The result of the compilation is an executable file with the file ending ".exe". The script is then run from the command line with the common notation ./filename.exe (this command works with Linux operating systems, possibly also with other operating systems). Note that the executable file should be placed in the same directory as the iris.txt file before usage. You will see when running the executable that it is executed in an instant, which is significantly faster than running a comparable script in R or Python. On the downside, the de-bugging of such a C++ program is much more tedious work, so it is best to write C++ programs instead of R or Python scripts when you know exactly what you are doing, and when execution speed is a limiting factor.

The above code in its entirety can be found here.

Ex Data, Scientia