The domains of data science
Data science is an umbrella term for several domains of analytical or prediction-oriented techniques whose common grounds may not be immediately visible. Here, we are going to take a broad look at these domains and their relationships.
The domains of data science can be arranged in two broad classes: The first class contains analytical and explorative techniques, i.e. procedures to test for the existence of an underlying relationship, difference or mechanism. These are typically often used in biological or medical research questions in order to check whether a working hypothesis is supported or refuted by experimental data. This class contains the domains of statistical tests and regression, which are driven by one or multiple hypothesis, as well as correlation tests, principal-component analysis and clustering, which are purely explorative techniques, i.e. better suited to inform the creation of hypotheses. The second broad class contains techniques that are geared towards prediction, often in a context where the prediction causes some action to be taken or is incorporated into a model that can simulate future developments. These techniques are thus often used in engineering tasks, but also in general planning tasks, e.g. in environmental management. This second class contains the domains of classification, non-linear modeling and also general-purpose optimization. Some further, relatively specific techniques like time-series analyses and reinforcement learning may also be found in these broad classes. The single domains, and even the two broad classes, are not distinctly separated from each-other. There are a number of techniques that stand in-between two or more domains, and their traditional names may obscur their role as "domain-linking" techniques.
At the base of almost every technique and domain lies an optimization problem. In some techniques, especially the more prediction-oriented ones, this is rather obvious: Clearly, it is desirable to improve classification or prediction success. In parameter-based techniques, where, very basically speaking, an input is mapped to an output via multiplication with and / or addition to real-valued parameters, this is achieved by adjusting these parameters in a directed way. The goal is to obtain the smallest difference between prediction (the term prediction is here meant as anything that can be compared to a human-defined truth, i.e. it also includes classification output) and observation. This is often termed as "finding the (global) loss minimum on a loss surface". What is not too often mentioned is that this loss surface, and the loss minimum, are properties of a loss function. The loss function features as its independent variable one or more parameters (parameters are always numerical and real-valued), and as its dependent variable the divergence (or loss) between expectation (or observation) and the prediction made by a model (or technique) that incorporates a specific value of these parameters. (The loss is usually summed or averaged over all observations, and the calculation of the loss can be done in various ways, with a typical way being the computation of the squared difference between observation and prediction.) The optimization problem is then to find the place where the loss function has its minimum. This minimum exists at those parameter values that, when incorporated into the model (or technique) at hand, create the smallest divergence possible between observation and prediction.
Ideally, this divergence would be zero, but it hardly ever is, since observations may be erroneous (this is mostly the case in analytical techniques), or the model or technique has not been specified adequately (this happens in both analytical and prediction-oriented techniques, and is an avoidable human error). You may remember from calculus that finding an extremum of a function can theoretically be achieved by computing its first derivative, setting it to zero and solving for the independent variable. However, since in most cases a model or technique uses several parameters, it is not that easy to find an extremum. Iterative, approximating techniques like gradient descent or genetic algorithms are used instead to estimate the minimum of a loss function. The point-of-interest on which these algorithms converge is often checked for characteristics of a function minimum by computing the so-called Hessian matrix, which contains the second derivatives of the loss function with respect to each combination of its independent variables (i.e. the parameters of the model or technique). Calculating the second derivative of a function at an extremum is done to find out whether that extremum is a minimum (second derivative would be positive) or a maximum; in functions with multiple independent variables, where a Hessian matrix is computed instead of a single second derivative, other types of points-of-interest can occur, e.g. saddle points. It is almost always undesirable when a gradient-descent algorithm converges on such a point, since it means that a function minimum has not been found. In order to count as an acceptable point of convergence, the Hessian matrix should only contain positive and / or zero values, the matrix is then said to be positive-semidefinite. A statistician or modeler dreads the moment when a gradient-descent algorithm returns the message "convergence failed. Hessian is not positive-semidefinite". In that case, the Hessian matrix contains at least one negative value, and it is clear that a minimum of the loss function was not found. While some adjustments to the gradient-descent algorithm can be made to try to overcome this problem, the message should also be taken as a clue that perhaps the data used in the model or technique are too noisy or too sparse, or that the model or technique were mis-specified.
Information about the convergence of a gradient-descent algorithm is not always given by functions contained in the R and Python programming languages, nor is convergence, i.e. the successful finding of the minimum of the loss function for a given model, an indication that this model achieves optimum prediction performance (in applied models like classifiers) or that the assumptions in an analytical model (e.g. in statistical regression) that are being tested by optimizing the model on the available data are valid. This is easy to see and understand in the case of applied models: If the prediction or classification performance of one model is superior to that of another (even though a valid minimum was found for both models' loss functions), then the superior model is, of course, to be preferred (at least if the performance criteria have been correctly specified. If, for example, a classifier performs well on artificial test data, but fails in case studies, then it is likely not the optimum model for the job). In the analytical case, it is often more difficult to put model performance into context, since analytical models frequently yield relatively poor prediction performance due to lack of available data or measurement error. Still, it is arguably more important here to find the optimum model, since (scientific) statements derived from an optimized analytical model can have a large impact on e.g. policy making. Only an analytical model that has been designed with clear rationale and that has been selected over others (that are equally plausible from a scientific point-of-view) after rigorous performance testing with respect to multiple indicators will make a valid scientific statement.
Various tools exist to check model performance, both with respect to the optimization problem, and with respect to the design question. The most common tool is the calculation of the loss between prediction and observation (or machine classification and expert classification), often the squared difference between the two (if they are scalar metrics) summed over all data points. This single value is the value of the loss-function minimum of the model, and is often already a good indication of whether it is suited for the case at hand or not (i.e., if the model assumptions to be tested in an analytical model are valid or not). In the output of the optimization of regression models, a related value named explained variance, or R^2, appears: it gives the proportion of variance in the data that is explained by the model, i.e. it relates the loss between prediction and observation to the variation present in the data (which can be the result of the mechanism to be tested for existence using the model, and of additional measurement error). For analytical models, in particular analytical regression models, there exist further quality indicators, since the optimality of the model is so critical, and since models leading to different scientific outcomes can often have loss minima of similar magnitude. These include residual-distribution plots and quantile-quantile plots, which are visually analyzed by the model designer for the existence of residual patterns. Existence of these patterns indicates that the model performance increases or decreases (with respect to an explanatory variable) in a directed way, implying the existence of an additional or alternative mechanism that had not yet been considered. In effect, it indicates a wrong model design that would have potentially led to a wrong scientific statement if ignored.
A further performance metric to consider in both applied and analytical models is the generalizability of the model. Theoretically, a model can be built to reduce the minimum of its loss function to zero by supplying a very large number of parameters. In regression, multiple-order polynomials can achieve this. This is not a desired behaviour, though, since it means that the model will almost certainly fail when applied, or will not inform any useful analytical statement. In some models or techniques, dealing with a large number of parameters is the norm, and here it is particularly important to check for the generalizability of the model design, especially when data are relatively scarce. These include Neural Networks (function approximators that "distill" an unknown function from a large number of net-like arranged parameters and are often used in classification of complex data like images), complex Generalized Additive Models (analytical models applied to complex nonlinear systems with unknown mechanisms or to systems including continuous background variables that are over-parameterized to "smooth them out") and mixed-effects models (analytical models applied to measurements with categorical background variables, e.g. individual-patient-effects in clinical studies). Here, one often keeps a validation dataset that is not used in the optimization procedure. It is instead used for testing model generalizability once the optimization process has been completed. If prediction or classification performance on the validation data is good, then the model will perform well in application, or does give a valid analytical statement despite its high number of parameters. In regression models (also simple ones), it is also common to compute the Akaike Information Criterion of the model after optimization; basically speaking, it integrates prediction loss and the number of parameters into one indicator value. It is particularly useful when it is not feasible to keep a validation datset (due to a lack of data), or when the number of parameters does not differ very much between "competing" models.
Checking model performance and indicators of good model design in analytical models relates directly to the interpretation of one often miss-understood metric: the p-value. A significant p-value is, so to say, "found gold" in the perception of a scientist, since its usual interpretation is evidence (not proof) for the existence of a mechanism that had been considered and that was to be tested by the model at hand. Existence of a significant p-value basically means that the same analytical outcome will ensue (for example, species A is bigger than species B, or variable X has a strong positive effect on the response variable) when a new sample of data are drawn fo the same natural population (this refers not to a population in the biological sense, but rather to the pool of data from which a sample was taken in a survey). In classification, it could be interpreted as meaning that the same classification boundary would be drawn (or a single property of the boundary would be the same - classifier parameters are not easy to interpret). Showing a significant p-value is therefore often taken as showing that the mechanism considered is true, and therefore used as major support for making a scientific statement. Still, one has to consider that, when generalizing, the p-value shows whether some parameter, or rather the parameter and the variable it is attached to, were critical in the optimization of the model's loss function, and would have been as critical if a different sample of data from the same population had been used in the optimization process. As seen above, many things need to be considered in the design of an analytical model. If these critical checks were not made, or if the model was not designed with a clear rationale in mind, then the p-value is practically worthless. In this case, the p-value was purely informed by the data and by the model design, but there is no way of telling if the model does reflect a process existing in nature. Keep in mind that the goal of an optimization algorithm is not to "discover" an underlying truth in nature, it is to find the lowest point of the loss function for a given model. Therefore, no p-value should be communicated without stating the rationale for model design, and without showing the quality of its design and of its prediction performance.
Now let us go deeper into the single domains of data science, and let us start with the analytical tool of regression models. Regression models are a statistical technique used to determine the validity of an assumed causality, or a causal relationship. They therefore differ from the (later discussed) correlation, which is only used to discover or determine the validity of a perceived relationship in data. This assumed causality is formulated in a very generalized way; since we only assume its existence, we do not have a mechanistic equation that clearly explains the effect of one or several variables on the response variable. Such mechanistic models, whether representing a natural law (more often encountered in physics) or an "approximative" truth (i.e. an underlying mechanism agreed upon to be valid but ignoring case-specific sub-processes, more often found in ecology), are the result of decade-long research work and debate. Here, we only want to test if variables affect another one in any (logically conceivable) way at all. A regression model therefore maps the input variables, or explanatory variables, to the output variable, or explained variable, using a function that multiplies each variable with a parameter (in other domains sometimes referred to as weights), sums these products, and finally adds a scalar, which in regression is called intercept (in other domains bias). Sometimes, when the explained variable is not normally distributed, a transformative function (sometimes called an activation function), e.g. exponentiation, is applied afterwards. The parameters determine the form of the function (increasing vs. decreasing, and so on), while the intercept maps the function values to the correct value range of the response variable. In some cases, the regression function may also include an interaction term: Here, the mapping of one particular input variable to the explained variable changes depending on the value of a second input variable, i.e. the parameter value related to the first variable changes depending on the second variable. Here, an additional parameter is multiplied with the product of both input variables; this equals to the parameter multiplied with one input variable being a function of the second variable (https://stats.stackexchange.com/questions/56784/how-to-interpret-the-interaction-term-in-lm-formula-in-r). In yet other cases, a categorical variable may be incorporated as an explanatory variable. Then, the intercept is formulated for the first category, and an additional intercept (to be added to the first one) is given for each different category. Here, we have found the first hybrid case that connects two domains: Categorical explanatory variables are actually a typical property of statistical tests (more on that later); this type of regression is therefore sometimes deemed "ANCOVA" (in relation to the ANOVA test). When incorporating an interaction between a continuous (i.e. numerical) and a categorical variable, then a separate parameter related to the continuous variable exists for each category; the regression model then almost consists of separate functions for each category.
As described above, regression models used for statistical testing must be extensively checked for prediction performance and validity of design before any output (i.e. regarding parameter direction, magnitude and significance) can be used to support a scientific hypothesis. These include checking the amount of explained variance (i.e. the amount of deviation of the data from their mean that is due to the effect described by the model), as well as the distribution of the model residuals. If one or both (or further diagnostics) are deemed unsatisfactory, the common approach is to try to make improvements by excluding variable interactions or entire variables themselves. This is often done in a back-wise selective manner, where first interaction terms, then whole variables are removed from the model until diagnostics are deemed to be satisfactory. Very low amounts of explained variance or very poor diagnostics can be indicative about miss-specified distributions of either the explained or the explanatory variable, or both. This means that it is necessary to transform this / these variable(s), e.g. by logarithmization or logit-transformation. The transformed values are then hopefully closer to a normal distribution. The normal distribution of variables (or of transformed variables) makes the gradient-based fitting mechanism easier and more stable, hence its necessity in regression modeling (by the way, it is often also useful or required to work with normally-distributed variables in other domains of data science).
Two derivatives of regression modeling represent further connections to other domains of data science: I) The Generalized Additive Models (GAMs) introduce more freedom to the regression function to be fitted by segmenting the data along one variable and building a regression function for each segment (which theoretically gives an overall better fit, since the single functions are fitted to data that are "close" to each-other and therefore often share a similarity). These separate functions, often also called regression splines, are connected to each-other, forming an overall function similar to a polynomial. The connections are called nodes, and the number of nodes is determined by the model designer. This is a very basic description of the idea behind the GAMs; the reality is a bit more complex. The price for introducing multiple regression functions is a tendency for over-fitting the data, and for losing interpretability of the model and its parameters. GAMs, or rather their spline components, therefore rather fall into the domain of purely explorative approaches like kernel-density estimation / cluster detection, rather than into the analytical domain that seeks to support existing hypotheses statistically. Since GAMs allow to be selective about the explanatory variables that should be modeled using spines, it is possible to use GAMs in an analytical manner also by incorporating explanatory variables without using several spines / nodes. These variables, or rather their parameterization, then remain available for scientific interpretation. In the analytical approach, this method can be used to "smooth out" background noise, e.g. a latitudinal gradient that one is practically certain affects the explained variable to some degree, but is not at all of interest in the scientific hypothesis to be tested, which may center around other explanatory variables. The effect of the background variables would then be reduced by assigning several nodes to them, i.e. by fitting them with several regression splines. Of course, one must be aware that this technique can have its pit-falls, especially when one is not very certain about the influence of a background variable. Smoothing out a background variable that in reality does not have much of an effect would create a wrong baseline for analyzing the effect of variables of interest. Care must thus be taken when applying GAMs in this manner. The GAM approach has so far only been implemented for regression models, but might theoretically also work for mechanistic models, where the relationship between variables of interest is quite well known and one seeks good parameter values for predictive purposes (see below).
II) Linear Mixed-Effects models also allow to account for undesirable side-effects of background variables that are often the result of an experiment design that cannot set these background variables to constant values. In mixed-effects models, these variables are exclusively categorical (unlike in GAMs, where they are exclusively numeric / continuous). Typical background variables include experiment gear (like individual fish tanks that might have individual effects on their inhabitants) or an individual patient effect in medical studies where a medicine and a placebo are tested on the same person. They therefore represent a statistical test for dependent data, more specifically an alternative to the ANOVA or the Kruskal-Wallis analysis of variance (see below) for dependent data. Here, the transitionary nature between statistical regression and statistical tests becomes obvious again. Mixed effects models are therefore useful for extracting the most information from a somewhat "flawed" experiment while sacrificing as little credibility as possible. In applied usage (i.e. with focus on the accuracy of a prediction, not the validity of a scientific statement), they are relatively seldom used. Mixed-effects models can be defined with varying degrees of complexity: The dependency of the measurements on a background variable can be expressed via an intercept for each category of the background variable, or via an intercept plus slope for each category. This high degree of complexity means that mixed-effects models are relatively highly parameterized. Therefore, they are best used when a lot of data are available (especially if there are many background variables or many categories per background variable). Anyway, great care should be taken when making use of them. There are also mixed-effects models that exist as derivatives of mechanistic models; they might be important for testing the validity of the mechanism on a set of data in case it is unclear whether that mechanism exists in the case at hand, and when background variables are present.
Now, let us move to the next domain, statistical test. Statistical tests are a purely analytical tool; even though their use involves making predictions, they have no use in applied data science. As already hinted above in the case of "hybrid" approaches (e.g. ANCOVA), they can be understood as regression models that contain only categorical explanatory variables, and a numerical response variable. The term regression should be abandoned, though, since it implies a causal relationship between numeric (continuous) variables. However, the principle of the response variable being a function of (categorical) variable(s) is the same. Statistical tests are normally used to generate objective support for differences or similarities between several categories of which measurements have been taken. These can include different species, for which body length, or different medical treatments, for which effectiveness in curing a disease have been measured, for example. "category" is then the explanatory variable, and length or effectiveness is the explained variable. Oftentimes, analysts look for a significant difference between the categories, which, in the context of the test, manifests itself as a significant effect of the "category" variable. Since this variable is, of course, of categorical nature, the "effect" is not represented as a parameter that is multiplied with the explanatory variable to calculate a prediction of the response variable. Rather, it appears as multiple intercepts, one for each category, which, simply speaking, "predict" the magnitude of the response variable. The effect of the "category" variable is usually deemed significant when the intercepts, which equal the median of the response variable for each category, and the dispersion of response-variable values around these medians, have a sufficiently large distance from one-another. This becomes apparent through a significant p-value, as discussed above.
As in analytical regression, possibly even more so here, it is necessary to make sure that the choice of test is reasonable before the p-value can be used for interpretation. Unlike in many other data-science domains, the predictive power, i.e. the loss between observation and prediction, is rarely assessed when using statistical tests. In general, the validity of test choice, or "model design", as one would call it in a wider context, is usually ensured before it is applied, not in a post-hoc diagnostic manner. The most important subjects to consider in choosing a statistical test is to check whether the response-variable variable is normally distributed or not (if not, it is normally not transformed, but instead a different test is applied), whether the magnitude of dispersion of data around the median of the response variable is different between the different categories (this is not tested in statistical regression, since the explanatory variables are continuous - however, when the degree of dispersion changes in a directed way with an explanatory variable, a modification of model design might be necessary) and whether the data are dependent or not, i.e. whether there are interfering categorical background variables. The first two subjects can be checked visually or tested with specific statistical tests o their own, the latter must be decided by the test chooser / model designer, and is sometimes difficult to determine. For each outcome of these checks, there are different tests. For a normally-distributed, independent response variable with homogeneous dispersion between the categories to be compared, it is possible to apply the simple linear-regression function in R. If one of these conditions is not fulfilled, it is necessary to apply a specific test function (these are often named after the authors who have conceived these tests). When dealing with more than two categories, and it is of interest not only whether some, but also which of them differ with regard to the response variable, it is necessary to apply a post-hoc test. When dealing with non-independent data, the analyst quickly enters the realm of mixed-effects models (discussed above). Interestingly, the reverse of the formula of a statistical test - predicting a category from numerical variables - is the technique of classification, a strongly applied and hardly analytical domain.
Classification aims to find logical rules in the data that allow an optimum assignment of data points to one of several categories. Compared to the other domains of data science, classification has a strongly enforcing nature in that the class assignments to be predicted by the classifier model are given by the model engineer, and are not subject to a research hypothesis that seeks to prove the validity of the given classes. Thus, no exploratory or analytical investigation is the reason for designing a classifier model, but only classification performance. Classification is thus probably "closest" to the basic optimization problem that is the foundation for almost all techniques used in the domains of data science. The simplest case of classification, binary classification, can be regarded as a special case of regression, i.e. regression with absence-presence data. This type of regression attempts to predict the presence (represented by value 1, which is also the numerical analogue to the logical "TRUE") or absence (represented by a 0) of a condition with several input (explanatory) variables. The goal of this binomial regression is thus to find thresholds in the input variables, the crossing of which yields prediction of the relative opposite of the category predicted from input values that are below the threshold. When dealing with a multitude of input variables, there is a threshold for every (relevant) input variable (there may be redundant variables or such that do not contribute to classification success and would be excluded in model refinement), and the combination of these thresholds is termed a separating hyperplane (in reference to a hyperspace, i.e. a mathematical space with more than three dimensions). When there are more than two classes to be separated, there is also more than one separating hyperplane; the number of hyperplanes is then always one less than the number of classes. Multi-class classification cannot make use of a single scalar as objective values; rather, multi-class "labels" are represented by so-called one-hot-encoded vectors, i.e. vectors that contain as many elements as there are classes, with one element being a one, and all others being zero. The index of the one-value in the vector varies by class, and the classification task is to predict the correct vector index for each input. When classifying very complex data, like images (which are essentially giant matrices or matrix stacks), the predicted one-hot vector is formally still a function of the input. However, since this classification is often achieved by using multi-layer Neural Networks, the prediction process can also be interpreted as a gradual reduction of dimensionality of the input by filtering out irrelevant information (this is what the function parameters that are being optimized on the data do), up to the point where dimensionality has been reduced to the one-hot encoded vector (the output of the classifier function).
A technique that can be considered a close "relative", or a more explorative derivate, of classification is clustering. While classification imposes a specific grouping of the data onto the optimization of a classifier model, i.e. the definition of suitable class boundaries, clustering seeks to detect patterns (or cluster) in the data. It is thus a part of what is often called unsupervised learning, in order to the more applied supervised learning that includes classification, prediction-targeted regression (or prediction-targeted modeling in general) and general-purpose optimization. The only constraint on exploration posed when applying clustering algorithms is the number of clusters that should be found. However, there are also supporting techniques for determining a fitting number of clusters: these include repeating the clustering algorithm multiple times with various random initializations and checking the consistency of the placement of cluster means (when they are placed at approximately similar positions in every replicate, the number of clusters set could be considered appropriate), or using multiple clustering algorithms and comparing the outcomes. More advanced techniques consider the geometric outline of the clusters (which depend mainly on the data-points at the border of a cluster) found by an algorithm; when certain rules about these outlines are fulfilled, the number of clusters can be considered appropriate. Clustering is a fully explorative technique that may often be used to characterize a domain for which a more applied technique (e.g. classification) will later be designed. For example, a corporation might first want to find out what kinds of customers it is serving by clustering them according to buying behaviour and other features, and might then design a classifier that takes these features (or a reduced set, after sorting out irrelevant or redundant features; see below) as input in order to create specific advertising for single customer groups. As alluded to, clustering is indeed often performed to detect useful features for classification. For example, in image classification with Deep Neural Networks, a so-called auto-encoder, which is optimized to reconstruct its input, is often used to "engineer" image features that can be used for classification later on. In clustering, the parameters to be optimized (aside from the more arbitrary number of clusters, which is more of a hyper-parameter) are the centers of the clusters (i.e. vectors, with one value per variable used in the clustering), which are moved in the space of variables present in the data. Clustering algorithms are often rather simple in design; for example, the popular K-means algorithm works by iteratively assigning data to the (initially randomly set) closest cluster centers, and then updating the cluster centers by computing the mean of the assigned data. A loss function or a proper gradient-descent technique are not obviously used here, though the operations performed could, in a way, be interpreted as using a loss (the evenness of distances of the cluster center to the data-points assigned to it) and a gradient (updating the cluster center by relating its current "position" to this "loss"). Aside from the simple K-means, there exist also hierarchical clustering techniques that detect major clusters and sub-clusters in the data. Other explorative techniques related to clustering are kernel-density estimation, which can be used for filtering out signals in the data (i.e. smoothing) and principal-component analysis, which is useful for determining features (or variables) that are most useful for differentiating between various groups in the data.
Kernel-density estimation can be considered as one further step to pure data exploration, as this method requires even less prior assumptions than most clustering techniques. Instead, it can be considered as a way of emphasizing relationships and trends in the data. By treating every data-point as a cluster center, and measuring the summed distance (either a true geometric distance or the value of a probability-density function, for which one further attribute, the standard deviation or covariance matrix in multivariate cases, also referred to as band-width argument, must be specified) to all other data-points, it becomes apparent which data-points are more similar to each-other than others. Kernel-density estimation can thus be utilized for an initial exploration of the data, followed by approaches with more concrete aims like cluster analysis. The technique is sometimes also referred to as smoothing, especially in the context of visualizing data and kernel-density estimates. Principal-component analysis (PCA), on the other hand, seeks to detect the variables in the data that contribute the most to the differences in the data, i.e. those variables that contribute a lot to the arrangement of data into clusters. Fundamentally, PCA works like regression, where each variable is assigned several parameters, or, as they are called in this context, loadings. The loadings are iteratively updated to project the data into a latent state where the variability in the data is maximized. This latent state is defined as a set of orthogonal axes within the hyper-space taken up by the data. These axes are referred to as the principal components. In typical PCA visualizations, the first two principal components are plotted as the cartesian axes of a plot, and the data and remaining principal components are drawn into this plot according to their alignment to these two components. The properties of these axes are determined by the loadings, with each variable affecting each axis. PCA is often used to characterize those variables that are of greatest importance in describing differences in the data. This can have many practical uses, including being a basis for refining research strategies (i.e. to find out important variables to takes measurements of), or for omitting redundant variables in classification techniques (which can reduce computational workload and risk of model over-fitting).
There exist numerous other explorative and applied techniques, some of which are relatively simple algorithms that divide or scan the data space according to some specified rules. These include the classification- and regression trees (CART), which split up the data in a hierarchical manner so as to increase the mean of each fragment, and the patient-rule-induction method, which attempts to discover patterns in the data by repeatedly drawing boxes around subsets of the data, decreasing the extent of these boxes so as to increase the mean of the data included therein. These algorithms are relatively "naive", as they don't make use of specific functions or features, and are therefore most useful for (scenario) discovery, but not as solid statistical tools or tools for making predictions for inputs beyond the value ranges covered by the data. They are typically used in the field of data mining.
A final group of techniques returns very close to the optimization problem underlying almost all data-science techniques: general-purpose optimization attempts to optimize the parameters of a complex system represented by a model. Instead of being components of a relatively simple and / or well-structured function, like a basic linear model or a neural net, the parameters are distributed throughout the model in a set of inter-dependent equations. Examples of this are models of industrial production pathways where there are many "levers" that could be adjusted to make production more efficient, or a complex fisheries management model where one tries to parameterize key processes in order to understand how a stock has responded to fishing pressure over time. The goal in such applications is not always to fit a model to observed data; in the first example, parameters are optimized such as to improve a completely artificial scenario that is already deemed to be a very good representation of reality, since the production processes were exclusively designed by humans. Parameter optimization in such complex systems is often more difficult than in simple "one-line models", since they are often highly non-linear, the number of parameters can be high, and there may be rules that are very indirectly related to the parameters, e.g. the shut-down of some component as one variable crosses some threshold value. It can often be helpful to try to reduce the number of parameters by mapping some of them to the output of a function constructed around a core parameter to be optimized. This is particularly useful in cases where groups of parameters have a common functional trait, like fishing mortalities for different age classes in a fish-stock assessment model. Also, introducing scaling functions to enable a parameter optimization on a similar magnitude for all parameters can ease the optimization process, as this way a fixed step size in gradient descent leads to similar effect changes for all parameters.
Related to this are non-linear mechanistic models, which are essentially mathematical formulations of e.g. biological, chemical or ecological processes, though are not as complex as the systems described previously (thus they can be considered "one-line models"). The goal in fitting the parameters of such models is seldom to seek proof for the existence of an effect, as in statistical regression models, but rather to set them up for making reasonable predictions, e.g. for future projections in the context of more complex numerical models. The existence of the mechanisms that such models describe are usually not in doubt, but rather represent e.g. an ecological "rule" or physical "law" that is subject of universal agreement. Fitting of such models is usually easier when providing some starting values for the gradient-descent algorithm to start the gradient descent from; compared to simpler linear regression models, the complexity of such non-linear models makes achieving convergence more difficult. Alternatively, if possible, a linearization of the model equation, i.e. a rearrangement of its equation to a formulation more similar to the "intercept-plus-slope" form of a generalized linear model, can be helpful. Still, care must be taken when fitting the model proves to be difficult; very noisy data can lead to miss-estimation of key parameters, as parameter estimation is ultimately only forced to reduce the bias between observed and predicted data, and not to yield parameters of a "true" e.g. ecological mechanism. This is especially true for mechanisms that are known to be simplifications of more complex processes that are not fully understood, e.g. in the case of fish-stock recruitment. Here, it is crucial to check parameter values estimated by model-fitting against expert understanding of the system at hand, and to make manual corrections, if necessary.
In summary, it becomes apparent that many domains and techniques in the realm of data science are inter-connected, which is not surprising given that they are all ultimately based on the optimization problem inherent to fitting a model to measured data. A further commonality of all techniques is that the design of the models or algorithms employed must be critically checked in order to make a valid analysis or to create an effective application. This is so since all techniques are geared towards solving the optimization problem to the greatest extent, and not to reveal a truth inherent to the context that the data were taken from, or to solve a task like classification in a way that is helpful for the user. The beauty of the world of data science is, in essence, that it has an internal logic that is unquestionable and unconstrained by our ability to take measurements or samples, as in most other sciences. But in its applied form, it must always be used with conscience and care, since reality is much too complex to be compressed into a set of data that is completely free of bias and ambiguity.