ExDataScientia

Dimensionality of Data vs Structure of Data

When starting to work with complex data like images, it is often not easy to recognize the dimensionality of the data, and the structure of the data, and to tell apart one from the other.

To be honest, when I made my very first step with image data (as part of a task in machine-vision-based classification), I did not even know that images were data, after all. From my background of biology and ecology, I was familiar with rather low-dimensional data with a nicely-ordered vector structure, as in sets of physical variables from a research cruise. It was only when I considered digital images, which are made up of pixels, i.e. clearly separated "cells", that I started to understand images as data.

As mentioned, the vast majority of data in the natural sciences have the form of one-dimensional vectors, arranged as rows in a data frame or matrix. Each vector is thus a set of measurements of different variables at one specific instance, e.g. a time-step or a geographical location. Each element of the vector has a name and a meaning; for example, a vector could be made up of one salinity value, one temperature value and one oxygen-concentration value. This vector as three variables, and could also be described as a three-dimensional data-point. This data-point occupies one specific position in a three-dimensional space, where each axis represents one particular variable: salinity, temperature and oxygen concentration. The other vectors, or rows, of the data-frame occupy different positions in this space. The proximity of the data-points to each-other hint at the existence of clusters in the data; the arrangement of the data-points to two or more axes can inform about a possible corrlation between variables.

Imagining vectors as data-points in a multi-dimensional space (when it is very high-dimensional, it is referred to as a hyper-space) is often beneficial when trying to understand clustering and classifiaction and terms like speparating hyper-plane are dropped. Howver, in the present case, it is sufficient to understand that data always have a dimensionality, and that this dimensionality depends on the number of variables measured. It is also important to remember that one data-point is exactly one row-vector (at least in the case that the data are one-dimensional in structure and can thus occupy a two-dimensional "container" like a matrix). Two row-vectors are two data-points; two measurments of one variable are just two scalar elements, each belonging to a unique data-point.

Now that we have clarified the meaning of the dimensionality of the data, we should take a look at the structures that data can assume. This is a topic that is rarely addressed outside of the data sciences, since most data-sets contain one-dimensional data anyway. The one-dimensional data type is termed a vector, in some rare cases also a rank-1 tensor. In vectors, there is no spatial relationship between any of the dimensions, and often, also the sequence of elements in the vector is arbitrary. There are exceptions, though, especially when dealing with lagged instances of a variable measured over time, where these time-lags constitute variables of their own. In some applications, it is necessary to maintain the natural sequence of lagged variables. Zero-dimensional data are rather rare. They are referred to as scalars, and, due to their nature, are made up of a single number (or character). Finding a data-set that contains only a single variable is likely difficult, since data analyses and statistical analyses almost always concern finding some relationships involving mutliple variables. At the very least, one-dimensional data likely have a temporal component, which can then easily be used to generate artificial time-lag variables.

In addtion to data with a one-dimensional structure, there exist, however, also data with higher-dimensional structures. Consider a gray-scale image: The pixels are arranged in a matrix, a two-dimensional strcuture that has some meaning to our visual perception. Were the pixels arranged in one dimension only, i.e. in vector format, we would have a hard time grasping what is contained in the image. With very simple images, like those of hand-written digits, we might have a chance to interpret a vectorized image by looking for temporarily repeated "motifs", in that case, we might have better luck trying to interpret the vector with our ears!

It is important to understand that when dealing with images, each pixel forms a variable, or an element in the matrix (or vector, if the structure of the image has been reduced). This means that even if we only deal with small images with a height and width of 28 pixels, we are acually dealing with 784-dimensional data-points! In a gray-scale image, each data-point is structured as a matrix, or, to be more general, a two-dimensional array. Multiple images may be stacked "on top of each-other", forming a three-dimensional storage structure. The relationship between dimensions in the matrix is of strong relevance especially in more complex images; as described above, this spatial arrangement is absolutely necessary for a human to comprehend, or better perceive, the image as what it is. For a clustering or classification algorithm, it is also very beneficial to maintain the original two-dimensional structure of the data: success rates and efficiency are both improved in this way.

In the data sciences, we are not limited to two or three dimensions, as we are working in a very theoretical domain (that is still very close to praxis). Color images, for example, consitute data of three-dimensional structure; formally speaking, they are three-dimensional arrays (sometimes referred to as rank-three tensors). Here, it is the color channels (red, green and blue) that represent the third dimension in the structure of the data. A color video can be considered a four-dimensional array; it is essentially a stack of color images. One could argue that for the same reason, a black-and-white video should formally be the same as a color image, since the video is three-dimensional in structure (a stack of matrices). However, that is only half of the truth, since in videos, the sequence of images is of fundamental importance for understanding the video. Thus, as in any time series, there is a meaning to the sequence of matrices, and one could easily reshape the three-dimensional array to a higher-dimensional one if one wants to incorporate with time-lagged variables

Just like data with a one-dimensional structure, multi-dimensionally-structured data are also a set of data-points in a multi-dimensional space. Given the higher dimensionality of their structures, and the spatial relationships between variables resulting from this, the positoning of the data-points with respect to certain groups of axes has a disctinct meaning. The axes represent the variables in the data (or the elements in a matrix); thus if there is a spatial relationship between variables that is perceivable as a feature (like an edge or a circle), then the positioning of the data-point in the multi-dimensional space towards certain axes "expresses" this relationship (in simpler data, a loose analogy might be the correlation between variables). This thoughtplay can become much more complicated when considering the hierarchy of visual features, and issues like transformation variance (i.e., effects of photographing the same object at different rotational angles, or even from different perspectives).

One last point addresses the storage of data with higher-than-one-dimensional structures. Of course, these cannot be contained in a nice, simple two-dimensional "container" like a matrix or a data-frame. Instead, they have to be stored in multi-dimensional containers - multi-dimensional arrays - themselves. These typically have one more dimension than the structure of the data they contain: Color images would thus be stored in a four-dimensional array. Here, the additional dimension simply serves to index the data-points (so that we can take a single image - or a subset - from the container). By convention, this dimension is typically the leading dimension, i.e. it is called at the first position when taking a subset. Very large data like big images are never all loaded into the working environment or a Python or R session at the same time; the computer memory does not allow for this. Special "image-data generators" serve the purpose of loading small batches of images from disk into the memory - one at a time - to solve this problem.

Thinking about the dimensionality, but especially also about the structure of data is a great way of introducing oneself to the world of data science. Suddenly, you may start to perceive the world around you differently; as impressively complex arrangments of super-high dimensional data!

Ex Data, Scientia