Ex Data, Scientia

Home Contact

Welcome!

Welcome to the exciting world of data science!

Here, you will find information on topics covering data analysis, programming, statistics and visualization. The articles will primarily concern the ideas behind the various concepts, and their application. The mathematical side will be treated less intensely, so don't expect to find a lot of formulas or text-book-conform definitions. Instead, this site is primarily intended to be a fun "playground" to share all kinds of astonishing insights, in the hope that they may be of interest or use to the reader.

This website is not complete from the start. Rather, articles will be added over time, to hopefully become a comprehensive resource for applied data science at some point. This also means that the sequence in which articles are released follows a tutorial-like schedule. Some articles will require the reader to have more pre-existing knowledge than others. Articles published later may explain a topic only briefly addressed in a previous one.

So, without further ado, grab a cup of coffee, lean back and enjoy browsing the site. And don't forget to come back regularly to find new exciting insights!

Happy reading!

03-04-2024: The Hessian matrix and parameter fitting

When you fit a complex statistical model to data, you may sometimes encounter a warning message referring to unfulfilled properties in the so-called "Hessian matrix", with the final parameter estimates not being trustworthy as a result. Here we willy investigate the meaning of the Hessian matrix and why it is important in the statistical-fitting process. ...read on

03-04-2024: Mixed-effects models in R

Describing a complex system accurately using statistics can be challenging. In many situations, measurements are confounded by uncontrollable factors, e.g. different sampling sites, which may have an effect on the behaviour of the system or on parts thereof. Such cases call for the fitting of mixed-effect models, the appropriate design of which may, however, be complicated. ...read on

17-01-2024: Building a CNN from scratch in C++

We have already constructed a convolutional neural network (CNN) in R, but fitting and running it took an unpractically long time. Hence we will now construct the CNN in the programming language C++, which allows for a much faster execution. ...read on

06-06-2023: Computing partial derivatives via auto-differentiation - a comparison of approaches in R and Python

Computing partial derivatives of a function with respect to its parameters is a key procedure in optimization and thus a basic requirement for many machine-learning tasks. Here, we are going to compare the different methods of obtaining partial derivatives available in R and Python. ...read on

30-05-2023: Building a convolutional neural network (CNN) from scratch

Convolutional neural networks (CNNs) are a particularly complex type of neural network utilized in machine vision. Here, we will build one from scratch in order to shed light on these often black-box-like algorithms. ...read on

30-05-2023: Implementing k-means clustering in C++

Machine-learning are often conducted in the Python or R programming languages familiar to non-computer-scientists. C++, on the other hand, enables much faster-running code, but is relatively tideous to write. Here, we are going to implement a classical clustering algorithm in C++. ...read on

05-02-2023: Programming the backpropagation algorithm from scratch

Complex neural networks use the backpropagation algorithm for efficient fitting of their parameters. Since its actual meaning and functioning is often treated somewhat as a given, we are going to take a closer look. ...read on

12-11-2022: Support-vector machine for data classification

The support-vector machine is a common algorithm for the classification of data. Here, we take a closer look at the properties of a linear binary support-vector machine. ...read on

30-10-2022: Building classification- and regression trees from scratch

Classification- and regression trees are a common technique for "mining" complex data sets for information. Here, in order to shine some light on these often blackbox-like algorithms, we have a thorough look at some custom-written trees. ...read on

24-10-2022: Fitting numerical models with Template Model Builder

Fitting the parameters of a complex numerical model can be a daunting task. The Template Model Builder (TMB) package for R is designed for optimizing the parameters in an efficient way. Here, we will look at its impementation on an easy artificial example. ...read on

27-03-2022: Working with expressions in R

Working with expressions in R can be a powerful tool when the application of loops or functions is not useful. Here, we will explore the usage of expressions in an easy-to-follow example. ...read on

13-02-2022: Fitting non-linear regression models

Fitting non-linear regression models can be quite a daunting task from a programming perspective, especially when their complexity increases. Here, we are going to look at four methods of fitting such models. ...read on

06-02-2022: Building a deep neural network from scratch

Deep neural networks for classification or regression are typically constructed via the Keras API in Python, which is so user-friendly that it is essentially a blackbox. Here we will look at a much less opaque approach using the deriv() function in R. ...read on

01-01-2022: Hierarchical clustering

Hierarchical clustering is an attractive method for assigning data to multiple clusters simultaneously, and thereby overcomes constraints posed by more traditional approaches. ...read on

01-01-2022: The domains of data science

Data science is an umbrella term for several domains of analytical or prediction-oriented techniques whose common grounds may not be immediately visible. Here, we are going to take a broad look at these domains and their relationships. ...read on

01-01-2022: 3D plots with rgl

Three-dimensional (3D) plots have a bad standing in scientific literature due to the difficulty of their interpretation, but can be useful to visualize complex relationships in an educational context. Here, we look at creating 3D surface plots in R. ...read on

02-09-2021: E-Learning with swirl and swirlify

E-Learning has been an important cornerstone in teaching programs on programming languages and staistics, not just since the Covid pandemic. Here, we are going to look at how to design e-learning lessons with the swirlify package in R. ...read on

14-08-2021: Preventing RStudio from Freezing

Sometimes, you may encounter a situation in which you open the IDE RStudio with scripts still opened, and it becomes unresponsive immediately. Read on to find out how to solve this issue! ...read on

13-06-2021: Analyzing and Visualizing Classifier Predictions - Step by Step

Designing and training a Deep Neural Network is one part in the process of developing a classifier application. However, it is also important to visualize its performance to judge its quality. ...read on

24-05-2021: Implementing a Deep Neural Network in Keras - Step by Step

Deep Neural Networks are on the way to dominate the field of Machine Learning, seeing increased use in classification, regression and optimization tasks. Their implementation might appear as a mystery to some, yet the implmentation in the Keras API is actually fairly straightforward. ...read on

30-03-2021: Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are today's gold standard for image classification and Machine Vision in general. By simulating the procedures in which visual input is processed in the human brain, CNNs often outperfrom traditional Deep Neural Networks. ...read on

16-03-2021: Using Reticulate for R-Python interaction

The programming languages R and Python have very complimentary strengths and weaknesses. Integrating the functions of both languages for working on a specific task can thus be a beneficial venture, and is enabled through the R package reticulate. ...read on

09-03-2021: Dimensionality of Data vs Structure of Data

When starting to work with complex data like images, it is often not easy to recognize the dimensionality of the data, and the structure of the data, and to tell apart one from the other. ...read on

07-03-2021: Cluster Analysis with Auto-Encoders

While cluster analysis has traditionally been implemented with relatively simple algorithms like K-Means and Expectation-Maximization, the relatively recent emergence of Deep Neural Networks in applied data science has brought a new, more complex method to the field: the auto-encoder. ...read on

28-02-2021: Data Investigation with Kernel-Density Estimation

Kernel-density estimation (KDE) is a methodology to detect patterns in (often multi-variate) data without imposing the constraint of pre-defining the existence of a certain number of clusters. Basically speaking, KDE tries to detect "commonness" in the data. ...read on

28-02-2021: Clustering with the Expectation-Maximization Algorithm

Expectation-Maximization (EM) is a common clustering algorithm based on probability-density calculations. It is a common alernative to the K-means clustering algorithm ...read on

28-02-2021: How to get started with Python

Finding the right entry-way into programming Python is not as straghtforward as one might think. There are a number of tricks that make working with Python really convenient, though. ...read on

27-02-2021: Customize your computer with bash scripts

Bash scripts − that is, scripts bearing the file-name ending ".sh" offer a convenient way of writing executable protocols or even customizing your computer to your needs. ...read on

25-02-2021: Three ways of implementing a loop

Loops are an essential part of many programming applications, from simple file-operation algorithms to complex numerical models. While inefficient, some operations clearly depend on the use of loops. ...read on

21-02-2021: K-means clustering

K-Means clustering is one of the most intuitive clustering techniques due to the simplicity and elegance of its design. ...read on

Ex Data Scientia − what does that actually mean?

Ex data, scientia is Latin and translates to "from the data, knowledge" (to be fair, the case form "data" is likely not correct in Latin grammar, but the term "data science" is so common today that a different formulation would have been less understandable to non-Latin speakers). Essentially, it means that we can discover a whole lot of information by just analyzing data in the right ways. This can reduce the amount to data to be gathered to gain insight, i.e. by research surveys, or open up entire new business fields, as in the branch of Machine Vision.