Welcome to the exciting world of data science!
Here, you will find information on topics covering data analysis, programming, statistics and visualization. The articles will primarily concern the ideas behind the various concepts, and their application. The mathematical side will be treated less intensely, so don't expect to find a lot of formulas or text-book-conform definitions. Instead, this site is primarily intended to be a fun "playground" to share all kinds of astonishing insights, in the hope that they may be of interest or use to the reader.
This website is not complete from the start. Rather, articles will be added over time, to hopefully become a comprehensive resource for applied data science at some point. This also means that the sequence in which articles are released follows a tutorial-like schedule. Some articles will require the reader to have more pre-existing knowledge than others. Articles published later may explain a topic only briefly addressed in a previous one.
So, without further ado, grab a cup of coffee, lean back and enjoy browsing the site. And don't forget to come back regularly to find new exciting insights!
Happy reading!
When you fit a complex statistical model to data, you may sometimes encounter a warning message referring to unfulfilled properties in the so-called "Hessian matrix", with the final parameter estimates not being trustworthy as a result. Here we willy investigate the meaning of the Hessian matrix and why it is important in the statistical-fitting process. ...read on
Describing a complex system accurately using statistics can be challenging. In many situations, measurements are confounded by uncontrollable factors, e.g. different sampling sites, which may have an effect on the behaviour of the system or on parts thereof. Such cases call for the fitting of mixed-effect models, the appropriate design of which may, however, be complicated. ...read on
We have already constructed a convolutional neural network (CNN) in R, but fitting and running it took an unpractically long time. Hence we will now construct the CNN in the programming language C++, which allows for a much faster execution. ...read on
Computing partial derivatives of a function with respect to its parameters is a key procedure in optimization and thus a basic requirement for many machine-learning tasks. Here, we are going to compare the different methods of obtaining partial derivatives available in R and Python. ...read on
Convolutional neural networks (CNNs) are a particularly complex type of neural network utilized in machine vision. Here, we will build one from scratch in order to shed light on these often black-box-like algorithms. ...read on
Machine-learning are often conducted in the Python or R programming languages familiar to non-computer-scientists. C++, on the other hand, enables much faster-running code, but is relatively tideous to write. Here, we are going to implement a classical clustering algorithm in C++. ...read on
Complex neural networks use the backpropagation algorithm for efficient fitting of their parameters. Since its actual meaning and functioning is often treated somewhat as a given, we are going to take a closer look. ...read on
The support-vector machine is a common algorithm for the classification of data. Here, we take a closer look at the properties of a linear binary support-vector machine. ...read on
Classification- and regression trees are a common technique for "mining" complex data sets for information. Here, in order to shine some light on these often blackbox-like algorithms, we have a thorough look at some custom-written trees. ...read on
Fitting the parameters of a complex numerical model can be a daunting task. The Template Model Builder (TMB) package for R is designed for optimizing the parameters in an efficient way. Here, we will look at its impementation on an easy artificial example. ...read on
Working with expressions in R can be a powerful tool when the application of loops or functions is not useful. Here, we will explore the usage of expressions in an easy-to-follow example. ...read on
Fitting non-linear regression models can be quite a daunting task from a programming perspective, especially when their complexity increases. Here, we are going to look at four methods of fitting such models. ...read on
Deep neural networks for classification or regression are typically constructed via the Keras API in Python, which is so user-friendly that it is essentially a blackbox. Here we will look at a much less opaque approach using the deriv() function in R. ...read on
Hierarchical clustering is an attractive method for assigning data to multiple clusters simultaneously, and thereby overcomes constraints posed by more traditional approaches. ...read on
Data science is an umbrella term for several domains of analytical or prediction-oriented techniques whose common grounds may not be immediately visible. Here, we are going to take a broad look at these domains and their relationships. ...read on
Three-dimensional (3D) plots have a bad standing in scientific literature due to the difficulty of their interpretation, but can be useful to visualize complex relationships in an educational context. Here, we look at creating 3D surface plots in R. ...read on
E-Learning has been an important cornerstone in teaching programs on programming languages and staistics, not just since the Covid pandemic. Here, we are going to look at how to design e-learning lessons with the swirlify package in R. ...read on
Sometimes, you may encounter a situation in which you open the IDE RStudio with scripts still opened, and it becomes unresponsive immediately. Read on to find out how to solve this issue! ...read on
Designing and training a Deep Neural Network is one part in the process of developing a classifier application. However, it is also important to visualize its performance to judge its quality. ...read on
Deep Neural Networks are on the way to dominate the field of Machine Learning, seeing increased use in classification, regression and optimization tasks. Their implementation might appear as a mystery to some, yet the implmentation in the Keras API is actually fairly straightforward. ...read on
Convolutional Neural Networks (CNNs) are today's gold standard for image classification and Machine Vision in general. By simulating the procedures in which visual input is processed in the human brain, CNNs often outperfrom traditional Deep Neural Networks. ...read on
The programming languages R and Python have very complimentary strengths and weaknesses. Integrating the functions of both languages for working on a specific task can thus be a beneficial venture, and is enabled through the R package reticulate. ...read on
When starting to work with complex data like images, it is often not easy to recognize the dimensionality of the data, and the structure of the data, and to tell apart one from the other. ...read on
While cluster analysis has traditionally been implemented with relatively simple algorithms like K-Means and Expectation-Maximization, the relatively recent emergence of Deep Neural Networks in applied data science has brought a new, more complex method to the field: the auto-encoder. ...read on
Kernel-density estimation (KDE) is a methodology to detect patterns in (often multi-variate) data without imposing the constraint of pre-defining the existence of a certain number of clusters. Basically speaking, KDE tries to detect "commonness" in the data. ...read on
Expectation-Maximization (EM) is a common clustering algorithm based on probability-density calculations. It is a common alernative to the K-means clustering algorithm ...read on
Finding the right entry-way into programming Python is not as straghtforward as one might think. There are a number of tricks that make working with Python really convenient, though. ...read on
Bash scripts − that is, scripts bearing the file-name ending ".sh" offer a convenient way of writing executable protocols or even customizing your computer to your needs. ...read on
Loops are an essential part of many programming applications, from simple file-operation algorithms to complex numerical models. While inefficient, some operations clearly depend on the use of loops. ...read on
K-Means clustering is one of the most intuitive clustering techniques due to the simplicity and elegance of its design. ...read on
Ex data, scientia is Latin and translates to "from the data, knowledge" (to be fair, the case form "data" is likely not correct in Latin grammar, but the term "data science" is so common today that a different formulation would have been less understandable to non-Latin speakers). Essentially, it means that we can discover a whole lot of information by just analyzing data in the right ways. This can reduce the amount to data to be gathered to gain insight, i.e. by research surveys, or open up entire new business fields, as in the branch of Machine Vision.