Bayesian statistics as an extension of machine learning methods
- Von Björn Piepenburg
- Bayesian theorem, conditional probabilities, Posteriori, Probabilities, Statistics, Thomas Bayes
Share post:
Thomas Bayes was born in London at the beginning of the 18th century as the son of a vicar and also became a vicar after studying theology. His other interests were logic and statistics, which he also researched in his spare time. His major scientific contribution is the so-called Bayes’ theorem, which was only published three years after his death
$P(A_k|E)=\frac{P(A_k)\cdot P(E|A_k)}{\sum_{i=1}^k P(A_i)\cdot P(E|A_i)}$
As an example for the application, we take a medical rapid test, which provides a positive test result in 95% of sick people ($P(positive|sick)=0.95$). In 2% of healthy people, the test also falsely leads to a positive result ($P(positive|healthy)=0.02$). The disease has infected 2% of all people ($P(sick)=0.02$ and correspondingly $P(healthy)=0.98$) and all people could be tested. Question: If a person tests positive, what is the probability that they actually have the disease?
$P(sick|positive)=\frac{P(sick)\cdot P(positive|sick)}{P(sick)\cdot P(positive|sick)+P(healthy)\cdot P(positive|healthy)}=49%$
Based on Bayes’ theorem, Bayesian statistics has developed, which is used in the context of inductive statistics and machine learning to estimate parameters and test hypotheses. For this purpose, the parameters are initially assigned assumed distributions (so-called a priori distributions). Iteratively, the distributions are adapted to the problem using statistics from samples or the results of experiments (the a priori distributions become post priori distributions).
One example that is frequently used in the literature is the experimental determination of the probability of winning in one-armed bandits. For example, let’s take three bandits with different (unknown) probabilities of winning (the result of a game is only a win or no win with a constant win amount). Since we have no prior knowledge, we assume a beta distribution with the parameters $a=1$ and $b=1$ (corresponds to a uniform distribution) for the probabilities of winning. To determine the post-priori distributions, we iteratively select a bandit (depending on the experience already gained) and adjust its win probability curve according to the outcome of the game. You can cancel the procedure if the probability curves of the three bandits no longer change significantly.
The following figures show the results for the described test after 5, 10, 20, 50, 100 and 200 games. The actual win probability of the blue bandit is 0.2, that of the green bandit 0.5 and that of the red bandit 0.75. You can see the developments from the a priori probabilities (all probabilities of winning are equally likely) to the post priori probabilities.
In addition to the estimated probabilities of winning, the figures show the spread in the results. These can be interpreted as certainty or uncertainty for the assumption of a profit probability. In order to be able to use this added value of information for different applications, machine learning and artificial intelligence algorithms are extended by Bayesian statistical approaches.
To illustrate: Imagine you have a problem that needs to be solved on the basis of data. Experience has shown that empirical data is subject to a certain degree of variability, is flawed, partially incomplete and, in summary, not unambiguous. You use this data to train your model and the result is a value that apparently represents the correct result for your problem. But how can this be if the data basis is not clear? The solution algorithm must therefore be adapted so that all data problems are taken into account in the result. To achieve this, adapted solution methods are currently being developed for relevant machine learning algorithms that take into account the scatter in the data in each calculation step and output a distribution as the result. One example is artificial neural networks, in which not only the outputs are replaced by post-priori distributions, but also the network weights. In addition to the solution methods, interfaces for further processing the results in the form of distributions must also be adapted. For example, we used Bayesian statistics to determine bid prices in a research project funded by the mFund on the basis of a data base that was weak in places.