Main Page | How It Works | How To Cite | Team | GitHub Repo | Download Data and Tools | Publications

# The Online Algorithmic Complexity Calculator

## Overview

This introductory video provides a brief overview for non-experts in the area explaining how both the Coding Theorem Method ($\textit{CTM}$) and the Block Decomposition Method ($\textit{BDM}$) approximate universal measures of algorithmic complexity.

The Online Algorithmic Complexity Calculator (OACC) is a tool developed by the Algorithmic Nature Group to provide reliable estimations to non-computable functions. This is achieved through various numerical methods based upon the mathematical theory of algorithmic probability and algorithmic randomness. The estimations have a wide range of applications in a range of disciplines from molecular biology, to cognitive science, time series research (e.g. finance), and graph theory (for a list of examples, see Publications).

The OACC provides numerical approximations (upper bounds) of Algorithmic (Kolmogorov-Chaitin) Complexity (AC) for short strings ($\textit{CTM}$), for strings of any length ($\textit{BDM}_{1D}$), and for binary arrays ($\textit{BDM}_{2D}$), which can represent the adjacency matrices of unweighted graphs. These techniques are not only an alternative to the widespread use of lossless compression algorithms to approximate AC, but true approaches to AC. Lossless compression algorithms (e.g. BZIP2, GZIP, LZ) that are based upon entropy rate are not more related to AC than Shannon Entropy itself, which is unable to compress anything but statistical regularities.

## Advantages of CTM and BDM over Entropy and Lossless Compression

The following plots show quantifications of the power of $\textit{CTM}$ and $\textit{BDM}$ to identify objects that look statistically random (high Entropy) but are actually causal in nature, as they are produced by short computer programs and thus recursively produced by a generation mechanism that Entropy and lossless compression algorithms based on entropy rate overlook.

A: X axis: Plot of values for a random sample of binary strings of length 100 bits sorted by increasing number of non-zeros. Y axis: Normalized values between 0 and 1 for each of the 3 measures ($\textit{BDM}$, Compress and Entropy) used to approximate algorithmic complexity $K$. The lossless compression curve (here the algorithm used is Compress, but similar behaviour occurs with LZ, LZW, etc) closely follow the classical Bernoulli distribution for the Entropy of the set of strings. This does not come by surprise because implementations of lossless compression algorithms are actually simply Entropy-rate estimators (up to a fixed window length size). However, the $\textit{BDM}$ distribution approximates the expected theoretical semi-exponential distribution for $K$ assigning lower values to strings that are causally generated but are statistically random-looking to Entropy and compression.
B: Same test with bitstrings of length 1000 ($\textit{BDM}$ starts approximating Shannon Entropy if $\textit{CTM}$ is not updated).
C: $\textit{BDM}$ provides a much finer-grained distribution where Shannon Entropy and lossless compression algorithms retrieve only a limited number (about 5) of differentiated 'complexity' values for the all strings of length 12.
D: Number of strings (of length 12) among all $2^{12}$ bitstrings with maximum Shannon Entropy or near maximum Shannon Entropy but low algorithmic complexity estimated by $\textit{CTM}$/$\textit{BDM}$. Arrows poiting to what we call 'causal gaps' show the power gained by using $\textit{CTM}$/$\textit{BDM}$ against Shannon Entropy and compression at identifying causally generated strings and objects that may not have any statistical patterns but can be recursively/algorithmically produced with a short computer program. More information on $\textit{CTM}$ is available in this paper and on Entropy and $\textit{BDM}$ in this other one.

## Choosing evaluation parameters for Block Decomposition

If the string that you want to evaluate has length shorter than $13$, you should use the Coding Theorem Method ($\textit{CTM}$) to estimate its algorithmic complexity. Otherwise, you should use the Block Decomposition Method ($\textit{BDM}$), which requires specifying block size and block overlap values if you do not want to use the (optimal) values. The key is to always compare among values with same block size and block overlap values, and not among different ones (as the size and overlap may under- or over-estimate complexity)

The following is a $\textit{BDM}$ partitioning example for block size $= 6$ and block overlap $= 1$ as an illustration of the meaning of block size and block overlap in the estimation of a complexity of a long string:

$\textit{BDM}$ is defined by $$\textit{BDM} = \sum_{i}^{n} \textit{CTM}(\textit{block}_i) + \log_{2}(|\textit{block}_i|),$$ where $\textit{CTM}(\textit{block}_i)$ denotes the approximated algorithmic complexity of the pattern in the block of symbols, and $|\textit{block}_i|$ denotes the number of occurrences (multiplicity) of the block.

For $\textit{BDM}$, optimal parameters are usually the largest possible block size ($= 12$) with no overlaping ($= 0$). You should always pick the largest available block size, as it provides better approximations to algorithmic complexity. In contrast, the smallest block size ($=1$) approximates Shannon Entropy. You can pick any overlapping value that is shorter than your block size. For example, say your string is $111001010111$ and you use block size $=6$. If the overlapping is $0$, then $\textit{BDM}$ will look up the known $\textit{CTM}$ values of $111001$ and $010111$ and add them up, outputting $29.9515$ bits. Alternatively, you can choose block overlap $=1$, for which the strings whose $\textit{CTM}$ values are added are $111001$, $101011$, and $010111$. This second evaluation will output $32.9672$ bits.

Overlapping helps to deal with leftovers of the block partitioning if the string length is not a multiple of the block size, otherwise, leftovers with length less than the block size will be discarded and won't be considered in the complexity estimation. This paper shows different schemes to deal with boundary conditions. The online calculator only deals with the most 'basic' of these schemes, but we have proven that the error is bounded, and thus the output values are reliable for most comparative purposes. In general, overlapping blocks produce overestimations of complexity, and non-overlapping blocks lead to an underestimation only for objects with dimensions that are not a multiple of the block size.

You should always compare results with the same chosen parameters (unless you estimate the error as we did in this paper and then make corrections or take the deviations into consideration).

As for matrices, the same rule holds. Current support for strings is binary and non-binary, but for arrays it's currently only binary. With some loss of precision, one can always translate any alphabet into binary with some loss of information due to the extra granularity introduced in the translation.

## Estimating Algorithmic Probability (AP)

For more technical details you must read the papers listed in the Bibliography Section below. In a nutshell, we calculate a function $D(n,m)(s)$, which estimates the Algorithmic Probability of a string $s$ from a set of halting Turing machines with $n$ states and $m$ symbols denoted by $(n,m)$. We use the standard model of Turing machines used by Tibor Rado in the definition of the Busy Beaver problem, but we have also proven that radical changes to the model produce similar estimations. Beyond the known values of the Busy Beaver problem we have also shown that educated choices of reasonable halting times can be chosen to reach certainty up to any arbitrary statistical significance level.

Formally, $$D(n, m)(s)=\frac{|\{T\in(n, m) : T(p) = s,\ b\in\{0,1,\cdots,m-1\}\}|}{|\{(T,b)\in(n, m)\times \{0,1,\cdots,m-1\} : T(b) \textit{ halts }\}|},$$ where $T(b)$ represents the output of Turing machine $T$ when running on a blank tape filled with symbol $b$ that produces $s$ upon halting, and $|A|$ represents the cardinality of the set $A$.

For $(n,2)$ with $n < 5$, the known Busy Beaver values give the maximum number of steps that a machine can run upon halting. But for $n \geq 5$ or for $(n,m)$ with $m > 2$, no Busy Beaver values are known, and the size of the machine spaces make impossible a complete exploration to calculate $D(n,m)$ for arbitrary $n$ and $m$, but an educated choice of timeouts can be made and samples produced (see the bibliography section below).

## Approximating Algorithmic Complexity (K) by CTM and BDM

The function $D(n,m)$ is an approximation to Levin's Universal Distribution $\mathfrak{m}(s)$, and it can be used to approximate $K(s)$, the algorithmic complexity of string $s$ by using the Coding Theorem, $$K(s) \simeq -\log_2 \mathfrak{m}(s)$$

The greater value of $n$ we use to calculate $D(n,m)$, the better approximations we make to $K(s)$ for $s$ a string of an alphabet of $m$ symbols. Due to the uncomputability of $D(n,m)$ we work with samples and runtime cutoffs. For the simulation of Turing machines we use a C++ simulator running on a supercomputer of medium size.

$\textit{BDM}$ extends the power of $\textit{CTM}$ as explained in this paper.

## Approximating Bennett's Logical Depth by CTM and BDM

The $\textit{CTM}$ allows us not only to build an empirical distribution of computer program outputs from smallest to larger size but once a string is generated for the first time among all the computer programs of the smallest size, we know which Turing machine is the smallest one producing a string and also the runtime of such a Turing machine. We take the runtime of such a Turing machine as an estimation of Bennett's Logical Depth ($LD$) by $\textit{CTM}$, and also extend the power of $\textit{CTM}$ to estimate $LD$ with a multiplicative variation of the $\textit{BDM}$ formula. Despite the fact that $LD$ is neither lower nor upper semi-computable (and therefore truly non-computable), estimations by $\textit{CTM}$ and $\textit{BDM}$ do produce the characteristic concave distribution assigning algorithmic random strings lower logical depth, thereby conforming with the theoretical expectation unlike Shannon Entropy and lossless compression:

Unlike approximations to algorithmic complexity by lossless compression (top left plot), $LD$-based values using $\textit{CTM}$ and $\textit{BDM}$ conform to theoretical expectation regarding $LD$ behaviour. The behaviour is confirmed among all the $\textit{BDM}$ variations according to different boundary conditions schemes, and it could not be reproduced either using Shannon Entropy (or Entropy rate) nor using lossless compression algorithms. Further information is available in this paper.

## Calculating your ability to behave randomly by producing a random-looking sequence and grid

In an article published by PLoS Computational Biology we have shown that the ability to produce algorithmic randomness peaks at age 25. The article was widely covered by the world media. You can test your own personal ability using this calculator. The results in the paper are given in what are called '$\textit{z-scores}$'. To obtain your result and be able to compare it to those in the paper, you only need to apply the following formula to the output that you obtain from this calculator: your $\textit{z-score} = (K-m)/s$, where $K$ is the output that you will obtain from this calculator for a sequence (in the tab 'For short strings' choose $\textit{CTM}$ value) or for a grid (in the tab 'For binary arrays' choose $\textit{BDM}$ value). The values for $m$ and $s$, and the parameters to choose in the calculator appear in the following table:

Your $\textit{z-score}$ calculation
item task $m$ $s$ alphabet overlap length
1 Toss 32.5 1.56 2 0 12
2 Guess 34.7 1.07 5 0 10
3 Spot 41.41 0.93 9 0 10
4 Roll 36.62 1.04 6 0 10
5 Grid 17.01 1.08 2D 0 3 x 3

## Graph complexity and perturbation analysis to move networks towards and away from randomness

Our algorithmic causal calculus article explains the details of the methods, and we have prepared an animated video with some of the basic ideas behind these methods in an application to steering biological (genetic regulatory) networks:

The method implemented in this version of the calculator does not allow the removal of nodes in networks of less than 5 nodes.

## Numerical Limitations

Numerical limitations of CTM is the ultimate incomputability of the universal distribution, and the constant involved in the invariance theorem which, nevertheless, we have quantified and estimated to apparently be under control, with values quickly converging to 0 even in the face of computational model variations. Our papers cover these limitations and their consequences should be taken into account.

For BDM, the limitations are explored in this paper and they are related to boundary conditions and to the limitations of CTM itself. The paper also shows that when CTM is not updated, BDM starts approximating Shannon Entropy in the long range, yet the local estimations of CTM shed light on the algorithmic causal nature of even large objects.

For the algorithmic perturbation analysis of graphs and networks capabilities, as described in our algorithmic causal calculus article and implemented in this calculator, the current version (3.0) does not correct for graph isomorphisms, something that one has to take into consideration by taking the minimum value of the information shift among all nodes/edges in the same group orbit. This is because of limitations of the Block Decomposition Method (BDM) to deal with the boundaries of adjacency matrices not multiple of the BDM block size (e.g. 4x4). In version 3.5 we are aiming at making this correction automatically by calculating the graph automorphism group using publicly available computer programs such as nauty or saucy that you can currently use yourself to make these corrections in the meantime. In any case, we have demonstrated that the error vanishes for large networks and we have also proven that approximations of algorithmic complexity by BDM of a labelled graph (an instance of its automorphism group) is a good approximation of the unlabelled version because one can calculate the group from any instance with a brute-force program of (small) fixed length (even when computationally intractable). So, in practice, the approximations provided are still reasonably good enough, robust and stable even if with some irregularities for some elements.

## Version history and future of the OACC

We keep expanding the calculator to wider horizons of methodological and numerical capabilities:

• Version 1: Estimations of $K$ for short binary strings
• Version 2: Expanded $\textit{CTM}$ and $\textit{BDM}$ capabilities for non-binary strings and arrays
• Version 2.5 (current version): Estimations of Bennett's logical depth based on $\textit{CTM}$
• Version 3: Algorithmic-information dynamics of strings, arrays and networks
• Version 4: Algorithmic complexity of models, algorithmic feature selection, algorithmic dimensionality reduction and model generator.

## Bibliography

Delahaye J.-P. and Zenil, H. (2012) Numerical Evaluation of the Complexity of Short Strings: A Glance Into the Innermost Structure of Algorithmic Randomness. Applied Mathematics and Computation 219, pp. 63-77.
BibTex entry

Soler-Toscano F., Zenil H., Delahaye J.-P. and Gauvrit N. (2014) Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machines. PLoS ONE 9(5): e96223.
BibTex entry

Zenil H., Soler-Toscano F., Dingle K. and Louis A. (2014) Correlation of Automorphism Group Size and Topological Properties with Program-size Complexity Evaluations of Graphs and Complex Networks, Physica A: Statistical Mechanics and its Applications, vol. 404, pp. 341–358, 2014.
BibTex entry

Zenil H., Soler-Toscano F., Kiani N.A., Hernández-Orozco S., Rueda-Toicen A. (2016) A Decomposition Method for Global Evaluation of Shannon Entropy and Local Estimations of Algorithmic Complexity, arXiv:1609.00110
BibTex entry

Zenil H., Kiani N.A., Marabita F., Deng Y., Elias S., Schmidt A., Ball G., Tegnér J.(2017) An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems arXiv:1709.05429 [q-bio.OT]

See the Publications section for even more articles, details and applications to other areas and connections to other measures.