A Crash course on proving the Halting Problem

Explained in an informally rigorous way

Haihan Lan
Towards Data Science

--

A plan for Charles Babbage’s Analytical Engine circa 1840, which would have been a Turing complete mechanical computer had it ever been built. CC BY 4.0

During a lesson, the Master explained the “I:

“The I is the voice of the circumference” he said

When asked by a student to explain what he meant; The master said “It is a voice inside your head”

“I don’t have a voice inside my head” thought the student, and he raised his hand to tell the master

The master stopped the student, and said “the voice that just said you have no voice in your head; is the I”

And then the students were enlightened.

Suppose Jeff Bezos announced over twitter: “I will offer $1 Billion to the person who can write a program that can test any and all other conceivable programs to see if on some input, they produce the correct output after a finite time or continue running forever.”

You immediately start clacking away in your dark-themed IDE, and proceed swiftly to write the Python class structure for the project. But in the back of your mind you feel a certain sense of foreboding, as if some esoteric knowledge from the aether is quietly warning you of the impending futility.

As it turns out, in general it is impossible to write a program or algorithm to determine if any arbitrary program or algorithm (including itself) halts on some arbitrary input or continues running forever. More so, it is even more impossible to determine if said programs will halt or run forever on all possible inputs. In this article we will talk about the former fact.

For the sake of discussion, let’s assume that our algorithms are perfect and will always produce the correct result in addition to halting. We can think of a general algorithm (informally) as behaving similarly to a giant lookup table (a function) that maps some input x to it’s corresponding output y = f(x), and does this predictably 100% of the time. However, unlike a lookup table, the algorithm can consist of decision statements (if, else), loops and recursion in the classic programming sense, so that it is a ‘dynamic’ process with start and end states, that computes some output y on inputs x or continues running forever. The above problem is known as the halting problem and was famously proved by Alan Turing in 1936 to be uncomputable by the the formal definition of algorithms that he invented and its associated computational model, now popularly called the Turing machine. Indeed, not all functions are computable so our algorithms are distinct from lookup table functions even though the above analogy is useful.

To get an appreciation of why this is the case, put on your thinking caps and we’ll discuss some of the ideas behind compatibility and the halting problem in an informally rigorous way. In fact we need not even define formally what an algorithm is nor a computational model such as a Turing machine. We will only make simple arguments about general computability.

Lets begin with some definitions. For now all our inputs we feed to our algorithms or programs are subsets of the natural numbers N, so our inputs consist of single whole numbers such as 2, 3, 12, 23 or 2^82589933 − 1, and combinations thereof (in the sense of passing multiple arguments x, y, z to f(x, y, z)). The idea is all that the unique inputs we care to feed into our algorithm or function can be encoded by natural numbers.

Definition 1: A function p(x) = y with x, y ∈ N is called partially computable (p.c.) if there is an algorithm that halts when y is defined but does not halt (diverges) when y is undefined. This algorithm computes the values of y or some related value. Intuitively a partial function p is a function that does not require p to map every element x of the domain X to an element of y of the co-domain Y.

An example of a partial computable function would be non-negative integer division with remainder. It is defined in the following with a, bZ+:

a ÷ b = p(a, b) = { φ(a, b) = q, r if b is not 0; undefined if b is 0 }

In the above, the entity φ is the actual algorithm or procedure used to compute the result of integer division with remainder, which is the function p. We can easily recall from elementary school that the rules of division with remainder are well defined and work for all non-negative integers a and b except when b is 0. Regular non-negative integer division without remainder is also a partial function.

Definition 2: A function f is total computable or simply computable if it is defined on its entire domain of inputs subset of N, and there exist algorithms which halt on the domain of f.

We can make general integer division (without remainder) computable with a few tricks:

a ÷ b = p(a, b) = { Φ(a, b) = q, r if b is not 0; NaN if b is 0 }

Noticed that we put a NaN as the output if b is 0 and we upgraded our algorithm Φ to handle negative integers as well. Technically* our function is defined for the entire domain of a, bZ, which, by the work of famous mathematician Georg Cantor has a size that is the same as the size of some subset of natural numbers N, or is countably infinite.

*On the subject of technicalities, it is the bread and butter of proof methods in math and first order logic, since a lot of definitions in math are fairly precise, so best just to accept it. The proofs later on will rely directly on technicalities of the definitions that we ourselves created.

Definition 3: For a certain algorithm φₖ a set (of inputs) is called computably enumerable (c.e., the term recursively enumerable is also popular) if it is in the domain of a p.c. function p = φₖ. That is, a set X is c.e. in φₖ if all elements x cause φₖ to halt or cause φₖ to diverge but not both.

The quality of computably enumerable is attributed to specific sets and specific algorithms. Therefore we can say that the set of integers Z is c.e. by the integer division with remainder algorithm, since there are values in Z that cause the division with remainder algorithm to give a undefined output (such as 0 and negative numbers). As well, φₖ are unique to a certain model of computation, but all models of deterministic computation with the right features but that differ only slightly from each other are equivalent in the Turing sense. So we can say φₖ is the kth algorithm in the set of all possible algorithms in a computational model T.

Definition 4: A subset of N called X is computable or recursive if it has a computable characteristic function or indicator function:

I(x) = { 1 if x is in X; 0 if x is not in X }

So the indicator function must halt and given a true or false result on the entire domain N.

Now that we have built our playground, it’s time to have some fun. Before going into the proof of why the halting problem is not computable, let’s warm up with the following question:

In a town where the barber shaves only the men who do not shave themselves, who shaves the barber? Wives, bachelorettes, mothers, children and out-of-towners do not get to shave anyone in this town.

If the barber does not shave himself, then he must shave himself and conversely if the barber shaves himself then he cannot by definition! This is an applied version of Russell’s paradox which asks whether the set of all sets not containing themselves exists. Russell’s paradox is a direct consequence of the naive notion of trying to find ‘the set of all sets’ before such concept was discarded from the axioms of standard set theory. Another interesting logical contradiction is the famous Catch-22:

Applying for a job requires work experience, but to obtain work experience one requires a job.

It turns out we can use these contradictions to show the opposite of some premise (a logical hypothesis) or compound premise must be true, i.e. if the premises we have logically deduce to a contradiction, the negation of the original premise must be true. The is the famous proof-by-contradiction technique. For example, the existence of Russell’s paradox implies that a set of all sets cannot exist. Note that many important theorems in computability theory, somewhat like in machine learning, are non-constructive. In other words the theorem may postulate the existence of some certain thing, or rule out the existence of others, but not actually provide an example of the thing, so the proof may be unintuitive and down right weird, but that’s how computability works.

The setup of the halting problem is as such that it leads to a contradiction; let’s define a halting set H as a set of all programs or algorithms that halt on some arbitrary inputs. The proof would not be as profound if we only argued about certain subsets of H. Intuitively this is an acceptable — albeit naive as we’ll soon see — definition. Recall in our computational model T all k indexed algorithms are unique.

Definition 5: H := { (k, x) such that an algorithm or program φₖ halts when run on input x }

Theorem 1: H is not computable.

Proof:

Assume H is (total) computable, then it must have a computable indicator function I(k, x) by Def. 4. This indicator outputs a 1 if a program in H halts and a 0 if otherwise:

I(k, x) = { 1 if φₖ halts on x; 0 otherwise}.

Suppose we make a program φ in the following way:

φₘ(x) := {1 if I(m, x) = 0; undefined if I(m, x) = 1}

A pseudo-code for the function φₘ would look like the following:

def φₘ():
if I(φₘ):
while 1:
pass
return 1

φₘ is in H (since φₘ halts on some input x}, and is also p.c. (remember, only the indicator needs to be total computable). Now observe what happens when we apply the indicator function to φₘ:

We want to determine if φₘ halts on x. If the indicator I(m, x) = I(φₘ(x)) determines φₘ(x) halts, then by definition of φₘ, it must not halt (φₘ’s return value is undefined). If the indicator determines φₘ does not halt, φₘ must halt and return 1 by definition. There is no rule against recursive self-reference in this case since the definition of H is broad enough to include such programs. Since we reach a contradiction, we cannot assume that H is computable.

To show that H is c.e. is left as an exercise to the reader (hint: can we build a p.c. indicator function for H?).

This is analogous to Russell’s paradox in that our (naive) definition of the set H allows such φₘ where φₘ halts if and only if φₘ does not halt. The stipulation of all programs in the definition of H is what allows this contradiction to happen. The might seem like circular reasoning, and in an abstract way it is, but our definition of H admits such contradiction when we assume that H is computable. Therefore to preserve the consistency of H, we must reject it as being computable.

Of course in real life, if we were careful to avoid writing programs similar to φₘ, then our halting tester can achieve practical performance on realistic programs, only limited by time. The implications of the halting problem are certainly remarkable. Conceivably we can build a meta-system to see if a subsystem of halt checkers runs into the halting problem. But then, how many meta-systems do we need to cover the systems below them? The situation becomes like an infinite Matryoshka doll. It is fascinating that humans can have an awareness of this situation from outside of the infinite regress of these halt checking meta-systems, but can also switch into arbitrary levels of these systems at will, and reason about them from both the outside and the inside. Indeed, this is what some would consider a hallmark attribute of consciousness. It is this power that allowed the likes of Kurt Gödel, Alan Turing, Alonzo Church and others to prove several profoundly deep and connected theorems concerning logic, computation and the foundations of math in the 1930's.

It would be interesting to see if we can build a deterministic or even machine learning based halting tester that analyzes CPU instructions to determine with a certain probability if a program is doing useful calculations or has run into an un-useful infinite loop*. If you enjoyed this article please check out my other ones, and if you found any glaring errors or logical gaps, please let me know.

*If you have venture capital and are interested in investing in the idea, please contact me.

Addendum:

In anticipation of complaints of the above proof being ‘hand-wavy’ or too informal (after all, we took recursive self-reference in stride and didn’t quite define how the indicator would check if φₘ halts or not), I will present a somewhat more rigorous proof below.

We will use the intuitive definition of a Turing machine T as a machine that runs a deterministic procedure, meeting the following conditions:

  • capable of loops and conditional branches and is stateful
  • capable of running subprograms
  • capable of data duplication (such as copying data into the subprograms)
  • capable of halting or running forever

This machine T will compute the function I(m, x) if it is computable. A lack of the precise definition of T will not impede our understanding of the proof too much.

Theorem 2 (Theorem 1 restated): I(m, x) is not computable.

Proof: Once again, assume I(m, x) is computable. One of the things we do to avoid weird recursive self-reference is to feed the source code of φₘ, S(m) = φₘ to I as input, i.e. I(φₘ, φₘ). In the Turing machine definition, programs φₘ and their corresponding source code are just unique and really large natural numbers. Mechanically, there is nothing wrong with feeding a really large natural number to some program in H even if that number is the unique ‘source code number’ of φₘ itself. There is no recursion where the function calls itself in this case.

Now we define a p.c. function h (our candidate halt checker) in the following:

h(φ) := {0 if I’(φ, φ) = 0; undefined otherwise}

Notice we have a different indicator function I’. Our definition of computable only speaks of the existence of computable indicators, they need not all be the same but they should behave the same (consistency). The program that computes h(φ) is in H. Thus I(m, x) would be able to check if h(φ) halts. We feed the source of h(φ) to I(m, x), i.e. I(h, h). Think of I(h, h) as our 1st level halting meta-checker.

When h(φ) is run on some program source code φ, it returns zero if I’(φ, φ) does not halt but I(h, h) returns a 1 indicating h halted, and diverges when I’(φ, φ) does halt, while I(h, h) returns a 0 indicating divergence. By the definition of computable, we cannot have programs that both halt and diverge at the same time (consider h(φ) a subprogram of I(h, h)). Therefore I(m, x) is not computable.

Links to my other articles:

  1. Tensorflow and custom loss functions
  2. Random forests
  3. Softmax classification
  4. Climate analysis
  5. Hockey riots and extreme values

--

--