Ex falso quodlibet - Testing semantic preservation

We can see a compiler as a piece of software that translates a text written in one language (the source language) to another (the target language).

It is a very important part of the tooling for the programming craft, because it allows us to program machines using high-level languages providing nice abstractions. That may be why we often have very strong feelings about the compiler of our favourite programming languages (yes, plural).

A compiler being a program, the cold-hard-truth is that it will contain bugs. This means that, as a translator, a compiler can make mistakes. I know, that's very unfortunate, we want to trust our beloved compilers!

The seminal work on compiler correctness is the 1967 paper from McCarthy and Painter "Correctness of A Compiler for Arithmetic Expressions". The property of correctness is formulated with the help of a semantic based on an abstract machine: a compiler is correct if there is an equivalence relation between the state of the memory after interpreting the source program and the state of the memory after interpreting the compiled program.

A few years later, [Lockwood Moris] came up with a more abstract idea on how to prove the correctness of a compiler. The following diagram should commute:

source         compile          target
language ---------------------> language
   |                               |
   |                               |
   |source                         |target
   |semantic                       |semantic
   |                               |
   V                               V
source         decode           target
meaning <---------------------- meaning

Here, the notion of semantic is abstract, we are free to choose how to express it. For example, Morris used a denotational semantic. A bit more than 30 years later (2009), when introducing the first realistic verified compiler CompCert, Xavier Leroy used the same idea but with an operational semantic (in the flavour of Plotkin's Structural Operational Semantic, McCarthy and Painter were also using an operational semantic but based on an abstract machine à la Peter Landin), focusing on observable behaviour. (This history is explained with a some more details in Patterson & Ahmed's functional pearl.)

Leroy introduces and defines the notion of semantic preservation in the section 2.1 of the CompCert paper. It is defined in term of a simulation relation between the observable behaviour of the source program and the observable behaviour of the compiled program. The relation that is of interest for us is the one of forward simulation (we are clearly cutting some corners here, the curious reader should read the paper, and the papers cited above too if interested in the historical aspect):

∀ B, S ↓ B ⇒ C ↓ B
 where
  B : Behaviour
  S : Source program
  C : Compiled program

This reads: for all observable behaviours B, if we observe the behaviour B from the source program S, then we will observe the same behaviour B from the compiled program C.

Now, how this applies to a compiler. Let's turn to the section 2.2 of the same paper where we can find the definition of a verified compiler (again, corners, don't hesitate to read the paper):

∀ S, C, Comp(S) = OK(C) ⇒ S ≈ C
 where
  ≈ is the semantic preservation relation

This reads: The compiler Comp is verified if for all source programs S and all compiled programs C, the fact that the compilation of S succeeds and output C implies the semantic of S is preserved in C.

Up until now, we were looking at how to prove the correctness of a compiler. But what I want to do is to test a compiler for the semantic preservation property using property based testing.

The usual trick when translating a formal specification as the one above to a property to be randomly tested is to transform universal quantifiers into generators. I like very much this idea because it reminds me of how universal quantification is interpreted in the dialogical logical framework (it is a similarity, not an identity).

Here, we have a lot of universal quantifiers!

Obviously, we don't want to generate a source program, a compiled program and a behaviour, then check that the generated source program compiled into the generated compiled one and that its observable behaviour is the one that we have generated, before checking that the compiled program displays the same behaviour. (Contrary to the dialogical logical framework, we don't have an ideal proponent and an ideal opponent always making the best choice, we have random generators, so there is very little chance that we will be able to generate a source program and the corresponding compiled one, not to mention the corresponding behaviour).

Luckily enough, we don't have to! If we manage to generate correct source programs, we can use the compiler to obtain the compiled program. And if we write a semantic for the source and the target languages, we can obtain the observable behaviours from them.

So, from one question (how do we test a compiler for semantic preservation?), we now have two new questions (that's progress!):

how do we write a generator for correct source programs?
how do we write a semantic computing the observable behaviour?

The second question is the easiest one (at least in theory, there is a lot of mistakes to be done, or to avoid depending on your state of mind). We want a semantic that computes the observable behaviour: that's an interpreter that logs the elements of this observable behaviour along the way.

The second question is a bit more difficult to answer. Let's first define what we mean by correct programs. I want to write (and test) a compiler for a statically typed programming language (something like system Fω). So, here, correct means well-typed. So the question is: how to write a generator of well-typed lambda terms?

Once again, some smart people have already worked on the question! We can take a look at the paper, from Pałka et al., "Testing an Optimising Compiler by Generating Random Lambda Terms".

The general idea is to read the typing rules (which are deduction rules) backward. Which makes a lot of sense. From the premises, we can infer the conclusion, but if we want to generate the conclusion, we need to generate the premises. This also means that the generator needs to take a type as input: generating a premise means generating a term of a given type.

This all sounds very fun, but there are a lot of traps along the way (which is also a lot of fun, let's be honest with ourselves!).

First, a term of a given type can be generated in multiple ways. Let's say we need to generate a term of type int. We can do that by looking at the environment and pick an existing variable with the right type (zero for example). Or we can build a function application with a function returning an int and an argument of the right type.

OK, not a big deal. Everybody that has done a bit of property based testing knows about a combinator of generator to pick one in a list of generators. The sublety here is that the list of generators depends on the type of the term we want to generate: for example, we shouldn't try to use a generator of lambda abstractions if the target type is not an arrow type.

Now, let's say we've decided to go the function-application way. We need to generate a function and an argument. Maybe we'll decide to use again a function application to generate the argument, and also in the next step, etc... Now we are in a situation where the generator can just run forever, or take an unreasonable amount of time to build an unreasonably large term.

That's where they introduce a size limit (which is, at least now, a pretty standard technique when generating recursive values). But they also introduce backtracking, because in the particular case of generating an abstract syntax tree, the size limit allows the generator to fail.

And that's just the beginning! The paper also explains how to make sure to be able to use functions from the environment and how to handle polymorphism.

In conclusion, in order to test a compiler for semantic preservation with property based testing, we need to write interpreters for the intermediate languages that logs elements of the observable behaviour and randomly generate well-typed terms of the source language.

We'll also need a language and a compiler for the language, which is the first step of the project (a lot of design and implementation choices to make).