[llvmlinux] Heavy experiments

Renato Golin renato.golin at linaro.org
Wed Jul 24 11:04:51 UTC 2013

Hi Marcelo,

We're probably leading ourselves towards a long discussion... While this is
good for both of us as a learning tool, it can generate stress. So I want
to begin by asking you to ignore if my answers seem a little on the harsh
side. I can assure you it's unintentional and a limitation of my
communication skills.

I, too, had the idea that LLVM IR was this powerful language where I could
infer almost everything from the original code/intentions if I looked hard
enough. Sadly, reality wasn't that nice to me, as it won't be with you. You
may escape most of it by only looking to C code execution, and it might
work well for a very small area of a very specific case, but still, it'd be
good if you know beforehand the limitations of the IR so to avoid

All I'm saying here is based on what I've tried myself and seen other
people trying, and that might have been a long ago, and my memory is really
bad. Don't take everything at face value, though, since a lot has changed
in the last 4 years, but be aware that a lot of things in the IR world only
look consistent because no one have tried differently. Things match by
accident on some cases...

On 24 July 2013 02:08, Marcelo Sousa <marceloabsousa at gmail.com> wrote:

> I dont' agree and in fact it feels to me that there is almost a new field
> in software verification over LLVM IR. Just to start, there are several
> examples of bounded model checkers, symbolic execution procedures that
> operate over LLVM IR. Are you somehow implying that this analysis is no
> good?

Software verification in IR is a good thing, semantic verification of
languages in IR is not. The function that transforms AST into IR is
surjective, not injective.

 - Sign information is on instructions, not types
> This is not a semantic problem, because you can refine the LLVM IR type
> system. In fact, that is something I am doing.

No, you can't. While some type representations might look better on paper,
or on a little project, it doesn't behave on the grand scheme of things,
namely the support for multiple ABIs on multiple architectures and multiple
language needs. IR is no silver bullet.

 - Structures are flattened out, or cast to arrays
> The structural information is still there otherwise it's not a sound
> compilation. I need to see real examples of C code where the original
> structures cannot be retrieved anymore.

There are many examples, but one that happens fairly regularly is when the
structure is cast to an int array to do ByVal lowering. Of course, C++
loses a lot more semantics than C.

>   - Multiple structures are commoned up if they're structurally equivalent
> Not sure what the problem is here. If LLVM IR can identify that two
> structs are structurally equivalent for free, that seems good for me.

In the case where a function operates in one type and another function
operates on the other, it'll be hard to reason about which group each
function belongs. Again, more of a problem to C++ and higher languages.

 - Function calls are changed (ByVal, ABI specific, architecture specific)
> Not sure why is this a problem also for verification. In fact, again I get
> a much more reliable execution model.

I thought you were doing semantic analysis. For execution analysis, this is

  - Intrinsics are expanded into IR code
>  - Perfectly valid code is shrunk into intrinsics (optimizing, legalizing,
>> atomic ops, pragmas)
> Looking at some IR of the kernel I'm not sure if there is heavy use
> of Intrinsics. Nevertheless, this is one criticism for LLVM IR, that it's
> not suppose to have any intrinsics at all but be able to generate
> appropriate optimized assembly code through patterns.

This is a strength, not a weakness. LLVM performs better than some
commercial compilers because of that.

>  Note that I'm being a somewhat purist here, i.e. LLVM is supposed to be
> Virtual Machine. There are almost more intrinsics than instructions, and
> that's not good for any perspective.

This is a common misconception. LLVM is NOT supposed to be a Virtual
Machine. It's just a misnomer, because it was originally intended to make
it easier for Chris to finish his PhD. LLVM is a compilation
infrastructure, of which being a virtual machine is a very (very) small
part, and not even that well supported any more.

 - Vectorization can explode the amount of IR beyond recognition
> Are you referring to vectorization as a compiler transformation? Please
> elaborate and if possible provide an example of this. Nevertheless, I doubt
> that this happens in the particular case of the Linux Kernel.

Yes, and I agree, it's possible that the Kernel has nothing worth
vectorizing. It's just an example where your tool wouldn't work at all.

  - Inlining shuffles code and will hide any analysis you do on the inlinee
> I fail to see this point if your analysis is sound, i.e. your analysis
> results of the inlinee when analyzed independently are more general than
> the ones when you analyze the function with the inlining. This may affect
> precision of the inlinee independently but the analysis is still precise in
> the inlining.
> For example, in a type-based analysis this can result that the inlinee
> would have a polymorphic type and the function that has the body of the
> inlinee is just a specialization.

When the function is static, LLVM tends not to keep the out-of-line copy on
O2 or higher.

 - PHI nodes, basic blocks for loops and switches can be recognized, but
>> they don't mean the same things anymore
> So what do they mean? Sure there is a small-step operational semantics of
> phi instruction which does not mean that there is a loop, but if you can
> identify a pattern that is a loop.

Loops can be destroyed by loop unroll, or created during vectorization

> However, note that i'm not claiming that there is a bijection between a
> particular C program and the LLVM IR code but I do claim a bisimulation
> relation with C + Arch and LLVM IR + Arch', where Arch' is a subset of
> Arch, and they really complete the *real/concrete* execution model with the
> information about the architecture.

The transformation between AST and IR, and further transformations from IR
to optimized IR are not symmetric, associative, nor transitive. They're
pattern-match based and do behave erratically quite frequently. LLVM is not
a mathematical tool, it's a real world code with real world defects because
it has real world expectations.

Some annotation is kept on the types (TBAA, for ex.) to be able to infer
high-level information about the type, but as with any metadata in LLVM,
that can (and will) be discarded by optimization passes. Inlining will only
make it worse.

>   - Target-specific lowering does destroy the IR, making it impossible to
>> be portable, sometimes even to sub-architectures
> Can you get me some concrete examples of this?

ByVal is one. Procedure call standard is another. Intrinsics are yet
another. You can't even throw an ARMv7-hf function into an ARMv5 target,
even both being ARM.

> this is useful for software verification that considers architectural
> information. This is exactly what I'm trying to do with the Linux Kernel at
> the moment: i'm just trying to show that for x86 certain behaviours that
> are arch-specific due to inline assembly instructions do not occur.

This is a very important analysis, and I can see that most of what I said
is irrelevant to your project. However, be careful on unintentional
target-specific code, like the ones I mentioned above.

>   - C++ gets flattened out to C in the Clang AST, and then to
>> target-specific in IR, it's virtually impossible to attach any meaning to it
> If that's the case, what is the argument to claim the "capability of
> representing ‘all’ high-level languages cleanly"?

If you're referring to Clang's AST, that's a complex issue. Even though the
AST has all the information you need about C++ concepts, the members become
functions, the classes become PODs with pointers to VTs and VTTs, etc. The
information is all there, in metadata or lists of members, but it'll be
hard to reproduce the original code back in C++. However, printing it as a
C code would be trivial.

All C++ front-ends I know do something like that (including IntelCC and
ARMCC, that both use EDG). This is required for you to be able to deal with
C and C++ code on the same file and inter-operate between the languages

> I agree with you than it's an hard task to have a back-end to C++ but
> semantically is not "virtually impossible".

Well, anything is impossible... ;)

  - SCEV, DCE, CSE and undefined behaviour will *also* change IR beyond
> recognition
> I don't recognize SCEV, DCE, CSE.

Sorry, too long in the field and I'm already behaving like them... ;)

SCEV = Scalar Evolution (constant folding at compile time, among other
DCE = Dead Code Elimination (removal of complete sections of the code if
unused, or if undefined)
CSE = Common Subexpression Evaluation (fuse common code into one)

> About undefined behaviour, are you talking about *undef*? We should really
> have a discussion just about the semantics of *undef*, or write a paper
> about "Defining undef".

Many papers, more likely... ;)

Here's a good blog post about it by Chris:

> Again, I agree that is possible that transformations *can* possibly change
> IR which makes lifting to a similar C program w.r.t the original very hard,
> but *sound* transformations will not change its behaviour - semantic
> preservation.

In practice, undefined behaviour will make the compiler treat the code
around it as not-worthy, and will eliminate it, or change everything that
uses it to undef, and "flatten-out" your code.

> Moreover, note that by performing analysis at the IR level you are not
> only able to identify potential real bugs in your "high-level" program but
> you also are doing compiler certification. Understanding the origin of the
> error might be hard but I find the analysis particularly valuable.

I'm not saying it's wrong, I'm just saying it's a daunting task. There are
too many bugs in LLVM that don't generate bad code, and I fear you'll be
hitting plenty of them.

 - IR has no knowledge of the pre-compiler syntax
> I'm not sure of why this is relevant for verification. Can you provide an
> example?

I thought you were doing static analysis (since you mentioned
clang-analyser), but for execution analysis, I agree, it is irrelevant.

For static analysis, it's quite relevant, since the typedefs and macros are
different depending on the arch (if you have things like #ifdef __arm__)
and the program can behave completely different based on that. Some macros
(of ill -written programs) can define the size of buffers, number of
elements and parametrization of template classes, which all contribute to
static analysis confusion.

>  I need to be further convinced that an Intermediate Representation is
>> NOT a Representation of the code.
I'll let time convince you... ;)

Not to mention the sheer size of the kernel, that will make any small task
>> a pain.
> I have a lightweight type analysis that I believe is scalable to verify
> the kernel in a couple of minutes. I can verify modules within a second.

That's great! Do you get meaningful results from your analysis on a large
scale software? I'd expect loads of reports on many different types and
places, and that can be hard to skim through. Even being fast to run, might
not be fast to understand the output.

I don't know the details of Clang's AST but I need to be convinced. In
> fact, I know of examples where an analysis at the LLVM IR can catch errors
> (semantics errors of the original program) that currently clang-analyse
> does not.

I believe you. But the existence of a few points does not guarantee the
existence of the curve they pass through. ;)

I'll let you get on with your work as you originally intended, in hope that
all my fears will be unfounded or just smartly avoided.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/llvmlinux/attachments/20130724/a0e73df8/attachment-0001.html>

More information about the LLVMLinux mailing list