[llvmlinux] Heavy experiments

Wed Jul 24 00:08:59 UTC 2013

On 24 July 2013 00:11, Marcelo Sousa <marceloabsousa at gmail.com> wrote:

> First of all, I want to understand if you're stating that analyzing the
> Linux Kernel through LLVM IR is not a good approach or analyzing LLVM IR in
> general is no good.
>

The latter.

> I do not understand what you mean by "the AST is gone on the IR level". I
> can argue that in a naive compilation the IR code is structurally to the
> original C code, in fact, the SSA transformation is "functionalizing".
>

That's not entirely true. All the IR needs to do is to "behave" like the
original code, but most of the semantics is gone for good.

Examples:
 - Sign information is on instructions, not types
 - Structures are flattened out, or cast to arrays
 - Multiple structures are commoned up if they're structurally equivalent
 - Function calls are changed (ByVal, ABI specific, architecture specific)
 - Intrinsics are expanded into IR code
 - Perfectly valid code is shrunk into intrinsics (optimizing, legalizing,
atomic ops, pragmas)
 - Vectorization can explode the amount of IR beyond recognition
 - Inlining shuffles code and will hide any analysis you do on the inlinee
 - PHI nodes, basic blocks for loops and switches can be recognized, but
they don't mean the same things anymore
 - Target-specific lowering does destroy the IR, making it impossible to be
portable, sometimes even to sub-architectures
 - C++ gets flattened out to C in the Clang AST, and then to
target-specific in IR, it's virtually impossible to attach any meaning to it

I could go on forever, but you get the idea. IR is NOT a representation of
your code. At best, it's a high-level assembly, and that's what it was
designed for, and for that, it does a pretty good job.

Not entirely sure what you mean in this paragraph. I believe that the sort
> of information that you loose if because LLVM IR has design faults, not
> necessarily because of transformation to the LLVM IR. Perhaps you can
> elaborate on what sort of information you loose besides the annoying
> implicit unsignedness that is relevant for verification and the fact that
> it may be harder to identify higher-abstraction constructs like for-loops.
>

Not design flaws, design. It wasn't designed to represent any particular
language and it won't be a 1-to-1 match to any of them.

Can you provide a reference to this work? At this point, I'm really not
> sure what you mean by "this kind of experiments".
>

Unfortunately not, it was an internal pet project. Basically, the idea was
to add pass flags to LLC via a Python script that would discard the bad
results (can't compile, can't run) and cross the good results, optimizing
for performance. It was a simple experiment, but shown with great clarity
the interaction between the most used (and most powerful) passes.

Surely I can apply certain levels of analysis (intra-procedural,
> inter-procedural and even inter-modular) to verify components of the
> kernel. The hard problem is how to verify several components in a
> concurrent setting.
>

Not to mention the sheer size of the kernel, that will make any small task
a pain.

Another question: Is one of the goals of the google summer project to apply
> the clang-analyzer to several versions of the kernel or just the latest one?
>

I don't know. But AFAIK, the Clang Analyser is not done in IR, but in the
Clang AST (where it should be, because of the problems I mentioned). So,
it's quite possible that you don't need to generate IR at all.

Any semantic analysis must be done at the Clang's AST, which is quite rich
and powerful. IR is most suitable for language/target-independent
transformations.

Hope that clears up a bit... ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/llvmlinux/attachments/20130724/c0396b93/attachment.html>