[llvmlinux] Heavy experiments

Marcelo Sousa marceloabsousa at gmail.com
Wed Jul 24 19:43:40 UTC 2013


Hi Renato,


On Wed, Jul 24, 2013 at 4:04 AM, Renato Golin <renato.golin at linaro.org>wrote:

> Hi Marcelo,
>
> We're probably leading ourselves towards a long discussion... While this
> is good for both of us as a learning tool, it can generate stress. So I
> want to begin by asking you to ignore if my answers seem a little on the
> harsh side. I can assure you it's unintentional and a limitation of my
> communication skills.
>

Not a problem. I am interested in enumerating all these issues and
formalize them.


> I, too, had the idea that LLVM IR was this powerful language where I could
> infer almost everything from the original code/intentions if I looked hard
> enough. Sadly, reality wasn't that nice to me, as it won't be with you. You
> may escape most of it by only looking to C code execution, and it might
> work well for a very small area of a very specific case, but still, it'd be
> good if you know beforehand the limitations of the IR so to avoid
> frustrations.
>
> All I'm saying here is based on what I've tried myself and seen other
> people trying, and that might have been a long ago, and my memory is really
> bad. Don't take everything at face value, though, since a lot has changed
> in the last 4 years, but be aware that a lot of things in the IR world only
> look consistent because no one have tried differently. Things match by
> accident on some cases...
>

Things matching by accident on some cases is for me a buggy implementation
of LLVM itself.


>
>
> On 24 July 2013 02:08, Marcelo Sousa <marceloabsousa at gmail.com> wrote:
>
>> I dont' agree and in fact it feels to me that there is almost a new field
>> in software verification over LLVM IR. Just to start, there are several
>> examples of bounded model checkers, symbolic execution procedures that
>> operate over LLVM IR. Are you somehow implying that this analysis is no
>> good?
>>
>
> Software verification in IR is a good thing, semantic verification of
> languages in IR is not. The function that transforms AST into IR is
> surjective, not injective.
>

Sure it is surjective but that's not a bad thing. It only means that
multiple C programs evaluate to the same assembly instructions and there is
nothing wrong with that if the compilation is sound.


>
>
>   - Sign information is on instructions, not types
>>>
>> This is not a semantic problem, because you can refine the LLVM IR type
>> system. In fact, that is something I am doing.
>>
>
> No, you can't. While some type representations might look better on paper,
> or on a little project, it doesn't behave on the grand scheme of things,
> namely the support for multiple ABIs on multiple architectures and multiple
> language needs. IR is no silver bullet.
>

Alright, here I do not follow. Do we agree that you can lift the sign
information on instructions to types? Can you give me an example of when
multiple ABIs actually influence the design decision that sign information
goes into instructions?


>
>   - Structures are flattened out, or cast to arrays
>>>
>> The structural information is still there otherwise it's not a sound
>> compilation. I need to see real examples of C code where the original
>> structures cannot be retrieved anymore.
>>
>
> There are many examples, but one that happens fairly regularly is when the
> structure is cast to an int array to do ByVal lowering. Of course, C++
> loses a lot more semantics than C.
>

I'm not sure if ByVal lowering is an optimization and therefore you can
simply not use it, but as for the general case for casts, you run into
problems if you don't extend the type system such that you have an history
of types that value had.


>
>
>>   - Multiple structures are commoned up if they're structurally
>>> equivalent
>>>
>> Not sure what the problem is here. If LLVM IR can identify that two
>> structs are structurally equivalent for free, that seems good for me.
>>
>
> In the case where a function operates in one type and another function
> operates on the other, it'll be hard to reason about which group each
> function belongs. Again, more of a problem to C++ and higher languages.
>

Again hard is different than undecidable. You need to provide me with a
convincing example of this situation.

>
>
>   - Function calls are changed (ByVal, ABI specific, architecture
>>> specific)
>>>
>> Not sure why is this a problem also for verification. In fact, again I
>> get a much more reliable execution model.
>>
>
> I thought you were doing semantic analysis. For execution analysis, this
> is fine.
>
>
>   - Intrinsics are expanded into IR code
>>>
>>  - Perfectly valid code is shrunk into intrinsics (optimizing,
>>> legalizing, atomic ops, pragmas)
>>>
>> Looking at some IR of the kernel I'm not sure if there is heavy use
>> of Intrinsics. Nevertheless, this is one criticism for LLVM IR, that it's
>> not suppose to have any intrinsics at all but be able to generate
>> appropriate optimized assembly code through patterns.
>>
>
> This is a strength, not a weakness. LLVM performs better than some
> commercial compilers because of that.
>

For me, this is obviously a weakness of the language. I'm quite sure that
you can achieve the same optimization levels without intrinsics and I don't
understand (and not sure if LLVM developers describe) the semantic
different between a LLVM IR instruction and an intrinsic (other than one is
native and the other is a *special* call).


>
>
>>  Note that I'm being a somewhat purist here, i.e. LLVM is supposed to be
>> Virtual Machine. There are almost more intrinsics than instructions, and
>> that's not good for any perspective.
>>
>
> This is a common misconception. LLVM is NOT supposed to be a Virtual
> Machine. It's just a misnomer, because it was originally intended to make
> it easier for Chris to finish his PhD. LLVM is a compilation
> infrastructure, of which being a virtual machine is a very (very) small
> part, and not even that well supported any more.
>

Sure it is misconception, but I want to change LLVM IR to be a proper
virtual machine. BTW: speaking of things in the thesis that I don't see
anymore: What happened to the whole *run-time optimization* for LLVM?


>
>   - Vectorization can explode the amount of IR beyond recognition
>>>
>> Are you referring to vectorization as a compiler transformation? Please
>> elaborate and if possible provide an example of this. Nevertheless, I doubt
>> that this happens in the particular case of the Linux Kernel.
>>
>
> Yes, and I agree, it's possible that the Kernel has nothing worth
> vectorizing. It's just an example where your tool wouldn't work at all.
>

Regardless, I'm slightly confused that you may be implying that
vectorization is unsound in LLVM.


>
>   - Inlining shuffles code and will hide any analysis you do on the
>>> inlinee
>>>
>> I fail to see this point if your analysis is sound, i.e. your analysis
>> results of the inlinee when analyzed independently are more general than
>> the ones when you analyze the function with the inlining. This may affect
>> precision of the inlinee independently but the analysis is still precise in
>> the inlining.
>> For example, in a type-based analysis this can result that the inlinee
>> would have a polymorphic type and the function that has the body of the
>> inlinee is just a specialization.
>>
>
> When the function is static, LLVM tends not to keep the out-of-line copy
> on O2 or higher.
>

It may be that for my analysis it's not important that LLVM removes that
particular function, and arguably for a sound analysis it's irrelevant.


>
>   - PHI nodes, basic blocks for loops and switches can be recognized, but
>>> they don't mean the same things anymore
>>>
>> So what do they mean? Sure there is a small-step operational semantics of
>> phi instruction which does not mean that there is a loop, but if you can
>> identify a pattern that is a loop.
>>
>
> Loops can be destroyed by loop unroll, or created during vectorization
> (pre/pos).
>

Well, here we can talk in two perspectives. It's obvious that loop unroll
can make our analysis incomplete, and that's *requirement* for example for
bounded model checking; but if you know the upper bound of the loop and you
unroll it completely, it's still possible to identify the loop right?


>
>
>
>>  However, note that i'm not claiming that there is a bijection between a
>> particular C program and the LLVM IR code but I do claim a bisimulation
>> relation with C + Arch and LLVM IR + Arch', where Arch' is a subset of
>> Arch, and they really complete the *real/concrete* execution model with the
>> information about the architecture.
>>
>
> The transformation between AST and IR, and further transformations from IR
> to optimized IR are not symmetric, associative, nor transitive. They're
> pattern-match based and do behave erratically quite frequently. LLVM is not
> a mathematical tool, it's a real world code with real world defects because
> it has real world expectations.
>

The optimization/real world expectations argument is acceptable, and that's
the motivation to perform analysis of LLVM IR: when you write a C program
you make certain assumptions that for example LLVM can break, and it's
important to identify which abstractions are sound and which are not.


>
> Some annotation is kept on the types (TBAA, for ex.) to be able to infer
> high-level information about the type, but as with any metadata in LLVM,
> that can (and will) be discarded by optimization passes. Inlining will only
> make it worse.
>

The point is to identify which optimizations passes, and say "Look: if you
apply this optimization you're going into a dark path because the
optimization is not semantic preserving".

>
>
>
>>   - Target-specific lowering does destroy the IR, making it impossible
>>> to be portable, sometimes even to sub-architectures
>>>
>> Can you get me some concrete examples of this?
>>
>
> ByVal is one. Procedure call standard is another. Intrinsics are yet
> another. You can't even throw an ARMv7-hf function into an ARMv5 target,
> even both being ARM.
>

I'm not sure if fully understand what you are saying still.

>
>
>
>>  this is useful for software verification that considers architectural
>> information. This is exactly what I'm trying to do with the Linux Kernel at
>> the moment: i'm just trying to show that for x86 certain behaviours that
>> are arch-specific due to inline assembly instructions do not occur.
>>
>
> This is a very important analysis, and I can see that most of what I said
> is irrelevant to your project. However, be careful on unintentional
> target-specific code, like the ones I mentioned above.
>
>
>
>
>>    - C++ gets flattened out to C in the Clang AST, and then to
>>> target-specific in IR, it's virtually impossible to attach any meaning to it
>>>
>> If that's the case, what is the argument to claim the "capability of
>> representing ‘all’ high-level languages cleanly"?
>>
>
> If you're referring to Clang's AST, that's a complex issue. Even though
> the AST has all the information you need about C++ concepts, the members
> become functions, the classes become PODs with pointers to VTs and VTTs,
> etc. The information is all there, in metadata or lists of members, but
> it'll be hard to reproduce the original code back in C++. However, printing
> it as a C code would be trivial.
>
> All C++ front-ends I know do something like that (including IntelCC and
> ARMCC, that both use EDG). This is required for you to be able to deal with
> C and C++ code on the same file and inter-operate between the languages
> safely.
>
>
>
>>  I agree with you than it's an hard task to have a back-end to C++ but
>> semantically is not "virtually impossible".
>>
>
> Well, anything is impossible... ;)
>

I think your arguments are stronger in C++ than in C for example, although
I'm of the opinion that conceptually you could analyze C++ in LLVM IR.


>
>
>   - SCEV, DCE, CSE and undefined behaviour will *also* change IR beyond
>> recognition
>> I don't recognize SCEV, DCE, CSE.
>>
>
> Sorry, too long in the field and I'm already behaving like them... ;)
>
> SCEV = Scalar Evolution (constant folding at compile time, among other
> things)
> DCE = Dead Code Elimination (removal of complete sections of the code if
> unused, or if undefined)
> CSE = Common Subexpression Evaluation (fuse common code into one)
>

These transformations are/can be sound so I don't see any problem.

>
>
>
>>  About undefined behaviour, are you talking about *undef*? We should
>> really have a discussion just about the semantics of *undef*, or write a
>> paper about "Defining undef".
>>
>
> Many papers, more likely... ;)
>
> Here's a good blog post about it by Chris:
> http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
>
>
>
>>  Again, I agree that is possible that transformations *can* possibly
>> change IR which makes lifting to a similar C program w.r.t the original
>> very hard, but *sound* transformations will not change its behaviour -
>> semantic preservation.
>>
>
> In practice, undefined behaviour will make the compiler treat the code
> around it as not-worthy, and will eliminate it, or change everything that
> uses it to undef, and "flatten-out" your code.
>

Undefined behavior needs to be properly formalized in LLVM so that people
are clear on the semantics of the IR.


>
>
>
>>  Moreover, note that by performing analysis at the IR level you are not
>> only able to identify potential real bugs in your "high-level" program but
>> you also are doing compiler certification. Understanding the origin of the
>> error might be hard but I find the analysis particularly valuable.
>>
>
> I'm not saying it's wrong, I'm just saying it's a daunting task. There are
> too many bugs in LLVM that don't generate bad code, and I fear you'll be
> hitting plenty of them.
>

Having a verification tool for LLVM sounds great to me!

>
>
>   - IR has no knowledge of the pre-compiler syntax
>> I'm not sure of why this is relevant for verification. Can you provide an
>> example?
>>
>
> I thought you were doing static analysis (since you mentioned
> clang-analyser), but for execution analysis, I agree, it is irrelevant.
>
> For static analysis, it's quite relevant, since the typedefs and macros
> are different depending on the arch (if you have things like #ifdef
> __arm__) and the program can behave completely different based on that.
> Some macros (of ill -written programs) can define the size of buffers,
> number of elements and parametrization of template classes, which all
> contribute to static analysis confusion.
>

That's not really true in the sense that the *code* is actually going to be
execution in some *machine* so I am interested in verifying the expansion
for that particular architecture and not all the boilerplate code for other
architectures. I want to stress this point.


>
>
>
>>  I need to be further convinced that an Intermediate Representation is
>>> NOT a Representation of the code.
>>>
>>
> I'll let time convince you... ;)
>
>
>  Not to mention the sheer size of the kernel, that will make any small
>>> task a pain.
>>>
>> I have a lightweight type analysis that I believe is scalable to verify
>> the kernel in a couple of minutes. I can verify modules within a second.
>>
>
> That's great! Do you get meaningful results from your analysis on a large
> scale software? I'd expect loads of reports on many different types and
> places, and that can be hard to skim through. Even being fast to run, might
> not be fast to understand the output.
>

I'm still running experiments with different variations to for example
reduce the number of false positives, but you're right in claiming that for
example in the case of the Linux Kernel is hard to even identify from the
LLVM IR code the C code because you're getting generic kernel code and
architecture code mixed with driver code but that's quite important for a
precise analysis.


>
>
>  I don't know the details of Clang's AST but I need to be convinced. In
>> fact, I know of examples where an analysis at the LLVM IR can catch errors
>> (semantics errors of the original program) that currently clang-analyse
>> does not.
>>
>
> I believe you. But the existence of a few points does not guarantee the
> existence of the curve they pass through. ;)
>
> I'll let you get on with your work as you originally intended, in hope
> that all my fears will be unfounded or just smartly avoided.
>

As I told you, I'm very interested in gathering ALL possible sources of
fear to verify IR code. Every analysis has limitations, but it's important
to be clear on the limitations. We can continue this discussion somewhere
else or even meet up, but I'm looking for LLVM IR experts that can raise
these issues.

Regards,
Marcelo


>
> cheers,
> --renato
>
> _______________________________________________
> LLVMLinux mailing list
> LLVMLinux at lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/llvmlinux
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linuxfoundation.org/pipermail/llvmlinux/attachments/20130724/0ebb11ee/attachment-0001.html>


More information about the LLVMLinux mailing list