[Celinux-dev] Proposal: combine qemu's tcg with tcc to produce new embedded compiler (qcc).

Rob Landley rob at landley.net
Mon Dec 21 05:52:30 UTC 2009


The QEMU project has a fairly general purpose "Tiny Code Generator" which is 
capable of producing machine code for every target QEMU supports.  This code 
generator is well maintained (by the qemu development community), operates 
extremely rapidly (producing code "on the fly"), and supports a large and 
increasing number of platforms, even distinguishing many specific variants 
within each platform (the qemu -cpu options).


Before QEMU, Fabrice Bellard's previous open source project was the Tiny C 
Compiler (tcc), which was notable for its small size (approximately 100k for a 
combined compiler/assembler/linker), its self-contained nature (not requiring 
external packages such as binutils), its speed of compilation (millions of 
lines of source code per second even on low-end hardware), and its "-run" mode 
(allowing use of C as a scripting language by starting a source file with 
"#!/usr/bin/tcc -run" and setting the executable bit on the source file).

Tinycc provides almost full c99 support (most notably missing complex number 
support and variable extent arrays).  In 2004, tinycc became the only open 
source compiler other than gcc to compile a working LInux kernel (albeit in 
limited circumstances).  Fabrice Bellard created "tccboot", a proof-of-concept 
project in which tcc was used to boot a Linux kernel directly from source 
code.  The tccboot ISO image booted directly into a modified tcc binary bundled 
with a modified subset of the 2.4 Linux kernel source.  It compiled this source 
to create a vmlinux, which it then executed.

QEMU actually started as an offshoot of tcc.  When fabrice looked into 
providing multiple output formats for tcc (to support targets other than 32-
bit x86), he started playing with multiple input formats as well, such as 
pages of existing machine code.  The result was qemu, which is actually a 
"dynamic recompiler" rather than an emulator.

The TCC project stalled when QEMU expanded to take up all Fabrice's time, and 
the project remained moribund for several years.  (Recently the original tcc 
has been relaunched as a windows-centric project, but its current maintainer 
has shown little to no interest in Linux or non-x86 targets.)

The tcc codebase as Fabrice left it provides an almost complete c99 compiler.  
Combined with qemu's code generator, this could provide a small fast compiler 
capable of running on and producing output for a wide range of embedded 
hardware.

Creating a "qcc" from tcc and tcg would involve:

1) Turn tcc into a "swiss army knife" executable (like busybox) so it its 
individual functions could be called as cc, ld, as, strip, cpp, and so on.

1A) optional - use the Firmware Linux ccwrap.c code to increase understanding 
of gcc command line options.

1B) optional - add missing utilities such as readelf, objdump, objcopy...

2) Refactor the code to untangle preprocessor, compiler, assembler, and linker 
functions.

3) Replace existing target code generation with qemu's tcg.

4) Add support infrastructure for targets supported by tcg (assembly code 
parser, ELF header information) 

5) Add missing functionality needed to build unmodified Linux 2.6, BusyBox, and 
uClibc.  (The linux kernel needs variable extent arrays, simple dead code 
elimination, assembly output (at least via objdump) to generate
asm-offsets.h...)

Rob

P.S.  See attached for some design work I did earlier this year.
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
-------------- next part --------------
QCC - QEMU C Compiler.

  Use QEMU's Tiny Code Generator as a backend for a compiler based on my old
  fork of Fabrice Bellard's tinycc project.

Why?

  QEMU's TCG provides support for many different targets (x86, x86-64, arm,
  mips, ppc, sh4, sparc, alpha, m68k, cris).  It has an active development
  community upgrading and optimizing it.

  QEMU application emulation also provides existing support for various ELF
  executable and library formats, so linking logic can presumably be merged.
  (See elf.h at the top of qemu.)  QEMU is also likely to grow coff and pxe
  support in future.

Building a self-bootstrapping system:

  My Firmware Linux project builds the smallest self-bootstrapping system
  I could come up with using the following existing packages:

    gcc, binutils, make, bash, busybox, uClibc, linux

  This new compiler should replace both binutils and gcc above.  (As a smoke
  test, the new system should still be able to build all seven packages.)

  To build those packages, FWL needs the following commands from the host
  toolchain.  (It can build everything else from source, but building these
  without already having them is a chicken and egg problem.)

    ar as nm cc gcc make ld /bin/bash

  The reason it needs "gcc" is that the linux and uClibc packages assume
  their host compiler is named "gcc", and call that name instead of cc even
  when it's not there.  (You can mostly override this by specifying HOSTCC=$CC
  on the make command line, although a few places need actual source patches.)

  Ignoring gcc, make, and bash, this leaves "ar, as, nm, cc, and ld" as
  commands qcc needs to provide for a minimal self-bootstrapping system.

  Note that the above set of tools is specifically enough to build a fresh
  compiler.  When building a linux kernel, creating a bzImage requires objcopy,
  building qemu requires strip, etc.

What commands does the current gcc/binutils combo provide?

  gcc 4.1 provides the commands:
    cc/gcc - C compiler
    cpp - C preprocessor (equivalent to cc -E)
    gcov - coverage tester (optional debugging tool)

    Of these, cc is required, cpp is low hanging fruit, and gcov is probably
    unnecessary.

  Binutils provides:
    ar - archiver, creates .a files.
    ranlib - generate index to .a archive (equivalent to ar -s)
    as - assembler
    ld - linker
    strip - discard symbols from object files (equilvalent to ld -S)
    nm - list symbols from ELF files.
    size - show ELF section sizes
    objdump - show contents of ELF files
    objcopy - copy/translate ELF files
    readelf - show contents of ELF files
    addr2line - convert addresses to filename/line number (optional debug tool)
    strings - show printable characters from binary file
    gprof - profiling support (optional)
    c++filt - C++ and Java, not C.
    windmc, dlltool - Windows only (why is it installed on Linux?)
    nlmconv - Novell Netware only (why is this installd on Linux?)

    Of these, ar, as, ld, and nm are needed, ranlib, strip, addr2line, and
    size are low hanging fruit, size, objdump, obcopy, and readelf are
    variants of the same logic as nm, and gprof, c++filt, windmc, dlltool,
    and nlmconv are probably unnecessary.

Standards:

  The following utilities have SUSv4 pages describing their operation, at
  http://www.opengroup.org/onlinepubs/9699919799/utilities

    ar, c99, nm, strings

  This means the following don't:

    ld, cpp, as, ranlib, strip, size, readelf, objdump, objcopy, addr2line

  (There isn't a "cc" standard, but you can probably use "c99" for that.)

Existing code:

  multiplexer:

    The compiler must be provide several different names, yet the same
    functionality must be callable from a single compiler executable,
    assembling when it encounters embedded assembler, passing on linker
    options via "-Wl," to the linking stage, and so on.

    The easy way to do this is for the qcc executable to be a swiss-army-knife
    executable, like busybox.  It needs a command multiplexer which can figure
    out which name it was called under and change behavior appropriately, to
    act as a compiler, assembler, linker, and so on.

    This multiplexer should accept arbitrary prefixes, so cross compiler names
    such as "i686-cc" work.  This means instead of matching entire known names,
    the multiplexer should checks that commands _end_  with recognized strings.
    (This would not only allow it to be called as both "qcc" and "cc", but
    would have the added bonus of making "gcc" work like "cc" as well.)

    Both busybox and tinycc already handle this.  Pretty straightforward.

  cc/c99 - front-end option parsing

    Both tinycc's options.c and ccwrap.c (in FWL) handle command line option
    parsing, in different ways.  Both take as input the same command line
    syntax as gcc, which is more or less the c99 command line syntax from
    SUSv4:

      http://www.opengroup.org/onlinepubs/9699919799/utilities/c99.html

    What ccwrap.c does is rewrite a gcc command line to turn "cc hello.c"
    into a big long command line with -L and -I entries, explicitly specifying
    header and library paths, the need to link against standard libraries
    such as libc, and to link against crt1.o and such as appropriate.

    Such a front end option parser could perform such command line rewriting
    and then call a "cc1" that contains no built-in knowledge about standard
    paths or libraries.  This would neatly centralize such behavior, and
    if the rewritten command line could actually be extracted it could be
    tested against other compilers (such as gcc) to help debugging.

    Note that adding distcc or ccache support to such a wrapper is a fairly
    straightforward item for future expansion.

    The option parser needs to distinguish "compiling" from "linking".

      When compiling, the option parser needs to specify two include paths;
      one for the compiler (varargs.h, defaulting to ../qcc/include) and
      one for the system (stdio.h, defaulting to ../include).

      When linking, the option parser needs to specify the compiler library
      path (where libqcc.a lives, defaulting to ../qcc/lib), the system
      library path (where libc.a lives, defaulting to ../lib), and add
      explicit calls to link in the standard libraries and the startup/exit
      code.  Currently, ccwrap.c does all this.

    Note that these default paths aren't relative to the current directory
    (which can't change or files listed on the command line wouldn't be found),
    but relative to the directory where the qcc executable lives.  This allows
    the compiler to be relocatable, and thus extracted into a user's home
    directory and called from there.  (The user's home directory name cannot
    be known at compile time.)  The defaults can also be specified as absolute
    paths when the compiler is configured.

    The current ccwrap.c also modifies the $PATH (so gcc's front-end can
    shell out to tools such as its own "cc1" and "ld"), and supports C++.
    Although qcc doesn't need either of these, both are useful for shelling
    out to another compiler (such as gcc).

    The wrapper can split "compiling and linking" lines into two commands,
    either saving intermediate results in the /tmp directory or forking and
    using pipes.  (That way cc1 doesn't need to know anything about linking.)
    Optionally, the compiler can initialize the same structures used by the
    linker, but is the speed/complexity tradeoff here worth it?

    Note that "-run" support is actually a property of the linker.

  cpp - preprocessor

    This performs macro substitution, like "qcc -E".

  cc1 - compiler

    This compiles C source code.  Specifically, it converts one or more .c
    files into to a single .o file, for a specific target.

    Generating assembly output is best done by running the binary tcg output
    through a disassembler.  Keep it orthogonal.

  ld - linker
    This needs to be able to read .o, .a, and .so files, and produce ELF
    executables and .so files.  It should also support linker scripts.

    This needs to "#include <elf.h>", which non-linux hosts won't always have
    but which qemu has it's own copy of already.

  ar - library archiver
    This is a wimpy archiver.  It creates .a files from .o files
    (and extracts .o files from .a files).  It's a flat archive, with no
    subdirectories.

    Busybox has partial support for this (still read-only, last I checked).

    The ranlib command indexes these archives.

    SUSv4 has a standards document for this command:

      http://www.opengroup.org/onlinepubs/9699919799/utilities/ar.html

  as - assembler
    Tinycc has an x86 assembler.  It should be genericized.

  nm - name list

    For some reason, gcc won't build without this.

    SUSv4 has a standards document for this command:

      http://www.opengroup.org/onlinepubs/9699919799/utilities/nm.html


More information about the Celinux-dev mailing list