[lsb-discuss] Ways for us to work more efficiently
Wichmann, Mats D
mats.d.wichmann at intel.com
Thu Jan 24 14:08:43 PST 2008
> Yes, it means more formalism, and yes it means people will
> need to write more. But (a) as we add more people to the project,
> we need a bit more formalism so that we can all work together
> efficiently, and (b) it will allow us to be able to more
> accurately predict when we will be able to make a release ---
> and more importantly know whether we are in danger of
> slipping release deadlines unlesss we cut features, or add more
> people, or both.
There's nothing particularly wrong with any of this. I do have
a bunch of random comments, though.
It would take us closer to the way the project operated some
years ago; many of the more formal procedures were pared down
over time as staff dwindled and we ended up with more overhead
than various people thought was needed. As a very basic example,
we had separate project teams for distinct functional areas -
this now lives on as a memory in bugzilla categories, and in the
distinct package signing keys that were used by the various teams.
Some of the other oddities are the result of natural "bitrot" -
for example, under cvs the project repository was broken up
the same way as described above, by functional area corresponding
to the teams. With the move to bazaar, it was felt that since
it doesn't handle sub-trees at all well (you need to check out
the whole tree, and a commit anywhere in the tree serves as
an update to the whole tree) it was better to break things up
per component instead - this is where we got separate trees
for each test suite. The bit-rot part is that "what to retrieve"
probably didn't get very well documented, as you noted.
We absolutely need a getting started page. We had one (I know
this because I "got it started" when the lsb-desktop team was
forming), and I'm darned if I know where it went to.
In general, the wiki is apallingly stale, and I assert that
at least part of the reason is that it's now too hard to use,
so you don't get the same level of "drive-by updates" that we
used to - the whole purpose of a wiki. The authentication
mechanism is extremely slow (e.g. it takes several minutes for
me to complete a login), and for some reason nobody has ever
explained to me it times out your login very frequently. So
you'll have a window open which will indicate you're logged in,
you click to edit, are told you have to log in to edit pages,
you go to the login page, and then many times (not always,
I've never been able to sort out the exact conditions) your
place is lost and you have to navigate back. If I could leave
my wiki tab in firefox sitting there and just toss in some
text when the motivation struck rather than going through the
song and dance, at least I would keep things more up to date.
That said - getting started page is needed, but long experience
has taught us that it's still necessary to have a mentor.
I'm happy to do that for anyone who wants one.
We do have a wiki page on using the LSB bugzilla, and we can
certainly update that. That one at least hasn't vanished:
Concept that the LSB specification, build tools, etc., should be always
buildable, and ready for release. This means daily builds and regression
test suites, so we can find discover problems earlier. Michael
Schultheiss' work is the framework of what we need, but we need to
deploy it on enough machines so we can be automatically running it every
day on all of our architectures and on as many distributions as
possible. This is going to be a huge test matrix, and it is almost
certain that we do not have enough development machines to support this.
So one of the things we need to do is to figure out how much resources
this will need, so we can start trying to request the necessary hardware
and rack space so we can do this kind of exhaustive testing --- and not
just with the current enterprise versions, but also for the development
"community distribution" versions of the enterprise distro's (i.e.,
Fedora, Open SuSE, Debian unstable, etc.)
Always buildable, and ready for release, are clearly two different
points - both very important. During 2007 we made really good progress
on the former; we forced all code into the autobuild system
(before 2007, test suites were not autobuilt). In the process we
strained the autobuild setup, so we may have to look at rearchitecting
things a bit. Probably 25% of the days some package didn't build
somewhere for reasons that were not code bugs - missing dependencies
that could not be expressed in an automatic way (somebody had to install
foo-devel manually on each of the seven autobuilders); wedged resources;
low disk space; machines down; leftover bits from a manually initiated
build that the autobuilder didn't have permission to overwrite, thus
failing; etc. There's an autobuilder status page for those who don't
and I'm happy to say it's all green today :-)
Setting that aside, we actually spent very little time with packages
not buildable due to code bugs. We don't autobuild the specification
books - I'm not sure it would ba valuable - but we spent zero days with
the spec not buildable.
Ready for release, of course, was a completely different story. Yes,
testing feedback from lots of different distros, released &&
is crucial. I've been hoping we could have some of those resources
be external to LF, it would be pretty tricky to set up and maintain
the right farm and I think would cost quite a bit of admin time where
we're doing both nightly updates of the distro and of the tests (with
the former not infrequently requiring a reboot for a enw kernel). We
have trouble keeping our range of seven autobuild architectures,
spread over five machines (one of which is a donated resource at IBM)
up and running reliably, and those do not have the distro component
updated as part of the procedure. Maybe it's not practical, but I'd
love to see partnerships where at least in-development distros would
host an instance of the autotester and push results back.
Even if you get all this data, the hard part is analysis - you could
argue that we now have access to too much data. We don't yet have
very good comparative tools. If you have two test journals from
two different runs, you can easily do a comparison with tjreport.
But as the number of tests have swelled such that underneath the
nice dtk-manager interface it's anything but monolithic - a current
full certification-ready run will produce 27 distinct journal files -
it's not that easy to indentify a failure that is unique to one or
a subset of distros, and then drill down on what's going on. I'll
give just one example - we've been flummoxed by changes in behavior
in Gtk, where suddenly tests start returning different results in
later releases. The test journal for this component doesn't include
versioning information for the components, so once you've identified
which distributions are failing you have to start digging to see if
you can identify a common thread such as gtk-2.14 "fails", anything
earlier "works". THEN you can start anlysis of whether there was
an incompatible behavior change upstream, or whether the test made
bad assumptions, or....
I guess I'll stop there, the ISPRAS folks may have more comments
as they're doing just this kind of work with the tests they're
developing. Easy, it's not.
Conclusion: to go from autobuilt-nightly to ready-to-release-nightly
I do think we need to come up with much more sophisticated data
analysis mechanisms that can help us understand what that night's
auto-test runs mean.
More information about the lsb-discuss