[Ksummit-discuss] [CORE TOPIC] stable workflow

Sun Jul 10 06:19:39 UTC 2016

On Sat, Jul 9, 2016 at 6:34 PM, James Bottomley
<James.Bottomley at hansenpartnership.com> wrote:
> [duplicate ksummit-discuss@ cc removed]
> On Sat, 2016-07-09 at 15:49 +0000, Trond Myklebust wrote:
>> > On Jul 9, 2016, at 06:05, James Bottomley <
>> > James.Bottomley at HansenPartnership.com> wrote:
>> >
>> > On Fri, 2016-07-08 at 17:43 -0700, Dmitry Torokhov wrote:
>> > > On Sat, Jul 09, 2016 at 02:37:40AM +0200, Rafael J. Wysocki
>> > > wrote:
>> > > > I tend to think that all known bugs should be fixed, at least
>> > > > because once they have been fixed, no one needs to remember
>> > > > about them any more. :-)
>> > > >
>> > > > Moreover, minor fixes don't really introduce regressions that
>> > > > often
>> > >
>> > > Famous last words :)
>> >
>> > Actually, beyond the humour, the idea that small fixes don't
>> > introduce regressions must be our most annoying anti-pattern.  The
>> > reality is that a lot of so called fixes do introduce bugs.  The
>> > way this happens is that a lot of these "obvious" fixes go through
>> > without any deep review (because they're obvious, right?) and the
>> > bugs noisily turn up slightly later.  The way this works is usually
>> > that some code rearrangement is sold as a "fix" and later turns out
>> > not to be equivalent to the prior code ... sometimes in incredibly
>> > subtle ways. I think we should all be paying much more than lip
>> > service to the old adage "If it ain't broke don't fix it”.
>>
>> The main problem with the stable kernel model right now is that we
>> have no set of regression tests to apply. Unless someone goes in and
>> actually tests each and every stable kernel affected by that “Cc:
>> stable” line, then regressions will eventually happen.
>>
>> So do we want to have another round of “how do we regression test the
>> kernel” talks?
>
> If I look back on our problems, they were all in device drivers, so
> generic regression testing wouldn't have picked them up, in fact most
> would need specific testing on the actual problem device.  So, I don't
> really think testing is the issue, I think it's that we commit way too
> many "obvious" patches.  In SCSI we try to gate it by having a
> mandatory Reviewed-by: tag before something gets in, but really perhaps
> we should insist on Tested-by: as well ... that way there's some
> guarantee that the actual device being modified has been tested.

Having worked on one of the projects that were trying to track stable
but got internal pushback against, it it came down to this:

The in-house developers on a certain subsystem didn't trust the
upstream maintainers to not regress their drivers -- in particular
they had seen some painful regressions on older chipsets when newer
hardware support was picked up. Esoteric bugs that had been fixed with
the help of the support team weren't folded in properly in the
upstream sources, or when they did they looked sufficiently different
that when -stable came around they didn't want to revert back to that
version, or they weren't yet picked up for upstream and now other
fixes were touching the same code and that seemed risky. They had a
code base that worked for the use cases they cared about (with the fix
applied that the support team had provided), and very little interest
in risking a regression from switching to the upstream version.

In hindsight, I think the specific problems seen had later been solved
through other means, but the reluctance to keep upreving to -stable
was hard to get rid of once someone had gotten burnt by it, and it
didn't seem worth it at the time.

Instead, what the team started doing was using -stable as a source for
fixes -- when looking at a bug, first think you looked for was to see
if someone had touched that code/subsystem in -stable. It's not ideal
in the sense that you have to hit the bug and someone has to look at
it, but it was the state we ended up in on that project. It means
-stable still has substanial value even though it's not merged
directly.

-Olof