[Ksummit-discuss] [CORE TOPIC] stable workflow

Dan Williams dan.j.williams at intel.com
Sun Jul 10 02:27:00 UTC 2016


On Sat, Jul 9, 2016 at 6:56 PM, James Bottomley
<James.Bottomley at hansenpartnership.com> wrote:
> On Sun, 2016-07-10 at 01:43 +0000, Trond Myklebust wrote:
>> > On Jul 9, 2016, at 21:34, James Bottomley <
>> > James.Bottomley at HansenPartnership.com> wrote:
>> >
>> > [duplicate ksummit-discuss@ cc removed]
>> > On Sat, 2016-07-09 at 15:49 +0000, Trond Myklebust wrote:
>> > > > On Jul 9, 2016, at 06:05, James Bottomley <
>> > > > James.Bottomley at HansenPartnership.com> wrote:
>> > > >
>> > > > On Fri, 2016-07-08 at 17:43 -0700, Dmitry Torokhov wrote:
>> > > > > On Sat, Jul 09, 2016 at 02:37:40AM +0200, Rafael J. Wysocki
>> > > > > wrote:
>> > > > > > I tend to think that all known bugs should be fixed, at
>> > > > > > least
>> > > > > > because once they have been fixed, no one needs to remember
>> > > > > > about them any more. :-)
>> > > > > >
>> > > > > > Moreover, minor fixes don't really introduce regressions
>> > > > > > that
>> > > > > > often
>> > > > >
>> > > > > Famous last words :)
>> > > >
>> > > > Actually, beyond the humour, the idea that small fixes don't
>> > > > introduce regressions must be our most annoying anti-pattern.
>> > > >  The
>> > > > reality is that a lot of so called fixes do introduce bugs.
>> > > >  The
>> > > > way this happens is that a lot of these "obvious" fixes go
>> > > > through
>> > > > without any deep review (because they're obvious, right?) and
>> > > > the
>> > > > bugs noisily turn up slightly later.  The way this works is
>> > > > usually
>> > > > that some code rearrangement is sold as a "fix" and later turns
>> > > > out
>> > > > not to be equivalent to the prior code ... sometimes in
>> > > > incredibly
>> > > > subtle ways. I think we should all be paying much more than lip
>> > > > service to the old adage "If it ain't broke don't fix it”.
>> > >
>> > > The main problem with the stable kernel model right now is that
>> > > we
>> > > have no set of regression tests to apply. Unless someone goes in
>> > > and
>> > > actually tests each and every stable kernel affected by that “Cc:
>> > > stable” line, then regressions will eventually happen.
>> > >
>> > > So do we want to have another round of “how do we regression test
>> > > the
>> > > kernel” talks?
>> >
>> > If I look back on our problems, they were all in device drivers, so
>> > generic regression testing wouldn't have picked them up, in fact
>> > most
>> > would need specific testing on the actual problem device.  So, I
>> > don't
>> > really think testing is the issue, I think it's that we commit way
>> > too
>> > many "obvious" patches.  In SCSI we try to gate it by having a
>> > mandatory Reviewed-by: tag before something gets in, but really
>> > perhaps
>> > we should insist on Tested-by: as well ... that way there's some
>> > guarantee that the actual device being modified has been tested.
>>
>> That guarantees that it has been tested on the head of the kernel
>> tree, but it doesn’t really tell you much about the behaviour when it
>> hits the stable trees.
>
> The majority of stable regressions are actually patches with subtle
> failures even in the head, so testing on the head properly would have
> eliminated them.  I grant there are some problems where the backport
> itself is flawed but the head works (usually because of missing
> intermediate stuff) but perhaps by insisting on a Tested-by: before
> backporting, we can at least eliminate a significant fraction of
> regressions.
>
>>  What I’m saying is that we really want some form of unit testing
>> that can be run to perform a minimal validation of the patch when it
>> hits the older tree.
>>
>> Even device drivers have expected outputs for a given input that can
>> be validated through unit testing.
>
> Without the actual hardware, this is difficult ...

...but not impossible, certainly there's opportunity to test more code
paths than we do today with unit testing approaches.  For example
tools/testing/nvdimm/ simulates "interesting" values in an ACPI NFIT
table, and does not need a physical platform.  Yes, there will always
be a class of bugs that can only be reproduced with hardware.
However, I've tested USB host controller TRB handling code with unit
tests for conditions that are difficult to reproduce with actual
hardware.  I think there is room for improvement for device driver
unit testing.


More information about the Ksummit-discuss mailing list