[cgl_discussion] [Fwd: [Dcl_tech_board] Linux Kernel Crash Dump (LKCD) evaluation]

John Cherry cherry at osdl.org
Fri Apr 11 17:25:04 PDT 2003


Steve Hemminger (of DCL fame) wrote up a nice synopsis of the Linux
Kernel Crash Dump (LKCD) project and where it is going.  There are a
number of reasons that LKCD is not gaining mainline acceptance
(dependency on kexec, arch specific, too invasive, bloat, etc.).  Linus
has stated that he would consider a network-only crash dump, but that is
not where LKCD is heading.

Steve is proposing a mini crash dump project that has a chance of
mainline acceptance.  This would be beneficial to DCL, CGL, and the
community at large.

Please give Steve your feedback on this proposal.  I know there has been
some work going on with network dumps, so if you feel that an existing
project might be a good baseline for the mini crash dump project, please
sync up with Steve.

Thanks,
John

-----Forwarded Message-----

> From: Stephen Hemminger <shemminger at osdl.org>
> To: dcl_tech_board at osdl.org, dcl_steering at osdl.org
> Subject: [Dcl_tech_board] Linux Kernel Crash Dump (LKCD) evaluation
> Date: 11 Apr 2003 16:34:52 -0700
> 
> 	Linux Kernel Crash Dump Evaluation
> 
> Crash dump is an important diagnostic tool for production systems. Commercial
> customers rely on binary distributions from vendors; these vendors need tools
> like crash dumps to provide timely support.
> 
> A version of crash dump was ported by SGI and it became the Linux Kernel
> Crash Dump (LKCD) project.  This code has failed to gain community acceptance. 
> 
> 
> Kernel acceptance
> 
> Full text of discussion from Oct '02 in Addendum's.  The last attempt
> to submit to kernel failed and Linus expressed the opinion that LKCD
> is a "vendor-driven" thing [Linus1]. He seemed willing to accept network
> dump since net drivers fail less [Linus2]. Red Hat supplies network
> dump, but has requirements for disk as well [Dave1].
> 
> OSDL can act to help facilitate a useful crash dump solution, which
> aligns with the  "vendor-driven" perspective.  The problem is that 
> existing LKCD project may not be the right mechanism.
> 
> LKCD team
> 
> The LKCD project is active with development for OSDL DCL member
> companies. Individuals from IBM, Intel, and OSDL regularly contribute
> to the current CVS tree. The distro vendors (the real customers) don't
> seem to be involved.
> 
> Issues with LKCD
> 
> * Not heading towards closure
> 
> The size of LKCD has grown since Oct and got more complex.  Support
> for kexec (save to memory) has been added as well as other changes.
> It seems like the project has given up on getting it into 2.6.
> 
> Current additions to LKCD are all sound, but are not heading the
> project towards integration in the standard kernel.  The most recent
> is saving crash dump to reserved memory and saving it on reboot.  This
> makes sense on some machines with lots of memory but isn't something
> that will end up getting used on average binary distribution vendor
> support. 
> 
> The de-facto strategy is to try and target a smaller solution based on
> the memory dump.  This is not a bad concept, but means that acceptance
> is now dependent on Linus (and distro's) accepting the kexec patch.
> 
> Response from the mailing list to the suggestion of pared down crash
> dump was positive from the active developers, but no action has taken
> place in that regard.
> 
> * Integration touches too many places
> 
> Far too many files in the main kernel need to be patched.  None of
> these are big patches, but they hit "sensitive places"
>  1. Scheduler needs change to main loop to allow other CPU's to dump.
>  2. VM needs additional flag to keep track of kernel memory usage.
>  3. Makefile changes to get type information
>  4. SMP IPI additions to capture other processor state.
>  5. Extensions for reserving memory at boot.
> 
> * Current disk dump is unreliable
>  
> In order to dump to disk, it goes through the normal device block
> layer which means LKCD must re-enable interrupts, and
> re-schedule. Also, since disk drivers are often a source of failures
> it risks double faulting by using the same code path.
> 
> * Non IA32 platform support missing on 2.5
> 
> Since it is a side project, no one has updated LKCD to work on
> non-i386 kernel.  Also since so many places get touched in the main
> kernel it is a non-trivial port.
> 
> * LKCD interface
> 
> Interface is through /dev/dump. Linus doesn't like pseudo-devices and
> prefers /proc and eventually /sysfs for such things. There is a /proc
> interface to LKCD but the utilities use ioctl's on /dev/dump.
> 
> * Bloat
> 
> LKCD supports a plethora of options about dump devices, compression types, 
> how much memory to dump, ... This leads to LKCD being tagged as bloat. 
> If LKCD is to work on customer installed systems, it has to have a simple setup.
> 
> Suggested alternative
> 
> Start a project to create a mini-crash dump that has a chance of
> acceptance.  . 
> 
>  * Address the basic requirements of a binary enterprise
> 	system distribution.
> 
>  * Use existing code if possible
> 	- Network only crash dump 
> 	- Rusty's IDE mini-oopser
> 
>  * Use existing dump format to save rewriting analysis tools 
> 
> 
> Addendum
> ============================================================
> 
> [Linus1]
> 
> From: Linus Torvalds <torvalds at transmeta.com>
> To: "Matt D. Robinson" <yakker at aparity.com>
> cc: Rusty Russell <rusty at rustcorp.com.au>, <linux-kernel at vger.kernel.org>, <lkcd-general at lists.sourceforge.net>, <lkcd-devel at lists.sourceforge.net>
> Subject: [lkcd-general] Re: What's left over.
> Date: Thu, 31 Oct 2002 07:46:08 -0800 (PST)
> Sender: lkcd-general-admin at lists.sourceforge.net
> 
> 
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
> 
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > > 
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> > 
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
> 
> That's fine. And since they are paid to support it, they can apply the 
> patches.  
> 
> What I'm saying by "vendor driven" is that it has no relevance for the 
> standard kernel, and since it has no relevance to that, then I have no 
> incentives to merge it. The crash dump is only useful with people who 
> actively look at the dumps, and I don't know _anybody_ outside of the 
> specialized vendors you mention who actually do that.
> 
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
> 
> Horse before the cart and all that thing.
> 
> People have to realize that my kernel is not for random new features. The
> stuff I consider important are things that people use on their own, or
> stuff that is the base for other work. Quite often I want vendors to merge
> patches _they_ care about long long before I will merge them (examples of
> this are quite common, things like reiserfs and ext3 etc).
> 
> THAT is what I mean by vendor-driven. If vendors decide they really want
> the patches, and I actually start seeing noises on linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.
> 
> 		Linus
> -----------------------------------
> [Linus2]
> 
> From: Linus Torvalds <torvalds at transmeta.com>
> To: "Matt D. Robinson" <yakker at aparity.com>
> cc: Rusty Russell <rusty at rustcorp.com.au>, <linux-kernel at vger.kernel.org>, <lkcd-general at lists.sourceforge.net>, <lkcd-devel at lists.sourceforge.net>
> Subject: [lkcd-general] Re: What's left over.
> Date: Thu, 31 Oct 2002 09:25:21 -0800 (PST)
> Sender: lkcd-general-admin at lists.sourceforge.net
> 
> 
> [ Ok, this is a really serious email. If you don't get it, don't bother 
>   emailing me. Instead, think about it for an hour, and if you still don't 
>   get it, ask somebody you know to explain it to you. ]
> 
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> > 
> > Sure, but why should they have to?  What technical reason is there
> > for not including it, Linus?
> 
> There are many:
> 
>  - bloat kills:
> 
> 	My job is saying "NO!"
> 
> 	In other words: the question is never EVER "Why shouldn't it be
> 	accepted?", but it is always "Why do we really not want to live 
> 	without this?"
> 
>  - included features kill off (potentially better) projects.
> 
> 	There's a big "inertia" to features. It's often better to keep 
> 	features _off_ the standard kernel if they may end up being
> 	further developed in totally new directions.
> 
> 	In particular when it comes to this project, I'm told about
> 	"netdump", which doesn't try to dump to a disk, but over the net.
> 	And quite frankly, my immediate reaction is to say "Hell, I
> 	_never_ want the dump touching my disk, but over the network
> 	sounds like a great idea".
> 
> To me this says "LKCD is stupid". Which means that I'm not going to apply 
> it, and I'm going to need some real reason to do so - ie being proven 
> wrong in the field.
> 
> (And don't get me wrong - I don't mind getting proven wrong. I change my 
> opinions the way some people change underwear. And I think that's ok).
> 
> > I completely don't understand your reasoning here.
> 
> Tough. That's YOUR problem.
> 
> 		Linus
> -----------------------------------
> [Dave1]
> From: Dave Anderson <anderson at redhat.com>
> To: Linus Torvalds <torvalds at transmeta.com>
> CC: "Matt D. Robinson" <yakker at aparity.com>, Rusty Russell <rusty at rustcorp.com.au>, linux-kernel at vger.kernel.org, lkcd-general at lists.sourceforge.net, lkcd-devel at lists.sourceforge.net
> Subject: [lkcd-general] Re: What's left over.
> Date: Thu, 31 Oct 2002 15:59:34 -0500
> Sender: lkcd-general-admin at lists.sourceforge.net
> X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.9-e.3.genterprise i686)
> 
> 
> On Thu, 31 Oct 2002, Linus Torvalds wrote:
> 
> >  - included features kill off (potentially better) projects.
> >
> >         There's a big "inertia" to features. It's often better to keep
> >         features _off_ the standard kernel if they may end up being
> >         further developed in totally new directions.
> >
> >         In particular when it comes to this project, I'm told about
> >         "netdump", which doesn't try to dump to a disk, but over the net.
> >         And quite frankly, my immediate reaction is to say "Hell, I
> >         _never_ want the dump touching my disk, but over the network
> >         sounds like a great idea".
> >
> > To me this says "LKCD is stupid". Which means that I'm not going to apply
> > it, and I'm going to need some real reason to do so - ie being proven
> > wrong in the field.
> >
> > (And don't get me wrong - I don't mind getting proven wrong. I change my
> > opinions the way some people change underwear. And I think that's ok).
> 
> It would be most unfortunate if the existance of netdump is used as a
> reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.
> 
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> 
> > We want to see this in the kernel, frankly, because it's a pain
> > in the butt keeping up with your kernel revisions and everything
> > else that goes in that changes.  And I'm sure SuSE, UnitedLinux and
> > (hopefully) Red Hat don't want to spend their time having to roll
> > this stuff in each and every time you roll a new kernel.
> 
> While Red Hat advocates Ingo's netdump option, we have customer
> requests that are requiring us to look at LKCD disk-based dumps as an
> alternative, co-existing dump mechanism.  Since the two methods are not mutually
> exclusive, LKCD will never kill off netdump -- nor certainly vice-versa.  We're
> all just looking for a better means to be able to
> provide support to our customers, not to mention its value as a
> development aid.
> 
> Dave Anderson
> Red Hat, Inc.
> 
> 
> 
> _______________________________________________
> Dcl_tech_board mailing list
> Dcl_tech_board at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/dcl_tech_board




More information about the cgl_discussion mailing list