[cgl_discussion] [Fwd: [Dcl_tech_board] Linux Kernel Crash Dump (LKCD) evaluation]
John Cherry
cherry at osdl.org
Fri Apr 11 17:25:04 PDT 2003
Steve Hemminger (of DCL fame) wrote up a nice synopsis of the Linux
Kernel Crash Dump (LKCD) project and where it is going. There are a
number of reasons that LKCD is not gaining mainline acceptance
(dependency on kexec, arch specific, too invasive, bloat, etc.). Linus
has stated that he would consider a network-only crash dump, but that is
not where LKCD is heading.
Steve is proposing a mini crash dump project that has a chance of
mainline acceptance. This would be beneficial to DCL, CGL, and the
community at large.
Please give Steve your feedback on this proposal. I know there has been
some work going on with network dumps, so if you feel that an existing
project might be a good baseline for the mini crash dump project, please
sync up with Steve.
Thanks,
John
-----Forwarded Message-----
> From: Stephen Hemminger <shemminger at osdl.org>
> To: dcl_tech_board at osdl.org, dcl_steering at osdl.org
> Subject: [Dcl_tech_board] Linux Kernel Crash Dump (LKCD) evaluation
> Date: 11 Apr 2003 16:34:52 -0700
>
> Linux Kernel Crash Dump Evaluation
>
> Crash dump is an important diagnostic tool for production systems. Commercial
> customers rely on binary distributions from vendors; these vendors need tools
> like crash dumps to provide timely support.
>
> A version of crash dump was ported by SGI and it became the Linux Kernel
> Crash Dump (LKCD) project. This code has failed to gain community acceptance.
>
>
> Kernel acceptance
>
> Full text of discussion from Oct '02 in Addendum's. The last attempt
> to submit to kernel failed and Linus expressed the opinion that LKCD
> is a "vendor-driven" thing [Linus1]. He seemed willing to accept network
> dump since net drivers fail less [Linus2]. Red Hat supplies network
> dump, but has requirements for disk as well [Dave1].
>
> OSDL can act to help facilitate a useful crash dump solution, which
> aligns with the "vendor-driven" perspective. The problem is that
> existing LKCD project may not be the right mechanism.
>
> LKCD team
>
> The LKCD project is active with development for OSDL DCL member
> companies. Individuals from IBM, Intel, and OSDL regularly contribute
> to the current CVS tree. The distro vendors (the real customers) don't
> seem to be involved.
>
> Issues with LKCD
>
> * Not heading towards closure
>
> The size of LKCD has grown since Oct and got more complex. Support
> for kexec (save to memory) has been added as well as other changes.
> It seems like the project has given up on getting it into 2.6.
>
> Current additions to LKCD are all sound, but are not heading the
> project towards integration in the standard kernel. The most recent
> is saving crash dump to reserved memory and saving it on reboot. This
> makes sense on some machines with lots of memory but isn't something
> that will end up getting used on average binary distribution vendor
> support.
>
> The de-facto strategy is to try and target a smaller solution based on
> the memory dump. This is not a bad concept, but means that acceptance
> is now dependent on Linus (and distro's) accepting the kexec patch.
>
> Response from the mailing list to the suggestion of pared down crash
> dump was positive from the active developers, but no action has taken
> place in that regard.
>
> * Integration touches too many places
>
> Far too many files in the main kernel need to be patched. None of
> these are big patches, but they hit "sensitive places"
> 1. Scheduler needs change to main loop to allow other CPU's to dump.
> 2. VM needs additional flag to keep track of kernel memory usage.
> 3. Makefile changes to get type information
> 4. SMP IPI additions to capture other processor state.
> 5. Extensions for reserving memory at boot.
>
> * Current disk dump is unreliable
>
> In order to dump to disk, it goes through the normal device block
> layer which means LKCD must re-enable interrupts, and
> re-schedule. Also, since disk drivers are often a source of failures
> it risks double faulting by using the same code path.
>
> * Non IA32 platform support missing on 2.5
>
> Since it is a side project, no one has updated LKCD to work on
> non-i386 kernel. Also since so many places get touched in the main
> kernel it is a non-trivial port.
>
> * LKCD interface
>
> Interface is through /dev/dump. Linus doesn't like pseudo-devices and
> prefers /proc and eventually /sysfs for such things. There is a /proc
> interface to LKCD but the utilities use ioctl's on /dev/dump.
>
> * Bloat
>
> LKCD supports a plethora of options about dump devices, compression types,
> how much memory to dump, ... This leads to LKCD being tagged as bloat.
> If LKCD is to work on customer installed systems, it has to have a simple setup.
>
> Suggested alternative
>
> Start a project to create a mini-crash dump that has a chance of
> acceptance. .
>
> * Address the basic requirements of a binary enterprise
> system distribution.
>
> * Use existing code if possible
> - Network only crash dump
> - Rusty's IDE mini-oopser
>
> * Use existing dump format to save rewriting analysis tools
>
>
> Addendum
> ============================================================
>
> [Linus1]
>
> From: Linus Torvalds <torvalds at transmeta.com>
> To: "Matt D. Robinson" <yakker at aparity.com>
> cc: Rusty Russell <rusty at rustcorp.com.au>, <linux-kernel at vger.kernel.org>, <lkcd-general at lists.sourceforge.net>, <lkcd-devel at lists.sourceforge.net>
> Subject: [lkcd-general] Re: What's left over.
> Date: Thu, 31 Oct 2002 07:46:08 -0800 (PST)
> Sender: lkcd-general-admin at lists.sourceforge.net
>
>
> On Wed, 30 Oct 2002, Matt D. Robinson wrote:
>
> > Linus Torvalds wrote:
> > > > Crash Dumping (LKCD)
> > >
> > > This is definitely a vendor-driven thing. I don't believe it has any
> > > relevance unless vendors actively support it.
> >
> > There are people within IBM in Germany, India and England, as well as
> > a number of companies (Intel, NEC, Hitachi, Fujitsu), as well as SGI
> > that are PAID to support this.
>
> That's fine. And since they are paid to support it, they can apply the
> patches.
>
> What I'm saying by "vendor driven" is that it has no relevance for the
> standard kernel, and since it has no relevance to that, then I have no
> incentives to merge it. The crash dump is only useful with people who
> actively look at the dumps, and I don't know _anybody_ outside of the
> specialized vendors you mention who actually do that.
>
> I will merge it when there are real users who want it - usually as a
> result of having gotten used to it through a vendor who supports it. (And
> by "support" I do not mean "maintain the patches", but "actively uses it"
> to work out the users problems or whatever).
>
> Horse before the cart and all that thing.
>
> People have to realize that my kernel is not for random new features. The
> stuff I consider important are things that people use on their own, or
> stuff that is the base for other work. Quite often I want vendors to merge
> patches _they_ care about long long before I will merge them (examples of
> this are quite common, things like reiserfs and ext3 etc).
>
> THAT is what I mean by vendor-driven. If vendors decide they really want
> the patches, and I actually start seeing noises on linux-kernel or getting
> requests for it being merged from _users_ rather than developers, then
> that means that the vendor is on to something.
>
> Linus
> -----------------------------------
> [Linus2]
>
> From: Linus Torvalds <torvalds at transmeta.com>
> To: "Matt D. Robinson" <yakker at aparity.com>
> cc: Rusty Russell <rusty at rustcorp.com.au>, <linux-kernel at vger.kernel.org>, <lkcd-general at lists.sourceforge.net>, <lkcd-devel at lists.sourceforge.net>
> Subject: [lkcd-general] Re: What's left over.
> Date: Thu, 31 Oct 2002 09:25:21 -0800 (PST)
> Sender: lkcd-general-admin at lists.sourceforge.net
>
>
> [ Ok, this is a really serious email. If you don't get it, don't bother
> emailing me. Instead, think about it for an hour, and if you still don't
> get it, ask somebody you know to explain it to you. ]
>
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
> >
> > Sure, but why should they have to? What technical reason is there
> > for not including it, Linus?
>
> There are many:
>
> - bloat kills:
>
> My job is saying "NO!"
>
> In other words: the question is never EVER "Why shouldn't it be
> accepted?", but it is always "Why do we really not want to live
> without this?"
>
> - included features kill off (potentially better) projects.
>
> There's a big "inertia" to features. It's often better to keep
> features _off_ the standard kernel if they may end up being
> further developed in totally new directions.
>
> In particular when it comes to this project, I'm told about
> "netdump", which doesn't try to dump to a disk, but over the net.
> And quite frankly, my immediate reaction is to say "Hell, I
> _never_ want the dump touching my disk, but over the network
> sounds like a great idea".
>
> To me this says "LKCD is stupid". Which means that I'm not going to apply
> it, and I'm going to need some real reason to do so - ie being proven
> wrong in the field.
>
> (And don't get me wrong - I don't mind getting proven wrong. I change my
> opinions the way some people change underwear. And I think that's ok).
>
> > I completely don't understand your reasoning here.
>
> Tough. That's YOUR problem.
>
> Linus
> -----------------------------------
> [Dave1]
> From: Dave Anderson <anderson at redhat.com>
> To: Linus Torvalds <torvalds at transmeta.com>
> CC: "Matt D. Robinson" <yakker at aparity.com>, Rusty Russell <rusty at rustcorp.com.au>, linux-kernel at vger.kernel.org, lkcd-general at lists.sourceforge.net, lkcd-devel at lists.sourceforge.net
> Subject: [lkcd-general] Re: What's left over.
> Date: Thu, 31 Oct 2002 15:59:34 -0500
> Sender: lkcd-general-admin at lists.sourceforge.net
> X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.9-e.3.genterprise i686)
>
>
> On Thu, 31 Oct 2002, Linus Torvalds wrote:
>
> > - included features kill off (potentially better) projects.
> >
> > There's a big "inertia" to features. It's often better to keep
> > features _off_ the standard kernel if they may end up being
> > further developed in totally new directions.
> >
> > In particular when it comes to this project, I'm told about
> > "netdump", which doesn't try to dump to a disk, but over the net.
> > And quite frankly, my immediate reaction is to say "Hell, I
> > _never_ want the dump touching my disk, but over the network
> > sounds like a great idea".
> >
> > To me this says "LKCD is stupid". Which means that I'm not going to apply
> > it, and I'm going to need some real reason to do so - ie being proven
> > wrong in the field.
> >
> > (And don't get me wrong - I don't mind getting proven wrong. I change my
> > opinions the way some people change underwear. And I think that's ok).
>
> It would be most unfortunate if the existance of netdump is used as a
> reason to deny LKCD's inclusion, or to simply dismiss LKCD as stupid.
>
> On Thu, 31 Oct 2002, Matt D. Robinson wrote:
>
> > We want to see this in the kernel, frankly, because it's a pain
> > in the butt keeping up with your kernel revisions and everything
> > else that goes in that changes. And I'm sure SuSE, UnitedLinux and
> > (hopefully) Red Hat don't want to spend their time having to roll
> > this stuff in each and every time you roll a new kernel.
>
> While Red Hat advocates Ingo's netdump option, we have customer
> requests that are requiring us to look at LKCD disk-based dumps as an
> alternative, co-existing dump mechanism. Since the two methods are not mutually
> exclusive, LKCD will never kill off netdump -- nor certainly vice-versa. We're
> all just looking for a better means to be able to
> provide support to our customers, not to mention its value as a
> development aid.
>
> Dave Anderson
> Red Hat, Inc.
>
>
>
> _______________________________________________
> Dcl_tech_board mailing list
> Dcl_tech_board at lists.osdl.org
> http://lists.osdl.org/mailman/listinfo/dcl_tech_board
More information about the cgl_discussion
mailing list