[Openais] [CLM] Crash if saClmClusterTrackStop() is not called before saClmFinalize()

Jerome Flesch jerome.flesch at netasq.com
Tue Oct 6 00:42:35 PDT 2009


On Mon, Oct 05, 2009 at 04:28:35PM -0700, Steven Dake wrote:
> I did the following:
> 
> installed openais and corosync on two nodes
> added in /etc/corosync/services.d a file called "clm" containing
> service {
> name: clm
> ver: 0
> }
> 
> ran corosync on two nodes
> ran corotests 2 on two nodes
> 
> no segfault.  May be timing related (race) but I had a look over the
> serialization code in ipc and it looks correct.
> 

If forgot to specify that if you want to use my tests to reproduce this
problem, you must disable the code between "#ifndef CRASH_CLM" and "#endif" in
test_cpg_multiplayer.c (
http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L156
). I've added the call to saClmClusterTrackStop() so I can continue testing.

Also, it's possibly a race condition. I'm doing most of my tests on Qemu VMs,
and since I haven't been able to make KVM work on my FreeBSD machine, they are
quite slow.

PS: I'm cc-ing back the OpenAIS mailing-list, I didn't un-cc it intentionally :/

> I ran on i386.
> 
> We need more info wrt reproduction.
> 
> Regards
> -steve
> 
> On Mon, 2009-10-05 at 15:18 -0500, Ryan O'Hara wrote:
> > I downloaded your app and compiled it. I also wrote my own test app
> > that just does saClmInitialize, saClmClusterTrack, and
> > saClmFinalize. I can't recreate the problem with my test app or your
> > test app.
> > 
> > When exactly are you killing node "b"? I think I need precise
> > instructions on what to do to recreate it. Also, I am just running
> > "corotests 2", FYI.
> > 
> > Ryan
> > 
> > PS - I am cc'ing Steve Dake.
> > 
> > 
> > 
> > On Mon, Oct 05, 2009 at 05:59:23PM +0200, Jerome Flesch wrote:
> > > On Mon, Oct 05, 2009 at 09:30:26AM -0500, Ryan O'Hara wrote:
> > > > On Mon, Oct 05, 2009 at 09:16:02AM -0500, Ryan O'Hara wrote:
> > > > > On Mon, Oct 05, 2009 at 03:49:42PM +0200, Jerome Flesch wrote:
> > > > > > Hello,
> > > > > 
> > > > > Hi, Jerome.
> > > > > 
> > > > > > I'm still stress-testing Corosync/Openais (trunk) on FreeBSD, and I've found out a tiny bug:
> > > > > > 
> > > > > > On peer A, my test program calls:
> > > > > > - saClmInitialize()
> > > > > > - saClmClusterTrack(SA_TRACK_CURRENT | SA_TRACK_CHANGES)
> > > > > > - saClmFinalize()
> > > > > > - (does various tests with CPG ..)
> > > > > > Next, when I shut down/kill Corosync on peer B, Corosync on peer A segfaults.
> > > > 
> > > > Node A segfaults, correct? See below.
> > > > 
> > > > > Can you provide the exact test program? I'd like to see all the
> > > > > details of each API call. Are you using a test program from the
> > > > > openais tree or did you write your own.
> > > > > 
> > > 
> > > I wrote my own. My goal is to test Corosync (CPG) / Openais (CLM) on a FreeBSD
> > > cluster as much as possible and to be able to compare the results with the ones
> > > from a Debian cluster as quickly as possible. To do that, I have a scripts
> > > dispatching Corosync, Openais, and the test program on a bunch of virtual
> > > machines (or real machines, depending of the settings), and then starting
> > > corosync and the test program.
> > > 
> > > I've create a public git repository:
> > > git clone git://github.com/jflesch/Corotests.git corotests
> > > 
> > > The code related to CLM that you are looking for is the following:
> > > http://github.com/jflesch/Corotests/blob/master/test_cpg_multiplayer.c#L134
> > > CLM is only used during the initialization of this test suite.
> > > 
> > > PS: I did this code on my work time, so legally, the copyright belongs to my
> > > company (Netasq). However, I just got the authorization to share it (and patchs
> > > are welcome, of course :)
> > > 
> > > 
> > > > > > When my test program calls saClmClusterTrackStop() before saClmFinalize,
> > > > > > Corosync doesn't crash on peer B. From that and the stacktrace
> > > > > > (joined below)
> > > > 
> > > > OK. I re-read this email and I am a bit confused. Here you can it
> > > > crashed on node B. Above you said node A segfaults. Can you clarify?
> > > > 
> > > Oops, my bad. So the crash happens on node A (the one where my test program called
> > > saClmInitialize(), saClmClusterTrack() and saClmFinalize()) when I kill node B.
> > > 
> > > 
> > > 
> > > > > > I guess it tries to signal the change in the cluster to a program that is not
> > > > > > connected anymore (-> missing disconnection notification to CLM ?). I also
> > > > > > guess it means that Corosync will segfault if the client itself crashes.
> > > > > 
> > > > > I'm guessing that a callback is sent to node A. If I understand, you
> > > > > are enabling tracking on group A, correct? If CLM is anything like MSG
> > > > > service (and I think it is with respect to how tracking works),
> > > > > enabling tracking will generate callbacks on membership changes *to
> > > > > the node that enabled tracking.
> > > > 
> > > > Sorry. I was trying to write a reply while in a meeting and I forgot
> > > > to finish this thought.
> > > > 
> > > > If a callback is being sent to node A after it has already called
> > > > finalize, I believe it should be a no-op. I think it would be better
> > > > if tracking callbacks weren't sent at all if the node that enabled
> > > > tracking calls Finalize, but how CLM handles these things is an
> > > > implementation detail that I will look into.
> > > > 
> 
> 



More information about the Openais mailing list