[Openais] RE: Library core dump using evt.

Kristen Smith kjsmith at nortel.com
Mon Jan 9 12:11:06 PST 2006


Steve,

When we shutdown the following 3 items are called:

   saEvtEventUnsubscribe(EVTHANDLER->channelHandle,SUBSCRIPTION_ID);
   saEvtChannelClose(EVTHANDLER->channelHandle);
   saEvtFinalize(EVTHANDLER->evtHandle);

After further research, I see how a race condition could have occurred
between different threads I have. I was not stopping one of my writer
threads before I did the close, so I believe a close and an
saEvtAllocate might have occurred at the same time (on 2 different
threads) in this particular case. I have fixed my closing of the thread
so that it closes down before I close down EVT - this should prevent the
race condition from showing up.

I still have the core file and can dump out any variables you might want
to see if you are interested.

Kristen

-----Original Message-----
From: Steven Dake [mailto:sdake at mvista.com] 
Sent: Monday, January 09, 2006 2:01 PM
To: Bajpai, Muni [RICH1:B670:EXCH]
Cc: scd at broked.org; openais at lists.osdl.org; Mark Haverkamp; Smith,
Kristen [RICH1:B670:EXCH]
Subject: RE: Library core dump using evt.


Muni
please explain the exact calls you use to shutdown the evt service.  Is
it channel close followed by evtFinalize?

The description you state should still work with threads with openais
implementation...  It is designed to properly handle this scenario.

I'll try to write a test program that duplicates this scenario.

Thanks
-steve

On Mon, 2006-01-09 at 11:22 -0600, Muni Bajpai wrote:
> Hey Steve,
> 
> We still can't reproduce the issue but our theory on the sequence of 
> events is this
> 
> There is thread (EvtThread) that wakes up every 3 seconds and sends 
> out an evt message and there is a Main thread (MainThread) that ties 
> our application and its children threads together.
> 
> 1.) Main Thread instructed to shutdown and starts the process of 
> closing down the ckpt, evt, application.
> 2.) Timer pops and the EvtThread tries to send a message. Obviously 
> this is not thread safe and the race condition occurs where in the 
> middle of the EventAllocate call the HandleDatabase is shutdown by the

> MainThread.
> 
> We can't reproduce this but have added checks to the EvtThread to be 
> more MainThread state aware.
> 
> I would still recommend the previous patch to at least not let a core 
> dump happen in the case of erroneous handleDatabase access.
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com]
> Sent: Monday, January 09, 2006 2:56 AM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: scd at broked.org; openais at lists.osdl.org; Mark Haverkamp; Smith,
> Kristen [RICH1:B670:EXCH]
> Subject: Re: Library core dump using evt.
> 
> On Sun, 2006-01-08 at 13:27 -0600, Muni Bajpai wrote:
> > Hey Steve,Mark
> > 
> >  
> > 
> > Kristen found an issue where the library dumped core in the process 
> > of shutting down. As of now we can't reproduce it but the issue was 
> > in saHandleInstancePut being called from saEvtEventAllocate. Looks 
> > like the handle was corrupted or the database closed. In either case
> > handleDatabase->handles[handle].check caused the issue and my best
> > guess is accessing the array after its been deallocated.
> > 
> >  
> > 
> > There is  a check in saHandleInstancePut that I added to the other 
> > to access functions. This seems like a good error check anyways. Now

> > obviously the calling functions should check the return code which 
> > is not the case but at least we won't have invalid memory access.
> > 
> 
> Well its been awhile since we have found any sorts of bug in the lib 
> directory.  Good work on the bug find Muni :)  Question, was the 
> application multithreaded?  The handle array should always be valid, 
> unless HandleInstanceDestructor has been called previously.  This only

> happens when the reference count reaches zero.
> 
> Another option is that the reference counting could be putting when it

> shouldn't (perhaps some error condition that is incorrectly handled in

> the library).
> 
> Mark could you look over the evt library code and see if you see any 
> problem?
> 
> I'd like to fix the bug rather then apply the workaround you submitted

> Muni...  This way we know its fixed...
> 
> The first rule is that put should never be called, unless a get is 
> called on it first.  The second rule is that get and put are matched 
> on all calls.  The third rule is that after a create, if a get fails, 
> a destroy should take place on the handle and an error returned.  I 
> know its kind of onerous, but threads suck :)  If this is wrong, all 
> kinds of things could break.
> 
> Could you give more information on the failure?
> 
> Was a multi-threaded application signaled to exit during runtime?
> 
> One other possibility is that the application itself is corrupting the

> handle database.  Have you tried running your apps through valgrind to

> determine any invalid memory accesses?
> 
> It would be interesting to know which handle database put was the 
> result of the crash - do you know the answer to this?  (ie was it the 
> channel database, or the evt database).
> 
> Finally mark I think this code is wrong:
>     error = saHandleCreate(&event_handle_db, sizeof(*edi),
>             eventHandle);
>     if (error != SA_AIS_OK) {
>         goto alloc_put2;
>     }
> 
>    error = saHandleInstanceGet(&event_handle_db, *eventHandle,
>                     (void*)&edi);
>     if (error != SA_AIS_OK) {
>         goto alloc_put2;
> 
> ^^^^^ this could looks suspect - shouldn't it be goto alloc_put3 which

> destroys the created event handle and puts the other two handle and 
> returns an error?  At minimum this is a leak during error condition 
> but probably not the cause of Kristen's crash...
> 
>     }
> 
> 
> ......
> 
> alloc_put2:
>     saHandleInstancePut (&evt_instance_handle_db,
> eci->eci_instance_handle);
> alloc_put1:
>     saHandleInstancePut (&channel_handle_db, edi->edi_channel_handle);
> 
> More thinking out loud the event could be closed before the channel is

> closed.  The specs allow this, but only after all references in the 
> library to the channel and evt handle are put.
> 
> Does any of this speculation sound like something that could happen in

> your application?
> 
> Regards
> -steve
> 
> >  
> > 
> > Thanks
> > 
> >  
> > 
> > Muni
> > 
> >  
> > 
> > diff -uNr --exclude=svn --exclude=.svn --exclude=SCCS 
> > --exclude=BitKeeper --exclude=ChangeSet --exclude=init 
> > --exclude=LICENSE --exclude=Makefile --exclude=man 
> > --exclude=README.devmap --exclude=SECURITY --exclude=TODO 
> > --exclude=CHANGELOG --exclude=conf --exclude=loc 
> > --exclude=Makefile.samples --exclude=QUICKSTART 
> > --exclude=.cdtproject --exclude=.project --exclude=nortel.patch 
> > openais/branches/picacho/lib/util.c picacho_patch/lib/util.c
> > 
> > --- openais/branches/picacho/lib/util.c 2006-01-06 17:23:32 -06:00
> > 
> > +++ picacho_patch/lib/util.c    2006-01-08 13:23:15 -06:00
> > 
> > @@ -608,6 +608,11 @@
> > 
> >         uint32_t handle = inHandle & 0xffffffff;
> > 
> >  
> > 
> >         pthread_mutex_lock (&handleDatabase->mutex);
> > 
> > +
> > 
> > +       if (handle >= (SaUint64T)handleDatabase->handleCount) {
> > 
> > +                       error = SA_AIS_ERR_BAD_HANDLE;
> > 
> > +                       goto error_exit;
> > 
> > +       }
> > 
> >  
> > 
> >         if (check != handleDatabase->handles[handle].check) {
> > 
> >                 error = SA_AIS_ERR_BAD_HANDLE;
> > 
> > @@ -673,6 +678,11 @@
> > 
> >         uint32_t handle = inHandle & 0xffffffff;
> > 
> >  
> > 
> >         pthread_mutex_lock (&handleDatabase->mutex);
> > 
> > +
> > 
> > +       if (handle >= (SaUint64T)handleDatabase->handleCount) {
> > 
> > +                       error = SA_AIS_ERR_BAD_HANDLE;
> > 
> > +                       goto error_exit;
> > 
> > +       }
> > 
> >  
> > 
> >         if (check != handleDatabase->handles[handle].check) {
> > 
> >                 error = SA_AIS_ERR_BAD_HANDLE;
> > 
> > 
> 
> 






More information about the Openais mailing list