[Openais] RE: Library core dump using evt.

Steven Dake sdake at mvista.com
Mon Jan 9 14:49:31 PST 2006


On Mon, 2006-01-09 at 14:11 -0600, Kristen Smith wrote:
> Steve,
> 
> When we shutdown the following 3 items are called:
> 
>    saEvtEventUnsubscribe(EVTHANDLER->channelHandle,SUBSCRIPTION_ID);
>    saEvtChannelClose(EVTHANDLER->channelHandle);
>    saEvtFinalize(EVTHANDLER->evtHandle);
> 
> After further research, I see how a race condition could have occurred
> between different threads I have. I was not stopping one of my writer
> threads before I did the close, so I believe a close and an
> saEvtAllocate might have occurred at the same time (on 2 different
> threads) in this particular case. I have fixed my closing of the thread

The APIs, atleast in openais, are designed to be raceless in threaded
environments.  

So in 1 thread you are executing this set of functions, in another
thread you are executing EventAllocate?  Is there another thread also
executing a dispatch function call?  This should all work properly with
the current ref counting.

I will write a test program to duplicate your scenario and spam this
over and over to see if I can get a crash.

Could you describe the list of function calls you execute in the
remaining thread that crashes?

Regards
-steve


> so that it closes down before I close down EVT - this should prevent the
> race condition from showing up.
> 
> I still have the core file and can dump out any variables you might want
> to see if you are interested.
> 
> Kristen
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Monday, January 09, 2006 2:01 PM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: scd at broked.org; openais at lists.osdl.org; Mark Haverkamp; Smith,
> Kristen [RICH1:B670:EXCH]
> Subject: RE: Library core dump using evt.
> 
> 
> Muni
> please explain the exact calls you use to shutdown the evt service.  Is
> it channel close followed by evtFinalize?
> 
> The description you state should still work with threads with openais
> implementation...  It is designed to properly handle this scenario.
> 
> I'll try to write a test program that duplicates this scenario.
> 
> Thanks
> -steve
> 
> On Mon, 2006-01-09 at 11:22 -0600, Muni Bajpai wrote:
> > Hey Steve,
> > 
> > We still can't reproduce the issue but our theory on the sequence of 
> > events is this
> > 
> > There is thread (EvtThread) that wakes up every 3 seconds and sends 
> > out an evt message and there is a Main thread (MainThread) that ties 
> > our application and its children threads together.
> > 
> > 1.) Main Thread instructed to shutdown and starts the process of 
> > closing down the ckpt, evt, application.
> > 2.) Timer pops and the EvtThread tries to send a message. Obviously 
> > this is not thread safe and the race condition occurs where in the 
> > middle of the EventAllocate call the HandleDatabase is shutdown by the
> 
> > MainThread.
> > 
> > We can't reproduce this but have added checks to the EvtThread to be 
> > more MainThread state aware.
> > 
> > I would still recommend the previous patch to at least not let a core 
> > dump happen in the case of erroneous handleDatabase access.
> > 
> > Thanks
> > 
> > Muni
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com]
> > Sent: Monday, January 09, 2006 2:56 AM
> > To: Bajpai, Muni [RICH1:B670:EXCH]
> > Cc: scd at broked.org; openais at lists.osdl.org; Mark Haverkamp; Smith,
> > Kristen [RICH1:B670:EXCH]
> > Subject: Re: Library core dump using evt.
> > 
> > On Sun, 2006-01-08 at 13:27 -0600, Muni Bajpai wrote:
> > > Hey Steve,Mark
> > > 
> > >  
> > > 
> > > Kristen found an issue where the library dumped core in the process 
> > > of shutting down. As of now we can't reproduce it but the issue was 
> > > in saHandleInstancePut being called from saEvtEventAllocate. Looks 
> > > like the handle was corrupted or the database closed. In either case
> > > handleDatabase->handles[handle].check caused the issue and my best
> > > guess is accessing the array after its been deallocated.
> > > 
> > >  
> > > 
> > > There is  a check in saHandleInstancePut that I added to the other 
> > > to access functions. This seems like a good error check anyways. Now
> 
> > > obviously the calling functions should check the return code which 
> > > is not the case but at least we won't have invalid memory access.
> > > 
> > 
> > Well its been awhile since we have found any sorts of bug in the lib 
> > directory.  Good work on the bug find Muni :)  Question, was the 
> > application multithreaded?  The handle array should always be valid, 
> > unless HandleInstanceDestructor has been called previously.  This only
> 
> > happens when the reference count reaches zero.
> > 
> > Another option is that the reference counting could be putting when it
> 
> > shouldn't (perhaps some error condition that is incorrectly handled in
> 
> > the library).
> > 
> > Mark could you look over the evt library code and see if you see any 
> > problem?
> > 
> > I'd like to fix the bug rather then apply the workaround you submitted
> 
> > Muni...  This way we know its fixed...
> > 
> > The first rule is that put should never be called, unless a get is 
> > called on it first.  The second rule is that get and put are matched 
> > on all calls.  The third rule is that after a create, if a get fails, 
> > a destroy should take place on the handle and an error returned.  I 
> > know its kind of onerous, but threads suck :)  If this is wrong, all 
> > kinds of things could break.
> > 
> > Could you give more information on the failure?
> > 
> > Was a multi-threaded application signaled to exit during runtime?
> > 
> > One other possibility is that the application itself is corrupting the
> 
> > handle database.  Have you tried running your apps through valgrind to
> 
> > determine any invalid memory accesses?
> > 
> > It would be interesting to know which handle database put was the 
> > result of the crash - do you know the answer to this?  (ie was it the 
> > channel database, or the evt database).
> > 
> > Finally mark I think this code is wrong:
> >     error = saHandleCreate(&event_handle_db, sizeof(*edi),
> >             eventHandle);
> >     if (error != SA_AIS_OK) {
> >         goto alloc_put2;
> >     }
> > 
> >    error = saHandleInstanceGet(&event_handle_db, *eventHandle,
> >                     (void*)&edi);
> >     if (error != SA_AIS_OK) {
> >         goto alloc_put2;
> > 
> > ^^^^^ this could looks suspect - shouldn't it be goto alloc_put3 which
> 
> > destroys the created event handle and puts the other two handle and 
> > returns an error?  At minimum this is a leak during error condition 
> > but probably not the cause of Kristen's crash...
> > 
> >     }
> > 
> > 
> > ......
> > 
> > alloc_put2:
> >     saHandleInstancePut (&evt_instance_handle_db,
> > eci->eci_instance_handle);
> > alloc_put1:
> >     saHandleInstancePut (&channel_handle_db, edi->edi_channel_handle);
> > 
> > More thinking out loud the event could be closed before the channel is
> 
> > closed.  The specs allow this, but only after all references in the 
> > library to the channel and evt handle are put.
> > 
> > Does any of this speculation sound like something that could happen in
> 
> > your application?
> > 
> > Regards
> > -steve
> > 
> > >  
> > > 
> > > Thanks
> > > 
> > >  
> > > 
> > > Muni
> > > 
> > >  
> > > 
> > > diff -uNr --exclude=svn --exclude=.svn --exclude=SCCS 
> > > --exclude=BitKeeper --exclude=ChangeSet --exclude=init 
> > > --exclude=LICENSE --exclude=Makefile --exclude=man 
> > > --exclude=README.devmap --exclude=SECURITY --exclude=TODO 
> > > --exclude=CHANGELOG --exclude=conf --exclude=loc 
> > > --exclude=Makefile.samples --exclude=QUICKSTART 
> > > --exclude=.cdtproject --exclude=.project --exclude=nortel.patch 
> > > openais/branches/picacho/lib/util.c picacho_patch/lib/util.c
> > > 
> > > --- openais/branches/picacho/lib/util.c 2006-01-06 17:23:32 -06:00
> > > 
> > > +++ picacho_patch/lib/util.c    2006-01-08 13:23:15 -06:00
> > > 
> > > @@ -608,6 +608,11 @@
> > > 
> > >         uint32_t handle = inHandle & 0xffffffff;
> > > 
> > >  
> > > 
> > >         pthread_mutex_lock (&handleDatabase->mutex);
> > > 
> > > +
> > > 
> > > +       if (handle >= (SaUint64T)handleDatabase->handleCount) {
> > > 
> > > +                       error = SA_AIS_ERR_BAD_HANDLE;
> > > 
> > > +                       goto error_exit;
> > > 
> > > +       }
> > > 
> > >  
> > > 
> > >         if (check != handleDatabase->handles[handle].check) {
> > > 
> > >                 error = SA_AIS_ERR_BAD_HANDLE;
> > > 
> > > @@ -673,6 +678,11 @@
> > > 
> > >         uint32_t handle = inHandle & 0xffffffff;
> > > 
> > >  
> > > 
> > >         pthread_mutex_lock (&handleDatabase->mutex);
> > > 
> > > +
> > > 
> > > +       if (handle >= (SaUint64T)handleDatabase->handleCount) {
> > > 
> > > +                       error = SA_AIS_ERR_BAD_HANDLE;
> > > 
> > > +                       goto error_exit;
> > > 
> > > +       }
> > > 
> > >  
> > > 
> > >         if (check != handleDatabase->handles[handle].check) {
> > > 
> > >                 error = SA_AIS_ERR_BAD_HANDLE;
> > > 
> > > 
> > 
> > 
> 
> 




More information about the Openais mailing list