[Openais] RE: FW: Evt Deadlock

Steven Dake sdake at mvista.com
Tue Jan 24 14:42:50 PST 2006


On Tue, 2006-01-24 at 16:21 -0600, Muni Bajpai wrote:
> Ok I see the extra handleInstancePut I still don't see how that could
> cause a lock. But as I had asked Mark earlier maybe the new changes In
> January will take care of the issue
> 

I can't think of a scenario where an extra InstancePut could cause a
deadlock.  Still broken :)

If you can get us some debuggable cores and stack backtraces from the
various threads, that would really help us find the source of the
deadlock if it occurs in the future.

Regards
-steve

> Thanks
> 
> Muni 
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Tuesday, January 24, 2006 3:30 PM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: scd at broked.org; openais at lists.osdl.org; Smith, Kristen
> [RICH1:B670:EXCH]
> Subject: RE: FW: Evt Deadlock
> 
> On Tue, 2006-01-24 at 14:34 -0600, Muni Bajpai wrote:
> > Steve Defect 1029 was merged Jan 20th. This issue is from a December
> > 21st picacho view i.e 0.70.
> > 
> Yes 1029 may have fixed this problem.
> 
> > I agree with the gdb technique but unfortunately this time the tester
> > was in too much of a hurry and just sigkilled it.
> > 
> > In the 0.70 view I don't see how HandleDestroy can leave a mutex
> locked.
> > 
> Here is the code:
>    if (check != handleDatabase->handles[handle].check) {
>         error = SA_AIS_ERR_BAD_HANDLE;
>         goto error_exit;
> ^^^ here is the error condition    
> 
> }
> 
>     handleDatabase->handles[handle].state =
> SA_HANDLE_STATE_PENDINGREMOVAL;
> 
> error_exit:
>     pthread_mutex_unlock (&handleDatabase->mutex);
> 
>     saHandleInstancePut (handleDatabase, inHandle);
> 
> The extra handle instance put on an invalid data area could cause the
> problem.
> 
> 
> 
> > I doubt this is reproducible on demand so this Info is all we have for
> > now.
> > 
> > Will try to get this reproduced though
> > 
> > Thanks
> > 
> > Muni
> > 
> > SaErrorT
> > saHandleDestroy (
> >     struct saHandleDatabase *handleDatabase,
> >     SaUint64T inHandle)
> > {
> >     SaAisErrorT error = SA_AIS_OK;
> >     uint32_t check = inHandle >> 32;
> >     uint32_t handle = inHandle & 0xffffffff;
> > 
> >     pthread_mutex_lock (&handleDatabase->mutex);
> > 
> >     if (check != handleDatabase->handles[handle].check) {
> >         error = SA_AIS_ERR_BAD_HANDLE;
> >         goto error_exit;
> >     }
> > 
> >     handleDatabase->handles[handle].state =
> > SA_HANDLE_STATE_PENDINGREMOVAL;
> > 
> > error_exit:
> >     pthread_mutex_unlock (&handleDatabase->mutex);
> > 
> >     saHandleInstancePut (handleDatabase, inHandle);
> > 
> >     return (error);
> > }
> > 
> > -----Original Message-----
> > From: Steven Dake [mailto:sdake at mvista.com] 
> > Sent: Tuesday, January 24, 2006 1:27 PM
> > To: Bajpai, Muni [RICH1:B670:EXCH]
> > Cc: scd at broked.org; openais at lists.osdl.org
> > Subject: Re: FW: Evt Deadlock
> > 
> > Muni,
> > 
> > If this happens again instruct your testers to send a SIGSEGV to your
> > application via kill.  Make sure to ulimit -c unlimited.  Then you can
> > use gdb to debug the core created and we can see what call paths the
> > deadlock occurs upon.  You can use the "threads" command to switch
> > between thread 0 1 etc.  This is the technique I used to find the AMF
> > crash.
> > 
> > This information would help us considerably find which locks are
> > contended upon (or if it is actually a mutex that is contended).
> > 
> > Also defect 1029 (merged) could result in this deadlock situtation if
> > the check failed in the handle destroy.  It would leave the handle
> > database mutex locked in an error condition (the handle was invalid
> > passed to saHandleDestroy.  Later accesses to this mutex would lock up
> > the multithreaded app.  This would point to another problem you may be
> > having in a caller to saHandleDestroy.  It sure would be nice to know
> > where that HandleDestroy call failed (the call stack) as it points at
> a
> > bug in the evt library if this is the result of the deadlock.  One
> rule
> > we have is that handles should always be valid passed to
> > saHandleDestroy.
> > 
> > If you want to help find the source of this handle destroy problem in
> > 0.70.1 please apply the attached patch to your 0.70.1 and make sure to
> > save your core/sources if the assert occurs.
> > 
> > Mark, I'd take a second look at your saHandleDestroy calls as they may
> > have some kind of problem.
> > 
> > Regards
> > -steve
> > 
> > On Tue, 2006-01-24 at 08:54 -0600, Muni Bajpai wrote:
> > > Steve, 
> > > 
> > > Posting to the group as well.
> > > 
> > > -----Original Message-----
> > > From: Bajpai, Muni [RICH1:B670:EXCH] 
> > > Sent: Monday, January 23, 2006 4:25 PM
> > > To: 'Mark Haverkamp'
> > > Subject: RE: Evt Deadlock
> > > 
> > > SO we have one evt thread writing events and then there is this
> thread
> > > in question which was dispatching and then was told to exit by the
> > > application.
> > > 
> > > So it is definitely possible that a lock was held by the other
> thread
> > > doing on regular time intervals
> > > saEvtEventAllocate
> > > saEvtEventAttributesSet
> > > saEvtEventPublish
> > > saEvtEventFree
> > > 
> > > I'll do some more research too
> > > 
> > > Thanks
> > > 
> > > Muni
> > > 
> > > 
> > > -----Original Message-----
> > > From: Mark Haverkamp [mailto:markh at osdl.org] 
> > > Sent: Monday, January 23, 2006 4:12 PM
> > > To: Bajpai, Muni [RICH1:B670:EXCH]
> > > Subject: Re: Evt Deadlock
> > > 
> > > On Mon, 2006-01-23 at 15:37 -0600, Muni Bajpai wrote:
> > > > Hey Mark,
> > > > 
> > > >  
> > > > 
> > > > One of our testers came up with this issue after running about 24
> > > > hours of traffic. This is the version without your Evt fixes which
> I
> > > > just merged and have started testing. What I wanted to know if
> this
> > > > issue is fixed by your changes. Basically we were in shutdown mode
> > and
> > > > were trying to do an saEvtChannelClose
> > > > 
> > > 
> > > This particular thing wasn't addressed by my previous fixes.  
> > > 
> > > I don't see where something could have the event handle database
> > locked
> > > forever since it is taken and released inside the handle functions.
> > Do
> > > you know what the other threads were doing at the time?  Is it
> > possible
> > > that some other thread was killed while it held the mutex?  Anyway,
> > I'll
> > > keep looking at the code and see if I can figure out how it could
> > > deadlock.
> > > 
> > > Mark.
> > > 
> > > 
> > > 
> > > >  
> > > > 
> > > > Looks like saEvtEventFree is dead locked on 
> > > > 
> > > > error = saHandleInstanceGet(&event_handle_db, eventHandle,
> > > > 
> > > >             (void*)&edi);
> > > > 
> > > >  
> > > > 
> > > >  
> > > > 
> > > >  
> > > > 
> > > > #0  0xb747e2ab in saEvtEventFree (eventHandle=0) at evt.c:1378
> > > > #1  0xb747f67c in chanHandleInstanceDestructor
> (instance=0x80bd14c)
> > at
> > > > evt.c:266
> > > > #2  0xb74785c7 in saHandleInstancePut (handleDatabase=0xb74801c0,
> > > > inHandle=7222815479134420992) at util.c:687
> > > > #3  0xb747dbac in saEvtChannelClose
> > > > (channelHandle=7222815479134420992) at evt.c:1074
> > > > #4  0x08054b02 in EvtHandler::cleanupEVT (this=0x80b76ec) at
> > > > EvtHandler.cpp:1013
> > > > #5  0x0805ce7d in HalManager::shutdown (this=0xb3fe3bb0,
> > > > reason=0x808756c "The heartbeat to the Sig has failed.") at
> > > > HalManager.cpp:1062
> > > > #6  0x0806c82f in SigHandler::handle_exception (this=0x80c5e40) at
> > > > SigHandler.cpp:907
> > > > #7  0xb754f5e6 in ACE_Select_Reactor_Notify::dispatch_notify ()
> > > > from /opt/mcp/lib/libACE.so.5.3.1
> > > > #8  0xb754f6b2 in ACE_Select_Reactor_Notify::handle_input ()
> > > > from /opt/mcp/lib/libACE.so.5.3.1
> > > > #9  0xb754f47e in
> ACE_Select_Reactor_Notify::dispatch_notifications
> > ()
> > > > from /opt/mcp/lib/libACE.so.5.3.1
> > > > #10 0xb7542b83 in
> > > > ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
> > > > >::dispatch_notification_handlers ()
> > > >    from /opt/mcp/lib/libACE.so.5.3.1
> > > > #11 0xb7542a57 in
> > > > ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
> > >::dispatch
> > > > () from /opt/mcp/lib/libACE.so.5.3.1
> > > > #12 0xb753fff4 in
> > > > ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
> > > > >::handle_events () from /opt/mcp/lib/libACE.so.5.3.1
> > > > #13 0xb754d6e8 in ACE_Reactor::run_reactor_event_loop ()
> > > > from /opt/mcp/lib/libACE.so.5.3.1
> > > > #14 0x0805978c in main (argc=3, argv=0xbfffeff4) at
> halMain.cpp:976
> > > > Current language:  auto; currently c
> > > > 
> > > > 
> 
> 




More information about the Openais mailing list