[Openais] RE: FW: Evt Deadlock

Muni Bajpai muniba at nortel.com
Tue Jan 24 14:21:57 PST 2006


Ok I see the extra handleInstancePut I still don't see how that could
cause a lock. But as I had asked Mark earlier maybe the new changes In
January will take care of the issue

Thanks

Muni 

-----Original Message-----
From: Steven Dake [mailto:sdake at mvista.com] 
Sent: Tuesday, January 24, 2006 3:30 PM
To: Bajpai, Muni [RICH1:B670:EXCH]
Cc: scd at broked.org; openais at lists.osdl.org; Smith, Kristen
[RICH1:B670:EXCH]
Subject: RE: FW: Evt Deadlock

On Tue, 2006-01-24 at 14:34 -0600, Muni Bajpai wrote:
> Steve Defect 1029 was merged Jan 20th. This issue is from a December
> 21st picacho view i.e 0.70.
> 
Yes 1029 may have fixed this problem.

> I agree with the gdb technique but unfortunately this time the tester
> was in too much of a hurry and just sigkilled it.
> 
> In the 0.70 view I don't see how HandleDestroy can leave a mutex
locked.
> 
Here is the code:
   if (check != handleDatabase->handles[handle].check) {
        error = SA_AIS_ERR_BAD_HANDLE;
        goto error_exit;
^^^ here is the error condition    

}

    handleDatabase->handles[handle].state =
SA_HANDLE_STATE_PENDINGREMOVAL;

error_exit:
    pthread_mutex_unlock (&handleDatabase->mutex);

    saHandleInstancePut (handleDatabase, inHandle);

The extra handle instance put on an invalid data area could cause the
problem.



> I doubt this is reproducible on demand so this Info is all we have for
> now.
> 
> Will try to get this reproduced though
> 
> Thanks
> 
> Muni
> 
> SaErrorT
> saHandleDestroy (
>     struct saHandleDatabase *handleDatabase,
>     SaUint64T inHandle)
> {
>     SaAisErrorT error = SA_AIS_OK;
>     uint32_t check = inHandle >> 32;
>     uint32_t handle = inHandle & 0xffffffff;
> 
>     pthread_mutex_lock (&handleDatabase->mutex);
> 
>     if (check != handleDatabase->handles[handle].check) {
>         error = SA_AIS_ERR_BAD_HANDLE;
>         goto error_exit;
>     }
> 
>     handleDatabase->handles[handle].state =
> SA_HANDLE_STATE_PENDINGREMOVAL;
> 
> error_exit:
>     pthread_mutex_unlock (&handleDatabase->mutex);
> 
>     saHandleInstancePut (handleDatabase, inHandle);
> 
>     return (error);
> }
> 
> -----Original Message-----
> From: Steven Dake [mailto:sdake at mvista.com] 
> Sent: Tuesday, January 24, 2006 1:27 PM
> To: Bajpai, Muni [RICH1:B670:EXCH]
> Cc: scd at broked.org; openais at lists.osdl.org
> Subject: Re: FW: Evt Deadlock
> 
> Muni,
> 
> If this happens again instruct your testers to send a SIGSEGV to your
> application via kill.  Make sure to ulimit -c unlimited.  Then you can
> use gdb to debug the core created and we can see what call paths the
> deadlock occurs upon.  You can use the "threads" command to switch
> between thread 0 1 etc.  This is the technique I used to find the AMF
> crash.
> 
> This information would help us considerably find which locks are
> contended upon (or if it is actually a mutex that is contended).
> 
> Also defect 1029 (merged) could result in this deadlock situtation if
> the check failed in the handle destroy.  It would leave the handle
> database mutex locked in an error condition (the handle was invalid
> passed to saHandleDestroy.  Later accesses to this mutex would lock up
> the multithreaded app.  This would point to another problem you may be
> having in a caller to saHandleDestroy.  It sure would be nice to know
> where that HandleDestroy call failed (the call stack) as it points at
a
> bug in the evt library if this is the result of the deadlock.  One
rule
> we have is that handles should always be valid passed to
> saHandleDestroy.
> 
> If you want to help find the source of this handle destroy problem in
> 0.70.1 please apply the attached patch to your 0.70.1 and make sure to
> save your core/sources if the assert occurs.
> 
> Mark, I'd take a second look at your saHandleDestroy calls as they may
> have some kind of problem.
> 
> Regards
> -steve
> 
> On Tue, 2006-01-24 at 08:54 -0600, Muni Bajpai wrote:
> > Steve, 
> > 
> > Posting to the group as well.
> > 
> > -----Original Message-----
> > From: Bajpai, Muni [RICH1:B670:EXCH] 
> > Sent: Monday, January 23, 2006 4:25 PM
> > To: 'Mark Haverkamp'
> > Subject: RE: Evt Deadlock
> > 
> > SO we have one evt thread writing events and then there is this
thread
> > in question which was dispatching and then was told to exit by the
> > application.
> > 
> > So it is definitely possible that a lock was held by the other
thread
> > doing on regular time intervals
> > saEvtEventAllocate
> > saEvtEventAttributesSet
> > saEvtEventPublish
> > saEvtEventFree
> > 
> > I'll do some more research too
> > 
> > Thanks
> > 
> > Muni
> > 
> > 
> > -----Original Message-----
> > From: Mark Haverkamp [mailto:markh at osdl.org] 
> > Sent: Monday, January 23, 2006 4:12 PM
> > To: Bajpai, Muni [RICH1:B670:EXCH]
> > Subject: Re: Evt Deadlock
> > 
> > On Mon, 2006-01-23 at 15:37 -0600, Muni Bajpai wrote:
> > > Hey Mark,
> > > 
> > >  
> > > 
> > > One of our testers came up with this issue after running about 24
> > > hours of traffic. This is the version without your Evt fixes which
I
> > > just merged and have started testing. What I wanted to know if
this
> > > issue is fixed by your changes. Basically we were in shutdown mode
> and
> > > were trying to do an saEvtChannelClose
> > > 
> > 
> > This particular thing wasn't addressed by my previous fixes.  
> > 
> > I don't see where something could have the event handle database
> locked
> > forever since it is taken and released inside the handle functions.
> Do
> > you know what the other threads were doing at the time?  Is it
> possible
> > that some other thread was killed while it held the mutex?  Anyway,
> I'll
> > keep looking at the code and see if I can figure out how it could
> > deadlock.
> > 
> > Mark.
> > 
> > 
> > 
> > >  
> > > 
> > > Looks like saEvtEventFree is dead locked on 
> > > 
> > > error = saHandleInstanceGet(&event_handle_db, eventHandle,
> > > 
> > >             (void*)&edi);
> > > 
> > >  
> > > 
> > >  
> > > 
> > >  
> > > 
> > > #0  0xb747e2ab in saEvtEventFree (eventHandle=0) at evt.c:1378
> > > #1  0xb747f67c in chanHandleInstanceDestructor
(instance=0x80bd14c)
> at
> > > evt.c:266
> > > #2  0xb74785c7 in saHandleInstancePut (handleDatabase=0xb74801c0,
> > > inHandle=7222815479134420992) at util.c:687
> > > #3  0xb747dbac in saEvtChannelClose
> > > (channelHandle=7222815479134420992) at evt.c:1074
> > > #4  0x08054b02 in EvtHandler::cleanupEVT (this=0x80b76ec) at
> > > EvtHandler.cpp:1013
> > > #5  0x0805ce7d in HalManager::shutdown (this=0xb3fe3bb0,
> > > reason=0x808756c "The heartbeat to the Sig has failed.") at
> > > HalManager.cpp:1062
> > > #6  0x0806c82f in SigHandler::handle_exception (this=0x80c5e40) at
> > > SigHandler.cpp:907
> > > #7  0xb754f5e6 in ACE_Select_Reactor_Notify::dispatch_notify ()
> > > from /opt/mcp/lib/libACE.so.5.3.1
> > > #8  0xb754f6b2 in ACE_Select_Reactor_Notify::handle_input ()
> > > from /opt/mcp/lib/libACE.so.5.3.1
> > > #9  0xb754f47e in
ACE_Select_Reactor_Notify::dispatch_notifications
> ()
> > > from /opt/mcp/lib/libACE.so.5.3.1
> > > #10 0xb7542b83 in
> > > ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
> > > >::dispatch_notification_handlers ()
> > >    from /opt/mcp/lib/libACE.so.5.3.1
> > > #11 0xb7542a57 in
> > > ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
> >::dispatch
> > > () from /opt/mcp/lib/libACE.so.5.3.1
> > > #12 0xb753fff4 in
> > > ACE_Select_Reactor_T<ACE_Select_Reactor_Token_T<ACE_Token>
> > > >::handle_events () from /opt/mcp/lib/libACE.so.5.3.1
> > > #13 0xb754d6e8 in ACE_Reactor::run_reactor_event_loop ()
> > > from /opt/mcp/lib/libACE.so.5.3.1
> > > #14 0x0805978c in main (argc=3, argv=0xbfffeff4) at
halMain.cpp:976
> > > Current language:  auto; currently c
> > > 
> > > 






More information about the Openais mailing list