[Openais] totempg reentrancy

Muni Bajpai muniba at nortel.com
Mon Jan 23 08:50:54 PST 2006


Steve,

Thanks I agree that It is not an issue that we have encountered so far
despite the fact that we have a multithreaded app (different threads for
EVT, CKPT and CLM drivers.) CLM Driver just listens so in essence we
have EVT and CKPT putting messages out for delivery but EVT does that
only once every 500 milliseconds and CKPT varies on the traffic load and
can go as low as 1 every 10 milliseconds. 

I still don't understand your statement on multithreads as aisexec is
single threaded unless you meant enabling the multithreading option in
the conf which we don't do.

Do you think there could be use case in the above scenario that could
cause the reentrant issue?

Thanks

Muni

-----Original Message-----
From: Steven Dake [mailto:sdake at mvista.com] 
Sent: Sunday, January 22, 2006 8:01 PM
To: Bajpai, Muni [RICH1:B670:EXCH]
Cc: scd at broked.org; Mark Haverkamp; openais at lists.osdl.org
Subject: RE: [Openais] totempg reentrancy

Muni
I didn't want to add the mutex code to the 0.70 release since it may
impact performance/stability and isn't well tested.  As is, the
condition of reentrancy cannot occur because messages are first queued,
then only delivered when the token is received.  It is not possible to
queue messages at the same time that the token is received, because
there is only one thread of execution.  I have spent atleast a week on
this problem and have atleast proved it to myself :)  The only place
this could occur is when the token callback is executed on received
tokens.  We only use token callbacks on sent tokens for now.  And I
would expect other unexplained crashes or malformed messages for over
the last 6 months as totem has been stable since then if this were
really a problem.

So I think this is a non-issue for you, but for people using
totem_pg.a/so in a multithreaded program trying to mcast from multiple
threads, they are going to have problems.  This is a feature I'd rather
support for wilson.

But if you have need to have this enhancement in Picacho, I can consider
it in a few weeks after further testing of the trunk change for 1045.
My general approach has been to avoid enhancements, but only port
bugfixes, for Picacho (with the makefile install being the only
outstanding example).

Regards
-steve

On Sun, 2006-01-22 at 19:44 -0600, Muni Bajpai wrote:
> Hey Steve,
> 
> Did you port defect 1045 back to picacho. I don't know if that was the
> intention but I couldn't see the attached changes in picacho
> 
> Thanks
> 
> Muni
> 
> -----Original Message-----
> From: openais-bounces at lists.osdl.org
> [mailto:openais-bounces at lists.osdl.org] On Behalf Of Steven Dake
> Sent: Friday, January 20, 2006 7:30 PM
> To: Mark Haverkamp
> Cc: openais at lists.osdl.org
> Subject: Re: [Openais] totempg reentrancy
> 
> 
> On Fri, 2006-01-20 at 16:47, Mark Haverkamp wrote:
> > On Fri, 2006-01-20 at 15:40 -0800, Mark Haverkamp wrote:
> > > On Fri, 2006-01-20 at 16:28 -0700, Steven Dake wrote:
> > > > On Fri, 2006-01-20 at 15:18 -0800, Mark Haverkamp wrote:
> > > > > On Fri, 2006-01-20 at 15:21 -0700, Steven Dake wrote:
> > > > > > I found during debugging AMF some strange behavior in the 
> > > > > > totempg layer.  I tracked it down to the fact that 
> > > > > > totempg_mcast (or msg_mcast) is not reentrant, meaning it is

> > > > > > not possible to call a mcast from a message handler that was

> > > > > > delivered a message.
> > > > > > 
> > > > > > This happens within the AMF quite often, and may also happen

> > > > > > within the CKPT and EVT resynchronization.  Muni do you know

> > > > > > for sure it happens in ckpt resync?
> > > > > > 
> > > > > > I think this is something we will have to fix before we 
> > > > > > finally release 0.70.1.
> > > > > > 
> > > > > > I have attached a patch which fixes the problem for trunk.  
> > > > > > Could we get some review then I'll work up something for 
> > > > > > picacho?
> > > > > > 
> > > > > > I have thought through this patch and it appears to solve 
> > > > > > multiple levels of reentrancy as well, but I could use more 
> > > > > > eyes and brains to think about the problem.
> > > > > 
> > > > > How can the code get here and this be true?
> > > > > 
> > > > > 
> > > > > if (reentrant_call == 1) {
> > > > > 	goto start_over_reentrant;
> > > > > }
> > > > > 
> > > > > It looks like if reentrant_call is 1 on entry, it goes to
> > > > > reentrant_mcast: and reentrant_call is set to zero. Otherwise,

> > > > > if reentrant_call is set to one before totemmrp_mcast, it is
set
> 
> > > > > back to zero just after the call.
> > > > > 
> > > > > 
> > > > put a printf in it and see if its executed :)
> > > > 
> > > > Yes it took me 3 days to figure out exactly what was happening; 
> > > > its pretty complicated.
> > > > 
> > > > Basically the way it happens is this:
> > > > 
> > > > mcst_mcast is called by one of the service handlers for some 
> > > > request, maybe from a library.  That service handler then queues
a
> 
> > > > message.  The message is then delivered.  When that message is 
> > > > delivered, the delivery handler requests a message to be mcast 
> > > > while the msg_mcast is still processing a previous request.
> > > > 
> > > > The problem is, we are already within the mcast routine (which
is 
> > > > then in the msg handler, which then calls the mcast routine), 
> > > > which screws up all of the fragmentation buffer and other static

> > > > data that is necessary to track the state of the totempg.
> > > > 
> > > > So this patch first "finishes the job" on that last message and 
> > > > then starts over on the new message requested.  It also seems to

> > > > now pass testing.
> > > > 
> > > > For an interesting test to prove that we are indeed reentrant,
put
> 
> > > > a printf right after totemmrp_mcast and run amf.  Sometimes it 
> > > > will not be printed, because amf will on delivery of a message 
> > > > recall the function.
> > > 
> > > I think that you are saying that the call to totemmrp_mcast can 
> > > cause mcast_msg to get called again.  If that is true, mcast_msg 
> > > will see reentrant_call == 1 at the start and goto
reentrant_mcast:
> 
> > > Which sets reentrant_call = 0.  I still don't see how we can get
to 
> > > line 801 with reentrant_call == 1.
> > 
> > Is there a chance that
> > 
> > 		res = totemmrp_mcast (iovecs, 3, guarantee);
> > reentrant_mcast:
> > 		reentrant_call = 0;
> > 
> > should be
> > 
> > 		res = totemmrp_mcast (iovecs, 3, guarantee);
> > 		reentrant_call = 0;
> > reentrant_mcast:
> > 
> 
> mark I got to the bottom of this by putting an assert in the
reentrancy
> bit of code.  This allowed me to stop in with a debugger and view the
> various threads.  I found there were multiple threads within msg_mcast
> at the same time.  Since totempg isn't thread safe, this is what was
> causing the problem.
> 
> I also thought more about how totem works and don't believe it is
> posible for msg_mcast to be reentered by the totem code.
> 
> This patch should fix up the problem for the short term, where longer
> term we may want to think of multi-threading and the totempg library.
> 
> Regards
> -steve
> 
> 
> 
> 
> 
> 
> > > _______________________________________________
> > > Openais mailing list
> > > Openais at lists.osdl.org 
> > > https://lists.osdl.org/mailman/listinfo/openais
> 
> _______________________________________________
> Openais mailing list
> Openais at lists.osdl.org
> https://lists.osdl.org/mailman/listinfo/openais






More information about the Openais mailing list