[Openais] proposal for better end to end flow control

Wed Mar 31 15:49:25 PDT 2010

On Thu, 2010-04-01 at 11:23 +1300, Tim Beale wrote:
> Hi Steve,
> 
> End-to-end flow control is something I'd really love to see. It sounds like
> your proposal won't fix all the problems we're seeing with flow control though.
> 
> A problem we've seen is kind of permanent congestion - the receiver gets a
> burst of several hundred CPG messages queued up and never really recovers. The
> sender continues sending enough CPG messages that the receiver never clears out
> its queue, but doesn't run out of memory either. The receiver's queue could
> hover in this state indefinitely. On our setup, a healthcheck mechanism detects
> the receiver has locked up (some operations are blocking due to flow control
> congestion) and eventually restarts the process.
> (As an interim workaround for this on our setup, I fudged the token backlog
> calculation to gradually force the sender to backoff, so the sender's totem
> message queue fills up and it starts getting TRY_AGAIN errors).
> 
> I was wondering whether end-to-end flow control at the CPG group level is a
> possible/feasible option that'd solve both this case and the oom one? E.g. in
> the CPG library code it sends an internal message to notify the rest of the CPG
> group whenever the flow control status for an application changes?
> 

Tim,

I attempted the proposal I suggested and it turns out that sending the
message really doesn't work at fixing the memory allocation problem
(system still ooms) because messages can be out of date wrt the current
flow control state.  Angus and I have brainstormed this problem for
quite some time (over 6 months..:), and I rediscovered a patch he
created a long time ago but I wasn't sure if I wanted to introduce it.

Essentially his patch holds on to the token in the case that a node's
ipc queues are congested.  I don't like this though, because the token
is used for recovery and healthchecking, so altering its behavior is
problematic.

I did take this idea as a basis for some work in one of my dev trees.
I'll give a brief rundown of how it works currently.

coroipcs keeps a count of how much memory is allocated by dispatch
messages.  If the memory allocated is greater then a maximum (currently
128mb), coroipcs tells Totem to muck with the flow control parameters in
the token which stops regular messages from being ordered.  If the
node's memory allocation drops below a minimum (64mb) coroipcs tells
totem to stop mucking with the flow control values.

This mechanism allows us to limit corosync to a specific memory
footprint (since coroipcs dispatch messages are one of the few
allocators of memory during normal runtime).

To correct for applications that block for too long (usually because
they have failed in some kind of deadlock), I am planning for each ipc
connection to register a timeout value at which point its ipc connection
will be terminated (applications get back ERR_LIBRARY).

Could you send me your backlog backoff calculation code (or preferably
tarball of source tree)?  I'd like to see what you have.

Also I found some type of lockup bug in ipc dispatch under really heavy
load that is fixed in my current rework.

I would send you a patch, but its against a really old version and I
haven't rebased the work against current trunk yet.

Regards
-steve

> Regards,
> Tim
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais