[Openais] Re: ipc rewrite

Steven Dake sdake at redhat.com
Wed Apr 26 11:13:11 PDT 2006


On Wed, 2006-04-26 at 10:48 -0700, Mark Haverkamp wrote:
> On Tue, 2006-04-25 at 16:16 -0700, Steven Dake wrote:
> > The events dropped might be because of priority inversion of the
> > subscription and publish tests.  They should be set to sched-rr:1.  Look
> > at evsbench.  Eventually this will be resolved in a later patch so that
> > priorities are automatically determined.  Let me know what tests you are
> > running to get the "lockup" and I'll see what is wrong with the ipc.
> > 
> > evsbench seems to work properly which is the only way I tested this..
> > 
> > What was the test case for the double free?
> > 
> > With the new code, it will be difficult to run aisexec within gdb
> > because the ipc code will often call pthread_kill to interrupt the poll
> > when the outbound kernel queue is full (this interrupts gdb too sigh).
> > I'd recommend ulimit -c unlimited to create core files and then use
> > gdb ./aisexec corefile
> > 
> > you can use thread 1, thread 2, etc to get to different threads and get
> > backtraces.
> > 
> > I realize this adds extra complication for the developers but it should
> > pay off in the end.
> 
> 
> I think I have a clue as to what is going on.  I added some debug in the
> areas where events were queued for delivery and when they were requested
> by the application.  It seems that somehow my event count variable is
> getting out of sync with how many events are on the queue.  I see from
> the stack trace that clone is called and the delivery function is called
> by another thread.  I am guessing (since I don't have any mutexes in the
> event code) that there are races now in the various event processing and
> delivery functions that can cause inconsistencies in my data structures.
> Does this sound reasonable?
> 
> Mark.
> 
> 

While this is possible, I have tried to avoid any need for mutex
protection in the services themselves.  If there is need for mutex
protection in the services themselves at this point, it is a bug.

This is done in the following ways (to be checked for error).  All
incoming I/O requests are serialized via pthread_mutex_lock in
prioritized_poll_thread.  It is possible for both conn_info_outq_flush
to be called at the same time as the other totem thread is delivering a
message to the library caller via oepnais_conn_send_response.
Therefore, these two functions are protected from mutually accessing the
conn_info structure via mutex.  It is possible for msg_mcast and the
callback token to enter the totem code at the same time so that is
protected by mutex.

One possibility..  Do you use timers?

It might be possible for the timer code to execute something in the main
thread while the service handler is being executed in the poll thread.
Does this match the stack backtrace you might have seen?

Can you tell me the functions you are looking at or send me your debug
printfs so I can take a look?

Regards
-steve







More information about the Openais mailing list