[cgl_discussion] Proposal for the implementation of Robust Mu texes (ala Sun's POSI X extension)

Fri Mar 14 11:08:46 PST 2003

> -----Original Message-----
> From: Joe DiMartino [mailto:joe at osdl.org]
> 
> I like the idea to have the mutex remain in EOWNERDEAD state until all
> have a chance to fix it, however there is a slight snag.
> 
> ....
> 
> Here are the snags: First, there is no pthread_mutex_inconsistent_np()
> which will set the state to ENOTRECOVERABLE.  Even if there were such

That's easy to fix: we add it. We are talking about non-standard compliant
stuff here, so we are free to add whatever we see fit. I guess it'd be
interesting to propose this to POSIX after we have an agreed-upon solution,
though.

> a call, how would any of the surviving possible owners know that all
> other such owners have had a go at fixing it?  Imagine a busted mutex
> with 3 queued requests.  The first gets ownership, can't fix it and
> lets go (still EOWNERDEAD).  What does it do next - re-queue?  It most
> likely needs this mutex to complete whatever it's working on.  Whether
> it re-queues or not, the remaining two queued survivors eventually get
> their turn to fix it, and if they can't, the final one still doesn't
> know that everyone else has had a go.  So this mutex will remain forever
> in the EOWNERDEAD state.

Sure that's a problem, but I think it is up to the application(s) to
implement policy to go around it.

The way I would solve it at the application level would be splitting
programs into A) cannot fix consistency problems, B) can fix consistency
problems. I would make it so that only 1 program is in group B.

Whenever any program in group A finds a consistency problems, signals
program B and retries the lock for a maximum amount of time/tries, maybe
waiting for a broadcast from B to actually do it. Then if/when it retries,
it gets NOTRECOVERABLE then bang, bail out. On EOWNERDEAD, waits and try's
again until the timeout hits. If normal, keep going ...

Solution B, more in the line of your suggested scenario, is that a
fixer-program tries to lock, EOWNERDEAD, tries to fix, fails and passes it
on. Then it retries, if still EOWNERDEAD, kill it and bail out.

My point with this system is not that three guys can both try to fix it
(well, kind of is that too). My point is it gives you flexibility to
implement any type of solution, be it Sun's or a more elaborate one without
being limited.

Do you agree with this?

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own
(and my fault)