[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

Fri Feb 14 16:35:30 PST 2003

Hi George

> -----Original Message-----
> From: george anzinger [mailto:george at mvista.com]

> 
> At least for the forsee abe future the TID will be bounded in the 
> kernel to something less than 24 bits, but you really only need 1.  I 
> would use the sign bit to avoid the shift.  Here is the way 
> you do it, 
> assuming a compare and replace instruction.  Assume the pi_mutex is a 
> structure so we are not limited to just the one word.  On entry we 
> prefetch the TID (should be done when the thread first awakes, me 
> thinks).  Entry is:
> 
> 	if (mutex->entry_wd == 0){       // this line and the next is
> 		mutex->entry_wd = TID;  /// the compare and swap
>                  return ;		// we now own the mutex
> 	}else{
> 		call wait(mutex);
>    	}
> 	return;

What worries me about this approach is the window between the CAS and when
somebody (eg: before calling the wait) sets the sign bit. If at that moment
the owner preempts and unlocks, we open a window for a low priority thread
to cause a priority inversion by claiming and acquiring the futex just then.

Then you have the complexity in the kernel side. The caller that just went
to the kernel will see that the futex changed the value since he last saw it
until he entered the kernel path - if he were to queue without checking the
value, it would deadlock until two tasks contended and one went into the
kernel. And still it is not guaranteed he would make it.

So, in that situation, it is when he sees the value changes and returns with
-EWOULDDEADLOCK, so he retries the claim. Now multiply this by many that
happen to be intermixed (it's call paths) in the same situation -- ugly.

If you set the sign bit at the same time than you do the compare-and-swap
operation [or the equivalen composite operation], then you avoid that
problem. You also need the sign bit to avoid the priority inversion, so
win-win-situation, the only looser is the implementor [in this case, me :)].

> If not, the current task needs to be put in the list (headed at the 
> mutex) of waiters.  This should be in priority order.  If the current 

To speed it up and make it as O(1) as possible, the list is made like in
your rt scheduler and in Ingo's scheduler (priority array of queues).
Drawback: it is huge, the futex_q grows up to more than 1K). With slabs and
a recycling hack, it reduces the pressure on the VM and boosts performance,
though.

> task ends up at the head of this list the current owner i.e.
> "entry_wd" needs to be boosted in priority and if he is 
> waiting on any 
> other pi_mutex ies those owners need to also be boosted (here is the 
> recursion).  This means that there needs to be location unique to the 
> task, but findable by other tasks, that contains a pointer to the 
> mutex he is waiting for, if any.
>
> [...]

Piece of cake - a list rooted at the task structure that links all the
futexes the task holds does the trick; the first guy who goes down to wait
(or the first one who tries to boost it) will register the owner with the
futex. Then walking it is easy; also need to pay attention and put hooks
into setsched() and priority setting functions ... need to check that out
deeper.

> And it all depends on the atomic compare and swap, a native 
> instruction on the later x86 machines and available to some degree on 
> other platforms.  In any case a rather fast kernel call to emulate 
> such is always possible.

RISCs seem to be ok. UP machines [i386, ARMs] can hack it in user space
using algorithm similar to that of the RISCS without the memory protection
and assuming only the CPU is touching that part of memory and that IRQs
won't touch it ... maybe there are too many assumptios, but I believe
glibc/sysdeps/arm/atomicity.h has that already.

Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
fault]