[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

george anzinger george at mvista.com
Wed Feb 5 14:31:00 PST 2003

Perez-Gonzalez, Inaky wrote:
>>Given that one can put together an atomic compare-and-swap in user 
>>space, I think one can enter an uncontested pi-mutex with just that, 
>>i.e. compare / modify + a test for success.  This assumes 
>>that one has 
>>a unique id fits in the given word size, plus one bit extra (in the 
>>kernel I used the "current" pointer and the least bit).  Exit of the 
>>area is the same overhead as entry.
> Yep, I also thought of this, but I reached the conclusion that it cannot be
> done - I could be wrong:
> The only ID we can allow ourselves is the TID, which does not offer ANY
> guarantee that is going to always have bits so-and-so unused so we can put
> the futex count in there - I mean, I know I won't have 4 G tasks in a
> system, but ... Ok, I assume I can << 2 it and it'd be ok.

At least for the forsee abe future the TID will be bounded in the 
kernel to something less than 24 bits, but you really only need 1.  I 
would use the sign bit to avoid the shift.  Here is the way you do it, 
assuming a compare and replace instruction.  Assume the pi_mutex is a 
structure so we are not limited to just the one word.  On entry we 
prefetch the TID (should be done when the thread first awakes, me 
thinks).  Entry is:

	if (mutex->entry_wd == 0){       // this line and the next is
		mutex->entry_wd = TID;  /// the compare and swap
                 return ;		// we now own the mutex
		call wait(mutex);

exit is:

	if (mutex->entry_wd == TID){   // this line and the next is
		mutex->entry_wd = 0;  // the compare and swap
                 return ;		// we have freed the mutex
		call wake_waiter(mutex);

So much for the uncontended stuff.

wait(mutex)  is the slow path,
	first it needs to be atomic over the mutex so it would do a lock on 
an additional word in the mutex structure.  It would then check to see 
if the mutex is now free (like it happened on the way in).  If so it 
just sets the TID in the right place and returns.

If not, the current task needs to be put in the list (headed at the 
mutex) of waiters.  This should be in priority order.  If the current 
task ends up at the head of this list the current owner i.e.
"entry_wd" needs to be boosted in priority and if he is waiting on any 
other pi_mutex ies those owners need to also be boosted (here is the 
recursion).  This means that there needs to be location unique to the 
task, but findable by other tasks, that contains a pointer to the 
mutex he is waiting for, if any.

Also, while still atomic, the sign bit (or what ever) should be set in 
the "entry_wd".  This is to get the owners exit to fail into 

wake_waiter(mutex) needs also to be atomic over the mutex.  It removes 
  the first waiter from the list.  If there is none, just free the 
mutex and exit, making sure the priority is re adjusted on the way out 
(after releasing the atomic lock).  Now how do you do that?  Well, you 
need a linked list of owned pi_mutex ies kept in priority order.  Thus 
the task removes the top one and sets its priority to match the new 
head of the list.  Note that the task does not put its owned mutex ies 
in this list, but the first contender does (keeps the work in the slow 
path).  This means that this code needs to update the new owner to 
show he owns the mutex, needs to check and see if there is more 
contention, and if so do all the right things.  When done, the 
entry_wd will contain the new owners TID with the sign bit set IF 
there is further contention (i.e. the waiting list is not empty).  If 
there is contention, the new owner will have this mutex inserted in 
its owner list (a flag in the mutex struct to so indicate might be 
useful here, but I think it is not really needed).

I am sure I missed some of the gore, but that mostly covers it.

And it all depends on the atomic compare and swap, a native 
instruction on the later x86 machines and available to some degree on 
other platforms.  In any case a rather fast kernel call to emulate 
such is always possible.
> futex = (tid << 2) | (whatever_count & 0x3)
> Now, can we count on that being truly atomic? 
> The thing is, for example, initially the lock is released [lock == 0] and we
> try to acquire it:
> atomic {
>   if (lock == 0)
>     lock = (TID << 2) | 0x1; /* we got it */
>        /* Problem 1: if getting TID 
>           means going to the kernel, we are screwed;
>           let's hope the library caches it. */
>   else {
>     futex++;
>     futex_wait (&lock);
>   }
> }
> So, this is easy; how do we put that into assembler? [pseudo c/inlinegcc/asm
> borrowed and scrambled from NPTL's lowlevellock.h)
>     lock xaddl %0, %3
>     testl      %0, %0
>     jz         uncontended
> contended:
>     leal       %3, %%ecx
>     call       ___lll_mutex_lock_wait
> uncontended:
> : "=a" (result), "=&c" (ignore1), 
>   "=&d" (ignore2), "=m" (futex)
> : "0" (1), "3" (futex)
> : "memory"
> Ok, how it works is: %3 is the futex [0 - unlocked; 1 - locked, no waiters;
> 2 - locked, one or more waiters]; the first line is the juicy one; it does,
> as you guys know:
> temp = %0 + %3
> %0 = %3
> %3 = temp
> that is:
> temp = 1 + futex
> reg[0] = futex
> futex = temp
> Now if the futex was unlocked, this will set it to one and we jump to
> unconteded, as the testl on reg[0] will tell us it was zero and now we got
> it; if it was locked, no waiters, it moves to 2 and we need to go down to
> the kernel to wait [call ___lll_...]; same thing, if it was in 2, goes to 3
> and we go to the kernel - when coming back, it will be put to 2 [so it does
> not keep going up forever]. So, we need three bits, at least [and this is
> not really true; given the right conditions, it could go up to a lot until
> we reset it to two, so this needs attention -- for now let's assume it will
> stay bounded].
> Now, how do we add the TID in there? This does not work:
> movl        (TID << 2) | 1, %ecx
> lock xaddl  %ecx, futex 
> As we will overwrite the TID of the owner of the futex if it is in states 1
> and/or 2. We cannot replace the value because the operation would not be
> atomic and somebody could jump in the middle.
> This would not help either:
> movl $1,%ecx
> lock xaddl %ecx,futex
> testl %ecx,%ecx
> jnz contention
> ; Uncontended, we got it
> xorl (TID << 2), futex
> For states 1 and 2 it works, but not for 0; in this case, from the time when
> the lock is acquired and the count goes from zero to 1, there is a window of
> three instructions (testl, jnz, xorl) until we set the TID, so that an
> incoming higher priority lock contender hitting it and raising our priority
> if need arises would find an invalid TID.
> So, here my assembler knowledge gets drained, and I think
> pseudo-I-came-up-with-it-assembler:
> movl $1, %ecx
> movl futex, %ebx
> movl (TID << 2) | 1, %edx
> lock add-or-load %ebx,%ecx,%edx
> jnz contention ; go to the kernel
> .... wait ...
> where add-or-load means [replace futex with %ebx]:
> if (futex == 0) {
>   futex = %edx;
>   ZF = 1;
> } else {
>   futex += %ecx;
>   ZF = 0;
> }
> But I guess this is asking for $1M or a date with Claudia Schiffer ... I
> don't think there is a similar instruction in any architecture ...
> So that's why I don't know how to do the identifier-embedded-in-the-word
> thing; I'd love to be proved wrong and shown a way to go.
>>I think we should think of this as instruction emulation where we go 
>>to the kernel VERY quickly (both in and out).  If we don't allow 
>>interruption, we would not need to save much context.  The major over 
>>head would be the system call instruction.
> Do you mean go to the kernel even in the uncontested case? I think this is
> what you meant, even if it is taking the fast path of it just being an
> interrupt, but I am sure the answer from anybody else would be something in
> the lines of "no way" - they want the uncontested case to stay in user
> space.
> Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
> fault]

George Anzinger   george at mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

More information about the cgl_discussion mailing list