[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

george anzinger george at mvista.com
Sun Feb 16 00:06:38 PST 2003

Perez-Gonzalez, Inaky wrote:
>>Your suggestion that a low priority thread could get in is not really 
>>relevant.  This could happen if the thread were a few instruction 
>>earlier also.  No problem, just do the pi-mutex boost and carry on.
> This does not work.
> For starters, it is a plain priority inversion. There was 
> somebody waiting with higher priority and the lock was 
> given to a lower priority thread. That is a bug.
> Also, it can potentially create a deadlock: thread A acquires,
> threads B, C and D are waiting and thread A sets to UNLOCK before
> calling the kernel_wake() [for the sake of simplicity, let's just
> say we only wake up one waiter, but it could happen with two or
> more]. 

I think you misunderstand the CAS instruction.  If there are waiters, 
the compare will fail and the swap will not be done.  The only hole is 
if there are no waiters and a contender (which will soon be a waiter) 
fails in the CAS.  The owner still has the mutex, if he releases AND 
another contender comes along there is a race for who will get it 
between the one who failed the CAS and the new contender which may 
take the lock without involving the kernel.  I contend that this is 
just too close to call and either outcome is ok.  Note that in a UP 
machine, the higher priority task will get it anyway, as it has enough 
priority to beat the other contender.

If I understand your example, A has the mutex and there are waiters 
when A unlocks.  This should cause another CAS failure since A will be 
looking for "A" and not "A"+flag.  The mutex stays locked and the 
kernel then gives it (IN THE KERNEL, i.e. atomicly) to the highest 

> Just before calling the kernel, thread E, (low or high
> priority) atomically LOCKs from UNLOCKED, so it does not even know
> that there are waiters in the kernel. Now thread B is woken up by
> A in the kernel_wake() path and just before checking the *futex
> and setting the waiter flag or whatever, it is killed [or delayed
> and later killed]--the thing is it does not have the chance to
> claim the *futex and mark it as having waiters present in the
> kernel; E unlocks, without knowing about C and D, who will be
> waiting until frogs grow beards.

The key is that, once there are waiters, the flag is set in the mutex 
and the exit CAS will fail, transfering control to the kernel.

The only race is between a contender who is on the way to the kernel 
and a second contender who just happens along just as A  unlocks.  So 
there will only be these three tasks (or I suppose other could be 
added, either calling the kernel or trying to lock the free mutex. 
Only one of the many will get the mutex, the rest will all meet in the 
kernel).  Now the kernel will see one of three things:
1) A holds the lock, still
2) The mutex is unlocked, or
3) C holds the lock

 From the kernels point of view 1 and 2 are the same, just queue the 
new waiter.  2 is easy, give the mutex to B and keep on trucking.
> I know there is an slim chance and that you have to align seven
> planets for it to happen, but the chance is there; minimal, but
> present. And I don't want to be blamed in the future for a
> deadlock that risked some critical system.

Murphy says this will happen ONLY after all tests pass and the product 
is shipped :)
>>No, he should not enter the kernel with the expectation that it is in 
>>any state, as I said above.
> Does the case I exposed invalidate this?

> Think also that the idea behind that was to separate the actual
> implementation of the *futex count from the futex implementation,
> to allow architectures to choose the most effective one. For this
> I am requiring compare-and-swap, at least. Maybe this can be 
> stretched in ... hmmm ...
>>But you really want to check for NULL and replace with self (now the 
>>owner).  The sign bit would only get set if there is already 
>>an owner. 
>>  I suppose you could do a two stage loop something like:
>>  xx:	if (mutex->entry_wd == 0){       // this line and the next is
>>		mutex->entry_wd = TID;  /// the compare and swap
>>                  return ;		// we now own the mutex
>>	}else{
>>		if(mutex->entry_wd == what it was) {
>>			mutex->entry_wd = what it was | flag;
>>			call waitmutex(mutex);
>>		}else{
>>			goto xx;
>>		}
>>	}
>>The problem with this code is that an exit by the owner can race to 
>>the kernel with the contender.  And, of course, there is the third 
> Not if your unlock code is like this:
>     int unlock (int *lock, int locker_id)
>     {
>       int old_value;
>       atomic {  /* normal compare and swap */
>         if (*lock == locker_id) {
>           old_value = *lock;
>           *lock = 0;
>         }
>       }
>       if (old_value == locker_id) 
>         return 0;
>       if (*lock == (locker_id | WAITERS_PRESENT))
>         return kernel_wake_up (lock, locker_id);
>       /* Error, we did not own the lock! */
>       return -EWOULDBLOCK;
>     }
> At least I cannot see no races; I have looked to it left and right 
> and can't find any ...
>>possibility, i.e. that the mutex flag is already set which the above 
>>does not handle.
> It does. If the mutex flag (the waiters present flag) is already 
> set when we step on xx, we just set it again and fall down to the 
> kernel. In ia32, it'd be something like this [it's been long since
> I don't write asm, so I could have the arguments swapped and stuff]:

If the compare fails the cmpxchg fails.  You don't need to do anything 
to reset it.  THAT is what prevents the race.
>       1: movl          locker_id, %ecx
>          movl          0, %eax
>          lock cmpxchg  futex, %ecx
>          jz            we_got_it_locked_we_are_owners
>          movl          %eax, %ecx
>          orl           WAITERS_PRESENT, %ecx
>          lock cmpxchg  futex, %ecx
>          jnz           1
>        wait_in_kernel:
>          ...
>>>To speed it up and make it as O(1) as possible, the list is 
>>made like in
>>>your rt scheduler and in Ingo's scheduler (priority array 
>>of queues).
>>>Drawback: it is huge, the futex_q grows up to more than 
>>1K). With slabs and
>>>a recycling hack, it reduces the pressure on the VM and 
>>boosts performance,
>>I don't think this is worth it.  Unless you expect to stack up more 
>>than a couple of tasks a priority ordered doubly linked list 
>>should be 
>>just fine.  Remember, with the O(1) you still need the bit 
>>map and you 
>>need to search it to remove a waiter.  What you gain on queue 
>>insertion you will loose on removal.
> Well, the bitmap search is still faster than walking a list,
> specially if you have bitsearch asm operations (cache wise).
> I agree is kind of overkill; but heck, they are asking me for
> O(1) as much as possible, trading memory for it, so there I
> am going.
> OTOH, not having prioarrays when you have complex inheritance
> graphs really brings down performance. Sorted insertion in an
> small list is an small it; in many it starts to be noticed.
> However, that part is encapsulated and could be changed as 
> wished, as an option to the admin.
>>priority lists should be under the same spin lock, or 
>>possibly one for 
>>all priority lists on a given cpu, but then you have per cpu priority 
>>lists in schedule() but need strict priority lists in the 
>>mutexes, the 
>>head spins, must be time for the week end.
> This reminds me of some further down the road improvements
> I am considering, like having a futex hash bound to each
> CPU (NUMA should benefit from this). Then they would migrate
> from CPU hash table to CPU hash table as they are requested
> on different CPUs ... blah blah
> Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
> fault]

George Anzinger   george at mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml

More information about the cgl_discussion mailing list