[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

Perez-Gonzalez, Inaky inaky.perez-gonzalez at intel.com
Sat Feb 15 18:36:15 PST 2003

> Your suggestion that a low priority thread could get in is not really 
> relevant.  This could happen if the thread were a few instruction 
> earlier also.  No problem, just do the pi-mutex boost and carry on.

This does not work.

For starters, it is a plain priority inversion. There was 
somebody waiting with higher priority and the lock was 
given to a lower priority thread. That is a bug.

Also, it can potentially create a deadlock: thread A acquires,
threads B, C and D are waiting and thread A sets to UNLOCK before
calling the kernel_wake() [for the sake of simplicity, let's just
say we only wake up one waiter, but it could happen with two or
more]. Just before calling the kernel, thread E, (low or high
priority) atomically LOCKs from UNLOCKED, so it does not even know
that there are waiters in the kernel. Now thread B is woken up by
A in the kernel_wake() path and just before checking the *futex
and setting the waiter flag or whatever, it is killed [or delayed
and later killed]--the thing is it does not have the chance to
claim the *futex and mark it as having waiters present in the
kernel; E unlocks, without knowing about C and D, who will be
waiting until frogs grow beards.

I know there is an slim chance and that you have to align seven
planets for it to happen, but the chance is there; minimal, but
present. And I don't want to be blamed in the future for a
deadlock that risked some critical system.

> No, he should not enter the kernel with the expectation that it is in 
> any state, as I said above.

Does the case I exposed invalidate this?

Think also that the idea behind that was to separate the actual
implementation of the *futex count from the futex implementation,
to allow architectures to choose the most effective one. For this
I am requiring compare-and-swap, at least. Maybe this can be 
stretched in ... hmmm ...

> But you really want to check for NULL and replace with self (now the 
> owner).  The sign bit would only get set if there is already 
> an owner. 
>   I suppose you could do a two stage loop something like:
>   xx:	if (mutex->entry_wd == 0){       // this line and the next is
> 		mutex->entry_wd = TID;  /// the compare and swap
>                   return ;		// we now own the mutex
> 	}else{
> 		if(mutex->entry_wd == what it was) {
> 			mutex->entry_wd = what it was | flag;
> 			call waitmutex(mutex);
> 		}else{
> 			goto xx;
> 		}
> 	}
> The problem with this code is that an exit by the owner can race to 
> the kernel with the contender.  And, of course, there is the third 

Not if your unlock code is like this:

    int unlock (int *lock, int locker_id)
      int old_value;
      atomic {  /* normal compare and swap */
        if (*lock == locker_id) {
          old_value = *lock;
          *lock = 0;
      if (old_value == locker_id) 
        return 0;
      if (*lock == (locker_id | WAITERS_PRESENT))
        return kernel_wake_up (lock, locker_id);

      /* Error, we did not own the lock! */
      return -EWOULDBLOCK;

At least I cannot see no races; I have looked to it left and right 
and can't find any ...

> possibility, i.e. that the mutex flag is already set which the above 
> does not handle.

It does. If the mutex flag (the waiters present flag) is already 
set when we step on xx, we just set it again and fall down to the 
kernel. In ia32, it'd be something like this [it's been long since
I don't write asm, so I could have the arguments swapped and stuff]:

      1: movl          locker_id, %ecx
         movl          0, %eax
         lock cmpxchg  futex, %ecx
         jz            we_got_it_locked_we_are_owners
         movl          %eax, %ecx
         orl           WAITERS_PRESENT, %ecx
         lock cmpxchg  futex, %ecx
         jnz           1

> > To speed it up and make it as O(1) as possible, the list is 
> made like in
> > your rt scheduler and in Ingo's scheduler (priority array 
> of queues).
> > Drawback: it is huge, the futex_q grows up to more than 
> 1K). With slabs and
> > a recycling hack, it reduces the pressure on the VM and 
> boosts performance,
> > though.
> I don't think this is worth it.  Unless you expect to stack up more 
> than a couple of tasks a priority ordered doubly linked list 
> should be 
> just fine.  Remember, with the O(1) you still need the bit 
> map and you 
> need to search it to remove a waiter.  What you gain on queue 
> insertion you will loose on removal.

Well, the bitmap search is still faster than walking a list,
specially if you have bitsearch asm operations (cache wise).

I agree is kind of overkill; but heck, they are asking me for
O(1) as much as possible, trading memory for it, so there I
am going.

OTOH, not having prioarrays when you have complex inheritance
graphs really brings down performance. Sorted insertion in an
small list is an small it; in many it starts to be noticed.

However, that part is encapsulated and could be changed as 
wished, as an option to the admin.

> priority lists should be under the same spin lock, or 
> possibly one for 
> all priority lists on a given cpu, but then you have per cpu priority 
> lists in schedule() but need strict priority lists in the 
> mutexes, the 
> head spins, must be time for the week end.

This reminds me of some further down the road improvements
I am considering, like having a futex hash bound to each
CPU (NUMA should benefit from this). Then they would migrate
from CPU hash table to CPU hash table as they are requested
on different CPUs ... blah blah

Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my

More information about the cgl_discussion mailing list