[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

Fri Feb 14 23:18:37 PST 2003

Perez-Gonzalez, Inaky wrote:
> Hi George
> 
> 
>>-----Original Message-----
>>From: george anzinger [mailto:george at mvista.com]
> 
> 
>>At least for the forsee abe future the TID will be bounded in the 
>>kernel to something less than 24 bits, but you really only need 1.  I 
>>would use the sign bit to avoid the shift.  Here is the way 
>>you do it, 
>>assuming a compare and replace instruction.  Assume the pi_mutex is a 
>>structure so we are not limited to just the one word.  On entry we 
>>prefetch the TID (should be done when the thread first awakes, me 
>>thinks).  Entry is:
>>
>>	if (mutex->entry_wd == 0){       // this line and the next is
>>		mutex->entry_wd = TID;  /// the compare and swap
>>                 return ;		// we now own the mutex
>>	}else{
>>		call wait(mutex);
>>   	}
>>	return;
> 
> 
> What worries me about this approach is the window between the CAS and when
> somebody (eg: before calling the wait) sets the sign bit. If at that moment
> the owner preempts and unlocks, we open a window for a low priority thread
> to cause a priority inversion by claiming and acquiring the futex just then.

My guess is that the best way to handle this is to not even pass the 
current value to the kernel, but just to wait till it has it all 
locked down.  It would then take a look an see what the problem is. 
There are three possibilities:
a) It is free (i.e. the owner exited while the contender was on the 
way to the kernel), so lock it to the requester and return success.
b) It is locked, set the contention flag and set up the queues, etc.
c) It is locked and contended, add the new contender to the queues.

Your suggestion that a low priority thread could get in is not really 
relevant.  This could happen if the thread were a few instruction 
earlier also.  No problem, just do the pi-mutex boost and carry on.

> 
> Then you have the complexity in the kernel side. The caller that just went
> to the kernel will see that the futex changed the value since he last saw it
> until he entered the kernel path - if he were to queue without checking the
> value, it would deadlock until two tasks contended and one went into the
> kernel. And still it is not guaranteed he would make it.

No, he should not enter the kernel with the expectation that it is in 
any state, as I said above.
> 
> So, in that situation, it is when he sees the value changes and returns with
> -EWOULDDEADLOCK, so he retries the claim. Now multiply this by many that
> happen to be intermixed (it's call paths) in the same situation -- ugly.
> 
> If you set the sign bit at the same time than you do the compare-and-swap
> operation [or the equivalen composite operation], then you avoid that
> problem. You also need the sign bit to avoid the priority inversion, so
> win-win-situation, the only looser is the implementor [in this case, me :)].

But you really want to check for NULL and replace with self (now the 
owner).  The sign bit would only get set if there is already an owner. 
  I suppose you could do a two stage loop something like:

  xx:	if (mutex->entry_wd == 0){       // this line and the next is
		mutex->entry_wd = TID;  /// the compare and swap
                  return ;		// we now own the mutex
	}else{
		if(mutex->entry_wd == what it was) {
			mutex->entry_wd = what it was | flag;
			call waitmutex(mutex);
		}else{
			goto xx;
		}
	}

The problem with this code is that an exit by the owner can race to 
the kernel with the contender.  And, of course, there is the third 
possibility, i.e. that the mutex flag is already set which the above 
does not handle.

All in all, the original code seems best.  Even in the kernel, where 
the lock down is just a spinlock, there is a race with that code.  The 
solution, in what I wrote, was to just not worry about what got me 
into the lock down, but rather to, once locked down, examine the mutex 
for all possible combinations and act accordingly.
> 
> 
>>If not, the current task needs to be put in the list (headed at the 
>>mutex) of waiters.  This should be in priority order.  If the current 
> 
> 
> To speed it up and make it as O(1) as possible, the list is made like in
> your rt scheduler and in Ingo's scheduler (priority array of queues).
> Drawback: it is huge, the futex_q grows up to more than 1K). With slabs and
> a recycling hack, it reduces the pressure on the VM and boosts performance,
> though.

I don't think this is worth it.  Unless you expect to stack up more 
than a couple of tasks a priority ordered doubly linked list should be 
just fine.  Remember, with the O(1) you still need the bit map and you 
need to search it to remove a waiter.  What you gain on queue 
insertion you will loose on removal.
> 
> 
>>task ends up at the head of this list the current owner i.e.
>>"entry_wd" needs to be boosted in priority and if he is 
>>waiting on any 
>>other pi_mutex ies those owners need to also be boosted (here is the 
>>recursion).  This means that there needs to be location unique to the 
>>task, but findable by other tasks, that contains a pointer to the 
>>mutex he is waiting for, if any.
>>
>>[...]
> 
> 
> Piece of cake - a list rooted at the task structure that links all the
> futexes the task holds does the trick; the first guy who goes down to wait
> (or the first one who tries to boost it) will register the owner with the
> futex. Then walking it is easy; also need to pay attention and put hooks
> into setsched() and priority setting functions ... need to check that out
> deeper.

Yes, that works.  I think I was thinking of doing more of the work in 
user space, but that does not make a lot of since since you need the 
hard lock down only the kernel can provide.

Yeah, you need an effective priority and the actual priority.  The 
task runs at its effective priority while its actual priority is the 
lowest it will ever run.  I also had a pointer in the task structure 
to the priority management code for the given priority list the task 
was in at the moment.  This is a bit messy because it can change while 
you are trying to change it.  One then starts wondering if all the 
priority lists should be under the same spin lock, or possibly one for 
all priority lists on a given cpu, but then you have per cpu priority 
lists in schedule() but need strict priority lists in the mutexes, the 
head spins, must be time for the week end.

-g
> 
> 
> 
>>And it all depends on the atomic compare and swap, a native 
>>instruction on the later x86 machines and available to some degree on 
>>other platforms.  In any case a rather fast kernel call to emulate 
>>such is always possible.
> 
> 
> RISCs seem to be ok. UP machines [i386, ARMs] can hack it in user space
> using algorithm similar to that of the RISCS without the memory protection
> and assuming only the CPU is touching that part of memory and that IRQs
> won't touch it ... maybe there are too many assumptios, but I believe
> glibc/sysdeps/arm/atomicity.h has that already.
> 
> Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or my
> fault]
> 
> 

-- 
George Anzinger   george at mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml