[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

Tue Feb 18 16:34:34 PST 2003

All views expressed in this email are those of the individual sender. 

-----Original Message-----
From: george anzinger [mailto:george at mvista.com] 
Sent: Tuesday, February 18, 2003 4:14 PM
To: Perez-Gonzalez, Inaky
Cc: 'Corey Minyard'; 'Pradeep Kathail'; cgl discussion; Chen, Terence;
Abbas, Mohamed
Subject: Re: [cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d
requirements

Perez-Gonzalez, Inaky wrote:
>>>For starters, it is a plain priority inversion. There was 
>>>somebody waiting with higher priority and the lock was 
>>>given to a lower priority thread. That is a bug.
>>>
>>>Also, it can potentially create a deadlock: thread A acquires,
>>>threads B, C and D are waiting and thread A sets to UNLOCK before
>>>calling the kernel_wake() [for the sake of simplicity, let's just
>>>say we only wake up one waiter, but it could happen with two or
>>>more]. 
>>
>>I think you misunderstand the CAS instruction.  If there are waiters, 
>>the compare will fail and the swap will not be done.
> 
> 
> Oh man! I don't know if I am confusing you or you are confusing me, but
sure
> I don't get this - please help me straighten it out.
> 
> Ok, if there are waiters, *futex == LID0 | WP [LID0 = Locker ID of the
> owner, WP = Waiters Present bit]. Using the assembly code below, I do
> 
> ecx = MY_LID   ; MY locker ID
> eax = 0
> atomic {
>   if (*futex == 0) {
>     ZF = 1;
>     *futex = ecx; // *futex = MY_LID
>   } 
>   else { /* We take this path */
>     ZF = 0;
>     eax = *futex;  // eax is LID0 | WP
>   }
> }
> // So now ZF = 0, eax = *futex = LID0 | WP, ecx = MY_LID
> jz got_it // ZF = 0, so we don't take the jump
> ecx = eax;  // ecx = LID0 | WP
> ecx |= WP;  // mark again the WP, although its there in this case

I think this is an error.  I understand that you want to mark 
contention on the futex as soon as possible, however, I think it will 
cause problems if you mark it BEFORE you have set up the kernel 
structures to handle the exit from a contented futex.  Better, I 
think, to just leave it to the kernel to mark it contended once it 
either has  all the structures in place, or at least will have prior 
to letting any one else see the flag.  All that is at risk here is 
that the current holder will exit and leave the futex free (or that 
another will then grab it).  If you mark it here, you will have to 
handle the case of the holder beating the contender to the kernel and 
not knowing what to do...

When waiter calls the futex_wait it will pass the value of mutex. Inside the
kernel it will check this value with the futex's lock, if they are different
it will return to user space to try to lock it again. So I don't think it is
a problem.

> atomic {
>   if (*futex == (LID0 | WP)) { /* if still same as after previous CAS */
>     ZF = 1;
>     *futex = ecx;  // *futex = LID0 | WP, effectively a nop
>   }
>   else {
>     ZF = 0;
>     eax = *futex;
>   }
> }
> jnz beginning // oops, it changed, try again
> 
> Now, sure the first cas fails, as it is intended; but I have the *futex in
> eax (as it failed) to verify in the next CAS that it did not change. Now,
> the second CAS will only fail if the value of *futex changes (and thus
what
> I am trying to write back is already invalid because it was based on
that),
> so I retry from the beginning.
> 
> 
>>The only hole is 
>>if there are no waiters and a contender (which will soon be a waiter) 
>>fails in the CAS.  The owner still has the mutex, if he releases AND 
>>another contender comes along there is a race for who will get it 
>>between the one who failed the CAS and the new contender which may 
>>take the lock without involving the kernel.  I contend that this is
> 
> 
> This is what I don't see. The contender that failed the first CAS
(contender
> 1) will go to the second CAS. In the second CAS he is assuming that *futex
> equals to LID0 (owner's LID, as it saw it in the first CAS), but the
> 'hijacking' contender (contender 2) already set it to his LID2; so now,
the
> second CAS of contender 1 will compare *futex to the value he assumes it
> should have, LID0, but it has LID2, so it fails to set it to LID0|WP and
he
> starts all over again. If it was what he saw in the failed 1st CAS, then
he
> sets it to LID0|WP and goes to the kernel to wait.

This would work IF you can handle a contender who sees the WP getting 
to the kernel prior to the one who sets it...
> 
> He depends on the value staying the same, not on changing, so they cannot
> race. Or I still cannot see it and I need you to please shake me and
explain
> it to me as if I were stupid [please keep in mind I have been working
almost
> all Saturday ...]
>  
> 
>>just too close to call and either outcome is ok.  Note that in a UP 
>>machine, the higher priority task will get it anyway, as it 
>>has enough 
>>priority to beat the other contender.
>>
>>If I understand your example, A has the mutex and there are waiters 
>>when A unlocks.  This should cause another CAS failure since 
>>A will be 
>>looking for "A" and not "A"+flag.  The mutex stays locked and the 
>>kernel then gives it (IN THE KERNEL, i.e. atomicly) to the highest 
>>contender.
> 
> 
> Exactly, but the unlock first CAS's only for A (the fast unlock path); if
> that fails - AKA, either there are waiters (A+flag) or it is not our
futex.
> So next thing is test for A+flag - if it is so, go to the kernel (and
sure,
> the kernel will do the ownership transfer atomically to the highest
> contender, setting the flag if there are more).
> 
>  
> 
>>>Just before calling the kernel, thread E, (low or high
>>>priority) atomically LOCKs from UNLOCKED, so it does not even know
>>>that there are waiters in the kernel. Now thread B is woken up by
>>>A in the kernel_wake() path and just before checking the *futex
>>>and setting the waiter flag or whatever, it is killed [or delayed
>>>and later killed]--the thing is it does not have the chance to
>>>claim the *futex and mark it as having waiters present in the
>>>kernel; E unlocks, without knowing about C and D, who will be
>>>waiting until frogs grow beards.
>>
>>The key is that, once there are waiters, the flag is set in the mutex 
>>and the exit CAS will fail, transfering control to the kernel.
>>
>>The only race is between a contender who is on the way to the kernel 
>>and a second contender who just happens along just as A  unlocks.  So 
> 
> 
> This is why you pass, along with the kernel_wait(), the value you saw for
> the futex last time you did a CAS. If when you are about to lock, it has
> changed, then you go back to user space to contend again, so you avoid
this
> problem.

Actually I don't think this is a problem.  If the kernel has every 
thing locked down it can just go from what it finds.  I don't really 
see the point of even passing in the value that sent him to the kernel.
> 
> 
>>there will only be these three tasks (or I suppose other could be 
>>added, either calling the kernel or trying to lock the free mutex. 
>>Only one of the many will get the mutex, the rest will all 
>>meet in the 
>>kernel).  Now the kernel will see one of three things:
>>1) A holds the lock, still
>>2) The mutex is unlocked, or
>>3) C holds the lock
>>
>> From the kernels point of view 1 and 2 are the same, just queue the 
>>new waiter.  2 is easy, give the mutex to B and keep on trucking.
>>
>>>I know there is an slim chance and that you have to align seven
>>>planets for it to happen, but the chance is there; minimal, but
>>>present. And I don't want to be blamed in the future for a
>>>deadlock that risked some critical system.
>>
>>Murphy says this will happen ONLY after all tests pass and 
>>the product 
>>is shipped :)
>>
>>>
>>>>No, he should not enter the kernel with the expectation 
>>
>>that it is in 
>>
>>>>any state, as I said above.
>>>
>>>
>>>Does the case I exposed invalidate this?
>>
>>No.
>>
>>>Think also that the idea behind that was to separate the actual
>>>implementation of the *futex count from the futex implementation,
>>>to allow architectures to choose the most effective one. For this
>>>I am requiring compare-and-swap, at least. Maybe this can be 
>>>stretched in ... hmmm ...
>>>
>>>
>>>
>>>>But you really want to check for NULL and replace with self 
>>
>>(now the 
>>
>>>>owner).  The sign bit would only get set if there is already 
>>>>an owner. 
>>>> I suppose you could do a two stage loop something like:
>>>>
>>>> xx:	if (mutex->entry_wd == 0){       // this line 
>>
>>and the next is
>>
>>>>		mutex->entry_wd = TID;  /// the compare and swap
>>>>                 return ;		// we now own the mutex
>>>>	}else{
>>>>		if(mutex->entry_wd == what it was) {
>>>>			mutex->entry_wd = what it was | flag;
>>>>			call waitmutex(mutex);
>>>>		}else{
>>>>			goto xx;
>>>>		}
>>>>	}
>>>>
>>>>The problem with this code is that an exit by the owner can race to 
>>>>the kernel with the contender.  And, of course, there is the third 
>>>
>>>
>>>Not if your unlock code is like this:
>>>
>>>    int unlock (int *lock, int locker_id)
>>>    {
>>>      int old_value;
>>>      atomic {  /* normal compare and swap */
>>>        if (*lock == locker_id) {
>>>          old_value = *lock;
>>>          *lock = 0;
>>>        }
>>>      }
>>>      if (old_value == locker_id) 
>>>        return 0;
>>>    
>>>      if (*lock == (locker_id | WAITERS_PRESENT))
>>>        return kernel_wake_up (lock, locker_id);
>>>
>>>      /* Error, we did not own the lock! */
>>>      return -EWOULDBLOCK;
>>>    }
>>>
>>>At least I cannot see no races; I have looked to it left and right 
>>>and can't find any ...
>>>
>>>
>>>
>>>>possibility, i.e. that the mutex flag is already set which 
>>
>>the above 
>>
>>>>does not handle.
>>>
>>>
>>>It does. If the mutex flag (the waiters present flag) is already 
>>>set when we step on xx, we just set it again and fall down to the 
>>>kernel. In ia32, it'd be something like this [it's been long since
>>>I don't write asm, so I could have the arguments swapped and stuff]:
>>
>>If the compare fails the cmpxchg fails.  You don't need to do 
>>anything 
>>to reset it.  THAT is what prevents the race.

>>
>>>      1: movl          locker_id, %ecx
>>>         movl          0, %eax
>>>         lock cmpxchg  futex, %ecx
>>>         jz            we_got_it_locked_we_are_owners
>>>         movl          %eax, %ecx
>>>         orl           WAITERS_PRESENT, %ecx
>>>         lock cmpxchg  futex, %ecx
>>>         jnz           1
>>>       wait_in_kernel:
>>>         ...
> 
> 
> 
> 
> Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or
my
> fault]
> 
> 

-- 
George Anzinger   george at mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml