[cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d requirements

george anzinger george at mvista.com
Tue Feb 18 17:18:57 PST 2003


Abbas, Mohamed wrote:
> 
> All views expressed in this email are those of the individual sender. 
> 
> 
> -----Original Message-----
> From: george anzinger [mailto:george at mvista.com] 
> Sent: Tuesday, February 18, 2003 4:14 PM
> To: Perez-Gonzalez, Inaky
> Cc: 'Corey Minyard'; 'Pradeep Kathail'; cgl discussion; Chen, Terence;
> Abbas, Mohamed
> Subject: Re: [cgl_discussion] Question for TEMs/ISVs/OEMs regarding pthrea d
> requirements
> 
> Perez-Gonzalez, Inaky wrote:
> 
>>>>For starters, it is a plain priority inversion. There was 
>>>>somebody waiting with higher priority and the lock was 
>>>>given to a lower priority thread. That is a bug.
>>>>
>>>>Also, it can potentially create a deadlock: thread A acquires,
>>>>threads B, C and D are waiting and thread A sets to UNLOCK before
>>>>calling the kernel_wake() [for the sake of simplicity, let's just
>>>>say we only wake up one waiter, but it could happen with two or
>>>>more]. 
>>>
>>>I think you misunderstand the CAS instruction.  If there are waiters, 
>>>the compare will fail and the swap will not be done.
>>
>>
>>Oh man! I don't know if I am confusing you or you are confusing me, but
> 
> sure
> 
>>I don't get this - please help me straighten it out.
>>
>>Ok, if there are waiters, *futex == LID0 | WP [LID0 = Locker ID of the
>>owner, WP = Waiters Present bit]. Using the assembly code below, I do
>>
>>ecx = MY_LID   ; MY locker ID
>>eax = 0
>>atomic {
>>  if (*futex == 0) {
>>    ZF = 1;
>>    *futex = ecx; // *futex = MY_LID
>>  } 
>>  else { /* We take this path */
>>    ZF = 0;
>>    eax = *futex;  // eax is LID0 | WP
>>  }
>>}
>>// So now ZF = 0, eax = *futex = LID0 | WP, ecx = MY_LID
>>jz got_it // ZF = 0, so we don't take the jump
>>ecx = eax;  // ecx = LID0 | WP
>>ecx |= WP;  // mark again the WP, although its there in this case
> 
> 
> I think this is an error.  I understand that you want to mark 
> contention on the futex as soon as possible, however, I think it will 
> cause problems if you mark it BEFORE you have set up the kernel 
> structures to handle the exit from a contented futex.  Better, I 
> think, to just leave it to the kernel to mark it contended once it 
> either has  all the structures in place, or at least will have prior 
> to letting any one else see the flag.  All that is at risk here is 
> that the current holder will exit and leave the futex free (or that 
> another will then grab it).  If you mark it here, you will have to 
> handle the case of the holder beating the contender to the kernel and 
> not knowing what to do...
> 
> When waiter calls the futex_wait it will pass the value of mutex. Inside the
> kernel it will check this value with the futex's lock, if they are different
> it will return to user space to try to lock it again. So I don't think it is
> a problem.

What I am saying is that the value is not important.  All it needs is 
a pointer to it.  Once the kernel has everything locked down it can 
look at the value and decide what to do.  I don't think the kernel 
needs to return to the user until the lock is granted in his favor. 
What are the possibilities?

a) The lock is held by another user and is uncontended, i.e. this is 
the first contention. (normal)
b) The lock is held by another user and is contended, i.e. this is a 
second or higher contender (also normal)
c) The lock is held by the caller (this is a recursion error and 
should cause termination of the caller with ext ream prejudice)
d) The lock is not held (i.e. the holder has gone away while the call 
was making its way to the kernel, fine, give the lock to the caller 
and continue)

I think that covers it.  Knowing what the state was that cause entry 
into the kernel is basically a big so what.  One of the above will be 
correct and we can continue.  I suppose we need to also contend with 
the case:
e) The lock is garbage (should handle the same as c)

The issue with e) is how to notice that it is garbage.  Just what does 
a valid lock look like.  We also need to contend with locks that are 
held by processes that have terminated.  Since, on the high speed 
path, the kernel does not know about the lock, how do we handle locks 
that are left by such processes, or must the process "register" a 
mutex before first use.  This would allow the kernel to check each 
mutex on exit to insure that the exiting task does not hold it.
> 
> 
>>atomic {
>>  if (*futex == (LID0 | WP)) { /* if still same as after previous CAS */
>>    ZF = 1;
>>    *futex = ecx;  // *futex = LID0 | WP, effectively a nop
>>  }
>>  else {
>>    ZF = 0;
>>    eax = *futex;
>>  }
>>}
>>jnz beginning // oops, it changed, try again
>>
>>Now, sure the first cas fails, as it is intended; but I have the *futex in
>>eax (as it failed) to verify in the next CAS that it did not change. Now,
>>the second CAS will only fail if the value of *futex changes (and thus
> 
> what
> 
>>I am trying to write back is already invalid because it was based on
> 
> that),
> 
>>so I retry from the beginning.
>>
>>
>>
>>>The only hole is 
>>>if there are no waiters and a contender (which will soon be a waiter) 
>>>fails in the CAS.  The owner still has the mutex, if he releases AND 
>>>another contender comes along there is a race for who will get it 
>>>between the one who failed the CAS and the new contender which may 
>>>take the lock without involving the kernel.  I contend that this is
>>
>>
>>This is what I don't see. The contender that failed the first CAS
> 
> (contender
> 
>>1) will go to the second CAS. In the second CAS he is assuming that *futex
>>equals to LID0 (owner's LID, as it saw it in the first CAS), but the
>>'hijacking' contender (contender 2) already set it to his LID2; so now,
> 
> the
> 
>>second CAS of contender 1 will compare *futex to the value he assumes it
>>should have, LID0, but it has LID2, so it fails to set it to LID0|WP and
> 
> he
> 
>>starts all over again. If it was what he saw in the failed 1st CAS, then
> 
> he
> 
>>sets it to LID0|WP and goes to the kernel to wait.
> 
> 
> This would work IF you can handle a contender who sees the WP getting 
> to the kernel prior to the one who sets it...
> 
>>He depends on the value staying the same, not on changing, so they cannot
>>race. Or I still cannot see it and I need you to please shake me and
> 
> explain
> 
>>it to me as if I were stupid [please keep in mind I have been working
> 
> almost
> 
>>all Saturday ...]
>> 
>>
>>
>>>just too close to call and either outcome is ok.  Note that in a UP 
>>>machine, the higher priority task will get it anyway, as it 
>>>has enough 
>>>priority to beat the other contender.
>>>
>>>If I understand your example, A has the mutex and there are waiters 
>>>when A unlocks.  This should cause another CAS failure since 
>>>A will be 
>>>looking for "A" and not "A"+flag.  The mutex stays locked and the 
>>>kernel then gives it (IN THE KERNEL, i.e. atomicly) to the highest 
>>>contender.
>>
>>
>>Exactly, but the unlock first CAS's only for A (the fast unlock path); if
>>that fails - AKA, either there are waiters (A+flag) or it is not our
> 
> futex.
> 
>>So next thing is test for A+flag - if it is so, go to the kernel (and
> 
> sure,
> 
>>the kernel will do the ownership transfer atomically to the highest
>>contender, setting the flag if there are more).
>>
>> 
>>
>>
>>>>Just before calling the kernel, thread E, (low or high
>>>>priority) atomically LOCKs from UNLOCKED, so it does not even know
>>>>that there are waiters in the kernel. Now thread B is woken up by
>>>>A in the kernel_wake() path and just before checking the *futex
>>>>and setting the waiter flag or whatever, it is killed [or delayed
>>>>and later killed]--the thing is it does not have the chance to
>>>>claim the *futex and mark it as having waiters present in the
>>>>kernel; E unlocks, without knowing about C and D, who will be
>>>>waiting until frogs grow beards.
>>>
>>>The key is that, once there are waiters, the flag is set in the mutex 
>>>and the exit CAS will fail, transfering control to the kernel.
>>>
>>>The only race is between a contender who is on the way to the kernel 
>>>and a second contender who just happens along just as A  unlocks.  So 
>>
>>
>>This is why you pass, along with the kernel_wait(), the value you saw for
>>the futex last time you did a CAS. If when you are about to lock, it has
>>changed, then you go back to user space to contend again, so you avoid
> 
> this
> 
>>problem.
> 
> 
> Actually I don't think this is a problem.  If the kernel has every 
> thing locked down it can just go from what it finds.  I don't really 
> see the point of even passing in the value that sent him to the kernel.
> 
>>
>>>there will only be these three tasks (or I suppose other could be 
>>>added, either calling the kernel or trying to lock the free mutex. 
>>>Only one of the many will get the mutex, the rest will all 
>>>meet in the 
>>>kernel).  Now the kernel will see one of three things:
>>>1) A holds the lock, still
>>>2) The mutex is unlocked, or
>>>3) C holds the lock
>>>
>>>From the kernels point of view 1 and 2 are the same, just queue the 
>>>new waiter.  2 is easy, give the mutex to B and keep on trucking.
>>>
>>>
>>>>I know there is an slim chance and that you have to align seven
>>>>planets for it to happen, but the chance is there; minimal, but
>>>>present. And I don't want to be blamed in the future for a
>>>>deadlock that risked some critical system.
>>>
>>>Murphy says this will happen ONLY after all tests pass and 
>>>the product 
>>>is shipped :)
>>>
>>>
>>>>>No, he should not enter the kernel with the expectation 
>>>
>>>that it is in 
>>>
>>>
>>>>>any state, as I said above.
>>>>
>>>>
>>>>Does the case I exposed invalidate this?
>>>
>>>No.
>>>
>>>
>>>>Think also that the idea behind that was to separate the actual
>>>>implementation of the *futex count from the futex implementation,
>>>>to allow architectures to choose the most effective one. For this
>>>>I am requiring compare-and-swap, at least. Maybe this can be 
>>>>stretched in ... hmmm ...
>>>>
>>>>
>>>>
>>>>
>>>>>But you really want to check for NULL and replace with self 
>>>
>>>(now the 
>>>
>>>
>>>>>owner).  The sign bit would only get set if there is already 
>>>>>an owner. 
>>>>>I suppose you could do a two stage loop something like:
>>>>>
>>>>>xx:	if (mutex->entry_wd == 0){       // this line 
>>>
>>>and the next is
>>>
>>>
>>>>>		mutex->entry_wd = TID;  /// the compare and swap
>>>>>                return ;		// we now own the mutex
>>>>>	}else{
>>>>>		if(mutex->entry_wd == what it was) {
>>>>>			mutex->entry_wd = what it was | flag;
>>>>>			call waitmutex(mutex);
>>>>>		}else{
>>>>>			goto xx;
>>>>>		}
>>>>>	}
>>>>>
>>>>>The problem with this code is that an exit by the owner can race to 
>>>>>the kernel with the contender.  And, of course, there is the third 
>>>>
>>>>
>>>>Not if your unlock code is like this:
>>>>
>>>>   int unlock (int *lock, int locker_id)
>>>>   {
>>>>     int old_value;
>>>>     atomic {  /* normal compare and swap */
>>>>       if (*lock == locker_id) {
>>>>         old_value = *lock;
>>>>         *lock = 0;
>>>>       }
>>>>     }
>>>>     if (old_value == locker_id) 
>>>>       return 0;
>>>>   
>>>>     if (*lock == (locker_id | WAITERS_PRESENT))
>>>>       return kernel_wake_up (lock, locker_id);
>>>>
>>>>     /* Error, we did not own the lock! */
>>>>     return -EWOULDBLOCK;
>>>>   }
>>>>
>>>>At least I cannot see no races; I have looked to it left and right 
>>>>and can't find any ...
>>>>
>>>>
>>>>
>>>>
>>>>>possibility, i.e. that the mutex flag is already set which 
>>>
>>>the above 
>>>
>>>
>>>>>does not handle.
>>>>
>>>>
>>>>It does. If the mutex flag (the waiters present flag) is already 
>>>>set when we step on xx, we just set it again and fall down to the 
>>>>kernel. In ia32, it'd be something like this [it's been long since
>>>>I don't write asm, so I could have the arguments swapped and stuff]:
>>>
>>>If the compare fails the cmpxchg fails.  You don't need to do 
>>>anything 
>>>to reset it.  THAT is what prevents the race.
> 
> 
> 
>>>>     1: movl          locker_id, %ecx
>>>>        movl          0, %eax
>>>>        lock cmpxchg  futex, %ecx
>>>>        jz            we_got_it_locked_we_are_owners
>>>>        movl          %eax, %ecx
>>>>        orl           WAITERS_PRESENT, %ecx
>>>>        lock cmpxchg  futex, %ecx
>>>>        jnz           1
>>>>      wait_in_kernel:
>>>>        ...
>>
>>
>>
>>
>>Inaky Perez-Gonzalez -- Not speaking for Intel - opinions are my own [or
> 
> my
> 
>>fault]
>>
>>
> 
> 

-- 
George Anzinger   george at mvista.com
High-res-timers:  http://sourceforge.net/projects/high-res-timers/
Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml




More information about the cgl_discussion mailing list