Virtualizing /proc/sys/kernel/random/boot_id per container ?

Tue Sep 4 08:42:45 UTC 2012

On 09/03/2012 11:48 PM, Eric W. Biederman wrote:
> Glauber Costa <glommer at parallels.com> writes:
> 
>> On 08/31/2012 04:13 AM, Eric W. Biederman wrote:
>>> "Daniel P. Berrange" <berrange at redhat.com> writes:
>>>
>>>> On Thu, Aug 30, 2012 at 03:15:17PM -0700, Eric W. Biederman wrote:
>>>>> "Daniel P. Berrange" <berrange at redhat.com> writes:
>>>>>
>>>>>> One of the features that SystemD folks have asked us to fix in LXC, is
>>>>>> to make sure that /proc/sys/kernel/random/boot_id changes each time a
>>>>>> container is started.
>>>>>
>>>>> There may be a good reason for this.  Most of the time what I have seen
>>>>> of kernel requests from the direction of SystemD is that while there may
>>>>> be a real problem but usually their imagined solution is not a
>>>>> particularly good solution.  So a description of the problem is needed.
>>>>>
>>>>> Justifying something with just SystemD wants this is a good way to get
>>>>> a nack.
>>>>
>>>> SystemD records log messages for all system services in their journal.
>>>> They can show you all log messages for the current service execution,
>>>> all log messages for a service since system boot, or all log messsages
>>>> ever. The boot_id value is used as a unique tag to allow grouping of
>>>> the log messages per system boot. When we run systemd inside a container
>>>> we want to get that grouping of log messages generated by services inside
>>>> the container, to take account of the container boot, not the host boot.
>>>> Hence the desire to have the boot_id value reflect when a container is
>>>> booted.
>>>
>>> Since SystemD post-dates containers and since the logging feature is not
>>> currently in wide use that use case is completely non-persuasive.
>>>
>>> So far this just sounds like a plain SystemD bug and something that can
>>> be easily changed at this point in time.
>>>
>>> It has been a long time but my fuzzy memory says that the originial
>>> boot_id justification was based on use cases that could not be solved
>>> any other way.
>>>
>>> My memory says it was this thread https://lkml.org/lkml/1999/5/31/233
>>> that inspired the implementation of boot_id.  However reading the
>>> current emacs source code it appears emacs gave up before boot_id
>>> was implemented and stats /var/run/random-seed (which we seem to
>>> have removed) or looks in wtmp or utmp for the latest boot record.
>>>
>>> I did a quick grep through the binaries on my system and I could not
>>> find anything using /proc/sys/random/boot_id.
>>>
>>> That suggests to me that the proper solution is to actually just remove
>>> boot_id.
>>>
>>> Hmm.  And then there is other interesting detail.  What should boot_id
>>> return after the processes have migrated from one system to another.
>>>
>>
>> Since this would be a per-boot id, this clearly has to be carried over
>> with migration, along with all the tons of data we already carry.
> 
> The twist of course is what does a boot mean.  If we are really after
> machine boots than the current behavior is correct.
> 
> Looking back in the archives the desired behavior appears to be a value
> that can be used to see if a pid value must be stale.
> 
> As a stale pid detector boot_id is pretty lousy.  Pids can still be
> reused.
> 
> Still a role as a stale pid detector makes it clear which namespace
> boot_id should be in and how we should treat boot_id upon migration.
> 
> You can only serve as a stale pid detector if you are in the pid
> namespace.
> 
> So at this point patches are welcome.  Hopefully with a summary
> of the discussion.
> 
> Eric
> 

Your discussion about boot_id being a limited solution is totally valid.
But it is orthogonal to the question of whether or not a container
should have it.

I took a look at this, and I think the kernel should be in perfect
position to do it. FUSE is welcome so far for things that are really
ill-defined in the kernel, such as data coming from cgroups, which has
no concept of visibility.

boot_id as a pid namespace id is a very well defined concept. We just
need an interface to set it up to make it stable across migration. Maybe
we can accept writes to this file as valid, provided the pid namespace
has only the init process.

Then any tool could clone, mount proc, set this id, and continue
normally. Any objections ?