[PATCH v21 011/100] eclone (11/11): Document sys_eclone

Randy Dunlap randy.dunlap at oracle.com
Wed May 5 14:14:47 PDT 2010


On Sat,  1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
> 
> This gives a brief overview of the eclone() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev at linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue at us.ibm.com>
> Acked-by: Oren Laadan  <orenl at cs.columbia.edu>
> ---
>  Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 348 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/eclone
> 
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +		pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does, the

	                                that the clone()

> +	eclone() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid namespaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint. Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.

	in the existing

> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When eclone() supports more than 32 flags, the additional bits
> +		in the clone_flags should be specified in this field. This
> +		field is currently unused and must be set to 0.
> +
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields in
> +		clone() and clone2() (on IA64) system calls. The usage of
> +		these two fields depends on the processor architecture.
> +
> +		Most architectures use ->child_stack to pass-in a stack-pointer

		                                     to pass in

> +		itself and don't need the ->child_stack_size field. On these
> +		architectures the ->child_stack_size field must be 0.
> +
> +		Some architectures, eg IA64, use ->child_stack to pass-in the

		                    e.g.                        to pass in

> +		base of the region allocated for stack. These architectures
> +		must pass in the size of the stack-region in ->child_stack_size.

		                             stack region

Seems unfortunate that different architectures use the fields differently.

> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call

		                                      system call.

> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to eclone() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the

		                                                  i.e.

> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 reserved0;
> +	u64 reserved1;
> +
> +		These fields are intended to extend the functionality of the
> +		eclone() in the future, while preserving backward compatibility.
> +		They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
> +	is intended to enable extending this structure in the future, while
> +	preserving backward compatibility.  For now, this field must be set
> +	to the sizeof(struct clone_args) and this size must match the kernel's
> +	view of the structure.
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid namespaces. The
> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid

	                         of the clone(2)

> +	namespaces.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> +	@pids list is fewer than the nesting level of the process, the pids
> +	are applied from youngest namespace. i.e if the process is nested in

	                 the youngest namespace. I.e.

> +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +	have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, eclone() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
> +		specify the pids in this call (if pids are not specifed
> +		CAP_SYS_ADMIN is not required).
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process

		                                    process.

> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EINVAL	The child_stack_size field is not 0 (on architectures that
> +		pass in a stack pointer in ->child_stack field)

		                                         field).

> +
> +	EBUSY	A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone		337
> +#define CLONE_NEWPID            0x20000000
> +#define CLONE_CHILD_SETTID      0x01000000
> +#define CLONE_PARENT_SETTID     0x00100000
> +#define CLONE_UNUSED		0x00001000
> +
> +#define STACKSIZE		8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +	u32 nr_pids;
> +
> +	u32 reserved0;
> +};
> +
> +#define exit		_exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> +		int *pids)
> +{
> +	long retval;
> +
> +	__asm__ __volatile__(
> +		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
> +		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
> +		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
> +		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
> +
> +		 "pushl %%ebp\n\t"	/* save value of ebp */
> +		 "int $0x80\n\t"	/* Linux/i386 system call */
> +		 "testl %0,%0\n\t"	/* check return value */
> +		 "jne 1f\n\t"		/* jump if parent */
> +
> +		 "popl %%esi\n\t"	/* get subthread function */
> +		 "call *%%esi\n\t"	/* start subthread function */
> +		 "movl %2,%0\n\t"
> +		 "int $0x80\n"		/* exit system call: exit subthread */
> +		 "1:\n\t"
> +		 "popl %%ebp\t"		/* restore parent's ebp */
> +
> +		:"=a" (retval)
> +
> +		:"0" (__NR_eclone),
> +		 "i" (__NR_exit),
> +		 "m" (flags_low),
> +		 "m" (clone_args),
> +		 "m" (args_size),
> +		 "m" (pids)
> +		);
> +
> +	if (retval < 0) {
> +		errno = -retval;
> +		retval = -1;
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> +	void *stack_base;
> +	void **stack_top;
> +
> +	stack_base = malloc(size + size);
> +	if (!stack_base) {
> +		perror("malloc()");
> +		exit(1);
> +	}
> +
> +	stack_top = (void **)((char *)stack_base + (size - 4));
> +	*--stack_top = child_arg;
> +	*--stack_top = child_fn;
> +
> +	return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> +	int rc;
> +
> +	rc = syscall(__NR_gettid, 0, 0, 0);
> +	if (rc < 0) {
> +		printf("rc %d, errno %d\n", rc, errno);
> +		exit(1);
> +	}
> +	return rc;
> +}
> +
> +#define CHILD_TID1	377
> +#define CHILD_TID2	1177
> +#define CHILD_TID3	2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> +	struct clone_args *cs = (struct clone_args *)arg;
> +	int ctid;
> +
> +	/* Verify we pushed the arguments correctly on the stack... */
> +	if (arg != child_arg)  {
> +		printf("Child: Incorrect child arg pointer, expected %p,"
> +				"actual %p\n", child_arg, arg);
> +		exit(1);
> +	}
> +
> +	/* ... and that we got the thread-id we expected */
> +	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> +	if (ctid != CHILD_TID1) {
> +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
> +				CHILD_TID1, ctid);
> +		exit(1);
> +	} else {
> +		printf("Child got the expected tid, %d\n", gettid());
> +	}
> +	sleep(2);
> +
> +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> +	exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> +	int rc;
> +	void *stack;
> +	struct clone_args *ca = &clone_args;
> +	int args_size;
> +
> +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> +	memset(ca, 0, sizeof(*ca));
> +
> +	ca->child_stack		= (u64)(unsigned long)stack;
> +	ca->child_stack_size	= (u64)0;
> +	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
> +	ca->nr_pids		= nr_pids;
> +
> +	args_size = sizeof(struct clone_args);
> +	rc = eclone(flags_low, ca, args_size, pids_list);
> +
> +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> +				rc, errno);
> +	return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> +	int rc, pid, status;
> +	unsigned long flags;
> +	int nr_pids = 1;
> +
> +	flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> +	rc = waitpid(pid, &status, __WALL);
> +	if (rc < 0) {
> +		printf("waitpid(): rc %d, error %d\n", rc, errno);
> +	} else {
> +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> +			 gettid(), rc, status);
> +
> +		if (WIFEXITED(status)) {
> +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
> +		} else if (WIFSIGNALED(status)) {
> +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
> +		}
> +	}
> +	return 0;
> +}
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***


More information about the Containers mailing list