[v11][PATCH 9/9] Document clone_with_pids() syscall

Sukadev Bhattiprolu sukadev at us.ibm.com
Wed Nov 4 21:42:04 PST 2009


From: Sukadev Bhattiprolu <sukadev at us.ibm.com>
Subject: [v11][PATCH 9/9] Document clone_with_pids() syscall

This gives a brief overview of the clone_with_pids() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Signed-off-by: Sukadev Bhattiprolu <sukadev at us.ibm.com>
Acked-by: Oren Laadan  <orenl at cs.columbia.edu>
---
 Documentation/clone_with_pids |  332 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 332 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/clone_with_pids

diff --git a/Documentation/clone_with_pids b/Documentation/clone_with_pids
new file mode 100644
index 0000000..80e9b20
--- /dev/null
+++ b/Documentation/clone_with_pids
@@ -0,0 +1,332 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack_base;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+	u64 reserved1;
+};
+
+
+clone_with_pids(u32 flags_low, struct clone_args * __user cargs,
+		int cargs_size, pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does,
+	the clone_with_pids() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint.  Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When clone_with_pids() supports more than 32 clone flags, the
+		additional bits in the clone_flags should be specified in this
+		field.  This field is currently unused and must be set to 0.
+
+	u64 child_stack_base;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields
+		in clone() and clone2() system calls (on IA64).
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to clone_with_pids() (see below). nr_pids should
+		not exceed the current nesting level of the calling process
+		(i.e if the process is in init_pid_ns, nr_pids must be 1,
+		if process is in a pid namespace that is a child of
+		init-pid-ns, nr_pids cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		clone_with_pids() in the future, while preserving backward
+		compatibility. They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, clone_with_pids() returns -1 and sets 'errno' to one of
+	following values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/* 
+ * Example clone_with_pids() usage - Create a child with pid CHILD_TID1 if
+ * program is run in init_pid_ns. If program is run in a child of init_pid_ns,
+ * create the child process with pid CHILD_TID2.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_clone_with_pids	337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE	8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+        u64 clone_flags_high;
+
+        u64 child_stack_base;
+        u64 child_stack_size;
+
+        u64 parent_tid_ptr;
+        u64 child_tid_ptr;
+
+        u32 nr_pids;
+
+        u32 reserved0;
+        u64 reserved1;
+};
+
+#define exit		_exit
+
+/*
+ * Following clone_with_pids() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_clone_with_pids)
+
+int clone_with_pids(int flags_low, struct clone_args *clone_args, int args_size,
+	 	int *pids)
+{
+	long retval;
+
+	__asm__  __volatile__(
+		 "movl %0, %%ebx\n\t"		/* flags -> 1st (ebx) */
+		 "movl %1, %%ecx\n\t"		/* clone_args -> 2nd (ecx)*/
+		 "movl %2, %%edx\n\t"		/* args_size -> 3rd (edx) */
+		 "movl %3, %%edi\n\t"		/* pids -> 4th (edi)*/
+		 "pushl %%ebp\n\t"		/* save value of ebp */
+		:
+		:"b" (flags_low),
+		 "c" (clone_args),
+		 "d" (args_size),
+		 "D" (pids)
+		);
+
+	__asm__ __volatile__(
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+		 "popl %%ebx\n\t"	/* get subthread function */
+		 "call *%%ebx\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+		:"=a" (retval)
+		:"0" (__NR_clone_with_pids), "i" (__NR_exit)
+		:"ebx", "ecx", "edx"
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg)
+{
+	void *child_stack;
+	void **new_stack;
+
+        child_stack = malloc(STACKSIZE);
+        if (!child_stack) {
+		perror("malloc()");
+		exit(1);
+	}
+        child_stack = (char *)child_stack + (STACKSIZE - 4);
+
+	new_stack = (void **)child_stack;
+	*--new_stack = child_arg;
+	*--new_stack = child_fn;
+
+	return new_stack;
+}
+
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	25
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)cs->child_tid_ptr);
+	if (ctid != CHILD_TID) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID, ctid);
+		exit(1);
+	}
+	sleep(3);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg, 
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack_base 	= (u64)stack;
+	ca->child_tid_ptr 	= (u64)&child_tid;
+	ca->nr_pids 		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = clone_with_pids(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: clone_with_pids() returned %d, error %d\n",
+		 getpid(), gettid(), rc, errno);
+
+	return rc;
+}
+
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2 };
+main()
+{
+	int rc, pid, ret, status;
+	unsigned long flags; 
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_PARENT_SETTID|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		} 
+	}
+}
-- 
1.6.0.4



More information about the Containers mailing list