container disk quota

jeff.liu at oracle.com jeff.liu at oracle.com
Wed May 30 14:58:54 UTC 2012


Hello All,

According to glauber's comments regarding container disk quota, it should be binded to mount
namespace rather than cgroup.

Per my try out, it works just fine by combining with userland quota utilitly in this way.
However, they are something has to be done at user tools too IMHO.

Currently, the patchset is in very initial phase, I'd like to post it early to seek more
feedbacks from you guys.

Hopefully I can clarify my ideas clearly.

Kernel part:
* Container quota can be enabled indenpent to VFS quota or particular file system quota.
  quota per user/group are kept at memory instead of saved at separately files like general quota.
  There is no need to remount the rootfs inside container with general quota strings, quota could be
  enabled through quotaon/off directly.

* Always honor underlying file system quota checking firstly.  i.e, the exported quota bill up
  routines are take affected only after file system quota check up done if it is enabled at the
  same time.   hence the space allocation or inode creation inside container will failed if the
  outside quota limits were exceeded.

* Make use of the general VFS Q_XXXX quota control flags.

* Introduce a new disk quota struture as well as the operations to mount namespacedata structure,
  it should only be allocated and initialized at CLONE stage for contianer.

* Modify quotactl(2) to examine if the caller is invoked inside container.
  implemented by checking the quota device name("rootfs" for lxc guest) or current pid namespace
  is not the initial one, then do mount namespace quotactl if required, or goto
  the normal quotactl procedure.

* Introduce a new quota format "QFMT_NS" for container.  It will be used to examine the quota
  format at userland tools, so that quotacheck will do container quota IO initialization and
  proceeding operations. This flag returned when Q_GETQINFO was issued.

* Export a couple of container quota bill routines to the desired underlying
  file system.  They will take affected if container quota is enabled at kernel
  configuration, or just some inline functions without much overhead.

* Also, I have not handle a couple of things for now.
  . I think the container quota should be isolated to Jan's fs/quota/ directory.
  . There are a dozens of helper routines at general quota, e.g,
    struct if_dqblk <-> struct fs_disk_quota converts.
    dquot space and inodes bill up.
    They can be refactored as shared routines to some extents.
  . quotastats(8) is not teached to aware container for now.

Changes in quota userland utility:
* Introduce a new quota format string "lxc" to all quota control utility, to
  let each utility know that the user want to run container quota control. e.g:
  quotacheck -cvugm -F "lxc" /
  quotaon -u -F "lxc" /
  ....

* Currently, I manually created the underlying device(by editing cgroup
  device access list and running mknod /dev/sdaX x x) for the rootfs
  inside containers to let the cache mount points routine pass for
  executing quotacheck against the "/" directory.  Actually, it can be
  omitted here.

* Add a new quotaio_lxc.c[.h] for container quota IO, it basically same to
  VFS quotaio logic, I just hope to isolate container stuff here.

Issues:
* How to detect quotactl(2) is launched from container in a reasonable way.

* Do we need to let container quota works for cgroup combine with unshare(1)?
  Now the patchset is mainly works for lxc guest.  IMHO, it can be used outside
  guest if the user desired.  In this case, the quota limits can take effort
  among different underlying file systems if they have exported quota billing
  routines.

* As the configure entry for print warnning info to TTY has been marked to
  obsoleted, do we still need to support that.

* The warnning info format for sending it through netlink interface.
  VFS quota has a device parameter filled in the warns, how we define the
  format for container?

* The hash table list defines(hash table size)for dquot caching for each type is
  referred to kernel/user.c, maybe its better to define an array separatly for
  performance optimizations.  Of course, that's all depending on my current
  implementation is on the right road. :)
 
* Container quota statistics, should them be calculated and exposed to /proc/fs/quota?  If the underlying file system also enabled with quotas, they will be
  mixed up, so how about add a new proc file like "ns_quota" there?

* Memory shrinks acquired from kswap.
  As all dquot are cached in memory, and if the user executing quotaoff, maybe
  I need to handle quota disable but still be kept at memory.
  Also, add another routine to disable and remove all quotas from memory to
  save memory directly.

* Project quota(i.e, tree quota) support.
  Now the quota implemented without project quota supports, but it can be
  supported not complex based on current code, add a new parameter to
  ns_dquot_alloc_block(), etc... is ok.
  However, XFS support project quota setup on xfs tools, I observed there
  already have patchset for this feature in EXT4 mailist, is it possble
  to supply a unique interface and implementation to quota tools in the
  furture?
  AFAICS, project quota can be setup in container, because of we can
  fetch the super block from the transferred path.  Hence, the desired
  ioctl(2) for underlying file system can be invoked. 

* Security check up for mount namespace quotactl(2).
  In this version, I only do basic security check up to see if the caller
  has properly permissions for doing that.  I think I must miss much things
  in this point.

Testing:
Currently patch is lacking tests, I only do a few check to make sure the
basic operations works.

First of all, we need to invoke quotacheck with "--no-remount" opition
since the rootfs inside container guest can not be remouted:
root at debian:~/# quotacheck -cvugm -F "lxc" /
quotacheck: quotacheck: Scanning rootfs [/] done
quotacheck: Old user file name could not been determined. Usage will not be subtracted.
quotacheck: Old group file name could not been determined. Usage will not be subtracted.
quotacheck: Old user file name could not been determined. Usage will not be subtracted.
quotacheck: Old group file name could not been determined. Usage will not be subtracted.
quotacheck: Checked 3370 directories and 39434 files

By default, user/group quota is off:
root at debian:~/# quotaon -u -F "lxc" -p /
user quota on / (rootfs) is off

root at debian:~/# quotaon -u -F "lxc" -p /
group quota on / (rootfs) is off

Turn them on: 
root at debian:~/# quotaon -u -F "lxc" /   
root at debian:~/# quotaon -g -F "lxc" /
root at debian:~/# quotaon -u -F "lxc" -p /
user quota on / (rootfs) is on
root at debian:~/# quotaon -g -F "lxc" -p /
group quota on / (rootfs) is on

Edit quota, soft/hard for both space and inode are zeros by default:
configure them to a desired value:
root at debian:~/# edquota -u -F "lxc" /
Disk quotas for user jeff (uid 1000):
  Filesystem                   blocks       soft       hard     inodes     soft  
    hard
  rootfs                   2025740    2025840   2026000  42786  42790   42800

The configuration are saved properly:
root at debian:~/# repquota -u -F "lxc" /
Block grace time: 00:00; Inode grace time: 00:00
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      44       0       0             20     0     0       
jeff      -- 2025740 2025840 2026000          42786 42790 42800  

Do checking for blocks and inodes limits:
root at debian:~/# su - jeff
jeff at debian:/$ dd if=/dev/zero of=abc bs=1M count=10
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 1.19014 s, 8.8 MB/s
root at debian:~/# repquota -u -F "lxc" /
Jeff *** report() type=0 handle index=0
*** Report for user quotas on device rootfs
Block grace time: 00:00; Inode grace time: 00:00
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      44       0       0             20     0     0       
jeff      +- 2025980 2025840 2026000  7days   42786 42790 42800       

root at debian:~/# repquota -g -F "lxc" /
*** Report for group quotas on device rootfs
Block grace time: 00:00; Inode grace time: 00:00
                        Block limits                File limits
Group           used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --    8564       0       0            390     0     0       
adm       --     220       0       0              6     0     0       
tty       --       0       0       0              1     0     0       
utmp      --       4       0       0              1     0     0       
jeff      -- 2021268       0       0          42716     0     0  

root at debian:~/# su - jeff
jeff at debian:/$ dd if=/dev/zero of=test_space bs=1M count=100
dd: writing `test_space': Disk quota exceeded
11+0 records in
10+0 records out
10506240 bytes (11 MB) copied, 1.24721 s, 8.4 MB/s

root at debian:~/# repquota -u -F "lxc" /
Jeff *** report() type=0 handle index=0
*** Report for user quotas on device rootfs
Block grace time: 00:00; Inode grace time: 00:00
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      44       0       0             20     0     0       
jeff      +- 2026000 2025840 2026000  7days   42786 42790 42800   

root at debian:~/# su - jeff
jeff at debian:/$ for ((i=0; i<20; i++)); do touch test_file_cnt.$i; done
touch: cannot touch `test_file_cnt.14': Disk quota exceeded
touch: cannot touch `test_file_cnt.16': Disk quota exceeded
touch: cannot touch `test_file_cnt.18': Disk quota exceeded

root at debian:~/# repquota -u -F "lxc" /
Block grace time: 00:00; Inode grace time: 00:00
                        Block limits                File limits
User            used    soft    hard  grace    used  soft  hard  grace
----------------------------------------------------------------------
root      --      44       0       0             20     0     0       
jeff      ++ 2026000 2025840 2026000  6days   42800 42790 42800  7days

Any comments are appreciated, have a nice day!

-Jeff


More information about the Containers mailing list