[cgl_discussion] [cgl_valid] Simulating a system failure to f orce a filesystem rec overy

Lynch, Rusty rusty.lynch at intel.com
Wed Aug 7 10:52:58 PDT 2002


If at all possible, the test case needs to be implementation agnostic, so
this begs a few questions about what a "resilient" filesystem really means.

What exactly should be guaranteed by a resilient filesystem?
------------------------------------------------------------

The different implementations of journaling filesystems guarantee different
things, and in some cases the level of guaranteed recovery is configurable
(where greater performance is traded off for a safer level of file
recovery.)  So if the system were to crash while writing a bunch of data to
disk:

* Should we be guaranteed that all of the data from each of the write system
calls that returned is on disk with no corruption?
* Should we be guaranteed that only the files that were not opened (or maybe
not being written to) at the time of the crash are on disk with no
corruption?
* Something else?

How slow is too slow for disk recovery after a crash?
------------------------------------------------------

I know we talk about fsck in the requirements, but isn't that really an
implementation detail?  I could write the Lame File System (LFS) and have it
do an fsck type of recovery (i.e. n-order) after a crash.  It wouldn't be
fsck but it most definitely would not meet our requirements.

It seems like what we really are talking about is a maximum time for
recovery, or maybe a maximum time for so many interrupted write operations.
One simple test case would involve timing a startup after a graceful
shutdown and then compare that time to a startup after an ungraceful
shutdown (with all kinds of file operations happening during the crash.)

	-rusty

-----Original Message-----
From: Randy.Dunlap [mailto:rddunlap at osdl.org]
Sent: Wednesday, August 07, 2002 10:24 AM
To: Andy Pfiffer
Cc: Fleischer, Julie N; 'cgl_discussion at osdl.org'
Subject: Re: [cgl_discussion] [cgl_valid] Simulating a system failure to
force a filesystem rec overy


On 7 Aug 2002, Andy Pfiffer wrote:

| On Wed, 2002-08-07 at 09:55, Fleischer, Julie N wrote:
| > Validation -
| > As part of testing a resilient file system, I want a test case where I
am
| > sure that I have simulated a system failure so that on startup fsck (I
| > believe) must be performed.  In addition, it would be even better if
that
| > fsck could have to repair something (i.e., the system failure happened
in
| > the middle of a logical write).
| >
| > Does anyone know how I can do this reliably?
|
| As far as triggering an fsck, for non-journaled filesystems that are
| listed in /etc/fstab and automatically mounted on reboot, all you need
| to do is use reboot(2) with LINUX_REBOOT_CMD_RESTART without a previous
| unmount.
|
| You could probably arrange to reliably cause enough dirty state to be
| stuck in the bufffer cache that some form of repair would always be
| attempted.
|
| You might try this: create a new directory, and in that new directory,
| randomly create, write, re-write, re-name, and unlink a few 100 files
| and directories. Make sure it runs for a few seconds (like 3), and then
| call reboot().
|
| Saftey tip: don't do this on an ext2-based filesystem that you expect to
| be sane when the system reboots. ;^)

For ext2/3fs, you could also use debugfs to muck with the filesystem
metadata...

-- 
~Randy

_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion



More information about the cgl_discussion mailing list