[cgl_discussion] [cgl_valid] Simulating a system failure to f orce a filesystem rec overy

Wed Aug 7 11:11:47 PDT 2002

I understand these concerns.

Here's what I was planning to test to, which should cover the CGL
Requirements document requirement.

What should be guaranteed by a resilient filesystem?
====================================================
==>I was going to test to no corruption, since all transactions should be
logged before they happen so that they can always be recovered.  Maybe
someone else can offer more input here if this is in error.

How slow is too slow for disk recovery?
=======================================
==>Here's where I was planning on making my timings.  I was using a
resilient filesystem as taking O(1) time, or just enough time to complete
the write operations in the transaction that was in process during the
crash.
If it was not a resilient filesystem, then it would need to look at all the
meta-data on disk for recovery, so O(n), where n is proportional to the
filesystem size.
So, I'd need to be testing on a large enough file system that I'd be able to
detect these differences.
Basically, then, this is just a test that the resilient filesystem
implements what it claims to implement, nothing more about how fast it
really implements it.  For the 1.0 requirements document, this seemed
acceptable.

fsck note
=========
Note that I also agree the requirements document wording may be incorrect.
Somehow, the filesystem needs to be checked on reboot.  The requirements say
fsck will not be used, but I don't think that matters.  In fact, fsck
*could* be used.  The difference is with a resilient fs, fsck will take O(1)
time and with a nonresilient O(n).

- Julie

-----Original Message-----
From: Lynch, Rusty [mailto:rusty.lynch at intel.com]
Sent: Wednesday, August 07, 2002 10:53 AM
To: 'cgl_discussion at osdl.org'
Subject: RE: [cgl_discussion] [cgl_valid] Simulating a system failure to
f orce a filesystem rec overy

If at all possible, the test case needs to be implementation agnostic, so
this begs a few questions about what a "resilient" filesystem really means.

What exactly should be guaranteed by a resilient filesystem?
------------------------------------------------------------

The different implementations of journaling filesystems guarantee different
things, and in some cases the level of guaranteed recovery is configurable
(where greater performance is traded off for a safer level of file
recovery.)  So if the system were to crash while writing a bunch of data to
disk:

* Should we be guaranteed that all of the data from each of the write system
calls that returned is on disk with no corruption?
* Should we be guaranteed that only the files that were not opened (or maybe
not being written to) at the time of the crash are on disk with no
corruption?
* Something else?

How slow is too slow for disk recovery after a crash?
------------------------------------------------------

I know we talk about fsck in the requirements, but isn't that really an
implementation detail?  I could write the Lame File System (LFS) and have it
do an fsck type of recovery (i.e. n-order) after a crash.  It wouldn't be
fsck but it most definitely would not meet our requirements.

It seems like what we really are talking about is a maximum time for
recovery, or maybe a maximum time for so many interrupted write operations.
One simple test case would involve timing a startup after a graceful
shutdown and then compare that time to a startup after an ungraceful
shutdown (with all kinds of file operations happening during the crash.)

	-rusty

-----Original Message-----
From: Randy.Dunlap [mailto:rddunlap at osdl.org]
Sent: Wednesday, August 07, 2002 10:24 AM
To: Andy Pfiffer
Cc: Fleischer, Julie N; 'cgl_discussion at osdl.org'
Subject: Re: [cgl_discussion] [cgl_valid] Simulating a system failure to
force a filesystem rec overy

On 7 Aug 2002, Andy Pfiffer wrote:

| On Wed, 2002-08-07 at 09:55, Fleischer, Julie N wrote:
| > Validation -
| > As part of testing a resilient file system, I want a test case where I
am
| > sure that I have simulated a system failure so that on startup fsck (I
| > believe) must be performed.  In addition, it would be even better if
that
| > fsck could have to repair something (i.e., the system failure happened
in
| > the middle of a logical write).
| >
| > Does anyone know how I can do this reliably?
|
| As far as triggering an fsck, for non-journaled filesystems that are
| listed in /etc/fstab and automatically mounted on reboot, all you need
| to do is use reboot(2) with LINUX_REBOOT_CMD_RESTART without a previous
| unmount.
|
| You could probably arrange to reliably cause enough dirty state to be
| stuck in the bufffer cache that some form of repair would always be
| attempted.
|
| You might try this: create a new directory, and in that new directory,
| randomly create, write, re-write, re-name, and unlink a few 100 files
| and directories. Make sure it runs for a few seconds (like 3), and then
| call reboot().
|
| Saftey tip: don't do this on an ext2-based filesystem that you expect to
| be sane when the system reboots. ;^)

For ext2/3fs, you could also use debugfs to muck with the filesystem
metadata...

-- 
~Randy

_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion
_______________________________________________
cgl_discussion mailing list
cgl_discussion at lists.osdl.org
http://lists.osdl.org/mailman/listinfo/cgl_discussion