[PATCH] btrfs: fix rw device counting in __btrfs_free_extra_devids

Desmond Cheong Zhi Xi desmondcheongzx at gmail.com
Sun Jul 25 06:19:52 UTC 2021


On 22/7/21 1:59 am, David Sterba wrote:
> On Thu, Jul 15, 2021 at 06:34:03PM +0800, Desmond Cheong Zhi Xi wrote:
>> Syzbot reports a warning in close_fs_devices that happens because
>> fs_devices->rw_devices is not 0 after calling btrfs_close_one_device
>> on each device.
>>
>> This happens when a writeable device is removed in
>> __btrfs_free_extra_devids, but the rw device count is not decremented
>> accordingly. So when close_fs_devices is called, the removed device is
>> still counted and we get an off by 1 error.
>>
>> Here is one call trace that was observed:
>>    btrfs_mount_root():
>>      btrfs_scan_one_device():
>>        device_list_add();   <---------------- device added
>>      btrfs_open_devices():
>>        open_fs_devices():
>>          btrfs_open_one_device();   <-------- rw device count ++
>>      btrfs_fill_super():
>>        open_ctree():
>>          btrfs_free_extra_devids():
>> 	  __btrfs_free_extra_devids();  <--- device removed
>> 	  fail_tree_roots:
>> 	    btrfs_close_devices():
>> 	      close_fs_devices();   <------- rw device count off by 1
>>
>> Fixes: cf89af146b7e ("btrfs: dev-replace: fail mount if we don't have replace item with target device")
> 
> What this patch did in the last hunk was the rw_devices decrement, but
> conditional:
> 
> @@ -1080,9 +1071,6 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>                  if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>                          list_del_init(&device->dev_alloc_list);
>                          clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
> -                       if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> -                                     &device->dev_state))
> -                               fs_devices->rw_devices--;
>                  }
>                  list_del_init(&device->dev_list);
>                  fs_devices->num_devices--;
> ---
> 
> 
>> @@ -1078,6 +1078,7 @@ static void __btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices,
>>   		if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state)) {
>>   			list_del_init(&device->dev_alloc_list);
>>   			clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
>> +			fs_devices->rw_devices--;
>>   		}
>>   		list_del_init(&device->dev_list);
>>   		fs_devices->num_devices--;
> 
> So should it be reinstated in the original form? The rest of
> cf89af146b7e handles unexpected device replace item during mount.
> 
> Adding the decrement is correct, but right now I'm not sure about the
> corner case when teh devcie has the BTRFS_DEV_STATE_REPLACE_TGT bit set.
> The state machine of the device bits and counters is not trivial so
> fixing it one way or the other could lead to further syzbot reports if
> we don't understand the issue.
> 

Hi David,

Thanks for raising this issue. I took a closer look and I think we don't 
have to reinstate the original form because it's a historical artifact.

The short version of the story is that going by the intention of 
__btrfs_free_extra_devids, we skip removing the replace target device. 
Hence, by the time we've reached the decrement in question, the device 
is not the replace target device and the BTRFS_DEV_STATE_REPLACE_TGT bit 
should not be set.

But we should also try to understand the original intention of the code. 
The check in question was first introduced in commit 8dabb7420f01 
("Btrfs: change core code of btrfs to support the device replace 
operations"):
> @@ -536,7 +553,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
>                 if (device->writeable) {
>                         list_del_init(&device->dev_alloc_list);
>                         device->writeable = 0;
> -                       fs_devices->rw_devices--;
> +                       if (!device->is_tgtdev_for_dev_replace)
> +                               fs_devices->rw_devices--;
>                 }
>                 list_del_init(&device->dev_list);
>                 fs_devices->num_devices--;

If we take a trip back in time to this commit we see that 
btrfs_dev_replace_finishing added the target device to the alloc list 
without incrementing the rw_devices count. So this check was likely 
originally meant to prevent under-counting of rw_devices.

However, the situation has changed, following various fixes to 
rw_devices counting. Commit 63dd86fa79db ("btrfs: fix rw_devices miss 
match after seed replace") added an increment to rw_devices when 
replacing a seed device with a writable one in btrfs_dev_replace_finishing:
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index eea26e1b2fda..fb0a7fa2f70c 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -562,6 +562,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>         if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>                 fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>         list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> +       if (src_device->fs_devices->seeding)
> +               fs_info->fs_devices->rw_devices++;
>  
>         /* replace the sysfs entry */
>         btrfs_kobj_rm_device(fs_info, src_device);

This was later simplified in commit 82372bc816d7 ("Btrfs: make the logic 
of source device removing more clear") that simply decremented 
rw_devices in btrfs_rm_dev_replace_srcdev if the replaced device was 
writable. This meant that the rw_devices count could be incremented in 
btrfs_dev_replace_finishing without any checks:
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index e9cbbdb72978..6f662b34ba0e 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -569,8 +569,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>         if (fs_info->fs_devices->latest_bdev == src_device->bdev)
>                 fs_info->fs_devices->latest_bdev = tgt_device->bdev;
>         list_add(&tgt_device->dev_alloc_list, &fs_info->fs_devices->alloc_list);
> -       if (src_device->fs_devices->seeding)
> -               fs_info->fs_devices->rw_devices++;
> +       fs_info->fs_devices->rw_devices++;
>  
>         /* replace the sysfs entry */
>         btrfs_kobj_rm_device(fs_info, src_device);

Thus, given the current state of the code base, the original check is 
now incorrect, because we want to decrement rw_devices as long as the 
device is being removed from the alloc list.

To further convince ourselves of this, we can take a closer look at the 
relation between the device with devid BTRFS_DEV_REPLACE_DEVID and the 
BTRFS_DEV_STATE_REPLACE_TGT bit for devices.

BTRFS_DEV_STATE_REPLACE_TGT is set in two places:
- btrfs_init_dev_replace_tgtdev
- btrfs_init_dev_replace

In btrfs_init_dev_replace_tgtdev, the BTRFS_DEV_STATE_REPLACE_TGT bit is 
set for a device allocated with devid BTRFS_DEV_REPLACE_DEVID.

In btrfs_init_dev_replace, the BTRFS_DEV_STATE_REPLACE_TGT bit is set 
for the target device found with devid BTRFS_DEV_REPLACE_DEVID.

 From both cases, we see that the BTRFS_DEV_STATE_REPLACE_TGT bit is set 
only for the device with devid BTRFS_DEV_REPLACE_DEVID.

It follows that if a device does not have devid BTRFS_DEV_REPLACE_DEVID, 
then the BTRFS_DEV_STATE_REPLACE_TGT bit will not be set.

With commit cf89af146b7e ("btrfs: dev-replace: fail mount if we don't 
have replace item with target device"), we skip removing the device in 
__btrfs_free_extra_devids as long as the devid is BTRFS_DEV_REPLACE_DEVID:
> -               if (device->devid == BTRFS_DEV_REPLACE_DEVID) {
> -                       /*
> -                        * In the first step, keep the device which has
> -                        * the correct fsid and the devid that is used
> -                        * for the dev_replace procedure.
> -                        * In the second step, the dev_replace state is
> -                        * read from the device tree and it is known
> -                        * whether the procedure is really active or
> -                        * not, which means whether this device is
> -                        * used or whether it should be removed.
> -                        */
> -                       if (step == 0 || test_bit(BTRFS_DEV_STATE_REPLACE_TGT,
> -                                                 &device->dev_state)) {
> -                               continue;
> -                       }
> -               }
> +               /*
> +                * We have already validated the presence of BTRFS_DEV_REPLACE_DEVID,
> +                * in btrfs_init_dev_replace() so just continue.
> +                */
> +               if (device->devid == BTRFS_DEV_REPLACE_DEVID)
> +                       continue;

Given the discussion above, after we fail the check for device->devid == 
BTRFS_DEV_REPLACE_DEVID, all devices from that point are not the replace 
target device, and do not have the BTRFS_DEV_STATE_REPLACE_TGT bit set.

So the original check for the BTRFS_DEV_STATE_REPLACE_TGT bit before 
incrementing rw_devices is not just incorrect at this point, it's also 
redundant.

Of course, I would hate to introduce a hard-to-find bug with a bad 
analysis, so any thoughts on this would be appreciated.

Best wishes,
Desmond


More information about the Linux-kernel-mentees mailing list