arm-smmu-v3 high cpu usage for NVMe

John Garry john.garry at huawei.com
Fri May 22 14:52:30 UTC 2020


On 20/03/2020 10:41, John Garry wrote:

+ Barry, Alexandru

>>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 
>>>>> 0/34434 drop:
>>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
>>>>> -------------------------------------------------------------------------------------------------------------------------- 
>>>>>
>>>>>
>>>>>       27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
>>>>>       11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
>>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
>>>>>        2.65%  [kernel]          [k] get_user_pages_fast
>>>>>        2.03%  [kernel]          [k] __slab_free
>>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
>>>>>        1.47%  [kernel]          [k] arm_lpae_map
>>>>>        1.39%  [kernel]          [k] __fget
>>>>>        1.14%  [kernel]          [k] __lock_text_start
>>>>>        1.09%  [kernel]          [k] _raw_spin_lock
>>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
>>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
>>>>>        0.97%  [kernel]          [k] 
>>>>> arm_smmu_atc_inv_domain.constprop.42
>>>>>        0.91%  [kernel]          [k] fput_many
>>>>>        0.88%  [kernel]          [k] __arm_lpae_map
>>>>>

Hi Will, Robin,

I'm just getting around to look at this topic again. Here's the current 
picture for my NVMe test:

perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear

* only certain CPUs run the dma unmap for my scenario, cpu0 being one of 
them.

Colleague Barry has similar findings for some other scenarios.

So we tried the latest perf NMI support wip patches, and noticed a few 
hotspots (see 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate 
and 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) 
when running some NVMe traffic:

- initial cmpxchg to get a place in the queue
	- when more CPUs get involved, we start failing at an exponential rate
0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
26.52 :        ffff8000107a3504:       mov     x0, x4 : 
arm_smmu_cmdq_issue_cmdlist():

- the queue locking
- polling cmd_sync

Some ideas to optimise:

a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the 
contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged 
around the system.
Maybe we can implement something similar to the idea of queued/ticketed 
spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after 
initial cmpxchg fails, released by its leader, and releasing subsequent 
followers

b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for 
queue full or similar. This should also help current problem of a., as 
the less time between cmpxchg, the less chance of failing (as we check 
queue available space between cmpxchg attempts).

So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we 
always issue a cmd_sync for a batch (regardless of whether requested), 
then we should never fill (I think).

c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't 
think that this is required.

d. More minor ideas are to move forward when the "owner" stops gathering 
to reduce time of advancing the prod, hopefully reducing cmd_sync 
polling time; and also use a smaller word size for the valid bitmap 
operations, maybe 32b atomic operations are overall more efficient (than 
64b) - mostly valid range check is < 16 bits from my observation.

Let me know your thoughts or any other ideas.

Thanks,
John



More information about the iommu mailing list