[Ksummit-2013-discuss] [ATTEND] Linux VM Infrastructure to support Memory Power Management

Mon Jul 29 14:14:39 UTC 2013

On 7/28/2013 8:32 PM, Johannes Weiner wrote:
>> >But we have more than just a binary on-or-off switch, as I mentioned above.
>> >Also, we'll most likely have situations where we can just move pages
>> >around to save power, way more frequently than opportunities to evacuate
>> >and power-off entire regions. So the former usecase is pretty important,
>> >IMHO.
> I'm just wondering how far this gets us.

there's also some dangers here in that many things are tradeoffs, not easy to get right
and quite often only valid for a few years (since the tradeoffs are eventually hardware driven,
and this sort of thing tends to change in the longer run)

Few things to consider
* Running a cpu core to move stuff around may be expensive or cheap, depending on the cpu one picked
* DIMMs are getting bigger (and with interleaving, the minimum unit of saving may be a few dimms grouped together),
   so the amount of work may be quite large
* On just about all modern hardware, memory can go into self-refresh when all CPUs are idle. SR has a lower
   power level by quite a bit than active memory. This makes using the CPU to do memory work double heavy
* Most modern hardware can turn DIMMs into a lower power state (CKE and the like) when not accessed.
   CKE like states aren't as low power as SR, but still, it can add up. So compaction/grouping to DIMMs
   may be a benefit if it means we're not accessing some of the DIMMs for a while
* CPU caches can be huge, which can shield a lot of DIMM activity (allowing them to go to lower power states)
   or NUMA effects  .... but other systems have small caches.
* As Matthew said... it also depends a lot on storage speed. With NVMe and even faster storage,
   the value of a page in the pagecache is clearly different on such system than when using spinning rust
   or a glorified USB stick.
* There are huge differences in the various power levels between DDR3, DDR3L, LPDDR and likely DDR4
   whenever that shows up. These differences will likely mean different tradeoffs.
   E.g. super low power memory with super slow storage (eMMC) is different clearly than higher power memory
   with very fast storage in terms of what tradeoffs one should make in the VM. Getting this to auto-tune
   is important but interesting

There is one other aspect we should think about: Memory power does not normally depend on what bits are in memory[1].
Freeing memory might be the wrong thing to do; since such freeing is speculative on actually achieving memory
savings some time later...
... what if we could have it so that the VM keeps the page, unmapped of course, and only if memory power has totally
gone away (e.g. the content invalidated) do we mark the content as invalid... potentially only when we're asked to
map the content again (using some sort of generation number or whatever). This could avoid doing a lot of the more
heavy stuff for when it doesn't pay off, and make it cheap to declare a whole range suddenly no longer valid.

[1] Except in virtual machines, where "all zeroes" content allows the hypervisor to de-duplicate better and give
more memory to other VMs as a result which then can run more efficient