MMU gang wars: the TLB drive-by shootdown

eqvinox · on May 17, 2020

There is a very glaring omission (or call it gloss-over) in this article - whether and when a free() even causes the allocator to actually release memory to the OS. JVMs are kinda famous for their "om nom nom" attitude on this, but you can't transfer from ptmalloc (glibc, whose man page is cited regarding >=128k mmap) to tcmalloc and jemalloc.

Also, on Linux, an allocator may use madvise(MADV_FREE) instead of munmap(); this tells the kernel that the data in a page is no longer needed but can be unmapped at the kernel's discretion, possibly even at different times for different threads (e.g. when doing a task switch for an unrelated reason.)

[EDIT: this doesn't actually work, see below. Sorry.]

kentonv · on May 17, 2020

IIRC MADV_FREE doesn't actually let the kernel avoid shootdowns. From the man page (http://man7.org/linux/man-pages/man2/madvise.2.html):

    MADV_FREE (since Linux 4.5)
       The application no longer requires the pages in the range
       specified by addr and len.  The kernel can thus free these
       pages, but the freeing could be delayed until memory pressure
       occurs.  For each of the pages that has been marked to be
       freed but has not yet been freed, the free operation will be
       canceled if the caller writes into the page. (...)

The problem is with this bit about a write to the page canceling the free... This side effect needs to be implemented in a page fault. But if the page is still in the TLB, then writes to it won't fault. So... you need a TLB shootdown. :(

I do wonder why there isn't an API for "lazy munmap()"... it would behave exactly like munmap(), except that the pages might remain accessible in other threads until the end of their timeslices, when the kernel can apply queued TLB flushes. It seems to me likely that 99.9% of uses of munmap() could be switched to this lazy operation without introducing any bugs. I'm not a kernel programmer, though. Maybe there's some reason this doesn't work, or maybe the performance improvements aren't worth the effort?

cyphar · on May 18, 2020

> I do wonder why there isn't an API for "lazy munmap()"... it would behave exactly like munmap(), except that the pages might remain accessible in other threads until the end of their timeslices, when the kernel can apply queued TLB flushes.

That does exist, it's called MADV_DONTNEED and most operating systems have implemented it. However, MADV_DONTNEED on Linux was incorrectly implemented and (from memory) would always result in the page being unmapped immediately -- making it roughly equivalent to MADV_FREE. To quote the man page:

> All of the advice values listed here have analogs in the POSIX-specified posix_madvise(3) function, and the values have the same meanings, with the exception of MADV_DONTNEED.

kentonv · on May 18, 2020

AFAICT the POSIX behavior for MADV_DONTNEED is just a hint that the memory won't be accessed soon.

https://pubs.opengroup.org/onlinepubs/009695399/functions/po... says:

    POSIX_MADV_DONTNEED
        Specifies that the application expects that it will not access the specified
        range in the near future.

https://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2 says:

     MADV_DONTNEED    Allows the VM system to decrease the in-memory priority
        of pages in the specified address range. Consequently,
        future references to this address range are more likely
        to incur a page fault.

Neither of these suggest that the contents of the memory content can be discarded (as Linux does), only that it can be swapped out.

So it doesn't appear that this provides the "lazy unmap to avoid shootdown" behavior on any OS.

comex · on May 18, 2020

After reading that I wondered what happens on Linux if you actually call posix_madvise(POSIX_MADV_DONTNEED). The answer, from glibc:

    /* We have one problem: the kernel's MADV_DONTNEED does not
       correspond to POSIX's POSIX_MADV_DONTNEED.  The former simply
       discards changes made to the memory without writing it back to
       disk, if this would be necessary.  The POSIX behavior does not
       allow this.  There is no functionality mapping the POSIX behavior
       so far so we ignore that advice for now.  */
    if (advice == POSIX_MADV_DONTNEED)
      return 0;

davidtgoldblatt · on May 18, 2020

> Maybe there's some reason this doesn't work, or maybe the performance improvements aren't worth the effort?

I work on an allocator, and I've lobbied (Linux) kernel people for this over the years. I think everyone mostly agrees it'd be useful (and, the TLB shootdown cost is indeed often nontrivial), but the current architecture of the various subsystem's you'd need to touch makes it hard to do.

As allocators move more and more to a hugepages-first world, I think we'll see fewer and fewer benefits from faster unmappings (instead, we'll just spend more effort making sure we fill in the holes in existing in-use ranges).

markjdb · on May 18, 2020

> I do wonder why there isn't an API for "lazy munmap()"

You don't really need a separate API. The kernel can implement munmap() lazily, it just needs to also ensure that the address range isn't reused until a TLB shootdown is completed. LATR is a system that leverages this to implement lazy TLB shootdowns for munmap(): http://www.cs.yale.edu/homes/abhishek/kumar-asplos18.pdf

BeeOnRope · on May 18, 2020

Wouldn't this break the existing semantics of mmap, which is after the mmap call returns, other threads in the same process will fault if the they access the mapping?

Sure, avoiding re-use prevents some types of bugs but not others.

londons_explore · on May 18, 2020

Id be interested to know if any program relies on freeing memory in one thread, then (with some lock for synchronization) deliberately accessing it from another thread causing a segfault.

Sure, it's possible, but if no programs do it, it might be worth breaking the behaviour for the performance increase.

kentonv · on May 18, 2020

If the program is just using mmap/munmap for memory allocation, then I agree with you.

But some programs do crazier things. Imagine a program that implements mmap of a file entirely in userspace. For example, say I have a file that's stored on a distributed filesystem that dosen't have kernel drivers. So I catch SIGSEGV in userspace and populate pages on-demand by fetching them from the remote server. Later on, the page hasn't been used in a while, so I munmap() it to reclaim memory. I expect that any further uses after this point (especially writes, which maybe I need to sync back?) will raise a new SIGSEGV which I can handle.

I think something like that, unfortunately, could be broken by changing all munmaps to be lazy...

markjdb · on May 19, 2020

I was wondering how this scheme might break applications that do clever things with a SIGBUS/SIGSEGV handler. Do you happen to know of any OSS that actually does something like this?

saagarjha · on May 21, 2020

FUSE?

eqvinox · on May 17, 2020

> The problem is with this bit about a write to the page canceling the free... This side effect needs to be implemented in a page fault.

On some architectures, yes, but x86 has the dirty bit in the PTE for this; if it's cleared you're out easy (but that's unlikely); if it's already set I honestly don't know how costly it is to clear. You don't necessarily need to shoot the TLB on other CPUs though since the mapping is still valid; if there is some easier way to get the D flag cleared you can use that.

> But if the page is still in the TLB, then writes to it won't fault.

They can, a TLB entry has the full range of PTE flags to work with, e.g. a page may be flagged read-only for the kernel to do copy-on-write handling. I have no clue how exactly the dirty flag is set though :/ — I only know the function is indeed provided by the MMU itself, not a page fault.

kentonv · on May 18, 2020

> if there is some easier way to get the D flag cleared you can use that.

Is there though? I didn't know about the dirty bit, so obviously I don't know much about x86 TLB, but I'm having trouble imagining any reasonable implementation where clearing the dirty bit wouldn't itself require a shootdown.

Have you seen somewhere claim that MADV_FREE allows avoiding a shootdown? The man page doesn't say that and I am pretty sure I heard before that it does not, but would be very interested to know if that's incorrect.

eqvinox · on May 18, 2020

Intel docs say this on CPU behaviour:

https://software.intel.com/content/www/us/en/develop/downloa...

[4.8 ACCESSED AND DIRTY FLAGS]

Memory-management software may clear these flags when a page or a paging structure is initially loaded into physical memory. These flags are “sticky,” meaning that, once set, the processor does not clear them; only software can clear them.

A processor may cache information from the paging-structure entries in TLBs and paging-structure caches (see Section 4.10). This fact implies that, if software changes an accessed flag or a dirty flag from 1 to 0, the processor might not set the corresponding bit in memory on a subsequent access using an affected linear address (see Section 4.10.4.3). See Section 4.10.4.2 for how software can ensure that these bits are updated as desired.

[4.10.4 Invalidation of TLBs and Paging-Structure Caches]

[4.10.4.2 Recommended Invalidation]

... so, I guess, yeah, this doesn't really help :|

eqvinox · on May 18, 2020

> Have you seen somewhere claim that MADV_FREE allows avoiding a shootdown?

I haven't, and after looking at mm/madvise.c [ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... ] I guess it doesn't. Also, the unmapping happens immediately.

MADV_DONTNEED doesn't help either, it also seems to invalidate immediately.

That only leaves MADV_REMOVE, which AFAIK doesn't work on anonymous mappings...

sigh

[Edit: good thing I don't write allocators :D]

zozbot234 · on May 17, 2020

> I do wonder why there isn't an API for "lazy munmap()"... it would behave exactly like munmap(), except that the pages might remain accessible in other threads until the end of their timeslices, when the kernel can apply queued TLB flushes.

You would effectively need to run different address spaces in different threads of the "same" process, which might interoperate badly with whatever guarantees the kernel provides or relies on elsewhere based on the assumption of having a unified address space for the whole process. Though I absolutely agree that this is worthwhile, it probably needs to be designed carefully and can't just transparently replace any and all uses of munmap().

eqvinox · on May 17, 2020

Hmm - both maps and unmaps are already non-atomic though; there's gonna be some however minor span of time where different CPUs will have different state regarding a page that is being updated. IPIs don't grind the entire system to a halt to guarantee everyone is getting the same update at the same time.

I.e. you can already have a syscall on thread A racing against the incoming IPI on thread B's mmap/munmap...

new_realist · on May 18, 2020

Sure, but the munmap() won’t return to the caller until the effects are visible in all processors.

kentonv · on May 18, 2020

Imagine a program that, instead of calling munmap() directly, spins up a separate thread to call it, with the main thread proceeding on its way.

Doesn't such a program today already observe all the weirdness of the page "disappearing" from different threads at different points in time?

What difference would there be with a lazy mmap(), except that the window during which these inconsistencies are observable would be longer?

new_realist · on May 18, 2020

Lazy munmap() is definitely possible, there just needs to be more kernel to userspace bookkeeping. The weirdness you reference would only be visible in a use after free scenario, which is a bug, and therefore the behavior is undefined anyway.

But, reallocation is simplified with a synchronous munmap()—the memory allocator knows (through shared memory communication channels) that the virtual pages can be safely reallocated once the call returns; with a lazy approach some other mechanism needs to inform the allocator when all cores have been flushed (and thus it’s safe to make a new mapping), or else do a shootdown in mmap(). I think Solaris might have done something similar.

It’s safe to take a TLB miss on a remapping, but it’s not safe to reallocate memory and then inadvertently use an old, cached mapping. The synchronous design assumes that allocations are in the critical path and should be fast, but deallocations are not and can be slow. I think the original logic also assumed that virtual address space was scarce. These days I think a lazy unmap is probably worth it as a way of encouraging more efficient reuse of physical memory. Virtual address space is now plentiful. Note that, for security, a physical page might still result in a synchronous shoot down if it’s needed by another process quickly enough.

new_realist · on May 18, 2020

By the way, I love your new bandwidth-delay-product-based flow control logic in Cap’n Proto!

kentonv · on May 18, 2020

Thanks! (Though it'll be more exciting when it does its own BDP calculation rather than stealing the kernel's socket buffer size. :) )

joe_the_user · on May 17, 2020

Every once in a while I get involuntarily dragged into heated debates about whether reusing memory is better for performance than freeing it.

I couldn't comment on all the instances the article talks about. But this way of asking the question seems to me to hide the problem. It seems simpler to say "what memory allocation algorithm should you use?" Which is to say, "does your knowledge of your application's memory needs and memory performance trump all the effort and knowledge that went into creating the memory allocator of the operating system you're using?". And so then you get into the massive number of technical considerations the article and others might raise.

Memory allocation is a weird thing, it's an algorithm but it's often taken as a given in programming languages and discussions of algorithms.

bitcharmer · on May 17, 2020

Hi, author here. I'm not sure if I follow your argument. This article doesn't touch on allocators at all (ie. SLUB vs SLAB). It focuses solely on the cost of freeing memory which TLB-shootdowns are a notable part of.

I even mention it at the beginning:

> Regardless of the method by which your program acquired memory there are side effects of freeing/reclaiming it. This post focuses on the impact of so called TLB-shootdowns.

Hope this helps.

joe_the_user · on May 17, 2020

As I understand things, allocating and freeing memory pretty much forms a single system. Especially, if I have a system where I "manually" allocate 10 meg for my use, never free it but use an internal method to mark the memory free or used, I will still have issues with caching and virtual memory based on my use of the memory. IE, reusing memory effectively creating a "roll your own" free and allocate functions.

And in general, how contiguously you allocate memory plays a big part in whether freed memory can be easily discarded from the cache. If you get the heap to be exactly like a stack, then the cache shouldn't have problems. But I'll admit I'm not an expert and I could be missing something.

rocqua · on May 18, 2020

This article isn't so much about free-ing, but about unmapping memory. It could well be that you have an allocator that decides not to un-map free-ed memory so that it can quickly be re-used later.

That said, as per https://linux.die.net/man/3/malloc (and the article) the default implementation of free will (in some cases) unmap memory.

It is this un-mapping of memory that causes other threads to be affected. Because those threads should get a seg-fault after the unmapping if they try to access that memory.

userbinator · on May 17, 2020

Which is to say, "does your knowledge of your application's memory needs and memory performance trump all the effort and knowledge that went into creating the memory allocator of the operating system you're using?".

The OS's memory allocator will likely be optimised for good all-round performance on some benchmark which may or may not be representative of your application.

teknopaul · on May 18, 2020

Re: "does your knowledge of your application's memory needs and memory performance trump all the effort and knowledge that went into creating the memory allocator of the operating system you're using?".

Almost always.

You can almost always answer the question "am I going to need memory again after this free()".

If memory is tied to a request or some other object you almost always know that.

You can almost always make a good guess at how much data a repeated operation will need, thus how much memory to reuse.

You almost always know the benefit of caching data in memory that is otherwise on disk. Often the OS is better at this than the developer thinks!

You almost always know if data is likely to be needed in memory again. When you free() you lose the space and the data. Quite often you can benefit from a managed free() that you can ask for the memory again still populated if nothing else needed the space it in the end.

You almost always know the security concequences of reusing memory. e.g. the OS cannot tell if calloc or malloc is needed. Secure(ish) allocators are a thing.

You almost always know if a memory block might grow or is fixed size.

You almost always know when you need an i/o buffer that can be reused.

You almost always know if smart pointers are beneficial or not.

In C its perfectly normal to allocate on the stack rather than the heap because you know better than malloc().

You often know more than the allocator about memory alignment.

I think you could almost always do better than the OS allocator when writing servers.

I'd agree its often not worth the effort.

Unless you are writing lots of C. free() is such a pain.

You can almost always write better, safer, faster memory allocators than ad hoc use of malloc() and free().

brandmeyer · on May 17, 2020

This whole rigmarole is necessary for a single reason: TLBs don't participate in the cache coherency system.

Uh, why is that? If they did participate, then the mere act of writing to the cache line(s) which change the mapping would implicitly invalidate all of the associated entries in all of the system's TLBs. (Handwave, handwave), maybe you still end up needing a barrier similar to the instruction barrier needed when altering the content of executable pages.

What's the downside? Is it just power? Or is there something more fundamental about the TLB structure that makes it impractical?

eqvinox · on May 17, 2020

I'm guessing what you mean is for the TLB to get a notification when the physical memory that contains the page mapping it was loaded from is changed.

(As opposed to, some weird contortion the other way around where you touch the cache for the mapping that got changed.)

It'd probably require holding the cacheline for the TLB entry in the actual cache in at least MESI "S" state. For 5 levels of page tables. And you can't do non-flushing changes (e.g. dirty bit) anymore. I'm no CPU designer but it sounds complicated and bug prone...

brandmeyer · on May 18, 2020

> I'm guessing what you mean is for the TLB to get a notification when the physical memory that contains the page mapping it was loaded from is changed.

Yes, that's right.

Let us conceive of an inclusive cache architecture, just for the sake of argument. L2 already maintains a directory listing of all the lines which are present in L1I and L1D, and forwards cache coherency traffic to those caches based on the messages it receives. Expanding this to the ITLB and DTLB would be exactly the same circuits. The hard parts are already done.

But I think you're onto something with the hardware-managed dirty bit. Its much less general than the S->E->M transition, and doesn't need the Exclusive state at all.

I've done a bit more digging in the background. Turns out that ARMv8's TLB invalidate instructions are broadcasted operations. You don't need IPIs to execute them on every core. So processors do have some snoopy and/or directory management hardware interface for managing TLB shootdown. It just isn't as general-purpose as the rest of the cache maintenance hardware.

eqvinox · on May 18, 2020

A separate broadcast makes much more sense on the complexity front... if it were tied to the cache, the TLB wouldn't really care about anything other than specific transitions. It doesn't need the cacheline contents, and one cacheline would generally hold 8 or 16 PTEs (depending on PTE and cacheline size.) You'd also necessarily be binding to the PTE's physical address since page table walks occur on physical addresses.

With a broadcast, on the other hand, you get the specific transition event you need, and you can choose whether to broadcast the physical PTE, physical data or virtual (+ASID) address, or both. You can also have an acknowledgement returned if needed.

gpderetta · on May 18, 2020

Exactly. It is certainly possible to make TLB coherent: IIRC IBM mainframe architecture has coherent TLBs. But probably has a non trivial cost to implement and and hasn't been done yet for most common architectures. I wouldn't be surprised if it became a thing at some point i t the near future though, as it can be a win for high core count CPUs.

ip26 · on May 17, 2020

I would love to hear from someone with the real answer but I've always assumed it is legacy baggage from the old days. The impact on performance has probably never been big enough to compel the entire x86 ecosystem to make them coherent.

monocasa · on May 17, 2020

I imagine it's because the hardware doesn't have enough context to know which virtual address spaces exist in which cores' TLBs. ASIDs were only added in the virtualization instructions on x86, and I can't think of an OS where those are the same namespace across cores.

A side reason is that it'd probably heavily complicate the top level TLBs to have "please flush yourself" coming from anywhere other than the core they're attached to, and those are in the critical path between L1 and L2.

Totally spitballing here FWIW, I haven't been part of the design process for a TLB.

ncmncm · on May 18, 2020

It is because where it matters, it is too often easy to avoid the problem by better system design. And, it doesn't show up in benchmarks, in part because benchmarks don't unmap, or because they are well designed systems that neatly sidestep the problem.

cryptonector · on May 18, 2020

Nice write-up. I've long believed that malloc(4192) == mmap() is a very bad idea for this reason -- in fact, let your heap get giant pages, damn it. Also, fork() turns out to be pretty evil. fork() is OK when you're forking worker processes very early in a daemon's life, but for most other uses it's just a very bad idea -- use vfork() or posix_spawn() instead.

There are just too many problems with fork(). That paper about vfork() being harmful had it exactly backwards. It is fork() that is harmful. It has safety issues that make using it safely sufficiently hard as to not be worth it in many cases, but the real killer is fork()'s copy semantics, which just kills performance.

(A colleague of mine has used fork() in signal handlers to call abort() on the child side as a way of getting core dumps from live processes without killing them. That's pretty neat, and one of the very few uses of fork() I would endorse.)

forrestthewoods · on May 18, 2020

The anti-windows nipple rubbing was... weird. And not helpful.

As a primarily Windows guy I’d rather just know what, if any, differences there are. =/

idoncarfb · on May 18, 2020

Is this the only thing from the article you found worth pointing out? Also, no one here had a problem with that except you. What does it tell you?

forrestthewoods · on May 20, 2020

That everyone here uses Linux and MacBooks.

pete23 · on May 18, 2020

I like the nipple rubbing!

ncmncm · on May 18, 2020

All of this is moot if you have only the one thread. If you need parallelism, and can afford its complexity, separate processes sharing only what must be shared can eliminate your TLB exposure. Single-writer mappings eliminate cache storms.

All of this is moot if you allocate all your memory upfront.

All of this is moot if you can identify junctures when a stall is free, and cluster your shootdowns at such times.

adwn · on May 18, 2020

> If you need parallelism, and can afford its complexity, separate processes sharing only what must be shared can eliminate your TLB exposure.

Then you need to ensure that all shared data structures are exclusively allocated from a dedicated memory pool which is mapped to all processes. Oh, and you must make sure that you either don't use any pointers within those datastructures, or that the shared memory pool is mapped to the same address in all processes. And no pointers must point from within to outside that memory pool.

In theory, it is possible. In practice, this would preclude any kind of non-trivial parallelism.

ncmncm · on May 18, 2020

I see that you just haven't done it yet.

"Allocation" in this context just means open(), mmap(). A "shared memory pool" is a thing I would have no use for.

But I can assure you that you can have excellent, non-trivial parallelism with separate processes and chosen shared memory pages -- much moreso than with threads that must battle one another for access to locks, queues, and "pools". I routinely get order-of-magnitude performance improvement by reorganizing this way.

The single-writer ring buffer is the component that makes it all work. The environment might seem stark, but in exchange you can have exactly 0% concurrency overhead. Not wasting 90% on pool management and thread contention means other optimizations become meaningful. And, you can start and stop the processes independently.

rawoke083600 · on May 18, 2020

Awesome article... Extra points for posting on blogger.com :)

7532yahoogmail · on May 20, 2020

A hell of a good write up. I learned tons here.

lonelappde · on May 18, 2020

The title is absolutely obnoxious.

idoncarfb · on May 18, 2020

That's only if you're devoid of sense of humor.