- RE:First: selection/aggregation-centric-only -- see my comment on irregularity...

jandrewrogers · on Sept 14, 2020

A critical aspect being ignored is the economics. Highly optimized analytical database code saturates bandwidth on a surprisingly cheap CPU (usually lower-mid range). While a GPU may be 2-4x faster for some operations, it usually pencils out to be at least as expensive operationally, never mind CapEx, for the same workload performance as just using CPUs. This has been a reliable pattern. In which case, why wouldn't you just use CPUs? When you build systems at scale, these cost models are a routine part of the design specs because the bills are steep.

No one stores 10PB in RAM that I know of. A good CPU database kernel will run out of PCIe lanes driving large gangs of NVMe devices at theoretical without much effort. The performance for most workloads is indistinguishable from in-memory, but at a fraction of the cost. It would be slower to insert GPUs anywhere in this setup. (In modern database kernels generally, "in-memory" offers few performance benefits because storage has so much bandwidth that a state-of-the-art scheduler can exploit.) An interesting open research question is the extent to which we can radically reduce cache memory entirely, since state-of-the-art schedulers can keep query execution fed off disk for the most part, even in mixed workloads. Write sparsity still recommends a decent amount of cache for mixed workloads but probably much less than Bélády's optimality algorithm superficially implies.

Almost nothing is CPU-bound in databases these days in reasonable designs, not even highly compressive data model representations, parsing, or computational geometry. Which is great! A lot of analytics is join-intensive, but that is more about latency-hiding than computation. I would argue that the biggest bottleneck at the frontier right now is network handling, and GPUs don't help with that, though FPGAs/ASICs might.

I'm not sure how a GPU would help with operational real-time. Is it even possibly to parse, process, and index tens of millions of new complex records per second over the wire concurrent with running multiple ad hoc queries on a GPU? I've done this many times on a CPU but I've never seen a GPU database that came within an order of magnitude of that in practice, and I've used a few different GPU databases plus some custom bits. GPUs work better in a batch world.

I use GPUs, just not for analytical databases. I am biased in that GPU databases have consistently failed to deliver credible workloads across many scenarios in my experience and I understand at a technical level why they didn't live up to their marketing. Every time one gets stood up in a lab, and I see many of them, they fail to distinguish themselves versus a state-of-the-art CPU-based architecture. Most of them actually underperform in absolute terms. Almost everyone I know that has designed and delivered a production GPU database kernel eventually abandoned it because CPUs were consistently better in real-world environments.

GPU capabilities are improving, but I have seen limited progress in directions that address the underlying issues. They just aren't built to be used that way, and there are other applications for which they are exceedingly optimal that we wouldn't want to sacrifice for database purposes. CPU developments like AVX-512 get you surprisingly close to the practical utility of a GPU for databases without the weaknesses.

Anyway, this is a really big, really large conversation. It doesn't fit in the margin of an HN post. :-)

lmeyerov · on Sept 14, 2020

Yeah so it sounds like you are thinking about I/O bound workloads for what you consider analytical workloads, not compute bound. Traditional GPU (or your proposal of ASIC/FPGA) doesn't matter almost by definition: external systems can't feed them. No argument there, Spark won't replace your data warehouse's lookup engine :)

Assuming the analytic workload does have some compute, however, that's more of a comment about traditional systems having bad bandwidth than GPUs themselves. GPUs are already built for latency hiding, so it's more like CPUs are playing catchup to them. Two super interesting things have been happening here going forward, IMO:

- Nvidia finally got tired of waiting for the rest of the community to expose enough bandwidth. $-wise, they bought mellanox and now trying for ARM. In practice, this means providing more storage->device and network->device through improved hw+sw like https://developer.nvidia.com/blog/gpudirect-storage . I'm not privy to hyperscaler discussions on bandwidth across racks/nodes, but the outside trend does seem to be "more bandwidth", and I've been watching for straight-from-network in from cloud providers.

- Price drops for more hetero hardware. E.g., the T4 on AWS is ~6X cheaper for less choice in stuff like 64b vs 32b than the V100, yet has similar memory and perf within that sweet spot of choices. Nvidia pushes folks to DGX (big SSD -> local multi-GPU), which works for some scales, but in the wild, I see people often land on single-T4 / many-node once you take network bandwidth + cost into consideration in larger and more balanced systems.

For our own workloads, we don't trust GPU DBs enough to get it right, so found RAPIDS to be a nicer sweet spot where even junior devs can code it (Python dataframes), while perf people can predictably tune it and appropriately plug in the latest & greatest. Out-of-memory / streaming / etc. only became a thing starting in ~december, e.g., see recent https://github.com/rapidsai/tpcx-bb results + writeups, so it's been a wild couple of years. We still stick to single-GPU / in-memory for our workloads as we care about sub-second, but have been experimenting & architecting it for as ^^^^ smooths out for our use (and to help our customers who have different-shaped workloads). I've been impressed by stuff like the T4-many-node experience as layers like dask-cudf and blazingsql build up.

eternalban · on Sept 15, 2020

> GPUs work better in a batch world.

This categorical insight should be front and center of any discussion about the relative merits of using GPUs vs CPUs.

lmeyerov · on Sept 16, 2020

That seems inaccurate. GPUs are used for stuff like real-time games, video, and ML inference: the hardware is explicitly built for heavy streaming.

I would agree that productive GPU data frameworks in streaming modes are nascent, e.g., https://medium.com/rapids-ai/gpu-accelerated-stream-processi... .