> GPU databases are brilliant for cases where the working set can live entirely ...

boulos · on Sept 13, 2020

Disclosure: I work on Google Cloud (but you don’t need to rent GPUs from us).

Absolutely, but as last year’s discussion highlights, a bunch of GPUs connected via NVLINK kind of gives you the aggregate memory of the set for some of these database applications (large-scale ML training has also gone this way).

That’s why our A100 system design is 16 A100s in a single host. 16x40 GB gives you 640 GB of aggregate memory, which is pretty attractive for many applications.

The question as always is cost vs benefit. If there’s something that a GPU backed < noun > can do that you “couldn’t” with a large Intel/AMD cpu box, or is actually a large integer multiple cheaper, it’s probably worth the development effort.

ramoz · on Sept 13, 2020

Interesting. Much of my work here is on production GCP workloads. We've landed on C2s and cpu optimizations for our inference engine but hadn't really considered NVLINK. Now wondering if I can distribute batch inference across multi-gpus.

We'll be on premium support soon ... hoping we can get access to folks like yourself for some of this.

boulos · on Sept 13, 2020

I should clarify: A100s are likely a bad price/performance trade off for inference. The T4 part is designed for that, but doesn’t have NVLINK.

Do you have models >16 GB that you’re trying to do real-time inference against?

Feel free to send me an email regardless! (In my profile)

Edit: https://news.ycombinator.com/item?id=23800049 was my writeup for a recent Ask HN about cost efficient inference.

zetazzed · on Sept 13, 2020

A100s support MIG (multi instance GPUs), so you can carve each A100 up into 7 separate GPUs for inference of small models. If your inference workload is too small to feed this beast all at once, it can be pretty handy: https://docs.nvidia.com/datacenter/tesla/mig-user-guide/inde...

lmeyerov · on Sept 13, 2020

Yeah the sweet spots is landing as:

* CPUs: medium data, and queries that are small / slow / irregular

* GPUs: general analytics over data that is small (in-memory) or large / streaming data (replace Spark):

Data perspective:

-- small data (100MB - 512GB): all in GPU memory, so question if boring queries ("select username from django_table" better in psql) or compute ones ("select price where ..." better in GPU SQL)

-- medium data: data sits in CPU RAM / SSD with compute nodes, and in a preorganized / static fashion, e.g., time series DB. too much data for GPU RAM, yet enough for for local SSD, so PCI bus is the bottleneck (8-32 GB/s)

-- large data (ex: 10TB spread through S3 buckets) + streaming (ex: 10GB/s netflow): you'll be waiting on network bandwidth anyways, so network link of 10GB/s -> PCI of 10GB/s -> GPU wins out over CPU equiv anyways. Good chance, instead of the pricey multi-GPU V100/A100s, you'll want a fleet of wimpy T4 GPUs.

As network/disk<>GPU high-bandwidth hw rolls out and libs automate their use, the current medium data sweet spot of CPU analytics systems goes away.

Compute perspective:

-- The category of 'irregular' (non-vectorizable) compute has been steadily shrinking for the last ~30 years as it's an important + fun topic for CS people. Even CPU systems now try to generally optimize for bulk-fetches (cacheline, ...) & SIMD-compute (e.g., SIMD over columns), and that inherently can only go so far until it's effectively a GPU alg on worse hw.

I see other areas in practice like crazy-RAM CPU boxes and FPGA/ASIC systems that I'm intentionally skipping as these end up pretty tailored, while my breakdown above is increasingly common for 'commodity HPC'.

jandrewrogers · on Sept 13, 2020

I don't think the "large data" case holds true, and I would not expect it to be economical to use GPUs for that.

First, this is essentially limiting the scope of "analytics" to selection/aggregation centric operations which are memory bandwidth bound. Many types of high-value analytic workloads and data models don't look like that. Even when 90% of your workload is optimal for GPUs, I've often seen the pattern that the last 10% is poor enough that it largely offsets the benefit.

Also, GPUs have better memory bandwidth than CPUs but people overlook that CPUs can use their limited memory bandwidth more efficiently for the same abstract workload, so the performance gap is smaller than memory-bandwidth numbers alone would suggest.

Second, 10TB is tiny; this is around the top-end of what we consider "small data" at most companies where I work. For example, in the very broad domain of sensor and spatiotemporal analytics, we tend to use 10 petabytes as the point where data becomes "large" currently, and data sets this size are ubiquitous. This data is stored with the compute when at all possible for obvious reasons -- it ends up looking more like your "medium" case in practice, albeit across a small-ish number of machines. The cost of processing tens of petabytes of data on GPUs would be prohibitive.

Lastly, a growing percentage of analytics at every scale is operational real-time, so new data needs to be integrated into the analytical data model approximately instantly. GPUs are not good at this type of architecture.

GPUs have their use cases but their Achille's Heel is that their performance sweet spot is too narrow for many (most?) real-world analytic workloads, and for some workload patterns the performance can be much worse than CPUs. CPUs provide much more consistent and predictable performance across diverse and changing workload requirements, which is a valuable property even it has worse performance for some workloads. Databases give considerable priority to minimizing performance variability because users do.

lmeyerov · on Sept 14, 2020

- RE:First: selection/aggregation-centric-only -- see my comment on irregularity. If there is little compute, CPU vs GPU is moot (go tiny ARM..), but as soon as there is, the scope of vectorizable compute is ever-increasing. In my world, we do a lot of graph + ML + string, which are all now GPU-friendly. They were all iffy when I first started with GPUs. Feel free to throw up examples you're skeptical of. The list is shrinking, it's pretty wild..

- RE:Gap, sort of. Ultimately, it's still typically there though, in three important ways. The set of interesting compute that needs that CPU arch there is increasingly small relative to workloads ("super-speculative thread execution on highly branchy..."). Multi-core CPU vs. single GPU is more 2-10X for most tuned code: most 100X claims are apples/oranges b/c of that. When you get beyond those workloads, 100X becomes real again for multi-GPU / multi-node b/c of the bandwidth. Yeah, your real-time font library might still win out on CPU SIMD, but you have to dig for stuff like that, while the more data/compute, the more this stuff matters & gets easier.

- RE:scale, storing 10PB in CPU RAM is also expensive, so we're back to streaming... and thus back to where GPUs increasingly win. Even if you could afford that in CPU RAM, you can probably afford making that accessible to the GPUs too, and then save not just on the hw, but the power (which becomes the dominant cost.) Your example of large-scale & real-time spatiotemporal data seems very much leaning towards GPU, all the way from ETL to analytics to ML. It's still hard to write that GPU code as the frameworks are all nascent, so I wouldn't fault anyone for doing CPU on production systems here for another few years.

-- RE:real-time: writing is on the wall, mostly around (again) getting the unnecessary CPU bandwidth bottleneck out of the way in HW, and (harder), the efforts to use that in SW.

jandrewrogers · on Sept 14, 2020

A critical aspect being ignored is the economics. Highly optimized analytical database code saturates bandwidth on a surprisingly cheap CPU (usually lower-mid range). While a GPU may be 2-4x faster for some operations, it usually pencils out to be at least as expensive operationally, never mind CapEx, for the same workload performance as just using CPUs. This has been a reliable pattern. In which case, why wouldn't you just use CPUs? When you build systems at scale, these cost models are a routine part of the design specs because the bills are steep.

No one stores 10PB in RAM that I know of. A good CPU database kernel will run out of PCIe lanes driving large gangs of NVMe devices at theoretical without much effort. The performance for most workloads is indistinguishable from in-memory, but at a fraction of the cost. It would be slower to insert GPUs anywhere in this setup. (In modern database kernels generally, "in-memory" offers few performance benefits because storage has so much bandwidth that a state-of-the-art scheduler can exploit.) An interesting open research question is the extent to which we can radically reduce cache memory entirely, since state-of-the-art schedulers can keep query execution fed off disk for the most part, even in mixed workloads. Write sparsity still recommends a decent amount of cache for mixed workloads but probably much less than Bélády's optimality algorithm superficially implies.

Almost nothing is CPU-bound in databases these days in reasonable designs, not even highly compressive data model representations, parsing, or computational geometry. Which is great! A lot of analytics is join-intensive, but that is more about latency-hiding than computation. I would argue that the biggest bottleneck at the frontier right now is network handling, and GPUs don't help with that, though FPGAs/ASICs might.

I'm not sure how a GPU would help with operational real-time. Is it even possibly to parse, process, and index tens of millions of new complex records per second over the wire concurrent with running multiple ad hoc queries on a GPU? I've done this many times on a CPU but I've never seen a GPU database that came within an order of magnitude of that in practice, and I've used a few different GPU databases plus some custom bits. GPUs work better in a batch world.

I use GPUs, just not for analytical databases. I am biased in that GPU databases have consistently failed to deliver credible workloads across many scenarios in my experience and I understand at a technical level why they didn't live up to their marketing. Every time one gets stood up in a lab, and I see many of them, they fail to distinguish themselves versus a state-of-the-art CPU-based architecture. Most of them actually underperform in absolute terms. Almost everyone I know that has designed and delivered a production GPU database kernel eventually abandoned it because CPUs were consistently better in real-world environments.

GPU capabilities are improving, but I have seen limited progress in directions that address the underlying issues. They just aren't built to be used that way, and there are other applications for which they are exceedingly optimal that we wouldn't want to sacrifice for database purposes. CPU developments like AVX-512 get you surprisingly close to the practical utility of a GPU for databases without the weaknesses.

Anyway, this is a really big, really large conversation. It doesn't fit in the margin of an HN post. :-)

lmeyerov · on Sept 14, 2020

Yeah so it sounds like you are thinking about I/O bound workloads for what you consider analytical workloads, not compute bound. Traditional GPU (or your proposal of ASIC/FPGA) doesn't matter almost by definition: external systems can't feed them. No argument there, Spark won't replace your data warehouse's lookup engine :)

Assuming the analytic workload does have some compute, however, that's more of a comment about traditional systems having bad bandwidth than GPUs themselves. GPUs are already built for latency hiding, so it's more like CPUs are playing catchup to them. Two super interesting things have been happening here going forward, IMO:

- Nvidia finally got tired of waiting for the rest of the community to expose enough bandwidth. $-wise, they bought mellanox and now trying for ARM. In practice, this means providing more storage->device and network->device through improved hw+sw like https://developer.nvidia.com/blog/gpudirect-storage . I'm not privy to hyperscaler discussions on bandwidth across racks/nodes, but the outside trend does seem to be "more bandwidth", and I've been watching for straight-from-network in from cloud providers.

- Price drops for more hetero hardware. E.g., the T4 on AWS is ~6X cheaper for less choice in stuff like 64b vs 32b than the V100, yet has similar memory and perf within that sweet spot of choices. Nvidia pushes folks to DGX (big SSD -> local multi-GPU), which works for some scales, but in the wild, I see people often land on single-T4 / many-node once you take network bandwidth + cost into consideration in larger and more balanced systems.

For our own workloads, we don't trust GPU DBs enough to get it right, so found RAPIDS to be a nicer sweet spot where even junior devs can code it (Python dataframes), while perf people can predictably tune it and appropriately plug in the latest & greatest. Out-of-memory / streaming / etc. only became a thing starting in ~december, e.g., see recent https://github.com/rapidsai/tpcx-bb results + writeups, so it's been a wild couple of years. We still stick to single-GPU / in-memory for our workloads as we care about sub-second, but have been experimenting & architecting it for as ^^^^ smooths out for our use (and to help our customers who have different-shaped workloads). I've been impressed by stuff like the T4-many-node experience as layers like dask-cudf and blazingsql build up.

eternalban · on Sept 15, 2020

> GPUs work better in a batch world.

This categorical insight should be front and center of any discussion about the relative merits of using GPUs vs CPUs.

lmeyerov · on Sept 16, 2020

That seems inaccurate. GPUs are used for stuff like real-time games, video, and ML inference: the hardware is explicitly built for heavy streaming.

I would agree that productive GPU data frameworks in streaming modes are nascent, e.g., https://medium.com/rapids-ai/gpu-accelerated-stream-processi... .

kanwisher · on Sept 13, 2020

Next gen Nvidia 30x0 series, can have direct access to SSD without hitting the CPU. In that case would they be any worse than cpus on any workloads? I guess you could still have larger ram amounts on the cpu, albeit slower ram usually

dogma1138 · on Sept 13, 2020

That's for gaming (new DirectX feature), GPUDirect Storage is supported on older GPUs as well for compute. https://developer.nvidia.com/blog/gpudirect-storage/

shaklee3 · on Sept 14, 2020

Gpudirect is only on the data center cards afaik.

t0mas88 · on Sept 13, 2020

Interesting, then the only difference is that a CPU easily as 256 GB ram and a high end GPU typically 16 GB or so?

May Nvidia will start to create ML type cards with memory expansion options at some point.

ramoz · on Sept 13, 2020

Hmm - in our case all of the embeddings we process exist on SSD already. Idk enough here honestly but will see what I can learn.

WanderPanda · on Sept 13, 2020

Wow nvidia is selling SBCs now?

ReactiveJelly · on Sept 13, 2020

No, there's a new thing about giving GPUs some kind of DMA to storage. And it's pointless on HDDs, so it's only discussed in terms of SSDs.

Microsoft is bringing the DirectStorage API from XBox to Windows, Nvidia calls theirs RTX IO. I think they're the same class of idea, like Vulken vs. Metal.

They do have SBCs, I think, but other than being the basis for the Nintendo Switch I haven't heard much about them.

manigandham · on Sept 14, 2020

Like this? https://developer.nvidia.com/embedded/jetson-nano-developer-...