Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The question is even fuzzier: how much of that will be massive data parallelism via an on-node GPU? We are already reaching a point where a single machine with 4 CUDA cards can out perform a distributed cluster of 1000 nodes on some applications (the trick would be to eventually do both).

The memory hierarchy is our enemy here: the reason GPUs have done so well is that they schedule memory just as much (if not more) as they do computation. If you are going to go through the trouble of coordinating threads to share caches (and this is possible at all), you might have a GPU-friendly problem.



1. Coordinating threads on a GPU means using shared memory effectively, which is required to get decent performance on most any application. With many cores and hardware threads sharing cache on CPUs, it is also important for many applications if you insist on getting the best performance. For example it is mandatory to get 90% efficiency on HPL using Blue Gene/Q. Of course, not content with merely doing clever things to make their machines look good, IBM had John patent his cooperative prefetch: http://www.google.com/patents/US8490071

2. GPUs have less "close" memory (registers/shared memory/cache) per vector lane than current CPUs. This means that to get high efficiency, you have to find fine-grained parallelism without additional overhead. This is hard and it is common to see GPU algorithms make more round-trips to global memory than analogous CPU algorithms, eating into any benefits in raw bandwidth.

3. GPUs have a relatively narrow range of problem sizes in which they perform well. You typically have to use a significant fraction of device memory to expose enough parallelism to keep all the cores busy, and yet the device has limited memory compared to the CPU. On a GPU-heavy configuration, you have placed 90% of your compute next to 10% of your memory, connected by a straw (PCI bus) to the rest of the memory and the network. That is not a recipe for a versatile machine. CPUs give you vastly more flexibility in turn-around time (e.g., strong scale at >50% efficiency over a factor of 1000 as compared to 10). GPU performance results usually choose a problem size that fills device memory, but science/engineering is often not that convenient.

4. Even for problems in which GPUs perform optimally (like DGEMM or HPL), the ratio in energy efficiency is only 2x. See http://green500.org for example. Note that Blue Gene/Q is a CPU architecture that delivers the same energy efficiency as the GPU-heavy Titan. Also note that Haswell improves Intel efficiency by 2x over Sandy Bridge. The 1000x myth needs to die.

5. Enterprise GPUs (those with ECC) and Xeon Phi (MIC) are expensive ($3k-4k MSRP) relative to CPUs, and still need a host in almost all configurations. In performance tests, normalize-by-shrinkwrap needs to die. Normalize by total acquisition cost or by total energy consumption (always include the host, memory, network as applicable).


1. Yes, that's the whole point. The GPU makes mandatory what we muck around with on CPUs.

2. Every GPU generation adds more things like cache. The story changes every two years in favor of GPUs.

3. My colleague trains huge multi-gigabyte models on GPUs, so its not impossible. Terabytes of data is still the domain of MPI and increasingly MapReduce.

4. This is really not a concern for us, nor anyone who is in it for the performance (6 hours vs. 6 days).

5. The Phi is still kind of a joke. Tesla is quite competitive in terms of pricing.


Can you point to a problem where a GPU actually outperforms a CPU by 250x and the CPU is not being criminally underused? I have tried to find such examples, and never did.

Unless, maybe, you mean the communication costs are the real bottleneck in such cases? In which case I don't see the relevance of the GPU angle.


Yes, distributed clusters are limited by communicate costs, not computation power. Parallel computing is in general limited by communication costs, even on one node (as the amount of time it takes to service a cache miss). Minimizing communication is important in both cases.

DNN training is one problem where the GPU solution vastly outperforms the distributed HPC solution.


GPUs are useful for algebraic geometry. The speedup is not quite 250, but something like 80 or so.

http://www.mpi-inf.mpg.de/~emeliyan/phd_thesis.pdf


Switching one of the more fancier looking PS4 titles over to software rendering might do it. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: