More

bradcray · 2025-05-19T23:24:34 1747697074

@yubblegum: I'm unfairly biased towards Chapel (positively), so won't try to characterize HN's opinion on it. But I did want to note that while Chapel's original and main reason for being is HPC, now that everyone lives in a parallel-computing world, users also benefits from using Chapel in desktop environments where they want to do multicore and/or GPU programming. One such example is covered in this interview with an atmospheric science researcher for whom it has replaced Python as his go-to desktop language: https://chapel-lang.org/blog/posts/7qs-dias/

yubblegum · 2025-05-20T19:26:44 1747769204

Thank you Brad! I was in fact wondering about GPU use myself. Does it work with Apple's M# GPUs?

Btw, I was looking at the docs for GPU [1] and unsolicited feedback from a potential user is that the setup process needs to become less painful. For example, yesterday installed it via brew but then hit the setup page for GPU and noted I now needed to build from source.

(Back in the day, one reason some of Sun's Java efforts to extend Java's fieddom faltered was because of the friction of setup for (iirc) things like Applets, etc. I think Chapel deserves a far wider audiance.)

[1]: https://chapel-lang.org/docs/technotes/gpu.html#setup (for others - you obviously know the link /g)

p.s. just saw your comment from last year - dropping it here for others: https://news.ycombinator.com/item?id=39032481

bradcray · 2025-05-20T20:15:36 1747772136

@yubblegum: I'm afraid we don't have an update on support for Apple GPUs since last year's comment. While it comes up from time-to-time, nobody has opened an issue for it yet (please feel encouraged to!), and it isn't something we've had the chance to prioritize, where a lot of our recent work has focused on improving tooling support and addressing user requests.

I'll take your feedback about simplifying GPU-based installs back to our team, and have noted it on this thematically related issue: https://github.com/chapel-lang/chapel/issues/25187#issuecomm...

bradcray · on Feb 5, 2025

The ~10-minute video for this talk is here, if anyone's interested in the narrative behind the slides: https://www.youtube.com/watch?v=U8KM8wv32js

bradcray · on Jan 17, 2024

That doesn't seem extreme to me, as I generally feel similarly. If you (or other readers) are genuinely interested in using Chapel with Metal, please open an issue on our GitHub repository capturing your request, as that would be valuable to us.

Just to make sure it didn’t get lost, note that it is possible to develop GPU code in Chapel on a MacBook using the cpu-as-device mode Engin mentions above, and then deploy it on NVIDIA GPUs on production systems by recompiling. This is how I develop/debug GPU computations in Chapel.

bradcray · on Jan 17, 2024

These are great questions, and ones we’re very curious about as well. I don’t believe that our current Chapel team has much experience programming NNs and LLMs, having focused on other areas. That said, I’m also not aware of any intrinsic barriers to implementing such algorithms in a portable way within Chapel, potentially calling out to vendor-optimized implementations when available and appropriate.

If you, or others, would be interested in exploring this topic, we’d be very interested in either partnering with you or supporting your efforts.

(Also see Engin's response about programming tensor cores for some thematically related thoughts: https://news.ycombinator.com/item?id=39020703 )

bradcray · on Jan 16, 2024

Chapel was designed for the high performance computing community where programmers often want full control over mapping their computations to their hardware resources without needing to rely on techniques like virtualization or runtime load balancing, which can obscure key details. That said, higher-level abstractions can be (and have been) written in Chapel to insulate many computations from these system-level details, such as distributed arrays and iterators. Users of these higher-level features need not worry about the details of the underlying locales. We refer to this as Chapel's support for multiresolution programming.

That said, other communities may obviously prefer different approaches due to differing needs and constraints.

bjourne · on Jan 17, 2024

What you primarily want in HPC is control over where your data is stored. That is is subtly different from where your computations are performed. E.g an HPC computation may use N heterogeneous devices and require fine-grained control over how data is communicated between those devices. The examples with "locales" are too blunt to handle such scenarios.

bradcray · on Jan 17, 2024

We agree that the placement of data is important for HPC programmers to control. Locales are the means of controlling such placement in Chapel, whether directly (as in this article’s simple examples) of via abstractions like distributed arrays (whose implementations rely on locales).

Once the data is created, computations can be executed with affinity to a specific variable in a data-driven manner using patterns like `on myVar do foo(myVar, anotherVar)`. Alternatively, an abstraction can abstract such details away from a user's concern and control the affinity within its implementation, as the parallel iterator implementing `forall elem in MyDistributedArray` does.

bjourne · on Jan 18, 2024

According to the article, locales control where the code is running, not where the data is stored. Maybe that is implied in some cases such that if you create data in one locale that is also where it is stored, but it tells you nothing about how data created in one locale and accessed in another locale is handled (or even if that's allowed). As you mention other Chapel features that I don't know about they may fill in the gaps. My only point of contention is that the locale feature is poorly thought out and not a good way to address HPC needs.

e-kayrakli · on Jan 23, 2024

Locales do control where the data is stored. For example:

  var HostArr: [1..10] int;  // allocated on the host memory
  
  on here.gpus[0] {
    // now we are on a GPU sublocale...
    var DevArr:[1..10] int;  // allocated on the device memory
    ...
  }

In the near term, we are planning to publish our 2nd GPU blog post where we will discuss how to move data between device and host.

bradcray · on Jan 12, 2024

@ColonelPhantom: Thanks very much for your questions. The following are answers I'm relaying from Engin Kayraklioglu, who heads up the Chapel GPU effort:

Re Intel support: That's definitely in our plans. However, there are also many other areas where we are actively working on to add more features, fix bugs, and improve performance. When prioritizing, we typically make decisions based on what our current and potential users might need in the language. Frankly, we are not seeing a big push for Intel GPU support so far. So, currently it is not near the top of our priorities. If you (or other readers) have any input on that matter where lack of Intel support might be a blocker for testing Chapel and/or its GPU support out, definitely let us know.

Re implicit serialization: To clarify; the serialization based on order-dependence is not implicit. The users should use a `for` loop if their loop is order-dependent and `foreach` (and `forall`) if their loop is order-independent. In other words, the Chapel compiler doesn't make decisions about order-dependence. In particular, for GPU execution a `for` loop will never turn into a GPU kernel.

There are, however, some cases where a `foreach` does not turn into a kernel. You may be referring to those cases, but that's not related to order-dependence. Some Chapel features cannot execute on a GPU. If your `foreach` loop's body uses any of those features then it will not be launched as a kernel even though `foreach` signals order-independence. Now, a subset of such features that makes an order-independent loop GPU-ineligible are there because we haven't gotten a chance to properly address them, yet. Another subset of such features will remain thwarters for a longer time and maybe forever. For example, your `foreach` loop could be calling an external host function.

bradcray · on Jan 13, 2024

Sorry for what now appears to be a double-post. Engin had just registered for HN, hadn't seen his reply going through, so asked me to relay it.

Re-reading this Q+A this morning, I also wanted to clarify one thing, which is that when a 'foreach' or 'forall' does end up being executed on the CPU, that doesn't mean it has been serialized. 'foreach' loops on the CPU are candidates for vectorization while 'forall' loops typically result in multicore task-parallelism with each task also being a candidate for vectorization.

bradcray · on Nov 20, 2023

I would say Chapel was created less to replace MPI and more to provide a higher-level alternative to it that is amenable to compiler optimization.

bradcray · on Oct 19, 2023

Those interested in the intersection between Python, HPC, and data science may want to take a look at Arkouda, which is a Python package for data science at massive scales (TB of memory) at interactive rates (seconds), powered by Chapel:

* https://github.com/Bears-R-Us/arkouda * https://twitter.com/ChapelLanguage/status/168858897773200179...

bradcray · on Oct 9, 2023

My answer would be that Chapel supports a partitioned global namespace such that a variable within the lexical scope of a given statement can be referenced whether it is local to that CPU's memory, stored on a remote compute node, or stored within a GPU's memory (say). The compiler and runtime implement the communication on the programmer's behalf and take steps to optimize away unnecessary communication. Other key features include first-class support for creating parallel tasks in high-level ways, including parallel loops.

bradcray · on Aug 3, 2023

My understanding is that the Julia Petaflops run executed a Julia program per node, communicating via MPI. For some, that's probably obvious/expected for HPC; for others, it might not be considered "pure Julia".

cbkeller · on Aug 4, 2023

That's how it works in any language on a supercomputer. MPI is pretty much the only game in town for inter-node communication.

bradcray · on Aug 9, 2023

@cbkeller: Though MPI is dominant in HPC by a very large margin, it's definitely not the only game in town. SHMEM is an MPI alternative with a smaller but very dedicated following. UPC, Fortran 2008, UPC++, and Chapel are all alternatives that support inter-node communication without relying on MPI or explicit library communication calls. Chapel has the additional advantage of not imposing the SPMD programming model on the user and supporting asynchronous dynamic tasking.

It's my understanding that Julia aspires to join this group of languages if it is able to do so, which is why the Petaflops announcement was originally enticing to me, and then became somewhat less so once I learned that it was relying on MPI.